Google Cloud: Professional Cloud Architect (PCA) Exam Notes – Part V

Review Compute Options

Google Backup & DR is a managed service to ensure resilience and reliability for data generated by Google Cloud or on-premises infrastructure. Backups are necessary to recover from user error, malicious activity, or other reasons.

  • This service offers a centralized backup management solution across services
  • Uses Cloud Storage to efficiently manage storage costs (only incremental data is saved, and you can select the appropriate storage class)
  • Minimizes recovery times

Virtual Machines

  • Enabling the Compute Engine API creates a Service Account (IAM), the OS Login API is also enabled
  • You can set the default regions and zones in gcloud config settings
  • GCP offers rightsizing support, if you want to look at metrics yourself you will need the Stackdriver logging agent installed
    • GCE will tell you if you’re underutilizing your VM
    • You must stop the VM to change CPU/RAM
    • Use custom machine types if your VM’s CPU/RAM requirements don’t fit a predefined machine type
  • Use VMs if you need to add GPUs to workloads, good for ML and data processing – GKE can now support GPUs
  • NoOps solutions are products like Google App Engine and Google Kubernetes Engine, any compute instance VM will have some ops and are not “no-ops”

Special Types of VMs

  • Shielded VMs are a security feature designed to offer a verifiable integrity of your VM instances so that you can be sure your instances are not compromised by boot- or kernel-level malware or rootkits.
    • Leverage Secure Boot, with a virtual Trusted Platform Module (vTPM), and integrity monitoring to ensure that your virtual machine has not been tampered with.
  • Confidential VMs encrypt data in use. Decryption and encryption happens within the CPU, so unencrypted data never leaves the CPU when it’s written to RAM and thus unencrypted data is not visible to Google. Without this feature, plain data would be available in the RAM of the VM.
    • Enabling Confidential VM has little or no impact on most workloads, with only a 0-6% degradation in performance.
  • Preemptible VM instances are available at much lower price—a 60-91% discount—compared to the price of standard VMs. However, Compute Engine might stop (preempt) these instances if it needs to reclaim those resources for other tasks. Preemptible instances are excess Compute Engine capacity, so their availability varies with usage.
  • Spot VMs are the latest version of Preemptible VMs.
  • Tau VMs offer the best price-performance ratio and are optimized for scale-out workloads.

vCPUs

  • n1-highcpu means that a higher proportion of CPU to memory
  • n1-highmem means that a higher proportion of memory to CPU
  • n1-standard means that CPU and memory resources are balanced
  • 1 vCPU is equal to 1 hyperthread, 2 vCPUs is equal to 1 core

Startup Scripts

  • Use startup scripts to perform any action on your VM after boot such as installing software, running updates, turning on/off services, and anything else you can script.
  • From a local file – Use gcloud (--metadata-from-file startup-script=PATH) or copy/paste into GCP Console (256 KB limit on script)
  • From GCS – in Metadata section specify startup-script-url as the metadata key and gs://path/to/bucket/and/file as the value (No size limit on script)

Shutdown Scripts

  • Create and run shutdown scripts that execute commands right before an instance is terminated or restarted, on a best-effort basis. This is useful if you rely on automated scripts to startup and shutdown instances, allowing instances time to clean up or perform tasks, such as exporting logs, or syncing with other systems. Very useful when combined with managed instance groups and autoscaling or pre-emptible VMs.
  • From a local file – Use gcloud (--metadata-from-file shutdown-script=PATH) or copy/paste into GCP Console (256 KB limit on script)
  • From GCS – in Metadata section specify shutdown-script-url as the metadata key and gs://path/to/bucket/and/file as the value (No size limit on script)

Storage Options

PD HDDPD SSDLocal SSDRAM disk
Data RedundancyYesYesNoNo
Encryption at RestYesYesYesN/A
SnapshottingYesYesNoNo
BootableYesYesNoNo
Use caseGeneral, bulk file storageVery random IOPS, Lower latencyHigh IOPS + low latencyLow latency + risk of data loss

Local SSD – Zonal 375GB SSD physically attached to servers, attached to a single instance, can be striped across up to 24 drives for 9TB

  • All data will be lost when the instance shuts down!

Persistent Disk – A zonal or regional block storage attached to an instance via a network (NAS). PDs can be either a boot disk or attached to an instance for additional storage. 

  • PD Types:
    • Standard (pd-standard) – mechanical based HDD
    • Balanced (pd-balanced) – a SSD that balances performance and storage
    • SSD (pd-ssd) – an NVME based SSD for performance
    • Extreme (pd-extreme) – SSD that is optimized for extreme performance for DB
  • To move a PD, snapshot it and create a new VM
  • Network pipe is shared for network and disk I/O because these are network attached block devices
  • Larger the volume gets, the better the performance gets up to a certain point
    • Max disk size for PD HDD or PD SSD is 65,536 GB (64 TB)
    • Most instances can have up to 128 attached PDs (see restrictions), for a maximum of 257 TB total
  • Replicated for durability, stick around after you shut down an instance
  • Can resize while they’re in use, you can resize up but never down
    • Use gcloud or console to resize
    • Log in or escalate to root user, lsblk and fdisk to resize, then partprobe /dev/sda and resize2fs /dev/sda1 (or just reboot the VM)
  • Can take snapshots and make machine images
  • Can mount to multiple instances if all are read-only
    • Possible to have 2 VMs in multi-writer mode with N2 instances, with many caveats such as requiring a specialized filesystem, see documentation
    • In general, use Filestore if you need writing across multiple VMs

Network and Disk I/O

  • 2 Gbps of network I/O per vCPU
    • Up to 16 Gbps of total bandwidth which means an 8 vCPU VM is required 
  • Need more than 50K IOPS? Go to iSCSI or NVMe Local SSD with caveats
    • Only certain GCE images have optimized drivers for this, also look at partners like Nasuni or Dell PowerScale

Virtual Machine Best Practices

  • You can dynamically increase the size of a persistent SSD to get better throughput and IOPS performance, even while a VM is running. This can improve database performance if you are running your own database on a VM and can’t restart the machine.
  • Distribute instances across zones or regions for high availability, use global HTTPS load balancer with health checks
  • VMs in Managed Instance Groups (MIGs) have auto scaling capabilities, and also auto healing capabilities which will recreate a VM if it is deemed “unhealthy”
  • Use zone-specific internal DNS names, format is NAME.ZONE.c.PROJECT_ID.internal (this is separate from private Cloud DNS zones)
  • Create persistent disk snapshots as a backup mechanism, or replicate your data to a persistent disk in another region or zone with regional persistent disks. As expected, regional persistent disks are slower on writes because they have to write the data to multiple locations, but they are more durable
    • The user is responsible for a backup strategy!
  • VMs that are backends for HTTPS and SSL Proxy load balancers do not need external IP addresses to be accessed privately through the load balancer
  • Host keys can be stored as guest attributes to add a layer of security, guest attributes are part of the underlying metadata service
  • Each instance has a unique JSON web token (JWT) that includes details about the instance. Your applications can verify the signature against Google’s public OAuth2 certs to confirm the identity of the instance. This is good in case you want to confirm the identity of the instance before transmitting sensitive information including credentials
  • SQL Server Best Practices:
    • Move data files and log files to a separate SSD persistent disk
    • Use a local SSD to improve IOPS for the tempdb and Windows paging files
    • Use Windows default firewall and network settings, install antivirus software and have a backup strategy
    • Set max degree of parallelism to 8
    • Install logging agent
  • Image Management Best Practices:
    • An image is a bundle of the raw bytes used to create a prepopulated disk, needs a master boot record and bootable partition to be bootable. Image families cannot be stored in Cloud Storage by the end user, they must be in the Custom Images service
    • There are Public Images for use at no extra cost, and also premium images like RHEL and Windows that incur hourly fees
    • If you use a startup script to deploy your applications as the instances boot, make sure the script is idempotent to avoid partially configured states or inconsistent configurations, the startup script can start a tool like Chef or Ansible
    • The process of creating a custom image is called baking. It can be manual, automated, or imported. Manual would be starting with a public image, customizing it, then creating a custom image from the boot disk. Packer is a good tool for automated baking, and importing is a migration conversation.
    • Recommended to shut down instances before creating new custom image
    • Images are encrypted by default but you can also bring your own key.
    • Image Families help you manage images in your project by grouping related images together to make rolling forward and backwards easier.
    • Images are good candidates to span multiple projects, you can use a shared set of images to meet best practices for security, etc.
  • Image Families Best Practices:
    • Allows users to keep track of the image-family name, not an exact image, which can be useful for easier versioning, kind of like a Docker image.
    • Public images are grouped into image families and always points to the latest version that is available in your VM’s zone
    • You can create your own custom image families and deprecate old images, test your latest referenced images from the image families before using it in production
  • Use Cloud Trace to help you diagnose latency issues caused by application-serving requests
  • Leverage Cloud CDN for cacheable resources, host static content on GCS buckets to reduce web server load, deploy across regions if possible to bring apps closer to users
  • Use Cloud Load Balancers and a Managed Instance Group instead of floating IP addresses – Review differences from Floating IP Addresses

Disk Snapshots

  • Snapshots are a project-level resource, but can now be shared across projects. Like Images, they are stored in GCS but only visible to users through the Snapshot interface
  • They are incremental, the first snapshot is full and the subsequent ones only contain differences
  • You can take a snapshot of a VM in one zone and create a new VM off of it in a different zone, it is the basis for the gcloud compute instances move command
  • On Linux, if it’s a boot disk, halt the system. If it’s secondary, unmount first. If you can’t unmount, stop apps from writing, complete pending writes with sudo sync and suspend writing with sudo fsfreeze -f /path/to/mountpoint
  • On Windows, use Volume Shadow Service (VSS)
  • Determine whether you need crash consistent snapshots or application consistent snapshots. Crash consistent snapshots are used when applications are running, but you’ll likely need to replay file system and application-level journals before use. Application consistent snapshots require pausing your applications, potentially between multiple persistent disks. You don’t need to stop the VM instance to do this, but it will require pausing your apps
  • You can snapshot your disk once every 10 minutes, best practice is to snapshot a disk once per hour, and use a snapshot schedule to do this
  • You can also only create new zonal persistent disks from a snapshot at most once every 10 minutes
  • If you have existing snapshots of a persistent disk, the system automatically uses them as a baseline for any subsequent snapshots that you create from that same disk
  • Snapshot creation usually peaks at midnight. Do it off hours for faster speeds.
  • Organize your data on separate persistent disks so you’re not snapshotting excessive data, and use discard or fstrim on Linux before you create a snapshot so you don’t bring in files you don’t need
  • If you use a snapshot frequently, you can save on networking costs by creating a custom image of the snapshot.
    • A custom image can be created from a running disk, you don’t need to snapshot it first

Unmanaged Instance Groups

  • Unmanaged instance groups are collections of instances that are not necessarily identical and do not share a common instance template
  • Best for load balancing dissimilar instances, which you can add and remove arbitrarily
  • Autoscaling, autohealing, and rolling updates are not supported
  • Each unmanaged instance group can contain a maximum of 500 instances
  • Unmanaged instance groups have to be in one zone
  • Deleting an unmanaged instance group leaves the underlying instances behind

Managed Instance Groups

  • You can’t change Instance templates after they’re created, they’re immutable
    • Immutable: Rather than tweaking resources, update the infrastructure code or settings and redeploy. Minimizes config drift and snowflake servers.
  • Monitoring tab aggregates data across instances in an instance group
  • Managed instance groups can be in multi-zones and you can allow and disallow specific zones
  • Start with an instance template, set autoscaling rules with max/min number of VMs, cool-down period excludes newly created instances (should be longer than a new instance’s startup time)
    • If you update the instance template, it will create dissimilar instances
    • Instances in a MIG can be deleted or abandoned for troubleshooting
  • Auto-healing uses health checks to automatically restart VMs if necessary. Health checks use HTTP, HTTPS, or TCP health probes
  • Deleting a managed instance group deletes the underlying VMs
  • Supports autoscaling and rolling updates