Del via


Reliability in Azure Virtual Machine Scale Sets

Azure Virtual Machine Scale Sets is an Azure compute resource that you can use to create and manage a group of virtual machine (VM) instances. The number of VM instances can automatically increase or decrease in response to demand or a defined schedule. Virtual machine scale sets help make applications highly available and resilient by distributing VMs across multiple availability zones and fault domains.

When you use Azure, reliability is a shared responsibility. Microsoft provides a range of capabilities to support resiliency and recovery. You're responsible for understanding how those capabilities work within all of the services you use, and selecting the capabilities you need to meet your business objectives and uptime goals.

This article describes how to make Virtual Machine Scale Sets resilient to various potential outages and problems, including transient faults, availability zone outages, region outages, VM reconfiguration, and service maintenance. It also describes how you can use backups to recover from other types of problems, and it highlights key information about the Virtual Machine Scale Sets service-level agreement (SLA).

Important

When you consider the reliability of a scale set and its VMs, you also need to consider the reliability of your disks, network infrastructure, and applications that run on your VMs. Improving the resiliency of the VMs alone might have limited effect if the other components aren't equally resilient. Depending on your resiliency requirements, you might need to make configuration changes across multiple areas.

Production deployment recommendations

The Azure Well-Architected Framework provides recommendations for reliability, performance, security, cost, and operations. To learn how these areas influence each other and contribute to a reliable Virtual Machine Scale Sets solution, see Architecture best practices for Azure Virtual Machines and scale sets in the Well-Architected Framework.

Reliability architecture overview

A scale set groups multiple VM instances together and applies centralized configuration, autoscale rules, and rolling upgrades.

Scale sets support two distinct orchestration modes:

  • Flexible scale sets (recommended) give you more flexibility to deploy and manage individual VM instances.
  • Uniform scale sets deploy VMs that have identical configuration, and you manage them as a group.

Fault domain spreading

Fault domains are fault isolation groups within a datacenter. Each fault domain is like a server rack, which is a collection of hardware nodes that share the same power, networking, cooling, and platform maintenance schedule. Because the VM instances of each scale set are spread across multiple fault domains, a planned or unplanned outage that occurs in one fault domain likely doesn't affect the VM instances in other fault domains.

When you deploy a scale set, you can control how many fault domains the VMs are spread across. For most scenarios, use the max spreading behavior, which uses as many fault domains as possible. For more information, see Choose the right number of fault domains for Virtual Machine Scale Sets.

In regions that have availability zones, each zone has a distinct set of fault domains. When you create a zone-spanning scale set, instances are spread across fault domains in each zone that your scale set uses.

Load balancing

Scale sets can integrate with Azure load balancing services, including Azure Load Balancer and Azure Application Gateway. When the scale set adds or removes instances, the built-in load balancer integration automatically updates the load balancer configuration. For more information, see Networking for Virtual Machine Scale Sets.

Scale sets include many other controls and capabilities that affect how you deploy, scale, distribute, and update instances. For more information, see Virtual Machine Scale Sets overview.

Resilience to transient faults

Transient faults are short, intermittent failures in components. They occur frequently in a distributed environment like the cloud, and they're a normal part of operations. Transient faults correct themselves after a short period of time. It's important that your applications can handle transient faults, usually by retrying affected requests.

All cloud-hosted applications should follow the Azure transient fault handling guidance when they communicate with any cloud-hosted APIs, databases, and other components. For more information, see Recommendations for handling transient faults.

Applications that run on your VMs should implement appropriate fault-handling strategies to ensure that any temporary interruptions in service don't affect your workload.

Resilience to instance problems

When a scale set initiates a VM instance creation or deletion task, the operation might fail. To automatically retry failed VM instance creation or deletion tasks, consider using the resilient create and delete feature for Virtual Machine Scale Sets (preview).

Problems might occur while instances run. For example, an instance might become unresponsive because of application crashes or resource exhaustion. Use automatic instance repairs to monitor your application's health and automatically restart, reimage, or replace a VM instance when needed.

Resilience to availability zone failures

Availability zones are physically separate groups of datacenters within an Azure region. When one zone fails, services can fail over to one of the remaining zones.

Virtual Machine Scale Sets supports availability zones in both zone-spanning and zonal configurations.

If you don't specify availability zones for your scale set, it's nonzonal or regional. In this scenario, instances might be placed in any zone within the region and might not be evenly distributed or located in the same zone. When you use a nonzonal scale set, disk colocation in the same zone is guaranteed for Ultra and Premium v2 disks. Colocation is provided on a best-effort basis for Premium v1 disks and not guaranteed for Standard SKU disks, including solid-state drive (SSD) or hard disk drive (HDD) disks. If any zone in the region fails, your scale set might experience downtime.

Requirements

  • Region support: You can deploy zone-spanning and zonal scale sets into any region that supports availability zones.

    However, some VM types and sizes are only available in specific regions, or specific zones within a region. To check which regions and zones support the VM types that you need, use the following resources:

    If a specific VM SKU isn't available in any of the zones that you select for your scale set, then your scale set might not be able to scale out to meet your capacity requirements.

  • Dedicated hosts: Azure Dedicated Host deployments don't support zone-spanning or zonal scale sets.

  • Types: Availability zone support is available for all types of scale sets, including flexible and uniform scale sets.

Considerations

  • Fault domain spreading: When your scale set uses availability zones, you must select from specific fault domain spreading approaches. We recommend that you use max spreading, which uses as many fault domains as possible, for most workloads. For more information, see Choose the right number of fault domains for Virtual Machine Scale Sets.

  • Zone balancing: Zone balancing determines whether VM instances in a scale set are evenly distributed across the zones that you select. A scale set is considered balanced if each zone has the same number of VMs, plus or minus one VM. You can set the zone balancing mode to best effort or strict. This setting controls whether the scale set can scale out unevenly, including in zone outage scenarios.

  • Placement groups: For uniform scale sets, if you configure multiple placement groups, Azure deploys multiple placement groups in each zone that your scale set uses.

Cost

There's no cost difference between a zone-spanning, zonal, and nonzonal scale set that has the same number and type of VM instances.

Configure availability zone support

This section explains how to configure availability zone support for your scale set.

  • Create a zone-spanning or zonal scale set. You can configure availability zones when you create a new scale set. For more information, see Create a virtual machine scale set that uses availability zones.

    Note

    When you select which availability zones to use, you're actually selecting the logical availability zone. If you deploy other workload components in a different Azure subscription, they might use a different logical availability zone number to access the same physical availability zone. For more information, see Physical and logical availability zones.

  • Convert existing scale sets to use availability zones. You can convert an existing nonzonal (regional) scale set to use availability zones. For more information, see Update scale sets to add availability zones.

  • Change the availability zone configuration of an existing scale set. You can add zones to an existing scale set, but you can't remove zones. For more information, see Update scale sets to add availability zones.

    Important

    When you expand a scale set to more zones, the original VM instances don't immediately migrate or change. When you scale out, new instances are created and spread evenly across the selected availability zones. If you need data from the original instances, you're responsible for migrating the data to instances in the new zones. When you scale in the scale set, any regional instances are prioritized for removal first. Then, instances are removed based on the scale set's scale-in policy. For more information, see How to manually balance your scale set.

Capacity planning and management

To prepare for availability zone failure, consider over-provisioning the number of VM instances in your scale set. This approach allows the solution to tolerate some capacity loss and continue to function without degraded performance and ensures that the remaining zones have sufficient capacity to handle full production load. For more information, see Manage capacity by using over-provisioning.

Behavior when all zones are healthy

This section describes what to expect when scale sets are configured with availability zone support and all availability zones are operational.

  • Traffic routing between zones: You're responsible for routing traffic between VMs in the scale set, including VMs that are in different availability zones. Common approaches include Load Balancer and Application Gateway, which provide built-in integration with scale sets. For more information, see Networking for Virtual Machine Scale Sets.

  • Data replication between zones: You're responsible for any data replication that needs to happen between VMs, including across VMs in different availability zones. Databases and other similar stateful applications that run on VMs often provide capabilities to replicate data.

Behavior during a zone failure

This section describes what to expect when scale sets are configured with availability zone support and there's an outage in their availability zones.

  • Detection and response: You're responsible for detecting the loss of an availability zone and deciding how to respond.

    For zone-spanning scale sets, any VM instances in the affected zone might be unavailable. Instances in the healthy zones remain operational.

    For zonal scale sets deployed in the affected zone, all of the VM instances might be unavailable. You need to plan how you respond to a zone failure. For example, you might redirect traffic to another scale set in a different zone or region.

  • Notification: Microsoft doesn't automatically notify you when a zone is down. However, you can use Azure Resource Health to monitor for the health of an individual resource, and you can set up Resource Health alerts to notify you of problems. You can also use Azure Service Health to understand the overall health of the service, including any zone failures, and you can set up Service Health alerts to notify you of problems.
  • Active requests: Any active requests or other work that occurs on VMs in the affected availability zone are likely to be terminated.

  • Expected data loss: Zonal VM disks might be unavailable during a zone failure.

    If you use zone-redundant storage (ZRS) disks and an outage affects your VM, you can force detach your ZRS disks from the failed VM. This approach allows you to attach the ZRS disks to another VM.

  • Expected downtime: Any VMs in the affected zone remain down until the availability zone recovers. When you use zone-spanning scale sets, VMs located in healthy zones continue to work.

  • Traffic rerouting: You're responsible for rerouting traffic to other VMs in healthy zones.

    If you configure a zone-resilient load balancer that does health checks, the load balancer typically detects failed VMs and can route traffic to other VM instances in healthy zones.

  • Instance replacement: Virtual Machine Scale Sets isn't guaranteed to automatically add new instances into healthy zones.

    If you have a zone-spanning scale set, you can scale out to add more instances. If the zone failure is restricted to specific sets of servers within the zone, the scale-out operation might add healthy instances into the same zone, or it might add instances into other zones. However, if the scale set uses strict zone balancing, the scale set blocks scale-out operations that cause an imbalance.

    Tip

    It's a good practice to configure autoscale rules based on CPU or memory usage. The autoscale rules can allow the scale set to respond to a loss of the VM instances in a zone by scaling out to add new instances in the remaining operational zones.

Zone recovery

When the zone is healthy, VMs in the zone restart. You're responsible for any zone recovery procedures and data synchronization that your workloads require.

If you add temporary instances to your scale set during a zone failure, when the zone is restored, you might need to scale down your scale set to the original capacity.

Test for zone failures

You can use Azure Chaos Studio to simulate the loss of VMs in one or more availability zones as part of an experiment. Chaos Studio provides built-in faults for scale sets, including the ability to shut down VMs in specific zones. You can use these capabilities to simulate zone-level failures and test your failover processes.

Resilience to region-wide failures

Scale sets are single-region resources. If the region is unavailable, any scale sets in the region are also unavailable.

Custom multi-region solutions for resiliency

You can deploy multiple scale sets into different regions, but you need to implement replication, load balancing, and failover processes. For example, you might deploy identical scale sets in multiple regions and use Azure Front Door or Azure Traffic Manager with health probes to route traffic. You're responsible for replicating state by using application mechanisms or managed data services.

Backup and restore

Azure Backup provides native backup support for VMs. Azure Backup creates and manages backups, and provides application-consistent protection for the entire VM, including all attached disks. A VM backup solution with Azure Backup is ideal when you need coordinated backup of multiple disks or application-aware backups. However, for database workloads, consider application-specific backup solutions that provide transaction-consistent protection and faster recovery options.

With Azure Backup for VMs, you can customize the backup frequency, retention duration, and storage configuration to suit your needs. For more information, see Azure Backup for VMs.

Backup also supports disks that are attached to VMs. For more information, see Overview of Azure Disk Backup.

For most solutions, you shouldn't rely exclusively on backups. Instead, use the other capabilities described in this guide to support your resiliency requirements. However, backups protect against some risks that other approaches don't. For more information, see What are redundancy, replication, and backup?.

Resilience to VM reconfiguration

Scale sets let you control how you apply configuration changes to your VMs, like changing your VM SKU, changing the image that each VM uses, and adding or removing VM extensions. You can control the upgrade policy mode, which determines how upgrades are applied. For more information, see Upgrade policy modes for Virtual Machine Scale Sets.

Some upgrade types require reimaging or redeploying an instance. To exclude specific instances from automatic upgrades, consider using instance protection. You might exclude instances that contain state that you need to preserve or configuration that you can't replicate on other instances.

Resilience to service maintenance

Azure periodically performs updates to improve the reliability, performance, and security of the host infrastructure for VMs. Scale sets provide multiple ways to understand and control planned maintenance:

Service-level agreement

The service-level agreement (SLA) for Azure services describes the expected availability of each service and the conditions that your solution must meet to achieve that availability expectation. For more information, see SLAs for online services.

Virtual machine scale sets share the availability SLA for VMs. You can achieve a higher uptime percentage for your VMs by using a scale set that meets both of the following criteria:

  • The scale set contains two or more instances.
  • The scale set spreads those instances across two or more availability zones.