Improve Availability Zone Resiliency for the Mission Critical Online Application #1391

markti · 2024-05-13T16:43:16Z

The Mission Critical Online application which is published by the Well Architected Framework team deploys a workload to an active-active multi-region architecture with Kubernetes clusters managed using Azure Kubernetes Service in both regions to achieve high availability and resiliency during a regional outage. This solution does provision Azure resources by enabling Availability Zones where possible.

For example, the following Azure services were configured for Zone redundancy using the following configuration settings in the Terraform codebase:

• Azure Container Registry: zone_redundancy_enabled = true
• Event Hub: zone_redundant = true
• Public IP Addresses: zones = [1,2,3]
• Azure Kubernetes Service: zones = [1,2,3]
• Cosmos DB: zone_redundant = true

Some services had Zone redundancy disabled explicitly through their configuration settings:
• Azure Container Registry: zone_redundancy_enabled = false
• Azure App Service: zone_balancing_enabled = false

In the Stamp environment, Azure container Registry Zone Redundancy was disabled because it is not supported in all regions. However, we recommend enabling it by default and recommending users provision within Azure Regions that support Availability Zones.

Azure App Service Zone Redundancy was disabled because the underlying database used by the workload hosted on the Azure App Service was not using Availability Zones.

Azure PostgreSQL Server was provisioned without Zone Redundancy. The Terraform resource “azurerm_postgresql_server” was used. This resource provisions a “Azure Database for PostgreSQL Single Server”. In order to achieve Zone Resiliency with Azure’s PostgreSQL Server offerings you need to provision using “Azure Database for PostgreSQL Flexible Server”—which is provisioned using the “azurerm_postgresql_flexible_server” Terraform resource. Because the PostgreSQL Server was provisioned as a single server without Zone Redundancy, the Azure App Service was provisioned without it as well.

The way AKS enables zone resiliency is a bit nuanced. Like Virtual Machine Scale Sets, AKS Node Pools can be configured with the “zones” attribute containing one or more Availability Zones. However, when AKS provisions the underlying Virtual Machine Scale Set, the Zone Balancing Feature will not be enabled by default. As a result, the VMSS that AKS provisions will use “best effort” to distribute the Virtual Machines that make up the AKS Node Pool across Availability Zones. In order to overcome this limitation, the AKS team recommends that in order to guarantee distribution of nodes within an AKS cluster across all Availability Zones at all times, you should create a single-Zone Node Pool for each Availability Zone.

The Kubernetes configuration of the Application Components are configured in Helm Charts. In each of these Helm Charts, the Kubernetes Deployment does not take advantage of the Topology Spread Constraint to ensure that Kubernetes Pods are distributed evenly across nodes in each Availability Zone. As a result, there is no guarantee that Kubernetes will place pods in each Azure Availability Zone even if the Virtual Machines within the AKS Node Pool’s VMSS happens to have Virtual Machines in each Availability Zone.
You should add the following Topology Spread Constraint to your Node Selector:

  topologySpreadConstraints:
    - maxSkew: 1
      topologyKey: kubernetes.io/zone
      whenUnsatisfiable: DoNotSchedule

This will ensure that if nodes within each Availabilty Zone are available to the Kubernetes Deployment, Kubernetes’ scheduler will place it’s pods evenly across all the Azure Availability Zones available to the Node Pool.

In our efforts to test Azure Availability Zone resiliency failure modes, we utilized the .NET reliable web application's codebase. We modified the solution to host it on Azure Kubernetes Service (AKS) while maintaining the same .NET code framework but added enhanced telemetry to provide deeper insights into how and where requests failed. The testing involved both Redis and SQL Database, and we extended our evaluation to include SQL Managed Instance and Cosmos DB. This document details the findings related to the .NET code implementation and the design changes we made during this process, offering insights into improving the resilience and reliability of the application.

Design Guidance
Retry mechanisms should be put in place where it is appropriate. Certain types of transactions should not be retried. These types include client errors especially authentication and authorization failures.

To avoid Thundering Herd situations, the utilization of exponential backoff and jitter within retry policies is important to prevent resource exhaustion. It’s also important to set a maximum number of retries. This should be set based on expected service level agreements of the underlying infrastructure. These retry policies need to be synchronized between Client and Server response times in order to avoid increased throughput and reduced throughput.

Inconsistent Backoff Strategies
If the server is using exponential backoff strategy in conjunction with jitter while the client is using a constant backoff interval, during an outage many of the clients might end up retrying in a synchronized fashion thus creating unpredictable spikes of load on the server. This can result in reduced throughput or even overwhelm the system via a thundering herd situation.

Misaligned Retry Policy
If the client is configured to retry 10 times but the server is configured to retry 3 times you can clearly see a situation where the server has already given up while the client will keep attempting to retry. This will reduce overall throughput by wasting unnecessary system resources and potentially lead to service failures through resource exhaustion.

In conclusion, we recommend that the Well Architected Framework (WAF) team improve this codebase by ensuring that AKS and the associated Kubernetes deployments are evenly distributed across Availability Zones. Additionally, components of the solution where Availability Zone resiliency is currently disabled should be reconfigured to enable it. These enhancements will significantly increase the solution's resilience, both regionally and across Availability Zones within each region.

@rspott @dave-read @Jerryp11 @gitforkash @shashwatchandra 👀

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Availability Zone Resiliency for the Mission Critical Online Application #1391

Improve Availability Zone Resiliency for the Mission Critical Online Application #1391

markti commented May 13, 2024 •

edited

Loading

Improve Availability Zone Resiliency for the Mission Critical Online Application #1391

Improve Availability Zone Resiliency for the Mission Critical Online Application #1391

Comments

markti commented May 13, 2024 • edited Loading

markti commented May 13, 2024 •

edited

Loading