The Azure Mission-Critical reference implementation follows a layered and modular approach. This approach achieves the following goals:
- Cleaner and manageable deployment design
- Ability to switch service(s) with other services providing similar capabilities depending on requirements
- Separation between layers which enables implementation of RBAC easier in case multiple teams are responsible for different aspects of Azure Mission-Critical application deployment and operations
The Azure Mission-Critical reference implementations are composed of three distinct layers:
- Infrastructure
- Configuration
- Application
Infrastructure layer contains all infrastructure components and underlying foundational services required for Azure Mission-Critical reference implementation. It is deployed using Terraform.
Note: Bicep (ARM DSL) was considered during the early stages as part of a proof-of-concept. Please refer to the following (archived stub) for more details.
Configuration layer applies the initial configuration and additional services on top of the infrastructure components deployed as part of infrastructure layer.
Application layer contains all components and dependencies related to the application workload itself.
Every stamp - which usually corresponds to a deployment to one Azure Region - is considered independent. Stamps are designed to work without relying on components in other regions (i.e. "share nothing").
The main shared component between stamps which requires synchronization at runtime is the database layer. For this, Azure Cosmos DB was chosen as it provides the crucial ability of multi-region writes i.e., each stamp can write locally with Cosmos DB handling data replication and synchronization between the stamps.
Aside from the database, a geo-replicated Azure Container Registry (ACR) is shared between the stamps. The ACR is replicated to every region which hosts a stamp to ensure fast and resilient access to the images at runtime.
Stamps can be added and removed dynamically as needed to provide more resiliency, scale and proximity to users.
A global load balancer is used to distribute and load balance incoming traffic to the stamps (see Networking for details).
As much as possible, no state should be stored on the compute clusters with all states externalized to the database. This allows users to start a user journey in one stamp and continue it in another.
In addition to stamp independence and stateless compute clusters, each "stamp" is considered to be a Scale Unit (SU) following the Deployment stamps pattern. All components and services within a given stamp are configured and tested to serve requests in a given range. This includes auto-scaling capabilities for each service as well as proper minimum and maximum values and regular evaluation.
An example Scale Unit design in Azure Mission-Critical consists of scalability requirements i.e. minimum values / the expected capacity:
Scalability requirements
Metric | max |
---|---|
Users | 25k |
New games/sec. | 200 |
Get games/sec. | 5000 |
This definition is used to evaluate the capabilities of a SU on a regular basis, which later then needs to be translated into a Capacity Model. This in turn will inform the configuration of a SU which is able to serve the expected demand:
Configuration
Component | min | max |
---|---|---|
AKS nodes | 3 | 12 |
Ingress controller replicas | 3 | 24 |
Game Service replicas | 3 | 24 |
Result Worker replicas | 3 | 12 |
Event Hub throughput units | 1 | 10 |
Cosmos DB RUs | 4000 | 40000 |
Note: Cosmos DB RUs are scaled in all regions simultaneously.
Each SU is deployed into an Azure region and is therefore primarily handling traffic from that given area (although it can take over traffic from other regions when needed). This geographic spread will likely result in load patterns and business hours that might vary from region to region and as such, every SU is designed to scale-in/-down when idle.
The reference implementation of Azure Mission-Critical deploys a set of Azure services. These services are not available across all Azure regions. In addition, only regions which offer Availability Zones (AZs) are considered for a stamp. AZs are gradually being rolled-out and are not yet available across all regions. Due to these constraints, the reference implementation cannot be deployed to all Azure regions.
As of February 2022, following regions have been successfully tested with the reference implementation of Azure Mission-Critical:
Europe/Africa
- northeurope
- westeurope
- germanywestcentral
- francecentral
- uksouth
- norwayeast
- southafricanorth
Americas
- westus2
- eastus
- eastus2
- centralus
- southcentralus
- brazilsouth
- canadacentral
Asia Pacific
- australiaeast
- southeastasia
- eastasia
- japaneast
- koreacentral
Note: Depending on which regions you select, you might need to first request quota with Azure Support for some of the services (mostly for AKS VMs and Cosmos DB).
It's worth calling out that where an Azure service is not available, an equivalent service may be deployed in its place. Availability Zones are the main limiting factor as far as the reference implementation of AZ is concerned.
As regional availability of services used in reference implementation and AZs ramp-up, we foresee this list changing and support for additional Azure regions improving where reference implementation can be deployed.
Note: If the target availability SLA for your application workload can be achieved without AZs and/or your workload is not bound compliance related to data sovereignty, an alternate region where all services/AZs are available can be considered.
- Front Door is used as the only entry point for user traffic. All backend systems are locked down to only allow traffic that comes through the AFD instance.
- Each stamp comes with a pre-provisioned Public IP address resource, which DNS name is used as a backend for Front Door.
- Diagnostic settings are configured to store all log and metric data for 30 days (retention policy) in Log Analytics.
- SQL-API (Cosmos DB API) is being used
Multi-master write
is enabled- The account is replicated to every region in which there is a stamp deployed.
zone_redundancy
is enabled for each replicated region.- Request Unit
autoscaling
is enabled on container-level. - Each stamp deploys an Azure Private Endpoint to the Cosmos DB.
- Network restrictions are enabled to allow only access from Private Endpoints.
sku
is set to Premium to allow geo-replication.georeplication_locations
is automatically set to reflect all regions that a regional stamp was deployed to.zone_redundancy_enabled
provides resiliency and high availability within a specific region.admin_enabled
is set to false. The admin user access will not be used. Access to images stored in ACR, for example for AKS, is only possible using AzureAD role assignments.- Diagnostic settings are configured to store all log and metric data in Log Analytics.
- Used to collect diagnostic logs of the global resources
daily_quota_gb
is set to prevent overspend, especially on environments that are used for load testing.retention_in_days
is used to prevent overspend by storing data longer than needed in Log Analytics - long term log and metric retention is supposed to happen in Azure Storage.
A stamp is a regional deployment and can also be considered as a scale-unit. For now we only always deploy one stamp in an Azure Region but this can be extended to allow multiple stamps per region if required.
The current networking setup consists of a single Azure Virtual Network per stamp that consists of one subnet dedicated for Azure Kubernetes Service (AKS) and an additional subnet for the Private Endpoints of different services.
- Each stamp infrastructure includes a pre-provisioned static Public IP address resource with a DNS name ([prefix]-cluster.[region].cloudapp.azure.com). This Public IP address is used for the Kubernetes Ingress controller Load Balancer and as a backend address for Azure Front Door.
- Diagnostic settings are configured to store all log and metric data in Log Analytics.
- Key Vault is used as the sole configuration store by the application for both secret as well as non-sensitive values.
sku_name
is set to standard.- Diagnostic settings are configured to store all log and metric data in Log Analytics.
Azure Kubernetes Service (AKS) is used as the compute platform as it is most versatile and as Kubernetes is the de-facto compute platform standard for modern applications, both inside and outside of Azure.
Azure Mission-Critical uses Linux-only clusters as there is no requirement for any Windows-based containers and Linux is the more mature platform in terms of Kubernetes.
role_based_access_control
(RBAC) is enabled.sku_tier
set to Paid (Uptime SLA) to achieve the 99.95% SLA within a single region (withavailability_zones
enabled).http_application_routing
is disabled as it is not recommended for production environments, a separate Ingress controller solution will be used.- Managed Identities (SystemAssigned) are used, instead of Service Principals.
addon_profile
configurationazure_policy
is set totrue
to enable the use of Azure Policies in Azure Kubernetes Service. The policy configured in the reference implementation is in "audit-only" mode. It is mostly integrated to demonstrate how to set this up through Terraform.oms_agent
is configured to enable the Container Insights addon and ship AKS monitoring data to Azure Log Analytics via an in-cluster OMS Agent (DaemonSet).
- Diagnostic settings are configured to store all log and metric data in Log Analytics.
default_node_pool
settingsavailability_zones
is set to3
to leverage all three AZs in a given region.enable_auto_scaling
is configured to let the default node pool automatically scale out if needed.os_disk_type
is set toEphemeral
to leverage Ephemeral OS disks for performance reasons.upgrade_settings
max_surge
is set to33%
which is the recommended value for production workloads.
Individual stamps are considered ephemeral and stateless. Updates to the infrastructure and application are following a Zero-downtime Update Strategy and do not touch existing stamps. Updates to Kubernetes are therefore primarily rolled out by releasing new versions and replacing existing stamps. To update node images between two releases, the automatic_channel_upgrade
in combination with maintenance_window
is used:
automatic_channel_upgrade
is set tonode-image
to automatically upgrade node pools with the most recent AKS node image.maintenance_window
contains the allowed window to runautomatic_channel_upgrade
upgrades. It is currently set toallowed
onSunday
between 0 and 2 am.
Each region has an individual Log Analytics workspace configured to store all log and metric data. As each stamp deployment is considered ephemeral, these workspaces are deployed as part of the global resources and does not share the lifecycle of a stamp. This ensures that when a stamp is deleted (which happens regularly), logs are still available. Log Analytics workspaces reside in a separate resource group <prefix>-monitoring-rg
.
sku
is set to PerGB2018.daily_quota_gb
is set to30
GB to prevent overspend, especially on environments that are used for load testing.retention_in_days
is set to30
days to prevent overspend by storing data longer than needed in Log Analytics - long term log and metric retention is supposed to happen in Azure Storage.- For the Health Model, a set of Kusto Functions needs to be added to LogAnalytics. There is a sub-resource type called
SavedSearch
. Because these queries can get quite bulky, they are loaded from files instead of specified inline in Terraform. They are stored in the subdirectory monitoring/queries in the/src/infra
directory.
As with Log Analytics, Application Insights is also deployed per-region and does not share the lifecycle of an stamp. All Application Insight resources are deployed in a separate resource group <prefix>-monitoring-rg
and are deployed as part of the global resources deployment.
- Log Analytics Workspace-attached mode is being used.
daily_data_cap_in_gb
is set to30
GB to prevent overspend, especially on environments that are used for load testing.
Azure Policy is used to monitor and enforce certain baselines. All policies are assigned on a per-stamp, per-resource group level. Azure Kubernetes Service is configured to use the azure_policy
addon to leverage Policies configured outside of Kubernetes.
- Each stamp has one
standard
tier,zone_redundant
Event Hub Namespace. - Auto-inflate (auto-scaleup) can be optionally enabled via a Terraform variable.
- The namespace holds one Event Hub
backendqueue-eh
with dedicated consumer groups for each consumer (currently only one). - A Private Endpoint is deployed which is used to securely access the Event Hub from within the stamp's VNet.
- Network restrictions are enabled to allow only access through Private Endpoints.
- Diagnostic settings are configured to store all log and metric data in Log Analytics.
- Two storage accounts are deployed per stamp:
- A "public" storage account with "static web site" enabled. This is used to host the UI single-page application.
- A "private" storage account which is used for internals such as the health service and the Event Hub checkpointing.
- Both accounts are deployed in zone-redundant mode (
ZRS
).
This repository also contains a couple of supporting services for the Azure Mission-Critical project:
These supporting services are required / optional based on how you chose to use Azure Mission-Critical.
All resources used for Azure Mission-Critical follow a pre-defined and consistent naming structure to make it easier to identify them and to avoid confusion. Resource abbreviations are based on the Cloud Adoption Framework. These abbreviations are typically attached as a suffix to each resource in Azure.
A prefix is used to uniquely identify "deployments" as some names in Azure must be worldwide unique. Examples of these include Storage Accounts, Container Registries and CosmosDB accounts.
Resource groups
Resource group names begin with the prefix and then indicate whether they contain per-stamp or global resources. In case of per-stamp resource groups, the name also contains the Azure region they are deployed to.
<prefix><suffix>-<global | stamp>-<region>-rg
This will, for example, result in aoprod-global-rg
for global services in prod or aoprod7745-stamp-eastus2-rg
for a stamp deployment in eastus2
.
Resources
<prefix><suffix>-<region>-<resource>
for resources that support -
in their names and <prefix><region><resource>
for resources such as Storage Accounts, Container Registries and others that do not support -
in their names.
This will result in, for example, aoprod7745-eastus2-aks
for an AKS cluster in eastus2
.