Skip to content

Commit db76f97

Browse files
authored
Merge pull request #413 from sharad4u/web-app-ref-arch
Update reference architecture docs for scalable and multi-region web apps
2 parents f47542b + 861d8fb commit db76f97

File tree

4 files changed

+41
-59
lines changed

4 files changed

+41
-59
lines changed
Loading
Loading

docs/reference-architectures/app-service-web-app/multi-region.md

+20-53
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ title: Highly available multi-region web application
33
titleSuffix: Azure Reference Architectures
44
description: Recommended architecture for a highly available web application running in multiple regions in Azure.
55
author: MikeWasson
6-
ms.date: 10/25/2018
6+
ms.date: 08/01/2019
77
ms.topic: reference-architecture
88
ms.service: architecture-center
99
ms.subservice: reference-architecture
@@ -24,18 +24,18 @@ This architecture builds on the one shown in [Improve scalability in a web appli
2424

2525
- **Primary and secondary regions**. This architecture uses two regions to achieve higher availability. The application is deployed to each region. During normal operations, network traffic is routed to the primary region. If the primary region becomes unavailable, traffic is routed to the secondary region.
2626
- **Azure DNS**. [Azure DNS][azure-dns] is a hosting service for DNS domains, providing name resolution using Microsoft Azure infrastructure. By hosting your domains in Azure, you can manage your DNS records using the same credentials, APIs, tools, and billing as your other Azure services.
27-
- **Azure Traffic Manager**. [Traffic Manager][traffic-manager] routes incoming requests to the primary region. If the application running that region becomes unavailable, Traffic Manager fails over to the secondary region.
27+
- **Front Door**. [Front Door](/azure/frontdoor/) routes incoming requests to the primary region. If the application running that region becomes unavailable, Front Door fails over to the secondary region.
2828
- **Geo-replication** of SQL Database and Cosmos DB.
2929

30-
A multi-region architecture can provide higher availability than deploying to a single region. If a regional outage affects the primary region, you can use [Traffic Manager][traffic-manager] to fail over to the secondary region. This architecture can also help if an individual subsystem of the application fails.
30+
A multi-region architecture can provide higher availability than deploying to a single region. If a regional outage affects the primary region, you can use [Front Door](/azure/frontdoor/) to fail over to the secondary region. This architecture can also help if an individual subsystem of the application fails.
3131

3232
There are several general approaches to achieving high availability across regions:
3333

3434
- Active/passive with hot standby. Traffic goes to one region, while the other waits on hot standby. Hot standby means the VMs in the secondary region are allocated and running at all times.
3535
- Active/passive with cold standby. Traffic goes to one region, while the other waits on cold standby. Cold standby means the VMs in the secondary region are not allocated until needed for failover. This approach costs less to run, but will generally take longer to come online during a failure.
3636
- Active/active. Both regions are active, and requests are load balanced between them. If one region becomes unavailable, it is taken out of rotation.
3737

38-
This reference architecture focuses on active/passive with hot standby, using Traffic Manager for failover.
38+
This reference architecture focuses on active/passive with hot standby, using Front Door for failover.
3939

4040
## Recommendations
4141

@@ -47,27 +47,27 @@ Each Azure region is paired with another region within the same geography. In ge
4747

4848
- If there is a broad outage, recovery of at least one region out of every pair is prioritized.
4949
- Planned Azure system updates are rolled out to paired regions sequentially to minimize possible downtime.
50-
- In most cases, regional pairs reside within the same geography to meet data residency requirements.
50+
- In every case except Brazil South, regional pairs reside within the same geography to meet data residency requirements.
5151

5252
However, make sure that both regions support all of the Azure services needed for your application. See [Services by region][services-by-region]. For more information about regional pairs, see [Business continuity and disaster recovery (BCDR): Azure Paired Regions][regional-pairs].
5353

5454
### Resource groups
5555

56-
Consider placing the primary region, secondary region, and Traffic Manager into separate [resource groups][resource groups]. This lets you manage the resources deployed to each region as a single collection.
56+
Consider placing the primary region, secondary region, and Traffic Manager into separate [resource groups][resource groups]. This placement allows you manage the resources deployed to each region as a single collection.
5757

5858
### Traffic Manager configuration
5959

60-
**Routing**. Traffic Manager supports several [routing algorithms][tm-routing]. For the scenario described in this article, use *priority* routing (formerly called *failover* routing). With this setting, Traffic Manager sends all requests to the primary region unless the endpoint for that region becomes unreachable. At that point, it automatically fails over to the secondary region. See [Configure Failover routing method][tm-configure-failover].
60+
**Routing**. Front Door supports several [routing mechanisms](/azure/frontdoor/front-door-routing-methods#priority-based-traffic-routing). For the scenario described in this article, we'll use *priority* routing. With this setting, Front Door sends all requests to the primary region unless the endpoint for that region becomes unreachable. At that point, it automatically fails over to the secondary region. All you need to do is mark the different back ends in the backend pool for your Front Door with different priority values - 1 for the active region and 2 or lower for the standby or passive region.
6161

62-
**Health probe**. Traffic Manager uses an HTTP (or HTTPS) probe to monitor the availability of each endpoint. The probe gives Traffic Manager a pass/fail test for failing over to the secondary region. It works by sending a request to a specified URL path. If it gets a non-200 response within a timeout period, the probe fails. After four failed requests, Traffic Manager marks the endpoint as degraded and fails over to the other endpoint. For details, see [Traffic Manager endpoint monitoring and failover][tm-monitoring].
62+
**Health probe**. Front Door uses an HTTP (or HTTPS) probe to monitor the availability of each backend. The probe gives Front Door a pass/fail test for failing over to the secondary region. It works by sending a request to a specified URL path. If it gets a non-200 response within a timeout period, the probe fails. You can configure the health probe frequency, number of samples required for evaluation, and the number of successful samples required to call the backend as healthy. Based on the health probe configuration, Front Door marks the backend as degraded and fails over to the other backend. For details, see [Health Probes](/azure/frontdoor/front-door-health-probes).
6363

64-
As a best practice, create a health probe endpoint that reports the overall health of the application and use this endpoint for the health probe. The endpoint should check critical dependencies such as the App Service apps, storage queue, and SQL Database. Otherwise, the probe might report a healthy endpoint when critical parts of the application are actually failing.
64+
As a best practice, create a health probe path in your application backend that reports the overall health of the application and use the configuration for the health probe. The backend should check critical dependencies such as the App Service apps, storage queue, and SQL Database. If you don't follow this pattern, the probe might report a healthy backend when critical parts of the application are actually failing.
6565

66-
On the other hand, don't use the health probe to check lower priority services. For example, if an email service goes down the application can switch to a second provider or just send emails later. This is not a high enough priority to cause the application to fail over. For more information, see the [Health Endpoint Monitoring pattern][health-endpoint-monitoring-pattern].
66+
Don't use the health probe to check lower priority services. For example, if an email service goes down the application can switch to a second provider or just send emails later. This alone is not a high enough priority to cause the application to fail over.
6767

6868
### SQL Database
6969

70-
Use [Active Geo-Replication][sql-replication] to create a readable secondary replica in a different region. You can have up to four readable secondary replicas. Fail over to a secondary database if your primary database fails or needs to be taken offline. Active Geo-Replication can be configured for any database in any elastic database pool.
70+
Use [Active Geo-Replication][sql-replication] to create a readable secondary replica in a different region. You can have up to four readable secondary replicas. Failover to a secondary database if your primary database fails or needs to be taken offline. Active Geo-Replication can be configured for any database in any elastic database pool.
7171

7272
### Cosmos DB
7373

@@ -79,20 +79,18 @@ Cosmos DB supports geo-replication across regions with multi-master (multiple wr
7979
8080
### Storage
8181

82-
For Azure Storage, use [read-access geo-redundant storage][ra-grs] (RA-GRS). With RA-GRS storage, the data is replicated to a secondary region. You have read-only access to the data in the secondary region through a separate endpoint. If there is a regional outage or disaster, the Azure Storage team might decide to perform a geo-failover to the secondary region. There is no customer action required for this failover.
82+
For Azure Storage, use [read-access geo-redundant storage][ra-grs] (RA-GRS). With RA-GRS storage, the data is replicated to a secondary region. You have read-only access to the data in the secondary region through a separate endpoint. If there is a regional outage or disaster, the Azure Storage team might decide to perform a geo failover to the secondary region. There is no customer action required for this failover.
8383

8484
For Queue storage, create a backup queue in the secondary region. During failover, the app can use the backup queue until the primary region becomes available again. That way, the application can still process new requests.
8585

86-
## Availability considerations - Traffic Manager
86+
## Availability considerations - Front Door
8787

88-
Traffic Manager automatically fails over if the primary region becomes unavailable. When Traffic Manager fails over, there is a period of time when clients cannot reach the application. The duration is affected by the following factors:
88+
Front Door automatically fails over if the primary region becomes unavailable. When Front Door fails over, there is a period of time (usually about 20-60 seconds) when clients cannot reach the application. The duration is affected by the following factors:
8989

90-
- The health probe must detect that the primary datacenter has become unreachable.
91-
- Domain name service (DNS) servers must update the cached DNS records for the IP address, which depends on the DNS time-to-live (TTL). The default TTL is 300 seconds (5 minutes), but you can configure this value when you create the Traffic Manager profile.
90+
- The frequency of health probes: The more frequent the health probes are sent, the faster Front Door can detect downtime or the backend coming back healthy.
91+
- The sample size configuration for the health probe to correctly detect that the primary data center has become unreachable and that the same is not an intermittent issue.
9292

93-
For details, see [About Traffic Manager Monitoring][tm-monitoring].
94-
95-
Traffic Manager is a possible failure point in the system. If the service fails, clients cannot access your application during the downtime. Review the [Traffic Manager service level agreement (SLA)][tm-sla] and determine whether using Traffic Manager alone meets your business requirements for high availability. If not, consider adding another traffic management solution as a fallback. If the Azure Traffic Manager service fails, change your canonical name (CNAME) records in DNS to point to the other traffic management service. This step must be performed manually, and your application will be unavailable until the DNS changes are propagated.
93+
Front Door is a possible failure point in the system. If the service fails, clients cannot access your application during the downtime. Review the [Front Door service level agreement (SLA)](https://azure.microsoft.com/support/legal/sla/frontdoor)) and determine whether using Front Door alone meets your business requirements for high availability. If not, consider adding another traffic management solution as a fallback such as Azure Traffic Manager. If the Front Door service fails, change your canonical name (CNAME) records in DNS to point to the other traffic management service. This step must be performed manually, and your application will be unavailable until the DNS changes are propagated.
9694

9795
## Availability Considerations - SQL Database
9896

@@ -103,8 +101,8 @@ The recovery point objective (RPO) and estimated recovery time (ERT) for SQL Dat
103101
RA-GRS storage provides durable storage, but it's important to understand what can happen during an outage:
104102

105103
- If a storage outage occurs, there will be a period of time when you don't have write-access to the data. You can still read from the secondary endpoint during the outage.
106-
- If a regional outage or disaster affects the primary location and the data there cannot be recovered, the Azure Storage team may decide to perform a geo-failover to the secondary region.
107-
- Data replication to the secondary region is performed asynchronously. Therefore, if a geo-failover is performed, some data loss is possible if the data can't be recovered from the primary region.
104+
- If a regional outage or disaster affects the primary location and the data there cannot be recovered, the Azure Storage team may decide to perform a geo failover to the secondary region.
105+
- Data replication to the secondary region is performed asynchronously. Therefore, if a geo failover is performed, some data loss is possible if the data can't be recovered from the primary region.
108106
- Transient failures, such as a network outage, will not trigger a storage failover. Design your application to be resilient to transient failures. Possible mitigations:
109107

110108
- Read from the secondary region.
@@ -114,34 +112,9 @@ RA-GRS storage provides durable storage, but it's important to understand what c
114112

115113
For more information, see [What to do if an Azure Storage outage occurs][storage-outage].
116114

117-
## Manageability Considerations - Traffic Manager
118-
119-
If Traffic Manager fails over, we recommend performing a manual failback rather than implementing an automatic failback. Otherwise, you can create a situation where the application flips back and forth between regions. Verify that all application subsystems are healthy before failing back.
120-
121-
Note that Traffic Manager automatically fails back by default. To prevent this, manually lower the priority of the primary region after a failover event. For example, suppose the primary region is priority 1 and the secondary is priority 2. After a failover, set the primary region to priority 3, to prevent automatic failback. When you are ready to switch back, update the priority to 1.
122-
123-
The following commands update the priority.
124-
125-
### PowerShell
126-
127-
```powershell
128-
$endpoint = Get-AzureRmTrafficManagerEndpoint -Name <endpoint> -ProfileName <profile> -ResourceGroupName <resource-group> -Type AzureEndpoints
129-
$endpoint.Priority = 3
130-
Set-AzureRmTrafficManagerEndpoint -TrafficManagerEndpoint $endpoint
131-
```
132-
133-
For more information, see [Azure Traffic Manager Cmdlets][tm-ps].
134-
135-
### Azure CLI
136-
137-
```azurecli
138-
az network traffic-manager endpoint update --resource-group <resource-group> --profile-name <profile> \
139-
--name <endpoint-name> --type azureEndpoints --priority 3
140-
```
141-
142115
## Manageability Considerations - SQL Database
143116

144-
If the primary database fails, perform a manual failover to the secondary database. See [Restore an Azure SQL Database or failover to a secondary][sql-failover]. The secondary database remains read-only until you fail over.
117+
If the primary database fails, perform a manual fail over to the secondary database. See [Restore an Azure SQL Database or fail over to a secondary][sql-failover]. The secondary database remains read-only until you fail over.
145118

146119
<!-- links -->
147120

@@ -158,10 +131,4 @@ If the primary database fails, perform a manual failover to the secondary databa
158131
[sql-replication]: /azure/sql-database/sql-database-geo-replication-overview
159132
[sql-rpo]: /azure/sql-database/sql-database-business-continuity#sql-database-features-that-you-can-use-to-provide-business-continuity
160133
[storage-outage]: /azure/storage/storage-disaster-recovery-guidance
161-
[tm-configure-failover]: /azure/traffic-manager/traffic-manager-configure-failover-routing-method
162-
[tm-monitoring]: /azure/traffic-manager/traffic-manager-monitoring
163-
[tm-ps]: /powershell/module/azurerm.trafficmanager
164-
[tm-routing]: /azure/traffic-manager/traffic-manager-routing-methods
165-
[tm-sla]: https://azure.microsoft.com/support/legal/sla/traffic-manager
166-
[traffic-manager]: https://azure.microsoft.com/services/traffic-manager
167134
[visio-download]: https://archcenter.blob.core.windows.net/cdn/app-service-reference-architectures.vsdx

0 commit comments

Comments
 (0)