You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/reference-architectures/app-service-web-app/multi-region.md
+20-53
Original file line number
Diff line number
Diff line change
@@ -3,7 +3,7 @@ title: Highly available multi-region web application
3
3
titleSuffix: Azure Reference Architectures
4
4
description: Recommended architecture for a highly available web application running in multiple regions in Azure.
5
5
author: MikeWasson
6
-
ms.date: 10/25/2018
6
+
ms.date: 08/01/2019
7
7
ms.topic: reference-architecture
8
8
ms.service: architecture-center
9
9
ms.subservice: reference-architecture
@@ -24,18 +24,18 @@ This architecture builds on the one shown in [Improve scalability in a web appli
24
24
25
25
-**Primary and secondary regions**. This architecture uses two regions to achieve higher availability. The application is deployed to each region. During normal operations, network traffic is routed to the primary region. If the primary region becomes unavailable, traffic is routed to the secondary region.
26
26
-**Azure DNS**. [Azure DNS][azure-dns] is a hosting service for DNS domains, providing name resolution using Microsoft Azure infrastructure. By hosting your domains in Azure, you can manage your DNS records using the same credentials, APIs, tools, and billing as your other Azure services.
27
-
-**Azure Traffic Manager**. [Traffic Manager][traffic-manager] routes incoming requests to the primary region. If the application running that region becomes unavailable, Traffic Manager fails over to the secondary region.
27
+
-**Front Door**. [Front Door](/azure/frontdoor/) routes incoming requests to the primary region. If the application running that region becomes unavailable, Front Door fails over to the secondary region.
28
28
-**Geo-replication** of SQL Database and Cosmos DB.
29
29
30
-
A multi-region architecture can provide higher availability than deploying to a single region. If a regional outage affects the primary region, you can use [Traffic Manager][traffic-manager] to fail over to the secondary region. This architecture can also help if an individual subsystem of the application fails.
30
+
A multi-region architecture can provide higher availability than deploying to a single region. If a regional outage affects the primary region, you can use [Front Door](/azure/frontdoor/) to fail over to the secondary region. This architecture can also help if an individual subsystem of the application fails.
31
31
32
32
There are several general approaches to achieving high availability across regions:
33
33
34
34
- Active/passive with hot standby. Traffic goes to one region, while the other waits on hot standby. Hot standby means the VMs in the secondary region are allocated and running at all times.
35
35
- Active/passive with cold standby. Traffic goes to one region, while the other waits on cold standby. Cold standby means the VMs in the secondary region are not allocated until needed for failover. This approach costs less to run, but will generally take longer to come online during a failure.
36
36
- Active/active. Both regions are active, and requests are load balanced between them. If one region becomes unavailable, it is taken out of rotation.
37
37
38
-
This reference architecture focuses on active/passive with hot standby, using Traffic Manager for failover.
38
+
This reference architecture focuses on active/passive with hot standby, using Front Door for failover.
39
39
40
40
## Recommendations
41
41
@@ -47,27 +47,27 @@ Each Azure region is paired with another region within the same geography. In ge
47
47
48
48
- If there is a broad outage, recovery of at least one region out of every pair is prioritized.
49
49
- Planned Azure system updates are rolled out to paired regions sequentially to minimize possible downtime.
50
-
- In most cases, regional pairs reside within the same geography to meet data residency requirements.
50
+
- In every case except Brazil South, regional pairs reside within the same geography to meet data residency requirements.
51
51
52
52
However, make sure that both regions support all of the Azure services needed for your application. See [Services by region][services-by-region]. For more information about regional pairs, see [Business continuity and disaster recovery (BCDR): Azure Paired Regions][regional-pairs].
53
53
54
54
### Resource groups
55
55
56
-
Consider placing the primary region, secondary region, and Traffic Manager into separate [resource groups][resource groups]. This lets you manage the resources deployed to each region as a single collection.
56
+
Consider placing the primary region, secondary region, and Traffic Manager into separate [resource groups][resource groups]. This placement allows you manage the resources deployed to each region as a single collection.
57
57
58
58
### Traffic Manager configuration
59
59
60
-
**Routing**. Traffic Manager supports several [routing algorithms][tm-routing]. For the scenario described in this article, use *priority* routing (formerly called *failover* routing). With this setting, Traffic Manager sends all requests to the primary region unless the endpoint for that region becomes unreachable. At that point, it automatically fails over to the secondary region. See [Configure Failover routing method][tm-configure-failover].
60
+
**Routing**. Front Door supports several [routing mechanisms](/azure/frontdoor/front-door-routing-methods#priority-based-traffic-routing). For the scenario described in this article, we'll use *priority* routing. With this setting, Front Door sends all requests to the primary region unless the endpoint for that region becomes unreachable. At that point, it automatically fails over to the secondary region. All you need to do is mark the different back ends in the backend pool for your Front Door with different priority values - 1 for the active region and 2 or lower for the standby or passive region.
61
61
62
-
**Health probe**. Traffic Manager uses an HTTP (or HTTPS) probe to monitor the availability of each endpoint. The probe gives Traffic Manager a pass/fail test for failing over to the secondary region. It works by sending a request to a specified URL path. If it gets a non-200 response within a timeout period, the probe fails. After four failed requests, Traffic Manager marks the endpoint as degraded and fails over to the other endpoint. For details, see [Traffic Manager endpoint monitoring and failover][tm-monitoring].
62
+
**Health probe**. Front Door uses an HTTP (or HTTPS) probe to monitor the availability of each backend. The probe gives Front Door a pass/fail test for failing over to the secondary region. It works by sending a request to a specified URL path. If it gets a non-200 response within a timeout period, the probe fails. You can configure the health probe frequency, number of samples required for evaluation, and the number of successful samples required to call the backend as healthy. Based on the health probe configuration, Front Door marks the backend as degraded and fails over to the other backend. For details, see [Health Probes](/azure/frontdoor/front-door-health-probes).
63
63
64
-
As a best practice, create a health probe endpoint that reports the overall health of the application and use this endpoint for the health probe. The endpoint should check critical dependencies such as the App Service apps, storage queue, and SQL Database. Otherwise, the probe might report a healthy endpoint when critical parts of the application are actually failing.
64
+
As a best practice, create a health probe path in your application backend that reports the overall health of the application and use the configuration for the health probe. The backend should check critical dependencies such as the App Service apps, storage queue, and SQL Database. If you don't follow this pattern, the probe might report a healthy backend when critical parts of the application are actually failing.
65
65
66
-
On the other hand, don't use the health probe to check lower priority services. For example, if an email service goes down the application can switch to a second provider or just send emails later. This is not a high enough priority to cause the application to fail over. For more information, see the [Health Endpoint Monitoring pattern][health-endpoint-monitoring-pattern].
66
+
Don't use the health probe to check lower priority services. For example, if an email service goes down the application can switch to a second provider or just send emails later. This alone is not a high enough priority to cause the application to fail over.
67
67
68
68
### SQL Database
69
69
70
-
Use [Active Geo-Replication][sql-replication] to create a readable secondary replica in a different region. You can have up to four readable secondary replicas. Fail over to a secondary database if your primary database fails or needs to be taken offline. Active Geo-Replication can be configured for any database in any elastic database pool.
70
+
Use [Active Geo-Replication][sql-replication] to create a readable secondary replica in a different region. You can have up to four readable secondary replicas. Failover to a secondary database if your primary database fails or needs to be taken offline. Active Geo-Replication can be configured for any database in any elastic database pool.
71
71
72
72
### Cosmos DB
73
73
@@ -79,20 +79,18 @@ Cosmos DB supports geo-replication across regions with multi-master (multiple wr
79
79
80
80
### Storage
81
81
82
-
For Azure Storage, use [read-access geo-redundant storage][ra-grs] (RA-GRS). With RA-GRS storage, the data is replicated to a secondary region. You have read-only access to the data in the secondary region through a separate endpoint. If there is a regional outage or disaster, the Azure Storage team might decide to perform a geo-failover to the secondary region. There is no customer action required for this failover.
82
+
For Azure Storage, use [read-access geo-redundant storage][ra-grs] (RA-GRS). With RA-GRS storage, the data is replicated to a secondary region. You have read-only access to the data in the secondary region through a separate endpoint. If there is a regional outage or disaster, the Azure Storage team might decide to perform a geofailover to the secondary region. There is no customer action required for this failover.
83
83
84
84
For Queue storage, create a backup queue in the secondary region. During failover, the app can use the backup queue until the primary region becomes available again. That way, the application can still process new requests.
85
85
86
-
## Availability considerations - Traffic Manager
86
+
## Availability considerations - Front Door
87
87
88
-
Traffic Manager automatically fails over if the primary region becomes unavailable. When Traffic Manager fails over, there is a period of time when clients cannot reach the application. The duration is affected by the following factors:
88
+
Front Door automatically fails over if the primary region becomes unavailable. When Front Door fails over, there is a period of time (usually about 20-60 seconds) when clients cannot reach the application. The duration is affected by the following factors:
89
89
90
-
- The health probe must detect that the primary datacenter has become unreachable.
91
-
-Domain name service (DNS) servers must update the cached DNS records for the IP address, which depends on the DNS time-to-live (TTL). The default TTL is 300 seconds (5 minutes), but you can configure this value when you create the Traffic Manager profile.
90
+
- The frequency of health probes: The more frequent the health probes are sent, the faster Front Door can detect downtime or the backend coming back healthy.
91
+
-The sample size configuration for the health probe to correctly detect that the primary data center has become unreachable and that the same is not an intermittent issue.
92
92
93
-
For details, see [About Traffic Manager Monitoring][tm-monitoring].
94
-
95
-
Traffic Manager is a possible failure point in the system. If the service fails, clients cannot access your application during the downtime. Review the [Traffic Manager service level agreement (SLA)][tm-sla] and determine whether using Traffic Manager alone meets your business requirements for high availability. If not, consider adding another traffic management solution as a fallback. If the Azure Traffic Manager service fails, change your canonical name (CNAME) records in DNS to point to the other traffic management service. This step must be performed manually, and your application will be unavailable until the DNS changes are propagated.
93
+
Front Door is a possible failure point in the system. If the service fails, clients cannot access your application during the downtime. Review the [Front Door service level agreement (SLA)](https://azure.microsoft.com/support/legal/sla/frontdoor)) and determine whether using Front Door alone meets your business requirements for high availability. If not, consider adding another traffic management solution as a fallback such as Azure Traffic Manager. If the Front Door service fails, change your canonical name (CNAME) records in DNS to point to the other traffic management service. This step must be performed manually, and your application will be unavailable until the DNS changes are propagated.
96
94
97
95
## Availability Considerations - SQL Database
98
96
@@ -103,8 +101,8 @@ The recovery point objective (RPO) and estimated recovery time (ERT) for SQL Dat
103
101
RA-GRS storage provides durable storage, but it's important to understand what can happen during an outage:
104
102
105
103
- If a storage outage occurs, there will be a period of time when you don't have write-access to the data. You can still read from the secondary endpoint during the outage.
106
-
- If a regional outage or disaster affects the primary location and the data there cannot be recovered, the Azure Storage team may decide to perform a geo-failover to the secondary region.
107
-
- Data replication to the secondary region is performed asynchronously. Therefore, if a geo-failover is performed, some data loss is possible if the data can't be recovered from the primary region.
104
+
- If a regional outage or disaster affects the primary location and the data there cannot be recovered, the Azure Storage team may decide to perform a geofailover to the secondary region.
105
+
- Data replication to the secondary region is performed asynchronously. Therefore, if a geofailover is performed, some data loss is possible if the data can't be recovered from the primary region.
108
106
- Transient failures, such as a network outage, will not trigger a storage failover. Design your application to be resilient to transient failures. Possible mitigations:
109
107
110
108
- Read from the secondary region.
@@ -114,34 +112,9 @@ RA-GRS storage provides durable storage, but it's important to understand what c
114
112
115
113
For more information, see [What to do if an Azure Storage outage occurs][storage-outage].
116
114
117
-
## Manageability Considerations - Traffic Manager
118
-
119
-
If Traffic Manager fails over, we recommend performing a manual failback rather than implementing an automatic failback. Otherwise, you can create a situation where the application flips back and forth between regions. Verify that all application subsystems are healthy before failing back.
120
-
121
-
Note that Traffic Manager automatically fails back by default. To prevent this, manually lower the priority of the primary region after a failover event. For example, suppose the primary region is priority 1 and the secondary is priority 2. After a failover, set the primary region to priority 3, to prevent automatic failback. When you are ready to switch back, update the priority to 1.
If the primary database fails, perform a manual failover to the secondary database. See [Restore an Azure SQL Database or failover to a secondary][sql-failover]. The secondary database remains read-only until you fail over.
117
+
If the primary database fails, perform a manual fail over to the secondary database. See [Restore an Azure SQL Database or fail over to a secondary][sql-failover]. The secondary database remains read-only until you fail over.
145
118
146
119
<!-- links -->
147
120
@@ -158,10 +131,4 @@ If the primary database fails, perform a manual failover to the secondary databa
0 commit comments