diff --git a/docs/pages/deploy-a-cluster/aws-gslb-proxy-peering-ha-deployment.mdx b/docs/pages/deploy-a-cluster/aws-gslb-proxy-peering-ha-deployment.mdx new file mode 100644 index 0000000000000..b1c9f78b916ab --- /dev/null +++ b/docs/pages/deploy-a-cluster/aws-gslb-proxy-peering-ha-deployment.mdx @@ -0,0 +1,282 @@ +--- +title: "AWS Route 53 GSLB Multi-Region Proxy Peering High Availability Deployment Guide" +description: "Deploying a Proxy-peered High Availability Teleport Cluster using Route 53 to create Global Server Load Balancing" +--- + +When deploying Teleport in production, you should design your deployment to +ensure that users can continue to access infrastructure should an outage or +incident affect the availability of your Teleport cluster. + +In order to maintain optimal end-user +experience with minimal latency and maximum performance, it is imperative to ensure the scalability +of your Auth Service and Proxy Service to accommodate increasing numbers of users and connected resources. + +(!docs/pages/includes/cloud/call-to-action.mdx!) + +## Overview +This deployment architecure makes all connected resources accessible through a single Teleport cluster +across multiple regions using exclusively AWS ecosystem infrastructure. + +This is accomplished using AWS Route 53 to create Global Server Load Balancing (GSLB) +for the Teleport Cluster and Teleport Proxy Peering to reduce the number of connections +created through the cluster. + +This deployment architecture isn’t recommended for use cases where your users or resources are +clustered in a single region or for Managed Service Providers needing to provide separate clusters +to customers. Additionally, this architecture is not a solution for increasing the scalability of +a single cluster. + +We recommend this for globally distributed resources and end-users that prefer a single point of +entry while also ensuring minimal latency when accessing connected resources. + +### Key deployment components +- High Availability Teleport Cluster +- Auth Servers must remain in a single region +- Proxies are deployed across multiple regions +- [AWS Route 53 latency based routing]([Latency-based routing](https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/routing-policy-latency.html) + to create [GSLB](https://www.cloudflare.com/learning/cdn/glossary/global-server-load-balancing-gslb/) +- [Teleport TLS Routing](https://goteleport.com/docs/architecture/tls-routing/) to reduce the number of ports needed to use Teleport +- [Teleport Proxy Peering](https://goteleport.com/docs/architecture/proxy-peering/) for reducing the number of resource connections +- [AWS Network Load Balancing](https://docs.aws.amazon.com/elasticloadbalancing/latest/network/introduction.html) +- [AWS DynamoDB](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Introduction.html) for cluster state storage +- [AWS S3](https://docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html) for session recording storage + +## Advantages of this deployment architecture +- Eliminates the complexity maintaining multiple Teleport clusters across multiple regions +- Uses the lowest latency path to connect users to resources +- Provides a highly-resilient, redundant HA architecture for Teleport that can quickly + scale with an organization’s needs. +- All required Teleport components can be provisioned within the AWS ecosystem. +- Using load balancers for the Proxy and Auth services allows for increased availability + during Teleport Cluster upgrades. Instances can easily be removed and added while + limiting impact to active users. + +## Disadvantages of this deployment architecture +- When Teleport Auth servers are limited to a single region, there is a higher likelihood + of decreased availability during an AWS regional outage. +- Technically complex to deploy +- Long-term cost may be a prohibitive factor for some organizations and can increase total + cost of ownership (TCO) throughout the system’s lifetime cycle. + + +![Diagram of a high-availability Teleport +architecture](../../img/deploy-a-cluster/aws-gslb-proxy-peering-ha-deployment.png) + + +## AWS Network load balancer(NLB) +For this deployment architecture we recommend using AWS NLBs if you plan +to use Teleport TLS routing and Proxy Peering. The NLB forwards traffic +from users and services to an available Teleport instance. This must not +terminate TLS, and must transparently forward the TCP traffic it receives. +In other words, this must be a Layer 4 load balancer, not a Layer 7 +(e.g., HTTP) load balancer. + +### Configure the NLBs +Configure the load balancer to forward traffic from the following ports on the +load balancer to the corresponding port on an available Teleport instance. The +configuration depends on whether you route Proxy Peering GRPC traffic over +the public internet: + + + + +| Port | Description | +| - | - | +| `443` | ALPN port for TLS Routing, HTTPS connections to authenticate `tsh` users into the cluster, and to serve a Web UI | +| `3021`| Proxy Peering GRPC Stream | + + + + +These ports are required: + +| Port | Description | +| - | - | +| `443` | ALPN port for TLS Routing, HTTPS connections to authenticate `tsh` users into the cluster, and to serve a Web UI | + + + + +We recommend enabling cross-zone load balancing for the Auth and Proxy service NLB configurations to route +traffic across multiple zones.Doing this improves resiliency against localized AWS zone outages. + +## Cluster state backend + +The Teleport Auth Service stores cluster state (such as dynamic configuration +resources) and audit events as key/value pairs. In high-availability +deployments, you must configure the Auth Service to manage this data in a +key-value store that runs outside of your cluster of Teleport instances. + +For Amazon DynamoDB, your Teleport configuration (which +we will describe in more detail in the [Configuration](#configuration) section) +names a table or collection where Teleport stores cluster state and audit +events. + +The Teleport Auth Service manages the creation of any required DynamoDB tables itself, +and does not require them to exist in advance. + + + +The Auth Service instances needs permissions to read from and write to DynamoDB, as well as +to create tables. + + + +## Session recording backend + +High-availability Teleport deployments use an object storage service for +persisting session recordings. + +In your Teleport configuration (described in the [Configuration](#configuration) +section), you must name an S3 bucket to use for managing session recordings. The Teleport Auth +Service creates this bucket, so to prevent unexpected behavior, you should not +create it in advance. + + + +The Auth Service instances need permissions to get S3 buckets as well as to create, get, list, +and update objects. Since this setup lets Teleport create buckets for you, you should also assign +Auth Service instances permissions to create buckets + + + +## TLS credential provisioning + +High-availability Teleport deployments require a system to fetch TLS +credentials from a certificate authority like Let's Encrypt, AWS Certificate +Manager, Digicert, or a trusted internal authority. The system must then +provision Teleport Proxy Service instances with these credentials and renew them +periodically. + +If you are running a single instance of the Teleport Auth Service and Proxy +Service, you can configure this instance to fetch credentials for itself from +Let's Encrypt using the [ACME ALPN-01 +challenge](https://letsencrypt.org/docs/challenge-types/#tls-alpn-01), where +Teleport demonstrates that it controls the ALPN server at the HTTPS address of +your Teleport Proxy Service. Teleport also fetches a separate certificate for +each application you have registered with Teleport, e.g., +`grafana.teleport.example.com`. + +For high-availability deployments that use Let's Encrypt to supply TLS +credentials to Teleport instances running behind a load balancer, you will need +to use the [ACME +DNS-01](https://letsencrypt.org/docs/challenge-types/#dns-01-challenge) +challenge to demonstrate domain name ownership to Let's Encrypt. In this +challenge, your TLS credential provisioning system creates a DNS TXT record with +a value expected by Let's Encrypt. + +In the configuration we are demonstrating in this guide, each Teleport Proxy +Service instance expects TLS credentials for HTTPS to be available at the file +paths `/etc/teleport-tls/tls.key` (private key) and `/etc/teleport-tls/tls.crt` +(certificate). + +## Global Server Load Balancing with Route 53 + +[Latency-based routing](https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/routing-policy-latency.html) +in a private hosted zone must be used to ensure traffic from Teleport +resources are routed to the closest or lowest latency path Proxy NLB based on the region of +the VPC the resource is connecting from. + +To create GSLB routing, create a CNAME record for each region you have VPCs containing Teleport connected resources. +It is recommeded to add a wildcard record for every region if you plan to use Teleport Appplication Access. + +The following CNAME record values need to be set: +- **Value:** The domain name of the NLB where example-region-1 located Teleport resource traffic should be routed +- **Routing policy:** Latency +- **Region:** The AWS region from which traffic should be routed to the NLB listed in **Value** +- **Health Check ID:** It is recommended that you set this so that traffic is always routed to a healthy NLB + +Example Hosted Zone using AWS Route53 Latency Routing to create GSLB: + +### Root GSLB record for Teleport: + +|Record name|Type|Value| +|---|---|---| +|```*teleport.example.com```|CNAME|AWS Route 53 nameservers| + +### Teleport Proxy DNS records for GSLB: +|Record name|Type|Routing Policy|Region|Value| +|---|---|---|---|---| +|```proxy.teleport.example.com```|CNAME|Latency|us-west-1| ```elb.us-west-1.amazonaws.com``` | +|```*.proxy.teleport.example.com```|CNAME|Latency|us-west-1| ```elb.us-west-1.amazonaws.com``` | +|```proxy.teleport.example.com```|CNAME|Latency|eu-central-1| ```elb.eu_central-1.amazonaws.com```| +|```*.proxy.teleport.example.com```|CNAME|Latency|eu-central-1| ```elb.eu_central-1.amazonaws.com```| + + + +If you are using Let's Encrypt to provide TLS credentials to your Teleport +instances, the TLS credential system we mentioned earlier needs permissions to +manage Route53 DNS records in order to satisfy Let's Encrypt's DNS-01 challenge. + + + +### Teleport resource agent configuration for GSLB +To facilitate latency routing, resource agents must be configured to point ```proxy_server:``` to +the GSLB domain configured in Route 53 _not the specific proxy NLB address_. + +For example: + +``` +teleport: + nodename: ssh-node + ... + proxy_server: teleport.example.com:443 + ... + ssh_service: + enabled: yes + ... +``` +Review the [configuration refrence](https://goteleport.com/docs/reference/config/) page for +additional settings. + +## Configure Proxy Peering + +In this deployment architecure, Proxy Peering is used to restrict the number of connections made from +resources to proxies in the Teleport Cluster. Full Proxy Peering explination and configuration details +can be reviewed in the [Proxy Peering RFD](https://github.com/gravitational/teleport/blob/master/rfd/0069-proxy-peering.md). + +This guide covers the necessary Proxy Peering settings for deploying an HA Teleport Cluster routing resource +traffic with GSLB. + +### Auth Service Proxy Peering configuration + +The Teleport Auth Service must be configured to use the proxy_peering tunnel strategy as shown in the example below: + +``` +auth_service: + ... + tunnel_strategy: + type: proxy_peering +``` +Refrence the [Auth Server configuration](https://goteleport.com/docs/reference/config/#auth-service) reference page +for additional settings. + +### Proxy Service Proxy Peering configuration + +Proxies must advertise a peer address which can be configured to use one of the two options listed below: + +**Option 1:** You can set peer_public_addr: to the specific name of that proxy. This is the recommended +method for lowest latency and most reliable connection. + +``` +proxy_service: + ... + peer_public_addr: teleport.example.com:3021 + ... +``` + +**Option 2:** Proxies can use peer_public_addr: to advertise the proxy NLB. When using this method +you could incur additional latency because peer proxies must continually dial through the NLB until +they establish connection to the correct peer target. + +When using an NLB for peer_public_addr, be sure to set agent_connection_count to a value >=2. + +``` +proxy_service: + ... + peer_public_addr: teleport-example-nlb-us-east-1.amazonaws.com:3021 + agent_connection_count: 2 + ... +``` +Refrence the [Proxy Service configuration](https://goteleport.com/docs/reference/config/#proxy-service) reference page +for additional settings. diff --git a/docs/pages/deploy-a-cluster/aws-gslb-proxy-peering-ha-deployment.png b/docs/pages/deploy-a-cluster/aws-gslb-proxy-peering-ha-deployment.png new file mode 100644 index 0000000000000..564cfe16298b2 Binary files /dev/null and b/docs/pages/deploy-a-cluster/aws-gslb-proxy-peering-ha-deployment.png differ