Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pluto: add more retries for requests out to AWS #3214

Merged
merged 2 commits into from
Jun 23, 2023

Conversation

etungsten
Copy link
Contributor

@etungsten etungsten commented Jun 20, 2023

Issue number:

N/A

Description of changes:

    pluto: increase retries for requests out to AWS
    
    This adds more resiliency when pluto makes API requests out to EC2 and
    EKS. We choose a large number of retries to force the SDK to keep
    retrying until we timeout on the overall request.
    
    It's now less likely for the request to overall fail due to
    transient connection errors.

    sundog: enforce settings generator execution timeout
    
    Enforces a 6 minute timeout on settings generator execution so as to not
    hang the boot for too long before erroring out.

Testing done:
On a host where 25% of all outbound packets get dropped, I was able to call pluto private-dns-name where it retried the request 5 times (beyond the default 3 tries set by the SDK).

18:54:36 [DEBUG] (4) hyper::client::connect::dns: resolving host="ec2.us-west-2.amazonaws.com"
18:54:39 [DEBUG] (1) aws_smithy_client::retry: attempt 1 failed with Error(TransientError); retrying after 93.530282ms
18:54:39 [DEBUG] (1) aws_smithy_client::retry: retry; kind=Error(TransientError)
18:54:39 [DEBUG] (1) aws_smithy_http_tower::map_request: map_request; name="resolve_endpoint"
18:54:39 [DEBUG] (1) aws_smithy_http_tower::map_request: map_request; name="resolve_endpoint"
18:54:39 [DEBUG] (1) aws_smithy_http_tower::map_request: map_request; name="generate_user_agent"
18:54:39 [DEBUG] (1) aws_smithy_http_tower::map_request: async_map_request; name="retrieve_credentials"
18:54:39 [DEBUG] (1) aws_credential_types::cache::lazy_caching: loaded credentials from cache
18:54:39 [DEBUG] (1) aws_smithy_http_tower::map_request: map_request; name="sigv4_sign_request"
18:54:39 [DEBUG] (1) aws_smithy_http_tower::map_request: map_request; name="recursion_detection"
18:54:39 [DEBUG] (1) tracing::span: dispatch;
18:54:39 [DEBUG] (5) hyper::client::connect::dns: resolving host="ec2.us-west-2.amazonaws.com"
18:54:42 [DEBUG] (1) aws_smithy_client::retry: attempt 2 failed with Error(TransientError); retrying after 381.207122ms
18:54:42 [DEBUG] (1) aws_smithy_client::retry: retry; kind=Error(TransientError)
18:54:43 [DEBUG] (1) aws_smithy_http_tower::map_request: map_request; name="resolve_endpoint"
18:54:43 [DEBUG] (1) aws_smithy_http_tower::map_request: map_request; name="resolve_endpoint"
18:54:43 [DEBUG] (1) aws_smithy_http_tower::map_request: map_request; name="generate_user_agent"
18:54:43 [DEBUG] (1) aws_smithy_http_tower::map_request: async_map_request; name="retrieve_credentials"
18:54:43 [DEBUG] (1) aws_credential_types::cache::lazy_caching: loaded credentials from cache
18:54:43 [DEBUG] (1) aws_smithy_http_tower::map_request: map_request; name="sigv4_sign_request"
18:54:43 [DEBUG] (1) aws_smithy_http_tower::map_request: map_request; name="recursion_detection"
18:54:43 [DEBUG] (1) tracing::span: dispatch;
18:54:43 [DEBUG] (6) hyper::client::connect::dns: resolving host="ec2.us-west-2.amazonaws.com"
18:54:46 [DEBUG] (1) aws_smithy_client::retry: attempt 3 failed with Error(TransientError); retrying after 1.246607675s
18:54:46 [DEBUG] (1) aws_smithy_client::retry: retry; kind=Error(TransientError)
18:54:47 [DEBUG] (1) aws_smithy_http_tower::map_request: map_request; name="resolve_endpoint"
18:54:47 [DEBUG] (1) aws_smithy_http_tower::map_request: map_request; name="resolve_endpoint"
18:54:47 [DEBUG] (1) aws_smithy_http_tower::map_request: map_request; name="generate_user_agent"
18:54:47 [DEBUG] (1) aws_smithy_http_tower::map_request: async_map_request; name="retrieve_credentials"
18:54:47 [DEBUG] (1) aws_credential_types::cache::lazy_caching: loaded credentials from cache
18:54:47 [DEBUG] (1) aws_smithy_http_tower::map_request: map_request; name="sigv4_sign_request"
18:54:47 [DEBUG] (1) aws_smithy_http_tower::map_request: map_request; name="recursion_detection"
18:54:47 [DEBUG] (1) tracing::span: dispatch;
18:54:47 [DEBUG] (5) hyper::client::connect::dns: resolving host="ec2.us-west-2.amazonaws.com"
18:54:50 [DEBUG] (1) aws_smithy_client::retry: attempt 4 failed with Error(TransientError); retrying after 1.733219105s
18:54:50 [DEBUG] (1) aws_smithy_client::retry: retry; kind=Error(TransientError)
18:54:52 [DEBUG] (1) aws_smithy_http_tower::map_request: map_request; name="resolve_endpoint"
18:54:52 [DEBUG] (1) aws_smithy_http_tower::map_request: map_request; name="resolve_endpoint"
18:54:52 [DEBUG] (1) aws_smithy_http_tower::map_request: map_request; name="generate_user_agent"
18:54:52 [DEBUG] (1) aws_smithy_http_tower::map_request: async_map_request; name="retrieve_credentials"
18:54:52 [DEBUG] (1) aws_credential_types::cache::lazy_caching: loaded credentials from cache
18:54:52 [DEBUG] (1) aws_smithy_http_tower::map_request: map_request; name="sigv4_sign_request"
18:54:52 [DEBUG] (1) aws_smithy_http_tower::map_request: map_request; name="recursion_detection"
18:54:52 [DEBUG] (1) tracing::span: dispatch;
18:54:52 [DEBUG] (4) hyper::client::connect::dns: resolving host="ec2.us-west-2.amazonaws.com"
18:54:52 [DEBUG] (1) hyper::client::connect::http: connecting to 52.94.214.26:443
18:54:52 [DEBUG] (1) hyper::client::connect::http: connected to 52.94.214.26:443
18:54:52 [DEBUG] (1) rustls::client::hs: No cached session for DnsName(DnsName(DnsName("ec2.us-west-2.amazonaws.com")))
18:54:52 [DEBUG] (1) rustls::client::hs: Not resuming any session
18:54:52 [DEBUG] (1) rustls::client::hs: ALPN protocol is Some(b"http/1.1")
18:54:52 [DEBUG] (1) rustls::client::hs: Using ciphersuite TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
18:54:52 [DEBUG] (1) rustls::client::tls12: ECDHE curve is ECParameters { curve_type: NamedCurve, named_group: secp256r1 }
18:54:52 [DEBUG] (1) rustls::client::tls12: Server DNS name is DnsName(DnsName(DnsName("ec2.us-west-2.amazonaws.com")))
18:54:52 [DEBUG] (1) rustls::client::tls12: Session saved
18:54:52 [DEBUG] (2) hyper::proto::h1::io: flushed 1994 bytes
18:54:52 [DEBUG] (3) hyper::proto::h1::io: parsed 8 headers
18:54:52 [DEBUG] (3) hyper::proto::h1::conn: incoming body is chunked encoding
18:54:52 [DEBUG] (1) tracing::span: load_response;
18:54:52 [DEBUG] (1) tracing::span: parse_unloaded;
18:54:52 [DEBUG] (1) tracing::span: read_body;
18:54:52 [DEBUG] (3) hyper::proto::h1::decode: incoming chunked header: 0x2000 (8192 bytes)
18:54:52 [DEBUG] (3) hyper::proto::h1::decode: incoming chunked header: 0x10EB (4331 bytes)
18:54:52 [DEBUG] (2) hyper::proto::h1::conn: incoming body completed
18:54:52 [DEBUG] (2) hyper::client::pool: pooling idle connection for ("https", ec2.us-west-2.amazonaws.com)
18:54:52 [DEBUG] (1) tracing::span: parse_loaded;
18:54:52 [DEBUG] (1) aws_smithy_client: send_operation; status="ok"
"i-abcd1234.us-west-2.compute.internal"
18:54:52 [DEBUG] (3) rustls::conn: Sending warning alert CloseNotify

real	0m17.869s
user	0m0.022s
sys	0m0.000s

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

@stmcginnis stmcginnis mentioned this pull request Jun 20, 2023
6 tasks
Copy link
Contributor

@webern webern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable to me. Is this the same timeout that we were getting by default before?

const IMDS_CONNECT_TIMEOUT: Duration = Duration::from_secs(2);

@etungsten
Copy link
Contributor Author

Is this the same timeout that we were getting by default before?

The default timeout for the IMDS clients are set to 1 second. https://github.com/awslabs/smithy-rs/blob/7ed51b21290aba818fcfd7a5501fe7035cde5c24/aws/rust-runtime/aws-config/src/imds/client.rs#L47


// Max request retry attempts
const MAX_ATTEMPTS: u32 = 10;
const IMDS_CONNECT_TIMEOUT: Duration = Duration::from_secs(2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this 3 seconds to align with the AWS SDK?

Comment on lines 23 to 34
pub(crate) async fn sdk_imds_client() -> Result<imds::Client> {
imds::Client::builder()
.max_attempts(MAX_ATTEMPTS)
.connect_timeout(IMDS_CONNECT_TIMEOUT)
.build()
.await
.context(SdkImdsSnafu)
}

pub(crate) fn sdk_retry_config() -> RetryConfig {
RetryConfigBuilder::new().max_attempts(MAX_ATTEMPTS).build()
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these need to be pub(crate)?

Suggested change
pub(crate) async fn sdk_imds_client() -> Result<imds::Client> {
imds::Client::builder()
.max_attempts(MAX_ATTEMPTS)
.connect_timeout(IMDS_CONNECT_TIMEOUT)
.build()
.await
.context(SdkImdsSnafu)
}
pub(crate) fn sdk_retry_config() -> RetryConfig {
RetryConfigBuilder::new().max_attempts(MAX_ATTEMPTS).build()
}
async fn sdk_imds_client() -> Result<imds::Client> {
imds::Client::builder()
.max_attempts(MAX_ATTEMPTS)
.connect_timeout(IMDS_CONNECT_TIMEOUT)
.build()
.await
.context(SdkImdsSnafu)
}
fn sdk_retry_config() -> RetryConfig {
RetryConfigBuilder::new().max_attempts(MAX_ATTEMPTS).build()
}

sources/api/pluto/src/aws.rs Outdated Show resolved Hide resolved
@etungsten
Copy link
Contributor Author

Pushes above and below addresses @bcressey 's comment.

  • Change IMDS connection timeout from 2 seconds to 3 seconds.
  • Remove extraneous crate visibility specification for functions that don't need it.
  • In pluto, increase max retry attempts to a large number of retries and let sundog terminate the process if it takes too long.
  • Enforce settings generator execution timeout in sundog so as to not let settings generator hang the boot indefinitely.

Enforces a 6 minute timeout on settings generator execution so as to not
hang the boot for too long before erroring out.
This adds more resiliency when pluto makes API requests out to EC2 and
EKS. We choose a large number of retries to force the SDK to keep
retrying until we timeout on the overall request.

It's now less likely for the request to overall fail due to
transient connection errors.
Copy link
Contributor

@stmcginnis stmcginnis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updates look good to me.

@etungsten etungsten merged commit bca352b into bottlerocket-os:develop Jun 23, 2023
@etungsten etungsten deleted the retry-retry-retry branch June 23, 2023 20:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants