-
Notifications
You must be signed in to change notification settings - Fork 247
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AWS SDK seems to have failed to poison S3 connections after today's outage #827
Comments
Eyeballing the output of some unaffected
|
Hi @benesch , thank you for reporting this issue & providing an analysis in the description. We'll add this to our backlog. @rcoh may have some insights into this. It's conceivable that the outage use case may have exposed a code execution path that was not originally considered when PR2445 was created. In the meantime, if you happen to discover a simpler reproduction step, kindly share it with us. |
Are you running with non-stock connectors of some kind? That could cause connection metadata to fail to work 🤔 |
Not as far as I know! This is our configuration: https://github.com/MaterializeInc/materialize/blob/bfbebe04a7d181f1bac91f01785309a52cf2de34/src/persist/src/s3.rs#L111-L180
Will do, but I'm afraid it'll be pretty unlikely that we do. To be honest, if the only outcome of this issue is the removal of the debugging |
coming back here as I prepare to close this ticket—it seems like in most of your pods, connection poisoning worked as intended and through out the bad connections but in one pod maybe a race condition of some kind caused us to fail to get the poisoning to work. Since no loader was set, it means we never actually made it to even trying to send the request with Hyper—maybe the timeout hit during waiting for a retry or In any case, the println has been replaced with a |
|
Yeah, could be. In any case, thanks for removing that |
Describe the bug
At @MaterializeInc we run a number of
clusterd
processes that continually read and write from S3. (We make a streaming database that's in the business of reading data from S3, transforming it, and writing it back to S3.)During today's AWS outage, these
clusterd
processes all experienced an outage. Almost all of theclusterd
processes recovered, except for two that continually produced error messages like the following:AIUI,
rollup::set
is attempting to write to S3, whiles3 get meta
is attempting to read from S3. The following error message (request has timed out
) or (failed to construct request
) comes straight from the AWS SDK, AFAICT.Expected Behavior
We expected our S3 reads/writes to eventually succeed on retry.
Current Behavior
The S3 reads/writes continued failing forever.
We validated that the two affected containers did in fact have access to S3—logging in to the containers and manually issuing S3 requests was successful at the time shown in the S3 logs.
Reproduction Steps
So sorry, but we have no reproduction of this. We run chaos tests in our CI that interrupt network connections but we've never seen anything like this. I think it's unlikely we'll see this again until another wide AWS outage.
I wonder if it's something specific about interrupting the IAM connections in the way that the AWS outage did. We don't test that chaos extensively in our CI.
Possible Solution
The reason I wanted to file this issue is because of this
println
that we're seeing in the output:https://github.com/awslabs/smithy-rs/blob/312d190535b1c77625d662d18313b90af64cb448/rust-runtime/aws-smithy-http/src/connection.rs#L85
This looks like a stray debugging
println
. It was added in smithy-lang/smithy-rs#2445. I'm just spitballing, but I'm wondering if this println is related to the issue. Perhaps connections in this process weren't getting poisoned properly because the connection metadata wasn't available?I've never seen this
println
in our logs while debugging before. Unfortunately I can't say that we've truly never seen this log on the unaffected processes because as aprintln
rather than atracing
log it doesn't get picked up by our logging infrastructure. It's possible that thisprintln
is actually a normal occurrence when there are S3 connectivity issues, and not indicative of a failure to poison broken connections.If nothing else, seems like the debugging
println
ought to be removed!Additional Information/Context
No response
Version
Environment details (OS name and version, etc.)
Linux 5.10.178-162.673.amzn2.x86_64
Logs
No response
The text was updated successfully, but these errors were encountered: