-
Notifications
You must be signed in to change notification settings - Fork 350
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lambda Extension sometimes fails with "Connection reset by peer" #837
Comments
Is this error produced when you send the messages to KDS? This sounds like a problem with the SDK, KDS itself, unrelated to the extension.
The log processor runs as a sidecar and it should always be running. I don't think logs in the extension are sent to the extension itself, if that's what you're expecting. |
I understand the part about extensions not receiving extension logs. It is clear to me that that is a separate available subscription. Thanks for clarifying that the extension is a sidecar and should always be running. That is also what I understood, but it helps narrow down the issue.
Yes. The issue only occurs when sending to KDS. I'll spend some time looking into some more examples of pushing to KDS via Rust SDK, however, the example repository demonstrates how I am calling out to KDS HERE and it's not much different from the example I haven't been able to gather further information on the underlying cause for the error, despite adding tracing logs and other things. I'll investigate some more KDS examples to see if my sample implementation differs in any meaningful way. |
@calavera Hey there, finally have some time to revisit this. Just wanted to ask to see if there were any new thoughts around this? Have you tried the minimal repro and seen the same behavior? Im considering whether this is a side-effect of lambda runtime shutting down and the extension not handling closing the connections gracefully. If that's the case, does it mean that unfulfilled / unsent requests out to external services need to get cached for future runs somehow? Having trouble diagnosing this issue. |
I have not been able to reproduce this.
This is a strong possibility. What you need to do in this case, is to subscribe your extension to |
@calavera I'll be looking closer at this part at some point, for the meantime, I was able to find some more info, perhaps worth considering in regards to the firehose example? If I decrease the timeout_ms of the When the timeout is set low enough, the log buffer spins up before any messages arrive (from what I can tell) and an empty record is attempted to be sent. This is because the match on the events doesn't cause different behavior on the creation of the future based on the type of log delivered by the buffer, it still creates a It seems like there's a little bit of a race condition here between the lambda running, the extension starting / ending, the extension log buffer delivering the logs to the extension. With the default timeout_ms (1000ms) for extensions, if a lambda runs quick enough, the log buffer won't deliver logs for a long time (in the lambda world), and you won't see the extension ingesting those logs until separate runs of the lambda. (The above potentially being related to what you said about this here:)
This seems to cause either 1) some delays to the logs getting forwarded or 2) the logs not getting forwarded until subsequent lambda invocations. Situation (2) seems to happen semi-often, so there's some data-loss to consider. What if the last lambda run had critical info we couldn't forward because the log buffer didn't spin up the logs processor? On the other hand, with the timeout low enough, the log buffer may spin up the extension with an empty vec, causing incomplete messages to get sent, with the below error getting thrown:
I'll try to update with more info as soon as I have some, but hopefully this all makes sense. Please let me know if I missed anything! |
Just adding a bit more datapoints Cold start -> cold start:
The lambda outputs some logs, timeout_ms set to 25ms Cold start -> warm start: warm start: 1 Cold start -> 20 warm starts: warm start 1 - 19: warm start 20: It seems like there's cold start -> cold start issues if you use a low timeout, and there's separate issues for cold start -> warm start with a large delay in the extension getting kicked off to forward logs. It seems like the default timeout for the Extension log buffer at least doesn't have as many issues with consistency in forwarding the logs, but there's an issue of how long it takes the warm start logs to get sent. It's interesting to note that the delay I'm seeing in warm starts only exists for the last invocation of the lambda, meaning, the last warm start. The warm starts in between don't seem to have issues with latency. Maybe this issue is more workable going from default For the latency on the final warm start: Could this be resolved by subscribing to more event types to kick off the extension more times? |
Woow, thanks a lot for diving deep into this.
I'm not sure, to be honest. The buffering configuration is basically telling Lambda how often it needs to send the logs to the extension. I would expect that if the timeout is low, you'd get the logs consistently, at a faster pace, and that would not have any impact in external connections. An extension is just an HTTP api that receives data from Lambda, it's unclear to me too why reducing the timeout would result in connectivity issues. A cold start is basically a new Lambda sandbox starting up. I don't understand why simultaneous cold starts cause problems connecting to KDS. My understanding is that it also spawns a completely new extension. @greenwoodcm might have more context about how Lambda interacts with extensions. I think I'm as lost as you are here. 😔 |
In regards to this, there's no other event types we could kick the extension off with. Came to the conclusion that the event types are technically missing warm start beginning / ending so the extension only gets kicked off in narrow situations (which mostly works, but seems to have some edge cases), namely: Lambda cold start (extension inits), timeout_ms exceeded, lambda shutdown Note that none of the above events really coincide with a warm start invocation or warm start invocation end. I believe this is why we see a delay in the final warm start delivery. The only event kicking off the extension at that point is the shutdown of the execution environment. I can't fully track why this means that all the other warm starts before the last have no delay though |
You can register for "INVOKE" event to receive notifications when an invoke happens (both cold and warm). For end of invokes, Lambda doesn't send notifications to extensions, but you can add yourself. Check out this example. |
Hey there! just updating, no updates I can provide from my side, but I did find this response interesting and potentially relevant to the discussion: |
Another potentially related ticket: |
hey @dgarcia-collegeboard , I'm going to close this issue since it looks like a problem with the SDK. That way we don't duplicate communications. Let me know if someone finds out that the issue is related to the extension library, and I'll be happy to reopen this, but for what I can read, it sounds like the problem is the SDK's hyper connection. |
This issue is now closed. Comments on closed issues are hard for our team to see. |
@calavera understood! Thanks for your help in trying to get to the bottom of this. Any info you might be able to provide in the other issue would be appreciated! :D |
@calavera Is it true that the log processor will run as a side car? The Lambda execution lifecycle documentation seems to say that the execution environment will be frozen after the extension calls Given this freezing I'm having difficulty understanding how we can safely and reliably process logs using this library without having control of when Appreciate your thoughts. Thank you. |
Hey!
I've derived an example from this repository: HERE
But instead of pushing to firehose, pushing to KDS instead. See minimal example
I've added some extra logic to my version of the above code where I'm providing custom credentials to the KDS client that's instantiated, but mostly, my implementation is the same. Is there a common reason for the
Connection reset by peer
error? It seems like the extension doesn't spin up the logs processor unless I invoke my lambda again, but this could just be because the async processing means any logs made in the Processorscall
method aren't spit out until they're resolved. I've seen some calls to kinesis succeed, but others seem to fail unexpectedly with this error:The above error is logged during a match on the result of the future that is pinned inside of a
Box
in the example, expanded from this value HEREAny guidance would be much appreciated!
The text was updated successfully, but these errors were encountered: