-
Notifications
You must be signed in to change notification settings - Fork 322
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: introduce tx idle timeout in postgres connections #4598
Conversation
Important Auto Review SkippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (invoked as PR comments)
Additionally, you can add CodeRabbit Configration File (
|
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #4598 +/- ##
==========================================
+ Coverage 74.56% 74.58% +0.02%
==========================================
Files 406 406
Lines 48075 48079 +4
==========================================
+ Hits 35845 35858 +13
+ Misses 9900 9895 -5
+ Partials 2330 2326 -4 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comment, otherwise LGTM
@@ -19,15 +20,20 @@ func GetConnectionString(c *config.Config, componentName string) string { | |||
port := c.GetInt("DB.port", 5432) | |||
password := c.GetString("DB.password", "ubuntu") // Reading secrets from | |||
sslmode := c.GetString("DB.sslMode", "disable") | |||
idleTxTimeout := c.GetDuration("DB.IdleTxTimeout", 5, time.Minute) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we keep this timeout limited to 120 Secs as from the graph in the desc max txn time seems to be 60-65 secs only? Keeping high value of timeout may still result in similar issues we are trying to solve(even-though for a smaller time)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This timeout affects connections waiting for a client query within an open transaction for longer than the specified amount of time.
So if the client opens a transaction, runs a query, keeps the transaction open, it does something with the results and that "something" takes longer than 120 seconds, then the connection is going to be terminated. This would mean that the client will quite likely have to restart from the beginning.
I know the max in the graph shows as 60-65 but there might be edge cases where it could go above 120 I guess. Hard to be 100% sure without checking all metrics of all time and even then, code changes so you might experience that behaviour anyway at some point. This is the reason why 5 minutes seems safer to me and still reasonable enough not to cause incidents due to connections left alive forever.
@@ -19,15 +20,20 @@ func GetConnectionString(c *config.Config, componentName string) string { | |||
port := c.GetInt("DB.port", 5432) | |||
password := c.GetString("DB.password", "ubuntu") // Reading secrets from | |||
sslmode := c.GetString("DB.sslMode", "disable") | |||
idleTxTimeout := c.GetDuration("DB.IdleTxTimeout", 5, time.Minute) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This timeout affects connections waiting for a client query within an open transaction for longer than the specified amount of time.
So if the client opens a transaction, runs a query, keeps the transaction open, it does something with the results and that "something" takes longer than 120 seconds, then the connection is going to be terminated. This would mean that the client will quite likely have to restart from the beginning.
I know the max in the graph shows as 60-65 but there might be edge cases where it could go above 120 I guess. Hard to be 100% sure without checking all metrics of all time and even then, code changes so you might experience that behaviour anyway at some point. This is the reason why 5 minutes seems safer to me and still reasonable enough not to cause incidents due to connections left alive forever.
Description
Motivation
In case, rudder-server processor or gateway crashes (panic, OOM killed), on going transactions remain open for a long time. When those transactions acquire a lock it creates issue for new connections to run their quires properly due locks. He have faced more than one CFD.
Solution
Introduce idle_in_transaction_session_timeout connection option when establishing tcp connection.
This setting is generally useful, not only for when their is a crash.
Default timeout is set to 5 minutes. A safe, high enough value not to affect any queries running production:
max(pg_stat_activity_max_tx_duration{container="postgres-exporter", team="pipelines", state="idle in transaction"})
Linear Ticket
https://linear.app/rudderstack/issue/PIPE-1017/configure-idle-in-transaction-session-timeout
Security