Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle epipe errors #871

Merged
merged 1 commit into from
Jan 19, 2023
Merged

Handle epipe errors #871

merged 1 commit into from
Jan 19, 2023

Conversation

n-oden
Copy link
Contributor

@n-oden n-oden commented Jan 13, 2023

Presently, if the tcp session to the server closes remotely (e.g. if the clickhouse-server process restarts), we will emit broken pipe errors potentially until we reach ConnMaxLifetime.

Since EPIPE means that the tcp session is dead (ie we received a RST packet from the server), there is no point in attempting to proceed past that point: set the connection as closed.

@n-oden
Copy link
Contributor Author

n-oden commented Jan 13, 2023

n.b. I got a bit lost in the weeds trying to create a reasonable test case for this: the container-management setup in your test suite can't serialize testcontainer.Container, so env.Container is nil by the time you're on the far side of GetTestEnvironment(), and even if you work around that by directly instantiating a new env in the test so that you can call container.Stop(), I couldn't think of a reliable way to assert that we had caught the EPIPE and closed the connection -- but if you have any ideas I'm all ears!

@rf
Copy link

rf commented Jan 13, 2023

#844 is most likely related!

@n-oden
Copy link
Contributor Author

n-oden commented Jan 17, 2023

@jkaflik Could I trouble you for a look at this?

@jkaflik
Copy link
Contributor

jkaflik commented Jan 17, 2023

@n-oden I will have a look if we can have it reproducible in tests

@jkaflik jkaflik self-assigned this Jan 17, 2023
@jkaflik jkaflik added the bug label Jan 17, 2023
@n-oden
Copy link
Contributor Author

n-oden commented Jan 17, 2023

Reproducing it is in conception simple: create a client that continuously writes to a clickhouse db, and then restart the clickhouse server while the client is writing and observe the infinite flood of broken pipe errors in the logs.

In practice with the integration test framework you have here, I'm not sure how we'd implement that while still keeping test times reasonable and results deterministic: you'd want to stop and start the container while writes were happening and I'm not sure how you'd do that with any precision. I'll keep poking at it but am very open to suggestions!

@jkaflik
Copy link
Contributor

jkaflik commented Jan 17, 2023

@n-oden I looked briefly at the issue. Can you please let me know the version you use? Also, I assume you are using the database/sql interface as your proposal implements a driver.ErrBadConn.

@jkaflik
Copy link
Contributor

jkaflik commented Jan 17, 2023

I agree we should mark the connection as closed. I will look at how we can achieve the same result for database/sql driver, but without remapping an error on a native interface level.

@n-oden
Copy link
Contributor Author

n-oden commented Jan 17, 2023

@jkaflik we're using v2.5.0, having recently updated from v1. And correct, we're using the database/sql interface.

@jkaflik
Copy link
Contributor

jkaflik commented Jan 18, 2023

@n-oden while I agree the connection should be marked as closed and the SQL driver on top should receive driver.ErrBadConn, I struggle to reproduce the following:

we will emit broken pipe errors potentially until we reach ConnMaxLifetime.

having CH server restart just before writing to the socket (conn.go:259) causes me to get the broken pipe, but later on, I get the connection established back and everything works fine. What happens is I get a few connection refused during dial. This tells me that the conn is abandoned and the new one is being established until the server is back to accept conn.

It has the same behavior no matter if we introduce your change or not.

My use case is:

conn := clickhouse.OpenDB( ... )
conn.SetMaxOpenConns(1)

for {
  scope, err := conn.Begin()
  stmt, err := scope.Prepare("INSERT....")
  stmt.Exec(...)
  scope.Commit()
}

I think there is some difference in how you do that and how you are able to reproduce it.

@n-oden
Copy link
Contributor Author

n-oden commented Jan 18, 2023

@jkaflik I've been having the exact same issue trying to reproduce the error in the test harness here, which is frustrating, but we have definitely seen the described behavior on our production systems. Happy to do a screenshare or post a recording somewhere if that would help!

@jkaflik
Copy link
Contributor

jkaflik commented Jan 18, 2023

@n-oden anything that brings me closer to this issue is welcome. Can you also let me know what/when library functions you call?

@n-oden
Copy link
Contributor Author

n-oden commented Jan 18, 2023

drop me a line -- [email protected]

Presently, if the tcp session to the server closes, we will emit broken
pipe errors until either the connection comes back to "idle" state or
until we reach ConnMaxLifetime.

Since EPIPE means that the tcp session is dead (ie we received a RST
packet from the server), there is no point in attempting to proceed past
that point: set the connection as closed.
@n-oden
Copy link
Contributor Author

n-oden commented Jan 19, 2023

@jkaflik updated as we discussed!

@jkaflik
Copy link
Contributor

jkaflik commented Jan 19, 2023

@n-oden thanks! as we agreed, merging it and merge in order to mitigate the issue on your side.

Let's continue with #844 since it might have the same root cause.

@jkaflik jkaflik merged commit 936a983 into ClickHouse:main Jan 19, 2023
@n-oden n-oden deleted the handle-epipe branch January 19, 2023 21:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants