-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question: how to stream data #122
Comments
My mistake, it reads only 5000 rows. I don't understand why, though. |
It defaults to 5000 rows for queries to prevent blowing up client memory -- you can change the |
You also add a higher |
I suggest to document it. |
BTW is it possible to stream a result? I need to query about 50 millions wide rows (~15 GB) ? |
Hm, I cannot make it work
this SQL throws "Division by zero" after 30 000 000 rows. |
I think this is the result of the default True value for the client setting for client = clickhouse_connect.get_client(send_progress=False) Because of the That default probably should also be changed at least for streaming queries (another leftover from before streaming was implemented). |
Still, no streaming
Curl returns data instantly
|
Also
|
Based on running a profile, the delay looks like it's coming from ClickHouse doing compression. One more setting change make it almost instant. client = clickhouse_connect.get_client(send_progress=False, compress=False) Bubbling the reported error up through the optmized Cython code is a bit more challenging but I'll take a look. |
Or a decompression buffer is huge. |
Waiting in `method 'recv_into' means the Python driver is waiting for ClickHouse to actually put data on the wire. Decompression on the Python side is actually pretty fast. If compression is used, the Python side decompresses each block individually. I'm not sure how ClickHouse does the lz4 compression over http. |
|
Hmm, I lied, it looks like ClickHouse picks zstd when Accept-Encoding includes both lz4 and zstd. And it appears that ClickHouse is sending the whole thing (or at least very big pieces) as a single zstd frame. I'll dig a little further. |
22 GB |
|
Yeah zstd is broken for streaming for whatever reason, looks like it slow on both the ClickHouse side and the Python side.
Current memory usage is 7.531117MB; Peak was 25.668245MB |
Thanks, Though, tracemalloc is not trustworthy or shows not RSS.
|
Thanks for stress testing and prompting me to find the bad |
I suggest to verify it w Clickhouse team.
It looks like console |
Yes, it needs more investigation, I'm quite possible using the Python zstd library wrong. |
I was using the wrong function in the zstandard library which was expecting a content size. It was reading the zstd frame header sent by clickhouse incorrectly and presumed a content size of many petabytes, so it would retrieve all of the response before decompressing. zstd and lz4 performance should be pretty much the same following the 0.5.9 release |
Thank you for the fix. It works amazingly fast now. Probably you can improve streaming exception message by extracting
Now it looks like this
|
I'll try the regex -- I was working under the assumption that the ClickHouse exception would be in a separate block/chunk at the end of the response, but that's clearly incorrect (or maybe only true on my current Mac clickhouse build) |
the same in CH client
The text was updated successfully, but these errors were encountered: