Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Client doesn't send enable_http_compression parameter in some circumstances #157

Closed
khmelevskiy opened this issue Mar 30, 2023 · 13 comments · Fixed by #158 or #168
Closed

Client doesn't send enable_http_compression parameter in some circumstances #157

khmelevskiy opened this issue Mar 30, 2023 · 13 comments · Fixed by #158 or #168
Assignees
Labels
bug Something isn't working

Comments

@khmelevskiy
Copy link

khmelevskiy commented Mar 30, 2023

Describe the bug

When i try run client.query in
clickhouse_connect version 0.5.3, i get result for 0:00:00.812322
clickhouse_connect version 0.5.17, i get result for 0:00:07.987024

Steps to reproduce

  1. take example
  2. change table to your random table
  3. run with clickhouse_connect 0.5.3
  4. run with clickhouse_connect 0.5.17
  5. compare speed result

Expected behaviour

Code example

import clickhouse_connect

clickhouse_connect_client = clickhouse_connect.get_client(
        username="",
        password="",
        host="",
        port=,
        secure=False,
        query_limit=1000,
        compress=True
    )

clickhouse_connect_client.query("SELECT * FROM table").result_rows

clickhouse-connect and/or ClickHouse server logs

Configuration

Environment

  • clickhouse-connect version: 0.5.3/ 0.5.17
  • Python version: 3.9
  • Operating system: macOS

ClickHouse server

  • ClickHouse Server version: 23.2.3.17
  • ClickHouse Server non-default settings, if any:
  • CREATE TABLE statements for tables involved:
  • Sample data for these tables, use clickhouse-obfuscator if necessary
@khmelevskiy khmelevskiy added the bug Something isn't working label Mar 30, 2023
@genzgd genzgd changed the title compress=True doesn't work Performance degradation when applying server timezone Mar 30, 2023
@genzgd
Copy link
Collaborator

genzgd commented Mar 30, 2023

Just to be clear, the query still works, it's just slower?

What are the results with compress=False?

I suspect the issue is that you have a ClickHouse server timezone set that is not UTC and you have one or more columns with DateTime/DateTime64 types. This is the most likely reason for the performance problem, since applying timezones is very expensive. You might try apply_server_timezone=False in get_client to see if that fixes the problem (this will return timezone naive datetime.datetime types. Version 0.5.3 does not apply the timezone.

Otherwise I would need a query and sample data to reproduce the problem.

I'm preparing a release that will not apply the server timezone when the client and server timezones match to handle this issue in locations where only one timezone is common.

@genzgd genzgd self-assigned this Mar 30, 2023
@genzgd genzgd linked a pull request Mar 30, 2023 that will close this issue
@khmelevskiy
Copy link
Author

khmelevskiy commented Mar 31, 2023

apply_server_timezone = False didn't help, of course we have datetime.datetime type columns .
It works not just slowly, but 10 times slower and the more data the slower.
@genzgd

@khmelevskiy
Copy link
Author

I removed all json and time columns. Only strings and arrays remain. Works 26 times slower than in version 0.5.3

@genzgd
Copy link
Collaborator

genzgd commented Mar 31, 2023

I reproduced a similar performance problem with mismatched timezones and released a fix for that problem. Otherwise I don't see this performance issue in our client benchmarks or other tests, and no one else has reported it despite many new downloads of 0.5.17 and 0.5.18. Apparently something else is going on in your environment.

(1) That much of a degradation might be related to not having the C extensions enabled/working for some reason. Most likely that would show up in any log messages when you run your application. What is the log output when you start your application with clickhouse-connect?

(2) If you are running against ClickHouse on the local host, it could be slower because your ClickHouse server is taking time to compress the data. Did you try compress=False? Can you try compress='zstd' and compress='lz4'?

Also can you run the following script on both versions:

#!/usr/bin/env python -u

import time

import clickhouse_connect
from clickhouse_connect.common import version

client = clickhouse_connect.get_client(compress=False, query_limit=0)

start = time.time()
result = client.query(
    """
    SELECT number,  toDateTime(toUInt32(number + toInt32(now()))) as date,
    randomPrintableASCII(256) as string from numbers(5000000)
    """
)

rows = len(result.result_set)
total_time = time.time() - start
print(f'VERSION: {version()}')
print(f'\t\tTime: {total_time:.4f} sec  rows: {rows}  rows/sec {rows // total_time}')

Result on my Mac laptop are:

VERSION: 0.5.17
		Time: 3.3012 sec  rows: 5000000  rows/sec 1514612.0

VERSION: 0.5.18
		Time: 3.2602 sec  rows: 5000000  rows/sec 1533627.0

VERSION: 0.5.3
		Time: 3.4551 sec  rows: 5000000  rows/sec 1447149.0

I'm happy to reopen this if you can provide reproducible example.

@genzgd genzgd reopened this Mar 31, 2023
@genzgd genzgd added the need info Please provide additional information label Mar 31, 2023
@khmelevskiy
Copy link
Author

Here fast test:

0.648514986038208 - version 0.5.18
0.22119498252868652 - version 0.5.3

import time
import clickhouse_connect

clickhouse_client = clickhouse_connect.get_client(
    host="",
    port=,
    secure=False,
    database=None,
    user='',
    password="",
    query_limit=0,
    connect_timeout=20,
    compress=True,
    session_id=f"test_session_id",
)
clickhouse_client.database = None

query_create_temp_table = """
CREATE TEMPORARY TABLE temp_test_table
        (
            Filed1 String,
            Filed2 String,
            Filed3 Array(String),
            Field4 Nullable(String)
        )"""
clickhouse_client.command(query_create_temp_table)

generate_values = ", ".join([f"('field1_{x}', 'field2_{x}', ['field3_1_{x}', 'field3_2_{x}'], 'field4_{x}' )" for x in range(10000)])
query_insert_temp_table = (
    f""" INSERT INTO temp_test_table (Filed1, Filed2, Filed3, Field4) VALUES {generate_values}"""
)
clickhouse_client.command(query_insert_temp_table)

start = time.time()
query_read_temp_table = """ SELECT * FROM temp_test_table"""
result = clickhouse_client.query(query_read_temp_table).result_rows
total_time = time.time() - start
print(total_time)
clickhouse_client.query("DROP TABLE temp_test_table")

@khmelevskiy
Copy link
Author

khmelevskiy commented Apr 1, 2023

if i make compress=False, then results:

0.7565817832946777 - verison 0.5.3
0.763045072555542 - version 0.5.18

and clickhouse on a remote server

@genzgd
Copy link
Collaborator

genzgd commented Apr 4, 2023

Thanks for the deeper investigation, it's really appreciated. Locally with your query I actually see that 0.5.18 is slightly faster in all cases regardless of compression. (Note that there were improvements to Nullable columns in 0.5.10). Also your current query size of 10k records is probably too small to really measure just clickhouse-connect performance, as many things can add variance in such a short time frame.

My slightly modified script:

import time
import clickhouse_connect
from clickhouse_connect.common import version

compression = 'zstd'
clickhouse_client = clickhouse_connect.get_client(
    user='default',
    query_limit=0,
    connect_timeout=20,
    compress=compression,
    session_id=f"test_session_id",
)

query_create_temp_table = """
CREATE TABLE IF NOT EXISTS temp_test_table
        (
            Field1 String,
            Field2 String,
            Field3 Array(String),
            Field4 Nullable(String)
        ) ENGINE MergeTree() ORDER BY Field1
"""
clickhouse_client.command(query_create_temp_table)

generate_values = [[f'field1_{x}', f'field2_{x}', [f'field3_1_{x}', f'field3_2_{x}'], 'field4_{x}']
                  for x in range(2000000)]
clickhouse_client.insert('temp_test_table', generate_values)

start = time.time()
result = clickhouse_client.query('SELECT * FROM temp_test_table')
rows = len(result.result_rows)
total_time = time.time() - start
print(f'VERSION: {version()}  COMPRESSION:{compression}')
print(f'\t\tTime: {total_time:.4f} sec  rows: {rows}  rows/sec {rows // total_time}')
clickhouse_client.query("DROP TABLE temp_test_table")

Test results on 2 million rows (locally):

VERSION: 0.5.3  COMPRESSION:zstd
		Time: 3.1889 sec  rows: 2000000  rows/sec 627170.0

VERSION: 0.5.3  COMPRESSION:zstd
		Time: 3.1528 sec  rows: 2000000  rows/sec 634352.0

VERSION: 0.5.3  COMPRESSION:lz4
		Time: 2.9027 sec  rows: 2000000  rows/sec 689015.0

VERSION: 0.5.3  COMPRESSION:lz4
		Time: 3.0087 sec  rows: 2000000  rows/sec 664737.0

VERSION: 0.5.3  COMPRESSION:lz4
		Time: 2.9727 sec  rows: 2000000  rows/sec 672791.0

VERSION: 0.5.3  COMPRESSION:False
		Time: 2.9082 sec  rows: 2000000  rows/sec 687711.0

VERSION: 0.5.3  COMPRESSION:False
		Time: 2.8132 sec  rows: 2000000  rows/sec 710932.0



VERSION: 0.5.18  COMPRESSION:False
		Time: 2.7406 sec  rows: 2000000  rows/sec 729780.0

VERSION: 0.5.18  COMPRESSION:False
		Time: 2.7063 sec  rows: 2000000  rows/sec 739029.0

VERSION: 0.5.18  COMPRESSION:False
		Time: 2.7510 sec  rows: 2000000  rows/sec 727016.0

VERSION: 0.5.18  COMPRESSION:lz4
		Time: 2.8280 sec  rows: 2000000  rows/sec 707215.0

VERSION: 0.5.18  COMPRESSION:lz4
		Time: 2.7650 sec  rows: 2000000  rows/sec 723338.0

VERSION: 0.5.18  COMPRESSION:zstd
		Time: 2.8820 sec  rows: 2000000  rows/sec 693973.0

VERSION: 0.5.18  COMPRESSION:zstd
		Time: 2.8767 sec  rows: 2000000  rows/sec 695248.0

VERSION: 0.5.18  COMPRESSION:zstd
		Time: 2.8741 sec  rows: 2000000  rows/sec 695875.0

So maybe there is something strange in in your environment since you are connecting to a remote server? I'm really not finding any evidence something changed between versions that would negatively affect performance.

@khmelevskiy
Copy link
Author

Thanks!

Here is my results with your code:

# VERSION: 0.5.18  COMPRESSION:zstd
# 		Time: 29.1147 sec  rows: 2000000  rows/sec 68693.0
    
# VERSION: 0.5.3  COMPRESSION:zstd
# 		Time: 5.1136 sec  rows: 2000000  rows/sec 391111.0

we'll look for what could be the reason.
But so far I don't understand how it can work quickly in one version, but not in another

@khmelevskiy
Copy link
Author

can i provide some settings of my clickhouse for diagnostics? please tell me which ones?

@khmelevskiy
Copy link
Author

khmelevskiy commented Apr 6, 2023

In our clickhouse:

we found the problem. In the old version, the parameter was passed
enable_http_compression=1 with requests, this does not happen in the new version.

And even if we add settings enable_http_compression=1 (example SELECT * FROM temp_test_table settings enable_http_compression=1) to the request, then this will not work. It started working only when we set this parameter in the clickhouse settings on the server.

Also if after turning on this setting in server if i use SELECT * FROM temp_test_table settings enable_http_compression=0, it's doesn't work)))
Compressed is disable if I pass the parameter to initialize client (compress=False).

Local clickhouse (this):

all work fast by default

@genzgd
Copy link
Collaborator

genzgd commented Apr 6, 2023

The current version should pass the setting enable_http_compression=1 as a query parameter if it's not already set on ClickHouse server and the user has permission to change it. (Just sending that setting with the query doesn't work because clickhouse-connect isn't sending the correct Accept-Encoding HTTP header with it).

I'm trying to figure why this works in 0.5.3 but doesn't work in 0.5.17. There are some code changes in that area but like you say, it works fine locally.

What version of ClickHouse are you using on the remote server?
Are there any special permissions on the system.settings table (which clickhouse-connect checks to see if has permission to send that setting)?
What was the original enable_http_compression setting that didn't work?

Ideally if you could step through this code in httpclient.py with a debugger and see why self._send_comp_setting isn't getting enabled that might explain where the bug is:

        comp_setting = self._setting_status('enable_http_compression')
        self._send_comp_setting = not comp_setting.is_set and comp_setting.is_writable
        if comp_setting.is_set or comp_setting.is_writable:
            self.compression = compression

@genzgd genzgd changed the title Performance degradation when applying server timezone Client doesn't send enable_http_compression parameter in some circumstances Apr 6, 2023
@khmelevskiy
Copy link
Author

khmelevskiy commented Apr 6, 2023

Problem in this line.

I think you need make:
return SettingStatus(comp_setting != '0', comp_setting.readonly != 1) -> return SettingStatus(comp_setting.value != '0', comp_setting.readonly != 1)

@genzgd
Copy link
Collaborator

genzgd commented Apr 6, 2023

Oh, nice catch! Thanks so much! I stared at that code many times and didn't see that problem. I'll release a fix.

@genzgd genzgd removed the need info Please provide additional information label Apr 6, 2023
@genzgd genzgd linked a pull request Apr 6, 2023 that will close this issue
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
2 participants