Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve reading for File-like table engine with column oriented format #21302

Closed

Conversation

keen-wolf
Copy link
Contributor

@keen-wolf keen-wolf commented Feb 28, 2021

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

Changelog category (leave one):

  • Performance Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Make the File-like table engine to read only needed columns which will reduce the IO, Memory, also the compution cost. This close #issue:20129.

Detailed description / Documentation draft:
When call executeFetchColumns to construct the read pipeline for StorageFile engines, previously all the columns in this table(storage metadata) are passed to the subsequent InputFormat which does the real IO. Now only the columns needed for latter reading are passed through: StorageFileSource -> SourceWithProgress ->InputFormat->(parquet, arrow, native...)BlockInputFormat. If the underlying InputFormat is column-oriented, then it can only read these required columns so to avoid unnecessary cost, and for those non-column-oriented InputFormats, this will have no side-effects on them. This update mainly modify codes within the "StorageFile.cpp".

@robot-clickhouse robot-clickhouse added the pr-performance Pull request with some performance improvements label Feb 28, 2021
@keen-wolf
Copy link
Contributor Author

keen-wolf commented Feb 28, 2021

Sorry for update so late. It's a long new year holiday~

Three situations for File-like related methods have been carefully analyzed here: the File-Engine for Creating Table, the TableFunction.File and the TableFunction.Input. The first two have been tested to be improved by this feature, It has no effects on the last one, for "Input" is handled in client-side, also the reading is done by outside PIPE CMD.

I've used parquet files to verify the performance improvement. also checked with the previous parquet and arrow format testcases to testify the correctness. Here is the comparison result before and after this optimization, for both File-Storage and TableFunction.File methods.

FileFunction-before:

File-function-before

FileFunction-after:

File-function-after

FileStorage-after:

File-Storage-after

But no thoroughly tests have been done for all other File-like Formats, it's going to be a big job. The automated performance comparison testcase can be made and integrated to the CI, if needed.

@keen-wolf
Copy link
Contributor Author

keen-wolf commented Mar 4, 2021

After test and code analysis, I found the “Native” format will read all columns from the raw data file, as this patch only fix the column filter path from the upper layer of StorageFile, so it will do no favor to the “Native” format. If we want to improve it, some internal code changes may be required.
And below is another test for 3 valid formats(Parquet, Arrow, ORC). As we can see, the most obvious improvement comes from the Parquet format

3formats-test

Btw. the random String was created via:
dd if=/dev/urandom of=/var/lib/clickhouse/user_files/200KB_rand bs=200K count=1

@nikitamikhaylov nikitamikhaylov mentioned this pull request Mar 29, 2021
nikitamikhaylov added a commit that referenced this pull request Apr 7, 2021
@nikitamikhaylov
Copy link
Member

This PR is merged in neighboring PR. All commits are saved.

@keen-wolf
Copy link
Contributor Author

This PR is merged in neighboring PR. All commits are saved.

great job!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-performance Pull request with some performance improvements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

InputFormat should allow to read subset of columns.
4 participants