Improve reading for File-like table engine with column oriented format #21302

keen-wolf · 2021-02-28T05:06:48Z

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

Changelog category (leave one):

Performance Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Make the File-like table engine to read only needed columns which will reduce the IO, Memory, also the compution cost. This close #issue:20129.

Detailed description / Documentation draft:
When call executeFetchColumns to construct the read pipeline for StorageFile engines, previously all the columns in this table(storage metadata) are passed to the subsequent InputFormat which does the real IO. Now only the columns needed for latter reading are passed through: StorageFileSource -> SourceWithProgress ->InputFormat->(parquet, arrow, native...)BlockInputFormat. If the underlying InputFormat is column-oriented, then it can only read these required columns so to avoid unnecessary cost, and for those non-column-oriented InputFormats, this will have no side-effects on them. This update mainly modify codes within the "StorageFile.cpp".

keen-wolf · 2021-02-28T05:46:31Z

Sorry for update so late. It's a long new year holiday~

Three situations for File-like related methods have been carefully analyzed here: the File-Engine for Creating Table, the TableFunction.File and the TableFunction.Input. The first two have been tested to be improved by this feature, It has no effects on the last one, for "Input" is handled in client-side, also the reading is done by outside PIPE CMD.

I've used parquet files to verify the performance improvement. also checked with the previous parquet and arrow format testcases to testify the correctness. Here is the comparison result before and after this optimization, for both File-Storage and TableFunction.File methods.

FileFunction-before:

FileFunction-after:

FileStorage-after:

But no thoroughly tests have been done for all other File-like Formats, it's going to be a big job. The automated performance comparison testcase can be made and integrated to the CI, if needed.

…umn-oriented keep consistent with the upstream

keen-wolf · 2021-03-04T10:07:59Z

After test and code analysis, I found the “Native” format will read all columns from the raw data file, as this patch only fix the column filter path from the upper layer of StorageFile, so it will do no favor to the “Native” format. If we want to improve it, some internal code changes may be required.
And below is another test for 3 valid formats(Parquet, Arrow, ORC). As we can see, the most obvious improvement comes from the Parquet format

Btw. the random String was created via:
dd if=/dev/urandom of=/var/lib/clickhouse/user_files/200KB_rand bs=200K count=1

…column-oriented Merging #21302

nikitamikhaylov · 2021-04-07T13:02:08Z

This PR is merged in neighboring PR. All commits are saved.

keen-wolf · 2021-04-15T04:43:37Z

This PR is merged in neighboring PR. All commits are saved.

great job!

keen-wolf added 3 commits February 25, 2021 19:07

Only read needed columns for formats as parquet etc

038a404

fix

7130ae7

update comments

fa0196c

robot-clickhouse added the pr-performance Pull request with some performance improvements label Feb 28, 2021

vdimir added the can be tested label Feb 28, 2021

nikitamikhaylov self-assigned this Feb 28, 2021

keen-wolf added 7 commits February 28, 2021 23:35

fix the getColumsForNames() to bring the whole column info from metadata

8895d2f

Branch with Format isColumnOriented() or not

4101699

Merge remote-tracking branch 'clickhouse/master' into storagefile-col…

f67d987

…umn-oriented keep consistent with the upstream

Small fix

66aa4b1

update comments

5ae7662

remove const for value-copy-return

30f0969

the Native format is not supported after test

3bef156

nikitamikhaylov mentioned this pull request Mar 29, 2021

Merging #21302 #22299

Merged

nikitamikhaylov added a commit that referenced this pull request Apr 7, 2021

Merge pull request #22299 from nikitamikhaylov/keen-wolf-storagefile-…

48af7a8

…column-oriented Merging #21302

nikitamikhaylov closed this Apr 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve reading for File-like table engine with column oriented format #21302

Improve reading for File-like table engine with column oriented format #21302

keen-wolf commented Feb 28, 2021 •

edited

Loading

keen-wolf commented Feb 28, 2021 •

edited

Loading

keen-wolf commented Mar 4, 2021 •

edited

Loading

nikitamikhaylov commented Apr 7, 2021

keen-wolf commented Apr 15, 2021

Improve reading for File-like table engine with column oriented format #21302

Improve reading for File-like table engine with column oriented format #21302

Conversation

keen-wolf commented Feb 28, 2021 • edited Loading

keen-wolf commented Feb 28, 2021 • edited Loading

keen-wolf commented Mar 4, 2021 • edited Loading

nikitamikhaylov commented Apr 7, 2021

keen-wolf commented Apr 15, 2021

keen-wolf commented Feb 28, 2021 •

edited

Loading

keen-wolf commented Feb 28, 2021 •

edited

Loading

keen-wolf commented Mar 4, 2021 •

edited

Loading