-
Notifications
You must be signed in to change notification settings - Fork 7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve reading for File-like table engine with column oriented format #21302
Improve reading for File-like table engine with column oriented format #21302
Conversation
Sorry for update so late. It's a long new year holiday~ Three situations for File-like related methods have been carefully analyzed here: the File-Engine for Creating Table, the TableFunction.File and the TableFunction.Input. The first two have been tested to be improved by this feature, It has no effects on the last one, for "Input" is handled in client-side, also the reading is done by outside PIPE CMD. I've used parquet files to verify the performance improvement. also checked with the previous parquet and arrow format testcases to testify the correctness. Here is the comparison result before and after this optimization, for both File-Storage and TableFunction.File methods. FileFunction-before: FileFunction-after: FileStorage-after: But no thoroughly tests have been done for all other File-like Formats, it's going to be a big job. The automated performance comparison testcase can be made and integrated to the CI, if needed. |
…umn-oriented keep consistent with the upstream
After test and code analysis, I found the “Native” format will read all columns from the raw data file, as this patch only fix the column filter path from the upper layer of StorageFile, so it will do no favor to the “Native” format. If we want to improve it, some internal code changes may be required. Btw. the random String was created via: |
…column-oriented Merging #21302
This PR is merged in neighboring PR. All commits are saved. |
great job! |
I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Make the File-like table engine to read only needed columns which will reduce the IO, Memory, also the compution cost. This close #issue:20129.
Detailed description / Documentation draft:
When call executeFetchColumns to construct the read pipeline for StorageFile engines, previously all the columns in this table(storage metadata) are passed to the subsequent InputFormat which does the real IO. Now only the columns needed for latter reading are passed through: StorageFileSource -> SourceWithProgress ->InputFormat->(parquet, arrow, native...)BlockInputFormat. If the underlying InputFormat is column-oriented, then it can only read these required columns so to avoid unnecessary cost, and for those non-column-oriented InputFormats, this will have no side-effects on them. This update mainly modify codes within the "StorageFile.cpp".