This repository has been archived by the owner on Feb 17, 2023. It is now read-only.
forked from apache/arrow
-
Notifications
You must be signed in to change notification settings - Fork 7
client: read batches from the returned bufferlist in ScanTask:Execute lazily #101
Comments
JayjeetAtGithub
added a commit
that referenced
this issue
Feb 21, 2021
Fixed by #100 |
JayjeetAtGithub
added a commit
that referenced
this issue
Feb 21, 2021
JayjeetAtGithub
added a commit
that referenced
this issue
Feb 21, 2021
Fix #101: In the RadosParquetScanTask::Execute() method, after the bufferlist filled with a serialized Table was returned, all the batches were first read into a RecordBatchVector and then that RecordBatchVector was again read for getting a stream of batches.We don't need to read the batches from the table inside the Execute function, rather we can keep the serialized bufferlist as is and we return an interator over that to increase performance by iterating over the bacthes only once rather than twice.
JayjeetAtGithub
added a commit
that referenced
this issue
Feb 24, 2021
1) Read batches from the returned bufferlist lazily (#100) Fix #101: In the RadosParquetScanTask::Execute() method, after the bufferlist filled with a serialized Table was returned, all the batches were first read into a RecordBatchVector and then that RecordBatchVector was again read for getting a stream of batches.We don't need to read the batches from the table inside the Execute function, rather we can keep the serialized bufferlist as is and we return an interator over that to increase performance by iterating over the bacthes only once rather than twice. 2) Use serial task group in the CLS (#104) Fixes #103: In the CLS, we scan just a single file, so we can just use the SerialTaskGroup. Using ThreadedTaskGroup causes a Thread pool to be created in every CLS function execution and that is a big performance overhead. 3) Fix bug causing client side filtering bypass to fail for rados-parquet (#105) On using `rados-parquet` format, although each of the fragments was scanned in the CLS, they were also getting scanned in the client uselessly leading to wastage of CPU cycles. This PR fixes it. 4) Remove costly Table print statement from CLS (#107)
JayjeetAtGithub
added a commit
that referenced
this issue
Feb 24, 2021
1) Read batches from the returned bufferlist lazily (#100) Fixes #101: In the RadosParquetScanTask::Execute() method, after the bufferlist filled with a serialized Table was returned, all the batches were first read into a RecordBatchVector and then that RecordBatchVector was again read for getting a stream of batches.We don't need to read the batches from the table inside the Execute function, rather we can keep the serialized bufferlist as is and we return an interator over that to increase performance by iterating over the bact hes only once rather than twice. 2) Use serial task group in the CLS (#104) Fixes #103: In the CLS, we scan just a single file, so we can just use the SerialTaskGroup. Using ThreadedTaskGroup causes a Thread pool to be created in every CLS function execution and that is a big performance overhead. 3) Fix bug causing client side filtering bypass to fail for rados-parquet (#105) On using `rados-parquet` format, although each of the fragments was scanned in the CLS, they were also getting scanned in th e client uselessly leading to wastage of CPU cycles. This PR fixes it. 4) Remove costly Table print statement from CLS (#107)
JayjeetAtGithub
added a commit
that referenced
this issue
Feb 24, 2021
1) Read batches from the returned bufferlist lazily (#100) Fixes #101: In the RadosParquetScanTask::Execute() method, after the bufferlist filled with a serialized Table was returned, all the batches were first read into a RecordBatchVector and then that RecordBatchVector was again read for getting a stream of batches.We don't need to read the batches from the table inside the Execute function, rather we can keep the serialized bufferlist as is and we return an interator over that to increase performance by iterating over the bact hes only once rather than twice. 2) Use serial task group in the CLS (#104) Fixes #103: In the CLS, we scan just a single file, so we can just use the SerialTaskGroup. Using ThreadedTaskGroup causes a Thread pool to be created in every CLS function execution and that is a big performance overhead. 3) Fix bug causing client side filtering bypass to fail for rados-parquet (#105) On using `rados-parquet` format, although each of the fragments was scanned in the CLS, they were also getting scanned in th e client uselessly leading to wastage of CPU cycles. This PR fixes it. 4) Remove costly Table print statement from CLS (#107)
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
No description provided.
The text was updated successfully, but these errors were encountered: