Skip to content
This repository has been archived by the owner on Feb 17, 2023. It is now read-only.

client: read batches from the returned bufferlist in ScanTask:Execute lazily #101

Closed
JayjeetAtGithub opened this issue Feb 21, 2021 · 1 comment

Comments

@JayjeetAtGithub
Copy link
Collaborator

No description provided.

JayjeetAtGithub added a commit that referenced this issue Feb 21, 2021
@JayjeetAtGithub
Copy link
Collaborator Author

Fixed by #100

JayjeetAtGithub added a commit that referenced this issue Feb 21, 2021
Fix #101: In the RadosParquetScanTask::Execute() method, after the bufferlist filled with a serialized Table was
returned, all the batches were first read into a RecordBatchVector and then that RecordBatchVector was again read
for getting a stream of batches.We don't need to read the batches from the table inside the Execute function, rather we can keep the serialized bufferlist
as is and we return an interator over that to increase performance by iterating over the bacthes only once rather than twice.
JayjeetAtGithub added a commit that referenced this issue Feb 24, 2021
1) Read batches from the returned bufferlist lazily (#100)

 Fix #101: In the RadosParquetScanTask::Execute() method, after the bufferlist filled with a serialized Table was
 returned, all the batches were first read into a RecordBatchVector and then that RecordBatchVector was again read
 for getting a stream of batches.We don't need to read the batches from the table inside the Execute function, rather we can keep the serialized bufferlist
as is and we return an interator over that to increase performance by iterating over the bacthes only once rather than twice.

2) Use serial task group in the CLS (#104)

   Fixes #103: In the CLS, we scan just a single file, so we can just use the SerialTaskGroup. Using ThreadedTaskGroup causes a Thread pool to be created in every CLS function execution and that is a big performance overhead.

3) Fix bug causing client side filtering bypass to fail for rados-parquet (#105)

   On using `rados-parquet` format, although each of the fragments was scanned in the CLS, they were also getting scanned in the client uselessly leading to wastage of CPU cycles. This PR fixes it.

4) Remove costly Table print statement from CLS (#107)
JayjeetAtGithub added a commit that referenced this issue Feb 24, 2021
1) Read batches from the returned bufferlist lazily (#100)

 Fixes #101: In the RadosParquetScanTask::Execute() method, after the bufferlist filled with a serialized Table was
 returned, all the batches were first read into a RecordBatchVector and then that RecordBatchVector was again read
 for getting a stream of batches.We don't need to read the batches from the table inside the Execute function, rather we can  keep the serialized bufferlist as is and we return an interator over that to increase performance by iterating over the bact hes only once rather than twice.

2) Use serial task group in the CLS (#104)

 Fixes #103: In the CLS, we scan just a single file, so we can just use the SerialTaskGroup. Using ThreadedTaskGroup causes a Thread pool to be created in every CLS function execution and that is a big performance overhead.

3) Fix bug causing client side filtering bypass to fail for rados-parquet (#105)

 On using `rados-parquet` format, although each of the fragments was scanned in the CLS, they were also getting scanned in th e client uselessly leading to wastage of CPU cycles. This PR fixes it.

4) Remove costly Table print statement from CLS (#107)
JayjeetAtGithub added a commit that referenced this issue Feb 24, 2021
1) Read batches from the returned bufferlist lazily (#100)

 Fixes #101: In the RadosParquetScanTask::Execute() method, after the bufferlist filled with a serialized Table was
 returned, all the batches were first read into a RecordBatchVector and then that RecordBatchVector was again read
 for getting a stream of batches.We don't need to read the batches from the table inside the Execute function, rather we can  keep the serialized bufferlist as is and we return an interator over that to increase performance by iterating over the bact hes only once rather than twice.

2) Use serial task group in the CLS (#104)

 Fixes #103: In the CLS, we scan just a single file, so we can just use the SerialTaskGroup. Using ThreadedTaskGroup causes a Thread pool to be created in every CLS function execution and that is a big performance overhead.

3) Fix bug causing client side filtering bypass to fail for rados-parquet (#105)

 On using `rados-parquet` format, although each of the fragments was scanned in the CLS, they were also getting scanned in th e client uselessly leading to wastage of CPU cycles. This PR fixes it.

4) Remove costly Table print statement from CLS (#107)
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant