client: read batches from the returned bufferlist in ScanTask:Execute lazily #101

JayjeetAtGithub · 2021-02-21T06:50:42Z

No description provided.

JayjeetAtGithub · 2021-02-21T06:53:49Z

Fixed by #100

Fix #101: In the RadosParquetScanTask::Execute() method, after the bufferlist filled with a serialized Table was returned, all the batches were first read into a RecordBatchVector and then that RecordBatchVector was again read for getting a stream of batches.We don't need to read the batches from the table inside the Execute function, rather we can keep the serialized bufferlist as is and we return an interator over that to increase performance by iterating over the bacthes only once rather than twice.

1) Read batches from the returned bufferlist lazily (#100) Fix #101: In the RadosParquetScanTask::Execute() method, after the bufferlist filled with a serialized Table was returned, all the batches were first read into a RecordBatchVector and then that RecordBatchVector was again read for getting a stream of batches.We don't need to read the batches from the table inside the Execute function, rather we can keep the serialized bufferlist as is and we return an interator over that to increase performance by iterating over the bacthes only once rather than twice. 2) Use serial task group in the CLS (#104) Fixes #103: In the CLS, we scan just a single file, so we can just use the SerialTaskGroup. Using ThreadedTaskGroup causes a Thread pool to be created in every CLS function execution and that is a big performance overhead. 3) Fix bug causing client side filtering bypass to fail for rados-parquet (#105) On using `rados-parquet` format, although each of the fragments was scanned in the CLS, they were also getting scanned in the client uselessly leading to wastage of CPU cycles. This PR fixes it. 4) Remove costly Table print statement from CLS (#107)

1) Read batches from the returned bufferlist lazily (#100) Fixes #101: In the RadosParquetScanTask::Execute() method, after the bufferlist filled with a serialized Table was returned, all the batches were first read into a RecordBatchVector and then that RecordBatchVector was again read for getting a stream of batches.We don't need to read the batches from the table inside the Execute function, rather we can keep the serialized bufferlist as is and we return an interator over that to increase performance by iterating over the bact hes only once rather than twice. 2) Use serial task group in the CLS (#104) Fixes #103: In the CLS, we scan just a single file, so we can just use the SerialTaskGroup. Using ThreadedTaskGroup causes a Thread pool to be created in every CLS function execution and that is a big performance overhead. 3) Fix bug causing client side filtering bypass to fail for rados-parquet (#105) On using `rados-parquet` format, although each of the fragments was scanned in the CLS, they were also getting scanned in th e client uselessly leading to wastage of CPU cycles. This PR fixes it. 4) Remove costly Table print statement from CLS (#107)

JayjeetAtGithub added a commit that referenced this issue Feb 21, 2021

Read the batches lazily (#100)

eb66411

Fix #101

JayjeetAtGithub closed this as completed Feb 21, 2021

JayjeetAtGithub added a commit that referenced this issue Feb 21, 2021

Read batches from the returned bufferlist lazily (#100)

95ffc2c

Fix #101

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

client: read batches from the returned bufferlist in ScanTask:Execute lazily #101

client: read batches from the returned bufferlist in ScanTask:Execute lazily #101

JayjeetAtGithub commented Feb 21, 2021

JayjeetAtGithub commented Feb 21, 2021

client: read batches from the returned bufferlist in ScanTask:Execute lazily #101

client: read batches from the returned bufferlist in ScanTask:Execute lazily #101

Comments

JayjeetAtGithub commented Feb 21, 2021

JayjeetAtGithub commented Feb 21, 2021