[Feature Request]: Reduce number of byte[] copies in TextSource #23193

lukecwik · 2022-09-12T20:45:17Z

What would you like to happen?

The current TextSource implementation is spending a lot of time during byte[] copying:

Hadoop LineReader.java implementation is signficantly faster (~2x) when handling typical files due to an implementation that reduces how many byte[]s are copied. A simple benchmark reading 10 million lines (60-120 characters long) shows that it takes about ~2.05 seconds to process such a file while the Apache Beam TextSource takes ~4.03 seconds.

Issue Priority

Priority: 2

Issue Component

Component: io-java-text

The text was updated successfully, but these errors were encountered:

…e copied (fixes apache#23193) This makes TextSource take about 2.3x less CPU resources during decoding. Before this change: ``` TextSourceBenchmark.benchmarkTextSource thrpt 5 0.248 ± 0.029 ops/s ``` After this change: ``` TextSourceBenchmark.benchmarkHadoopLineReader thrpt 5 0.465 ± 0.064 ops/s TextSourceBenchmark.benchmarkTextSource thrpt 5 0.575 ± 0.059 ops/s ```

lukecwik · 2022-09-12T21:39:25Z

CC: @bhisevishal

…e copied (fixes apache#23193) This makes TextSource take about 2.3x less CPU resources during decoding. Before this change: ``` TextSourceBenchmark.benchmarkTextSource thrpt 5 0.248 ± 0.029 ops/s ``` After this change: ``` TextSourceBenchmark.benchmarkHadoopLineReader thrpt 5 0.465 ± 0.064 ops/s TextSourceBenchmark.benchmarkTextSource thrpt 5 0.575 ± 0.059 ops/s ```

…e copied (fixes #23193) (#23196) * Improve the performance of TextSource by reducing how many byte[]s are copied (fixes #23193) This makes TextSource take about 2.3x less CPU resources during decoding. Before this change: ``` TextSourceBenchmark.benchmarkTextSource thrpt 5 0.248 ± 0.029 ops/s ``` After this change: ``` TextSourceBenchmark.benchmarkHadoopLineReader thrpt 5 0.465 ± 0.064 ops/s TextSourceBenchmark.benchmarkTextSource thrpt 5 0.575 ± 0.059 ops/s ``` * Write file in pieces instead of pre-allocating entire buffer * Address PR comments

…e copied (fixes apache#23193) (apache#23196) * Improve the performance of TextSource by reducing how many byte[]s are copied (fixes apache#23193) This makes TextSource take about 2.3x less CPU resources during decoding. Before this change: ``` TextSourceBenchmark.benchmarkTextSource thrpt 5 0.248 ± 0.029 ops/s ``` After this change: ``` TextSourceBenchmark.benchmarkHadoopLineReader thrpt 5 0.465 ± 0.064 ops/s TextSourceBenchmark.benchmarkTextSource thrpt 5 0.575 ± 0.059 ops/s ``` * Write file in pieces instead of pre-allocating entire buffer * Address PR comments

…te[]s are copied (fixes apache#23193) (apache#23196)" This reverts commit 30a48f0

lukecwik added new feature awaiting triage labels Sep 12, 2022

github-actions bot added io java P2 text labels Sep 12, 2022

lukecwik self-assigned this Sep 12, 2022

github-actions bot removed the awaiting triage label Sep 12, 2022

lukecwik mentioned this issue Sep 15, 2022

Improve the performance of TextSource by reducing how many byte[]s are copied (fixes #23193) #23196

Merged

4 tasks

lukecwik closed this as completed in #23196 Sep 15, 2022

johnjcasey added a commit to johnjcasey/beam that referenced this issue Nov 21, 2023

Revert "Improve the performance of TextSource by reducing how many by…

5f924d8

…te[]s are copied (fixes apache#23193) (apache#23196)" This reverts commit 30a48f0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request]: Reduce number of byte[] copies in TextSource #23193

[Feature Request]: Reduce number of byte[] copies in TextSource #23193

lukecwik commented Sep 12, 2022 •

edited

Loading

lukecwik commented Sep 12, 2022

[Feature Request]: Reduce number of byte[] copies in TextSource #23193

[Feature Request]: Reduce number of byte[] copies in TextSource #23193

Comments

lukecwik commented Sep 12, 2022 • edited Loading

What would you like to happen?

Issue Priority

Issue Component

lukecwik commented Sep 12, 2022

lukecwik commented Sep 12, 2022 •

edited

Loading