[Feature Request]: Support LZMA compression in python I/O SDKs #25316

wrossmorrow · 2023-02-05T14:49:27Z

What would you like to happen?

LZMA compression is standard in python but not one of the strategies in the beam.io.{Read,Write}FromText PTransforms. openwebtext, for example, uses this compression. I think this may be a pretty simple change. For example, I hacked up a naive "shim" here for use in Dataflow with a custom container by just overwriting apache_beam/io/filesystem.py in the site-packages. It's working (a) locally with decompression and compression (though the output filenames are malformed, the part schema follows the compression extension) and (b) in a DataflowRunner reading a GCS dump of all the openwebtext .xz archives. (Without this I've been having a hell of a time getting any horizontal scaling while reading openwebtext.) It may be this simple, but I haven't run any Beam tests on these minor changes. I will probably do a bit more research into that myself.

Issue Priority

Priority: 2 (default / most feature requests should be filed as P2)

Issue Components

The text was updated successfully, but these errors were encountered:

* (#25316) Added naive first shot at enabling LZMA compression * (#25316) Added a draft line to CHANGES.md * (#25316) fix linter issues * (#25316) update tests (draft) * (#25316) import order in test file

wrossmorrow · 2023-02-13T14:40:59Z

Merged in #25317

wrossmorrow added awaiting triage new feature labels Feb 5, 2023

github-actions bot added python P2 labels Feb 5, 2023

wrossmorrow added a commit to wrossmorrow/beam-pysdk-io-lzma that referenced this issue Feb 5, 2023

(apache#25316) Added naive first shot at enabling LZMA compression

1b8c72f

wrossmorrow added a commit to wrossmorrow/beam-pysdk-io-lzma that referenced this issue Feb 5, 2023

(apache#25316) Added a draft line to CHANGES.md

294e078

wrossmorrow mentioned this issue Feb 5, 2023

(#25316) Enable LZMA compression in Python SDK I/O #25317

Merged

3 tasks

damccorm removed the awaiting triage label Feb 7, 2023

wrossmorrow added a commit to wrossmorrow/beam-pysdk-io-lzma that referenced this issue Feb 11, 2023

(apache#25316) fix linter issues

4991413

wrossmorrow added a commit to wrossmorrow/beam-pysdk-io-lzma that referenced this issue Feb 11, 2023

(apache#25316) update tests (draft)

641d3f3

wrossmorrow added a commit to wrossmorrow/beam-pysdk-io-lzma that referenced this issue Feb 11, 2023

(apache#25316) import order in test file

1ceac55

wrossmorrow closed this as completed Feb 13, 2023

github-actions bot added this to the 2.46.0 Release milestone Feb 13, 2023

damccorm added the done & done Issue has been reviewed after it was closed for verification, followups, etc. label Feb 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request]: Support LZMA compression in python I/O SDKs #25316

[Feature Request]: Support LZMA compression in python I/O SDKs #25316

wrossmorrow commented Feb 5, 2023

wrossmorrow commented Feb 13, 2023

[Feature Request]: Support LZMA compression in python I/O SDKs #25316

[Feature Request]: Support LZMA compression in python I/O SDKs #25316

Comments

wrossmorrow commented Feb 5, 2023

What would you like to happen?

Issue Priority

Issue Components

wrossmorrow commented Feb 13, 2023