[Feature Request]: Support LZMA compression in python I/O SDKs #25316
Labels
done & done
Issue has been reviewed after it was closed for verification, followups, etc.
new feature
P2
python
Milestone
What would you like to happen?
LZMA compression is standard in python but not one of the strategies in the
beam.io.{Read,Write}FromText
PTransform
s. openwebtext, for example, uses this compression. I think this may be a pretty simple change. For example, I hacked up a naive "shim" here for use in Dataflow with a custom container by just overwritingapache_beam/io/filesystem.py
in thesite-packages
. It's working (a) locally with decompression and compression (though the output filenames are malformed, the part schema follows the compression extension) and (b) in aDataflowRunner
reading a GCS dump of all the openwebtext.xz
archives. (Without this I've been having a hell of a time getting any horizontal scaling while reading openwebtext.) It may be this simple, but I haven't run any Beam tests on these minor changes. I will probably do a bit more research into that myself.Issue Priority
Priority: 2 (default / most feature requests should be filed as P2)
Issue Components
The text was updated successfully, but these errors were encountered: