refactor: Make register readers and writers file format agnostic#14090
refactor: Make register readers and writers file format agnostic#14090yingsu00 wants to merge 3 commits intofacebookincubator:mainfrom
Conversation
✅ Deploy Preview for meta-velox canceled.
|
82c2185 to
7a16e68
Compare
4c10ab7 to
58e9af2
Compare
58e9af2 to
ad29306
Compare
7448517 to
2a82039
Compare
2a82039 to
8d1992c
Compare
46060e8 to
a9c8811
Compare
41b8d24 to
be00468
Compare
|
@pedroerp @Yuhta @mbasmanova @kgpai @assignUser Could you please review this PR? It cleans up the DWIO interface and allows for more test coverage on multiple file formats in the future. For example, we will be able to test multiple file formats with the existing TableScanTest and TableWriterTest, if we make file format a test parameter. |
be00468 to
8a8d133
Compare
| #ifdef VELOX_ENABLE_PARQUET | ||
| parquet::registerParquetReaderFactory(); | ||
| #endif | ||
| } |
There was a problem hiding this comment.
Would you mind sharing the specific issue you're experiencing with the separate registrations? I assume Spark SQL typically does not register the DWRF readers and writers by default, and prior to this change, it was possible to register only the necessary components as needed.
There was a problem hiding this comment.
@rui-mo Thanks for reviewing! There are a few motivations:
- We wanted the tests to be able to cover both Dwrf and Parquet in a more elegant way, e.g. TableScanTest, TableWriterTest, HiveDataSinkTest, etc. These tests only work on a specific format, mostly DWRF. These tests, being in exec or core, which are more core engine stuff, should stay file format agnostic. We shall also be able to use these tests to test on Parquet or other file formats. I believe this is a well known design principle and that's why the ReaderFactory and WriterFactory were introduced in the past. But there were many tests not following this paradigm, making reusing these tests for other file formats hard. With the changes in this PR, we can make the tests parameterized with the dwio::common::FileFormat enums like this
class OperatorTestBase : public ::testing::TestWithParam<dwio::common::FileFormat>
public velox::test::VectorTestBase
That way we don't have to introduce new tests specifically for Parquet or new file format, but just turning the tests to TEST_P tests that can automatically iterate through the file format collection.
-
It will also help to add more file readers in a modular way. We will need to add AVRO sometime in the future. If the new AVRO reader can folllow the DWIO interface, then it can just be plugged in and the engine automatically works because it's file format agnostic.
-
This will help us to make the Velox Parquet reader shareable with other communities like the Iceberg community.
I added a paragraph in the design doc and the PR message to explain why this is a good thing: #14514
There was a problem hiding this comment.
@rui-mo Also, the single file format registry is still supported. So there is no backward compatibility issue for Gluten.
There was a problem hiding this comment.
Thanks for explaining the benefits. Regarding the first point, I find TableScanTest is in the velox/exec, and I agree with you to make it format agnostic.
assignUser
left a comment
There was a problem hiding this comment.
This sounds like a great change, untangling dependencies will improve future changes like determining a public API! So in principle I am in favor of this change!
As I have limited time and this is a big PR but the CMake changes are only tangential, I will wait with my review until other maintainers have reviewed and approved the core idea.
assignUser
left a comment
There was a problem hiding this comment.
Please resolve conflicts/rebase so CI can run.
| add_subdirectory(parquet) | ||
| endif() | ||
|
|
||
| add_library(velox_dwio_registered_readers STATIC RegisterReaders.cpp) |
There was a problem hiding this comment.
Why do you force it to be static?
There was a problem hiding this comment.
@assignUser Thanks for reviewing! I recall that it forces CMake to build libvelox_dwio_registered_readers.a and help Linux system to build correctly. It is not an umbrella target (purely transitive deps, no actual objects) where "INTERFACE" is usually for. It actually does have code in RegisterReaders.cpp, so it would be either STATIC, OBJECT or SHARED. Since we want outside users to link with it so it cannot be "OBJECT" here. It is manual registration, and it is never supposed to be loaded dynamically on its own, so it won't be "SHARED" either. I believe we are not doing dynamic loading anyways, so by default it is still "STATIC", I just explicitly specified it out to make it clear.
There was a problem hiding this comment.
Sorry, not sure how I missed this before but this should be part of the main library as it's not a utility for tests or fuzzers, so please use our wrapper functions:
velox_add_librarywithout any library type keywordvelox_link_librariesinstead oftarget_link_libraries.
Sorry, this isn't well documented yet (#14941).
7db67b6 to
6e40bf6
Compare
| add_subdirectory(parquet) | ||
| endif() | ||
|
|
||
| add_library(velox_dwio_registered_readers STATIC RegisterReaders.cpp) |
There was a problem hiding this comment.
Sorry, not sure how I missed this before but this should be part of the main library as it's not a utility for tests or fuzzers, so please use our wrapper functions:
velox_add_librarywithout any library type keywordvelox_link_librariesinstead oftarget_link_libraries.
Sorry, this isn't well documented yet (#14941).
To decouple file format specific readers and writers from the rest parts of Velox, and make more tests to cover both DWRF and Parquet formats, we can package them into velox_dwio_registered_readers and velox_dwio_regis tered_writers targets. The users (tests) can just link with them without having to specify which file format reader it is, and call registerReaderFactories() or registerWriterFactories() to register the readers or writers. In the next step, we will remove all file format-specific headers from the tests and use dwio::common::ReaderFactory or WriterFactor to create the readers and writers.
This commit gives an example how to remove file format-specific readers and writers headers from modules other than dwio, and use the dwio::common::ReaderFactory and dwio::common::WriterFactory to get the file Reader's and Writer's.
6e40bf6 to
1d168ff
Compare
|
This pull request has been automatically marked as stale because it has not had recent activity. If you'd still like this PR merged, please comment on the PR, make sure you've addressed reviewer comments, and rebase on the latest main. Thank you for your contributions! |
This is the first PR to resolve #14514
Why Consolidate DWIO Interfaces?
Previously, DWIO tests and utilities directly included format-specific headers (e.g., RegisterDwrfReader.h, RegisterParquetReader.h) and invoked explicit registration functions. This created tight coupling between DWIO core, tests, and file format implementations. It also meant that every new format increased maintenance overhead: tests had to be updated, build rules had to pull in format targets individually, and Linux static builds risked missing registrations if static initializers weren’t pulled in. By consolidating registration behind velox_dwio_registered_readers and velox_dwio_registered_writers, we achieve a clean separation of concerns:
This consolidation improves maintainability, eliminates brittle static-init/linker behaviors, and makes DWIO more modular and predictable
What's done in this PR
bundles; tests link these and call registerAllReaders() / registerAllWriters().
dwio::common::{Reader,Writer}Factory, keyed by FileFormat, with no
format‑specific includes.
Backward Compatibility
The single file format registry is kept and the users can still choose to register a single file format instead of all. The newly introduced registerAllReaders(), registerAllWriters() functions are synthetic sugar for tests that need to cover more file formats.
Next Steps
We will remove all file format-specific headers from the tests and use dwio::common::ReaderFactory or WriterFactor to create the readers and writers. This work will be done gradually and the last commit in this PR provides an example.
We will introduce a FlushPolicyFactory so tests (and writers) can create
policies with a format-agnostic registry, without referencing Parquet/DWRF classes.
Final Goal
All tests stay file format agnostic, except the tests within the dwio module.