Skip to content

refactor: Make register readers and writers file format agnostic#14090

Closed
yingsu00 wants to merge 3 commits intofacebookincubator:mainfrom
yingsu00:reader_registry_3
Closed

refactor: Make register readers and writers file format agnostic#14090
yingsu00 wants to merge 3 commits intofacebookincubator:mainfrom
yingsu00:reader_registry_3

Conversation

@yingsu00
Copy link
Copy Markdown
Contributor

@yingsu00 yingsu00 commented Jul 12, 2025

This is the first PR to resolve #14514

Why Consolidate DWIO Interfaces?

Previously, DWIO tests and utilities directly included format-specific headers (e.g., RegisterDwrfReader.h, RegisterParquetReader.h) and invoked explicit registration functions. This created tight coupling between DWIO core, tests, and file format implementations. It also meant that every new format increased maintenance overhead: tests had to be updated, build rules had to pull in format targets individually, and Linux static builds risked missing registrations if static initializers weren’t pulled in. By consolidating registration behind velox_dwio_registered_readers and velox_dwio_registered_writers, we achieve a clean separation of concerns:

  • Core stays format-agnostic, without pulling in Parquet/DWRF/ORC/Text headers or generated proto files.
  • Improved test coverage: because tests no longer hard-code a single format, the same test code can exercise all supported formats. This broadens coverage, catches more corner cases, and increases confidence that DWIO core logic works consistently across file formats.
  • Simpler build and linking: downstreams link against one bundle target instead of manually pulling in every format.
  • Reliable initialization: explicit bulk registration (registerAllReaders(), registerAllWriters()) guarantees factories are available, avoiding platform-dependent linker behavior.
  • Extensibility: new formats can be integrated by extending the bundles, with no changes needed in consumers.

This consolidation improves maintainability, eliminates brittle static-init/linker behaviors, and makes DWIO more modular and predictable

What's done in this PR

  • Added velox_dwio_registered_readers and velox_dwio_registered_writers
    bundles; tests link these and call registerAllReaders() / registerAllWriters().
  • Tests and users now construct readers/writers through
    dwio::common::{Reader,Writer}Factory, keyed by FileFormat, with no
    format‑specific includes.

Backward Compatibility

The single file format registry is kept and the users can still choose to register a single file format instead of all. The newly introduced registerAllReaders(), registerAllWriters() functions are synthetic sugar for tests that need to cover more file formats.

Next Steps

  1. We will remove all file format-specific headers from the tests and use dwio::common::ReaderFactory or WriterFactor to create the readers and writers. This work will be done gradually and the last commit in this PR provides an example.

  2. We will introduce a FlushPolicyFactory so tests (and writers) can create
    policies with a format-agnostic registry, without referencing Parquet/DWRF classes.

Final Goal

All tests stay file format agnostic, except the tests within the dwio module.

@netlify
Copy link
Copy Markdown

netlify bot commented Jul 12, 2025

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 1d168ff
🔍 Latest deploy log https://app.netlify.com/projects/meta-velox/deploys/68e45064c7409100082fc265

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 12, 2025
@yingsu00 yingsu00 force-pushed the reader_registry_3 branch 7 times, most recently from 82c2185 to 7a16e68 Compare July 21, 2025 15:36
@yingsu00 yingsu00 force-pushed the reader_registry_3 branch 12 times, most recently from 4c10ab7 to 58e9af2 Compare July 25, 2025 15:01
@yingsu00 yingsu00 force-pushed the reader_registry_3 branch from 58e9af2 to ad29306 Compare July 30, 2025 23:52
@yingsu00 yingsu00 changed the title Reader registry 3 refactor: Make register readers and writers file format agnostic Jul 31, 2025
@yingsu00 yingsu00 force-pushed the reader_registry_3 branch 4 times, most recently from 7448517 to 2a82039 Compare August 2, 2025 04:48
@yingsu00 yingsu00 requested a review from assignUser August 19, 2025 06:46
@yingsu00
Copy link
Copy Markdown
Contributor Author

yingsu00 commented Sep 9, 2025

@pedroerp @Yuhta Could you please review this PR? Thanks!

@yingsu00 yingsu00 requested a review from PingLiuPing September 9, 2025 20:47
@yingsu00 yingsu00 force-pushed the reader_registry_3 branch 2 times, most recently from 41b8d24 to be00468 Compare September 9, 2025 22:57
@yingsu00 yingsu00 requested a review from mbasmanova September 10, 2025 02:13
@yingsu00
Copy link
Copy Markdown
Contributor Author

@pedroerp @Yuhta @mbasmanova @kgpai @assignUser Could you please review this PR? It cleans up the DWIO interface and allows for more test coverage on multiple file formats in the future. For example, we will be able to test multiple file formats with the existing TableScanTest and TableWriterTest, if we make file format a test parameter.

Copy link
Copy Markdown
Collaborator

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

#ifdef VELOX_ENABLE_PARQUET
parquet::registerParquetReaderFactory();
#endif
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind sharing the specific issue you're experiencing with the separate registrations? I assume Spark SQL typically does not register the DWRF readers and writers by default, and prior to this change, it was possible to register only the necessary components as needed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rui-mo Thanks for reviewing! There are a few motivations:

  1. We wanted the tests to be able to cover both Dwrf and Parquet in a more elegant way, e.g. TableScanTest, TableWriterTest, HiveDataSinkTest, etc. These tests only work on a specific format, mostly DWRF. These tests, being in exec or core, which are more core engine stuff, should stay file format agnostic. We shall also be able to use these tests to test on Parquet or other file formats. I believe this is a well known design principle and that's why the ReaderFactory and WriterFactory were introduced in the past. But there were many tests not following this paradigm, making reusing these tests for other file formats hard. With the changes in this PR, we can make the tests parameterized with the dwio::common::FileFormat enums like this
class OperatorTestBase : public ::testing::TestWithParam<dwio::common::FileFormat>
                         public velox::test::VectorTestBase  

That way we don't have to introduce new tests specifically for Parquet or new file format, but just turning the tests to TEST_P tests that can automatically iterate through the file format collection.

  1. It will also help to add more file readers in a modular way. We will need to add AVRO sometime in the future. If the new AVRO reader can folllow the DWIO interface, then it can just be plugged in and the engine automatically works because it's file format agnostic.

  2. This will help us to make the Velox Parquet reader shareable with other communities like the Iceberg community.

I added a paragraph in the design doc and the PR message to explain why this is a good thing: #14514

Copy link
Copy Markdown
Contributor Author

@yingsu00 yingsu00 Sep 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rui-mo Also, the single file format registry is still supported. So there is no backward compatibility issue for Gluten.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaining the benefits. Regarding the first point, I find TableScanTest is in the velox/exec, and I agree with you to make it format agnostic.

Copy link
Copy Markdown
Collaborator

@assignUser assignUser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like a great change, untangling dependencies will improve future changes like determining a public API! So in principle I am in favor of this change!

As I have limited time and this is a big PR but the CMake changes are only tangential, I will wait with my review until other maintainers have reviewed and approved the core idea.

@Yuhta Yuhta added the ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall label Sep 19, 2025
Copy link
Copy Markdown
Collaborator

@assignUser assignUser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please resolve conflicts/rebase so CI can run.

add_subdirectory(parquet)
endif()

add_library(velox_dwio_registered_readers STATIC RegisterReaders.cpp)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you force it to be static?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@assignUser Thanks for reviewing! I recall that it forces CMake to build libvelox_dwio_registered_readers.a and help Linux system to build correctly. It is not an umbrella target (purely transitive deps, no actual objects) where "INTERFACE" is usually for. It actually does have code in RegisterReaders.cpp, so it would be either STATIC, OBJECT or SHARED. Since we want outside users to link with it so it cannot be "OBJECT" here. It is manual registration, and it is never supposed to be loaded dynamically on its own, so it won't be "SHARED" either. I believe we are not doing dynamic loading anyways, so by default it is still "STATIC", I just explicitly specified it out to make it clear.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, not sure how I missed this before but this should be part of the main library as it's not a utility for tests or fuzzers, so please use our wrapper functions:

  • velox_add_library without any library type keyword
  • velox_link_libraries instead of target_link_libraries.

Sorry, this isn't well documented yet (#14941).

@yingsu00 yingsu00 force-pushed the reader_registry_3 branch 3 times, most recently from 7db67b6 to 6e40bf6 Compare October 2, 2025 07:51
add_subdirectory(parquet)
endif()

add_library(velox_dwio_registered_readers STATIC RegisterReaders.cpp)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, not sure how I missed this before but this should be part of the main library as it's not a utility for tests or fuzzers, so please use our wrapper functions:

  • velox_add_library without any library type keyword
  • velox_link_libraries instead of target_link_libraries.

Sorry, this isn't well documented yet (#14941).

@assignUser assignUser removed the ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall label Oct 4, 2025
To decouple file format specific readers and writers from the rest parts
of Velox, and make more tests to cover both DWRF and Parquet formats, we
can package them into velox_dwio_registered_readers and velox_dwio_regis
tered_writers targets. The users (tests) can just link with them without
having to specify which file format reader it is, and call
registerReaderFactories() or registerWriterFactories() to register the
readers or writers.

In the next step, we will remove all file format-specific headers from
the tests and use dwio::common::ReaderFactory or WriterFactor to create
the readers and writers.
This commit gives an example how to remove file format-specific readers
and writers headers from modules other than dwio, and use the
dwio::common::ReaderFactory and dwio::common::WriterFactory to get the
file Reader's and Writer's.
@stale
Copy link
Copy Markdown

stale bot commented Jan 5, 2026

This pull request has been automatically marked as stale because it has not had recent activity. If you'd still like this PR merged, please comment on the PR, make sure you've addressed reviewer comments, and rebase on the latest main. Thank you for your contributions!

@stale stale bot added the stale label Jan 5, 2026
@stale stale bot closed this Jan 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. stale

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DWIO Registry Consolidation

5 participants