refactor: Introduce ConnectorObjectFactory and HiveObjectFactory#13798
refactor: Introduce ConnectorObjectFactory and HiveObjectFactory#13798yingsu00 wants to merge 2 commits intofacebookincubator:mainfrom
Conversation
✅ Deploy Preview for meta-velox canceled.
|
a048207 to
77c5146
Compare
6371e5a to
6edae75
Compare
|
I don't see the benefit doing the dispatch at runtime, it just makes thing a lot fragile. If Hive is linked, corresponding users just create Hive object directly in their code; otherwise they should hide them behind the macro. This way it's much more robust and errors can be caught at compile time. |
Hi @Yuhta, I want to clarify that there isn’t any runtime dispatch in this PR—it’s purely a compile-time refactor to support building when VELOX_ENABLE_HIVE=OFF. With the new ConnectorObjectFactory and HiveObjectFactory, we can migrate tests away from direct Hive references. For example, instead of this We can refactor to: This version removes any direct dependence on Hive headers and relies solely on the common factory. We plan to roll out updates to the various tests incrementally in subsequent PRs. Note that this PR only introduces the new factories—it does not include those test changes. I believed this aligns with our agreement here: #13698. Did I misunderstand any part of that decision? Thanks! |
|
@yingsu00 If we are still going to have |
@Yuhta The core issue, as discussed in #13698, is that HiveConnectorTestBase.cpp and its derived tests currently live in the exec module and directly reference Hive-specific headers (e.g., velox/connectors/hive/HiveConnector.h and HiveConnectorSplitBuilder). When VELOX_ENABLE_HIVE=OFF, these headers—and the Hive connector itself—are excluded from the build, which causes the exec tests to fail to compile. With the introduction of the Now, you might ask: how do we avoid linking Then This fully decouples Hive from the exec tests and allows connector-specific code to evolve independently. I believe we were aligned on this direction in the issue discussion. It makes maintenance easier—changes to Hive won’t impact exec or core tests unnecessarily, and it is important for us. We'd greatly appreciate it if the decoupling can be adopted. This PR, which adds the Hive object factory, is a first step in that direction. The next will be addressing implementing HiveConnectorTestBase with the ConnectorObjectFactor and generalizing it for all connectors. Let me know if that still aligns with your understanding. Thanks! |
velox/core/PlanNode.h
Outdated
There was a problem hiding this comment.
@yingsu00 Ying, I'm still reading the PR, but wondering if you could clarify this TODO. What is this about?
There was a problem hiding this comment.
Sorry this was an old comment. I just removed it.
velox/dwio/common/Options.h
Outdated
There was a problem hiding this comment.
This change seems unrelated? Can it be extracted into a separate PR?
There was a problem hiding this comment.
This change seems unrelated? Can it be extracted into a separate PR?
@mbasmanova Hi Masha, serializing WriterOptions is a pre-requisite to serialize HiveInsertTableHandle, thus commit "Make WriterOptions serializable" is a pre-requisite to the commit "refactor: Introduce ConnectorObjectFactory and HiveObjectFactory". But I can certainly make it a separate PR. I have extract the commit out, added tests and created #14868. Your review is very much appreciated.
There was a problem hiding this comment.
Unrelated change?
If possible, consider extract all unrelated changes into separate PRs. These PR can land independently. Reducing the size of the PR will help get it reviewed faster.
There was a problem hiding this comment.
Unrelated change?
Hi Masha, this is just a cosmetic format fix that derived virtual functions shall be marked as override. I think it is Velox convention, but too small to be a separate commit or PR. I can remove it if you want.
velox/connectors/Connector.h
Outdated
There was a problem hiding this comment.
why not add = default right here?
There was a problem hiding this comment.
why not add = default right here?
Like an inline {} implementation for a virtual destructor, =default() in headers is also an inline definition of the destructor. With an inline virtual destructor, every translation unit that includes the header can emit its own destructor body, vtable, RTTI, etc. This is bad for a widely used common interface, because it may be consumed by multiple connectors or plugins, and each of them would contain one copy of the destructor and vtable. This means any change in the base ConnectorLocationHandle destructor would cause all connectors to be re-compiled. It could also make the build time slower, becasue even if the linker discards duplicates, intermediate .o files are bigger and link times are slower. It may introduce more problems if we enable dynamic linking in the future, like cross .so library cast may fail.
If we have the destructor definition in the cpp file, then changes to its implementation won't cause the header to change, and all downstream TUs don't need to be re-compiled, but just re-linked. It also forces a single vtable and type_info even across library boundaries. Even though we are doing static linking now, it would help improve link time.
So it's better to put all virtual destructor definitions in Connector.cpp. I intended to do it if this PR can get merged. There are some other minor cleanups needed in Connector.h/.cpp, e.g. some structs have toString(), some not. Some structs has connectorId_, some not. Then it'll be desirable to make the interface versioned but that's a future topic.
velox/connectors/ConnectorNames.h
Outdated
There was a problem hiding this comment.
Why do we have these listed here? Do we expect to add all connectors here?
There was a problem hiding this comment.
Why do we have these listed here? Do we expect to add all connectors here?
@mbasmanova We need the connector names in the common layer, because the ConnectorFactory or Connector registries are all keyed by the connector name. All users in outside modules shall be talking with the connector common layer only, instead of refering to the name in specific connectors like connector::hive::HiveConnectorFactory::kHiveConnectorName. If we stick to connector::hive::HiveConnectorFactory::kHiveConnectorName, then the users would have to include HiveConnector.h.
I believe this is the same idea why dwio::common::FileFormat was introduced. With the FileFormat enum in dwio::common, users outside of dwio module can just call getReaderFactory(dwio::common::FileFormat::DWRF)->createReader() to create a DWRF reader without referencing DWRF headers. I have another PR on DWIO refactor #14090 to additionally decouple file formats from users. Your review is also very appreciated there.
If you don't like putting them in another file ConnectorNames.h, we can put them as an enum in Connector.h.
There was a problem hiding this comment.
@mbasmanova Btw I made refactor: Introduce ConnectorNames.h in velox/connectors a separate commit in https://github.com/facebookincubator/velox/pull/14687/commits. It can also be a simple PR by itself. Do you want me to do that?
There was a problem hiding this comment.
these properties are specific to Hive-based connectors and do not apply to other connector (e.g. MySQL connector). Perhaps, we don't want to include these in this generic API.
There was a problem hiding this comment.
Thanks. I have updated the signature:
virtual std::shared_ptr<ConnectorSplit> makeSplit(
const std::string& connectorId,
const folly::dynamic& options = {}) const {
VELOX_UNSUPPORTED("ConnectorSplit not supported by connector", connectorId);
}
There was a problem hiding this comment.
There is no ctor. How will this be set?
There was a problem hiding this comment.
factories -> kFactories
There was a problem hiding this comment.
Are these missing from the .h file?
There was a problem hiding this comment.
@mbasmanova Thanks for the nice catch. I'm so sorry that this commit was indeed incomplete, and I have added the missing signatures in the .h in this PR. The reason why they were missing was that I was thinking to replace this commit with misc: Enhance ConnectorFactory interface , which does not need a separate ConnectorObjectFactory. Could you please take a look to see which way is preferred? If you prefer the other commit, I can pick it over here.
velox/connectors/hive/HiveDataSink.h
Outdated
There was a problem hiding this comment.
@yingsu00 Ying, if the goal is to make tests build with hive disabled, then, why can't we just fix cmake to not build tests that depend on hive when hive is disabled? Are you concerned that this will disable virtually all tests? Why is this a problem? Do you believe that many tests do not need to depend on Hive? If so, can they be re-written to use Values nodes instead of table scan? I know that at least some depends do have hard dependency on Hive because they check filter pushdown behavior. I don't think we can expect all connectors to support that.
Using folly::dynamic to specify connector properties in tests is not ideal. Wondering if that what @Yuhta referred to as "dynamic dispatch". It would be easy to miss a property or mistype its name or value and the error will only appear at runtime. How are you thinking about this? Can we provide an example test?
|
@mbasmanova Thank you very much for reviewing. There are too many exec tests that depend on HiveConnectorTestBase and PlanBuilder. While decoupling a specific connector from the core engine is always a good idea, we would also like to make the tests to cover both Hive and Iceberg. In addition, we hope to promote Iceberg to a standalone connector with its own config files, and this change is a pre-requisite for us to shape the Iceberg code in a cleaner structure. I will reply to your question in more details and address your comments the first thing tomorrow. At the mean time, I wonder if you could briefly take a look at another PR Make PlanBuilder connector agnostic part 1 where the first two commits are equal to the functionality with this PR, but with a slightly different approach. The difference was that instead of adding a new file ConnectorObjectFactory, I merged the new makeXXX functions into the existing ConnectorFactory class, such that the users don't have to register two factories. Could you please let me know which approach you prefer? Thanks! |
@yingsu00 Ying, is this the main goal? If so, it would be helpful to articulate that clearly and avoid suggesting that 'exec' depends on Hive connector. Also, I'm not sure how realistic it is to make tests work for 2 different connectors. Maybe some tests can be shared, but not all. This requires some design discussion.
A quick look suggests that first 2 commits do not include any interesting logic.
|
Hi @mbasmanova Sorry I meant the first 3 commits, especially the 3rd one here: misc: Enhance ConnectorFactory interface |
Hi @mbasmanova, thanks for the feedback and quesitons. The primary goal is to decouple Iceberg from Hive and make it a standalone connector. This reduces maintenance burden and aligns with the industry trend: Hive is a legacy table format, and it is gradually being replaced by more modern and versatile formats like Iceberg, Hudi and Delta Lake. Each of them has some special features, configs, and stats collection requirement, etc. IMHO Velox should evolve with this trend and promote these new standards instead of promoting a single Hive connector. Suppose Hudi and DeltaLake will be added to Velox in the near future, and if the current inheritance structure continues, logic in HiveDataSource and SplitReader (parents of Iceberg equivalents) will become increasingly entangled and harder to maintain. For example:
for me, A cleaner way is to extend the connector interfaces to form a thin, common layer for exec and other modules. Each connector (Hive, Iceberg, etc.) would implement this interface, while external users only depend on the interface. This also moves us toward a stable connector “SPI/ABI”, which would help with versioning and Velox releases in the long run. Such stable, versioned interfaces and releasing will help the community to adopt Velox better and have been desired by many of us for a long time.
Yes thanks for trying to make the description more accurate. But still, I do think more tests can be made table format or connector agnostic, which also improves coverage. Right now, many tests in exec, dwio, common/memory, and even functions depend directly on Hive. While these are test files, it still creates a strong coupling: building with tests enabled requires Hive. Maybe our definition of 'exec' or "dependence" is slightly different. From my humble perspective, tests are part of a broader range of the exec module, and it is an anti-pattern for core modules (including their tests) being tied to one particular connector. This PR and #14687
I agree not all tests can or should be shared between Hive and Iceberg at the current stage. Some candidates we’d like to adjust as the first step are:
All of these rely on HiveConnectorTestBase and PlanBuilder in exec/test/utils/. PR #14687 is about making PlanBuilder support multiple connectors. If you think further discussion is needed, I'll be happy to schedule a review session for it, or utilize the Presto Native Worker Group meeting. What do you think? Your feedback will be highly appreciated! |
This makes sense to me. Let's then focus on that. What is blocking you from making this happen? |
@yingsu00 Yes, I think this would be helpful. Let's discuss 1:1 over VC. |
62dd7eb to
ae6862b
Compare
This commit is the first part of the effort to decouple Hive from exec tests, which aims to make VELOX_ENABLE_HIVE_CONNECTOR=OFF build without errors. The content of this commit include: - Add a new ConnectorObjectFactory interface in velox/connectors, defining abstract methods for creating ConnectoSplits, TableHandles, InsertTableHandles,etc. - Create HiveObjectFactory in velox/connectors/hive that implements the common interface. - Add a new HiveObjectFactoryTest suite to verify that dynamic options yield correct Hive-specific objects without leaking connector internals into core or exec tests.
ae6862b to
f3c7879
Compare
|
This pull request has been automatically marked as stale because it has not had recent activity. If you'd still like this PR merged, please comment on the PR, make sure you've addressed reviewer comments, and rebase on the latest main. Thank you for your contributions! |

This commit is the first part of the effort to decouple Hive from exec tests, which aims to make VELOX_ENABLE_HIVE_CONNECTOR=OFF build without errors. The content of this commit include:
Partially resolves #13698