Skip to content

refactor: Introduce ConnectorObjectFactory and HiveObjectFactory#13798

Closed
yingsu00 wants to merge 2 commits intofacebookincubator:mainfrom
yingsu00:connector_refactor_1
Closed

refactor: Introduce ConnectorObjectFactory and HiveObjectFactory#13798
yingsu00 wants to merge 2 commits intofacebookincubator:mainfrom
yingsu00:connector_refactor_1

Conversation

@yingsu00
Copy link
Copy Markdown
Contributor

@yingsu00 yingsu00 commented Jun 17, 2025

This commit is the first part of the effort to decouple Hive from exec tests, which aims to make VELOX_ENABLE_HIVE_CONNECTOR=OFF build without errors. The content of this commit include:

  • Add a new ConnectorObjectFactory interface in velox/connectors, defining abstract methods for creating ConnectoSplits, TableHandles, InsertTableHandles,etc.
  • Create HiveObjectFactory in velox/connectors/hive that implements the common interface.
  • Add a new HiveObjectFactoryTest suite to verify that dynamic options yield correct Hive-specific objects without leaking connector internals into core or exec tests.

Partially resolves #13698

@netlify
Copy link
Copy Markdown

netlify bot commented Jun 17, 2025

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit f3c7879
🔍 Latest deploy log https://app.netlify.com/projects/meta-velox/deploys/68c929573b9c840008ec8e0c

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 17, 2025
@yingsu00 yingsu00 force-pushed the connector_refactor_1 branch 14 times, most recently from a048207 to 77c5146 Compare June 25, 2025 11:04
@yingsu00 yingsu00 force-pushed the connector_refactor_1 branch 3 times, most recently from 6371e5a to 6edae75 Compare June 26, 2025 07:05
@yingsu00 yingsu00 marked this pull request as ready for review June 27, 2025 08:44
@yingsu00 yingsu00 requested a review from majetideepak as a code owner June 27, 2025 08:44
@yingsu00 yingsu00 requested review from Yuhta and rui-mo and removed request for majetideepak June 27, 2025 08:48
@yingsu00 yingsu00 requested a review from PingLiuPing June 27, 2025 09:15
@Yuhta
Copy link
Copy Markdown
Contributor

Yuhta commented Jun 27, 2025

I don't see the benefit doing the dispatch at runtime, it just makes thing a lot fragile. If Hive is linked, corresponding users just create Hive object directly in their code; otherwise they should hide them behind the macro. This way it's much more robust and errors can be caught at compile time.

@yingsu00
Copy link
Copy Markdown
Contributor Author

yingsu00 commented Jun 30, 2025

I don't see the benefit doing the dispatch at runtime, it just makes thing a lot fragile. If Hive is linked, corresponding users just create Hive object directly in their code; otherwise they should hide them behind the macro. This way it's much more robust and errors can be caught at compile time.

Hi @Yuhta,

I want to clarify that there isn’t any runtime dispatch in this PR—it’s purely a compile-time refactor to support building when VELOX_ENABLE_HIVE=OFF.

With the new ConnectorObjectFactory and HiveObjectFactory, we can migrate tests away from direct Hive references. For example, instead of this

#include "velox/connectors/hive/HiveConnector.h"       // Direct reference of Hive connector 
#include "velox/connectors/hive/HiveConnectorSplit.h"       // Direct reference of Hive connector 
#include "velox/connectors/hive/HiveDataSink.h"       // Direct reference of Hive connector 
...
std::shared_ptr<connector::hive::HiveConnectorSplit>
HiveConnectorTestBase::makeHiveConnectorSplit(
    const std::string& filePath,
    uint64_t start,
    uint64_t length,
    int64_t splitWeight,
    bool cacheable) {
  return HiveConnectorSplitBuilder(filePath)       // Direct reference of Hive connector  
      .start(start)
      .length(length)
      .splitWeight(splitWeight)
      .cacheable(cacheable)
      .build();
}

We can refactor to:

#include "velox/connectors/common/ConnectorObjectFactory.h"  // Only reference to connectors/common

std::shared_ptr<connector::common::ConnectorSplit>
HiveConnectorTestBase::makeHiveConnectorSplit(
    const std::string& filePath,
    uint64_t start,
    uint64_t length,
    int64_t splitWeight,
    bool cacheable) {
  folly::dynamic options = folly::dynamic::object();
  options["splitWeight"] = splitWeight;
  options["cacheable"] = cacheable;
  return objectFactory_->makeConnectorSplit(filePath, start, length, options); // Only reference connectors/common
}

This version removes any direct dependence on Hive headers and relies solely on the common factory. We plan to roll out updates to the various tests incrementally in subsequent PRs. Note that this PR only introduces the new factories—it does not include those test changes.

I believed this aligns with our agreement here: #13698. Did I misunderstand any part of that decision?

Thanks!

@Yuhta
Copy link
Copy Markdown
Contributor

Yuhta commented Jun 30, 2025

@yingsu00 If we are still going to have makeHiveConnectorSplit, why referencing HiveConnector would be a problem?

@yingsu00
Copy link
Copy Markdown
Contributor Author

yingsu00 commented Jul 1, 2025

If we are still going to have makeHiveConnectorSplit, why referencing HiveConnector would be a problem?

@Yuhta makeHiveConnectorSplit is just a name—what matters is the underlying dependency. Ideally, it should be renamed to something generic like makeConnectorSplit, and similarly, velox/exec/tests/utils/HiveConnectorTestBase.cpp should become ConnectorTestBase.cpp. That way, it can support constructing connector objects like ConnectorSplits and ColumnHandles for any connector, not just Hive.

The core issue, as discussed in #13698, is that HiveConnectorTestBase.cpp and its derived tests currently live in the exec module and directly reference Hive-specific headers (e.g., velox/connectors/hive/HiveConnector.h and HiveConnectorSplitBuilder). When VELOX_ENABLE_HIVE=OFF, these headers—and the Hive connector itself—are excluded from the build, which causes the exec tests to fail to compile.

With the introduction of the ConnectorObjectFactory, we can decouple the exec tests from Hive entirely. HiveConnectorTestBase (renamed to ConnectorTestBase) can hold a std::shared_ptr<ConnectorObjectFactory> objectFactory_, which can dispatch createXxx functions to the appropriate connector-specific implementations. This allows tests to remain connector-agnostic while still instantiating the right types.

Now, you might ask: how do we avoid linking velox_exec_test_lib directly to every individual connector library? While it's true that, without dynamic plugin loading, the test binary still needs to link against connector libraries it uses (e.g., velox_hive_connector), we can solve this by creating an "umbrella" static target that includes all statically registered connectors. For example, in connectors/CMakeLists.txt:

set(VELOX_STATIC_CONNECTORS)

if (VELOX_ENABLE_HIVE_CONNECTOR)
  add_subdirectory(hive)
  list(APPEND VELOX_STATIC_CONNECTORS velox_hive_connector)
endif()

if (VELOX_ENABLE_TPCH_CONNECTOR)
  add_subdirectory(tpch)
  list(APPEND VELOX_STATIC_CONNECTORS velox_tpch_connector)
endif()
...
add_library(velox_registered_connectors INTERFACE)
target_link_libraries(velox_registered_connectors
  INTERFACE
    ${VELOX_STATIC_CONNECTORS}
)

Then velox_exec_test_lib can simply link against velox_registered_connectors, avoiding direct dependency on individual connector targets.

This fully decouples Hive from the exec tests and allows connector-specific code to evolve independently. I believe we were aligned on this direction in the issue discussion. It makes maintenance easier—changes to Hive won’t impact exec or core tests unnecessarily, and it is important for us. We'd greatly appreciate it if the decoupling can be adopted.

This PR, which adds the Hive object factory, is a first step in that direction. The next will be addressing implementing HiveConnectorTestBase with the ConnectorObjectFactor and generalizing it for all connectors. Let me know if that still aligns with your understanding. Thanks!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yingsu00 Ying, I'm still reading the PR, but wondering if you could clarify this TODO. What is this about?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry this was an old comment. I just removed it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change seems unrelated? Can it be extracted into a separate PR?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change seems unrelated? Can it be extracted into a separate PR?

@mbasmanova Hi Masha, serializing WriterOptions is a pre-requisite to serialize HiveInsertTableHandle, thus commit "Make WriterOptions serializable" is a pre-requisite to the commit "refactor: Introduce ConnectorObjectFactory and HiveObjectFactory". But I can certainly make it a separate PR. I have extract the commit out, added tests and created #14868. Your review is very much appreciated.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated change?

If possible, consider extract all unrelated changes into separate PRs. These PR can land independently. Reducing the size of the PR will help get it reviewed faster.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated change?

Hi Masha, this is just a cosmetic format fix that derived virtual functions shall be marked as override. I think it is Velox convention, but too small to be a separate commit or PR. I can remove it if you want.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not add = default right here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not add = default right here?

Like an inline {} implementation for a virtual destructor, =default() in headers is also an inline definition of the destructor. With an inline virtual destructor, every translation unit that includes the header can emit its own destructor body, vtable, RTTI, etc. This is bad for a widely used common interface, because it may be consumed by multiple connectors or plugins, and each of them would contain one copy of the destructor and vtable. This means any change in the base ConnectorLocationHandle destructor would cause all connectors to be re-compiled. It could also make the build time slower, becasue even if the linker discards duplicates, intermediate .o files are bigger and link times are slower. It may introduce more problems if we enable dynamic linking in the future, like cross .so library cast may fail.

If we have the destructor definition in the cpp file, then changes to its implementation won't cause the header to change, and all downstream TUs don't need to be re-compiled, but just re-linked. It also forces a single vtable and type_info even across library boundaries. Even though we are doing static linking now, it would help improve link time.

So it's better to put all virtual destructor definitions in Connector.cpp. I intended to do it if this PR can get merged. There are some other minor cleanups needed in Connector.h/.cpp, e.g. some structs have toString(), some not. Some structs has connectorId_, some not. Then it'll be desirable to make the interface versioned but that's a future topic.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have these listed here? Do we expect to add all connectors here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have these listed here? Do we expect to add all connectors here?

@mbasmanova We need the connector names in the common layer, because the ConnectorFactory or Connector registries are all keyed by the connector name. All users in outside modules shall be talking with the connector common layer only, instead of refering to the name in specific connectors like connector::hive::HiveConnectorFactory::kHiveConnectorName. If we stick to connector::hive::HiveConnectorFactory::kHiveConnectorName, then the users would have to include HiveConnector.h.

I believe this is the same idea why dwio::common::FileFormat was introduced. With the FileFormat enum in dwio::common, users outside of dwio module can just call getReaderFactory(dwio::common::FileFormat::DWRF)->createReader() to create a DWRF reader without referencing DWRF headers. I have another PR on DWIO refactor #14090 to additionally decouple file formats from users. Your review is also very appreciated there.

If you don't like putting them in another file ConnectorNames.h, we can put them as an enum in Connector.h.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mbasmanova Btw I made refactor: Introduce ConnectorNames.h in velox/connectors a separate commit in https://github.com/facebookincubator/velox/pull/14687/commits. It can also be a simple PR by itself. Do you want me to do that?

Comment on lines 35 to 37
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these properties are specific to Hive-based connectors and do not apply to other connector (e.g. MySQL connector). Perhaps, we don't want to include these in this generic API.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I have updated the signature:

  virtual std::shared_ptr<ConnectorSplit> makeSplit(
      const std::string& connectorId,
      const folly::dynamic& options = {}) const {
    VELOX_UNSUPPORTED("ConnectorSplit not supported by connector", connectorId);
  }

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no ctor. How will this be set?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

factories -> kFactories

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these missing from the .h file?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mbasmanova Thanks for the nice catch. I'm so sorry that this commit was indeed incomplete, and I have added the missing signatures in the .h in this PR. The reason why they were missing was that I was thinking to replace this commit with misc: Enhance ConnectorFactory interface , which does not need a separate ConnectorObjectFactory. Could you please take a look to see which way is preferred? If you prefer the other commit, I can pick it over here.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated change?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yingsu00 Ying, if the goal is to make tests build with hive disabled, then, why can't we just fix cmake to not build tests that depend on hive when hive is disabled? Are you concerned that this will disable virtually all tests? Why is this a problem? Do you believe that many tests do not need to depend on Hive? If so, can they be re-written to use Values nodes instead of table scan? I know that at least some depends do have hard dependency on Hive because they check filter pushdown behavior. I don't think we can expect all connectors to support that.

Using folly::dynamic to specify connector properties in tests is not ideal. Wondering if that what @Yuhta referred to as "dynamic dispatch". It would be easy to miss a property or mistype its name or value and the error will only appear at runtime. How are you thinking about this? Can we provide an example test?

@yingsu00
Copy link
Copy Markdown
Contributor Author

@mbasmanova Thank you very much for reviewing. There are too many exec tests that depend on HiveConnectorTestBase and PlanBuilder. While decoupling a specific connector from the core engine is always a good idea, we would also like to make the tests to cover both Hive and Iceberg. In addition, we hope to promote Iceberg to a standalone connector with its own config files, and this change is a pre-requisite for us to shape the Iceberg code in a cleaner structure. I will reply to your question in more details and address your comments the first thing tomorrow. At the mean time, I wonder if you could briefly take a look at another PR Make PlanBuilder connector agnostic part 1 where the first two commits are equal to the functionality with this PR, but with a slightly different approach. The difference was that instead of adding a new file ConnectorObjectFactory, I merged the new makeXXX functions into the existing ConnectorFactory class, such that the users don't have to register two factories. Could you please let me know which approach you prefer? Thanks!

@mbasmanova
Copy link
Copy Markdown
Contributor

we would also like to make the tests to cover both Hive and Iceberg

@yingsu00 Ying, is this the main goal? If so, it would be helpful to articulate that clearly and avoid suggesting that 'exec' depends on Hive connector. Also, I'm not sure how realistic it is to make tests work for 2 different connectors. Maybe some tests can be shared, but not all. This requires some design discussion.

At the mean time, I wonder if you could briefly take a look at another PR https://github.com/facebookincubator/velox/pull/14687where the first two commits are equal to the functionality with this PR, but with a slightly different approach.

A quick look suggests that first 2 commits do not include any interesting logic.

Screenshot 2025-09-15 at 4 56 30 PM

@yingsu00
Copy link
Copy Markdown
Contributor Author

yingsu00 commented Sep 15, 2025

t first 2 commits do not include any interesting logic.

Hi @mbasmanova Sorry I meant the first 3 commits, especially the 3rd one here: misc: Enhance ConnectorFactory interface

@yingsu00
Copy link
Copy Markdown
Contributor Author

we would also like to make the tests to cover both Hive and Iceberg

@yingsu00 Ying, is this the main goal?

Hi @mbasmanova, thanks for the feedback and quesitons. The primary goal is to decouple Iceberg from Hive and make it a standalone connector. This reduces maintenance burden and aligns with the industry trend: Hive is a legacy table format, and it is gradually being replaced by more modern and versatile formats like Iceberg, Hudi and Delta Lake. Each of them has some special features, configs, and stats collection requirement, etc. IMHO Velox should evolve with this trend and promote these new standards instead of promoting a single Hive connector. Suppose Hudi and DeltaLake will be added to Velox in the near future, and if the current inheritance structure continues, logic in HiveDataSource and SplitReader (parents of Iceberg equivalents) will become increasingly entangled and harder to maintain. For example:

  • Iceberg partition columns are part of the file schema, while Hive’s are not. this makes the split elimination based on partition key/values slightly different.
  • Hive’s $row_id and $row_index differ significantly from Iceberg’s synthetic columns, and updating ScanSpec for them doesn’t fit Iceberg’s model.
  • Iceberg schema evolution requires carrying a list of PartitionSpec in IcebergTableHandle, which today would leak into HiveDataSource if kept under the current structure.

for me, A cleaner way is to extend the connector interfaces to form a thin, common layer for exec and other modules. Each connector (Hive, Iceberg, etc.) would implement this interface, while external users only depend on the interface. This also moves us toward a stable connector “SPI/ABI”, which would help with versioning and Velox releases in the long run. Such stable, versioned interfaces and releasing will help the community to adopt Velox better and have been desired by many of us for a long time.

it would be helpful to articulate that clearly and avoid suggesting that 'exec' depends on Hive connector.

Yes thanks for trying to make the description more accurate. But still, I do think more tests can be made table format or connector agnostic, which also improves coverage. Right now, many tests in exec, dwio, common/memory, and even functions depend directly on Hive. While these are test files, it still creates a strong coupling: building with tests enabled requires Hive. Maybe our definition of 'exec' or "dependence" is slightly different. From my humble perspective, tests are part of a broader range of the exec module, and it is an anti-pattern for core modules (including their tests) being tied to one particular connector. This PR and #14687
are incremental steps toward more modularity, while maintaining backward compatibility: the old single (Hive) connector registries still works and gradually adjusting the tests one by one won't break anything.

Also, I'm not sure how realistic it is to make tests work for 2 different connectors. Maybe some tests can be shared, but not all. This requires some design discussion.

I agree not all tests can or should be shared between Hive and Iceberg at the current stage. Some candidates we’d like to adjust as the first step are:

  • TableScanTest
  • TableWriterTest
  • HiveConnectorTest (Adding new IcebergConnectorTest)
  • HiveDataSinkTest (Adding new IcebergDataSinkTest)

All of these rely on HiveConnectorTestBase and PlanBuilder in exec/test/utils/. PR #14687 is about making PlanBuilder support multiple connectors.

If you think further discussion is needed, I'll be happy to schedule a review session for it, or utilize the Presto Native Worker Group meeting. What do you think? Your feedback will be highly appreciated!

@mbasmanova
Copy link
Copy Markdown
Contributor

@yingsu00

The primary goal is to decouple Iceberg from Hive and make it a standalone connector.

This makes sense to me. Let's then focus on that. What is blocking you from making this happen?

@mbasmanova
Copy link
Copy Markdown
Contributor

If you think further discussion is needed

@yingsu00 Yes, I think this would be helpful. Let's discuss 1:1 over VC.

@yingsu00 yingsu00 force-pushed the connector_refactor_1 branch 2 times, most recently from 62dd7eb to ae6862b Compare September 16, 2025 08:56
This commit is the first part of the effort to decouple Hive from exec
tests, which aims to make VELOX_ENABLE_HIVE_CONNECTOR=OFF build without
errors. The content of this commit include:

- Add a new ConnectorObjectFactory interface in velox/connectors,
  defining abstract methods for creating ConnectoSplits, TableHandles,
  InsertTableHandles,etc.
- Create HiveObjectFactory in velox/connectors/hive that implements the
  common interface.
- Add a new HiveObjectFactoryTest suite to verify that dynamic options
  yield correct Hive-specific objects without leaking connector internals
  into core or exec tests.
@yingsu00 yingsu00 force-pushed the connector_refactor_1 branch from ae6862b to f3c7879 Compare September 16, 2025 09:09
@stale
Copy link
Copy Markdown

stale bot commented Dec 16, 2025

This pull request has been automatically marked as stale because it has not had recent activity. If you'd still like this PR merged, please comment on the PR, make sure you've addressed reviewer comments, and rebase on the latest main. Thank you for your contributions!

@stale stale bot added the stale label Dec 16, 2025
@stale stale bot closed this Dec 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. stale

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make Velox connectors as plugins

4 participants