Skip to content

Conversation

@Rijin-N
Copy link
Contributor

@Rijin-N Rijin-N commented Nov 19, 2024

Description

Add Prestissimo support for Arrow Flight connectors.

The native Arrow Flight connector can be used to connect to any Arrow Flight enabled Data Source. The metadata layer is handled by the Presto coordinator and does not need to be re-implemented in C++. Any Java connector that inherits from presto-base-arrow-flight can use this connector as it's counterpart for the Prestissimo layer.

Different Arrow-Flight enabled data sources can differ in authentication styles. A plugin-style interface is provided to handle such cases with custom authentication code.

Motivation and Context

RFC: https://github.com/prestodb/rfcs/blob/main/RFC-0004-arrow-flight-connector.md#prestissimo-implementation

Impact

Arrow Flight based connectors can be used with Prestissimo

Test Plan

Unit Tests set up a mock arrow flight server and exchange data using the new connector.

Contributor checklist

  • Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue an RFC is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== RELEASE NOTES ==

Prestissimo (Native Execution)
* Add support for Arrow Flight connectors :pr:`24082`

@Rijin-N Rijin-N requested a review from a team as a code owner November 19, 2024 12:06
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Nov 19, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

@Rijin-N
Copy link
Contributor Author

Rijin-N commented Nov 19, 2024

This PR requires a velox support PR facebookincubator/velox#11588, to install Arrow Flight dependencies.

@prestodb-ci
Copy link
Contributor

Saved that user @Rijin-N is from IBM

@prestodb-ci prestodb-ci requested a review from a team November 20, 2024 23:17
CMAKE_FLAGS += -DENABLE_ALL_WARNINGS=${ENABLE_WALL}
CMAKE_FLAGS += -DCMAKE_PREFIX_PATH=$(CMAKE_PREFIX_PATH)
CMAKE_FLAGS += -DCMAKE_BUILD_TYPE=$(BUILD_TYPE)
CMAKE_FLAGS += -DPRESTO_ENABLE_ARROW_FLIGHT_CONNECTOR=$(PRESTO_ENABLE_ARROW_FLIGHT_CONNECTOR)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should stop adding these and instead use EXTRA_CMAKE_FLAGS externally when invoking make

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


CallOptionsAddHeaders callOptsAddHeaders{};
FlightCallOptions& callOpts = callOptsAddHeaders;
AddCallHeaders& headerWriter = callOptsAddHeaders;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just add headers directly to the FlightCallOptions, you don't need to use the AddCallHeaders interface

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea here was to have an append-only interface for adding headers to the FlightCallOptions.

Without this, the Authenticator interface would need to accept a mutable reference to FlightCallOptions instead, which theoretically allows the authenticator to modify other parts of the FlightCallOptions, which is not desirable.

It's also less confusing when creating Authenticator implementations, since you don't need to worry about which parts of the FlightCallOptions should or should not be changed.

Copy link
Contributor

@aditi-pandit aditi-pandit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some common themes
i) Please add a copyright header to all the new files added in this PR.
ii) Please follow the test case naming as per conventions in https://github.com/facebookincubator/velox/blob/main/CODING_STYLE.md#naming-conventions

set_property(TARGET presto_server_lib PROPERTY JOB_POOL_LINK
presto_link_job_pool)

if(PRESTO_ENABLE_ARROW_FLIGHT_CONNECTOR)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These lines should be in presto_cpp/main/connectors/arrow_flight/CMakeLists.txt

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

std::make_unique<IcebergPrestoToVeloxConnector>("iceberg"));
registerPrestoToVeloxConnector(
std::make_unique<TpchPrestoToVeloxConnector>("tpch"));
#ifdef PRESTO_ENABLE_ARROW_FLIGHT_CONNECTOR
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous entries are for the base Velox connectors that we will mostly add by default.

Like in other parts of this code, we should add this registration in ArrowFlightConnector.cpp/.hpp if possible.

Another option is to add a ConnectorRegistration.h(cpp) file in presto_cpp/main/connectors folder that includes all the registrations of connectors in its sub-directories and only call that function from here.

e.g say for Windowfunctions in prestosql there is https://github.com/facebookincubator/velox/blob/main/velox/functions/prestosql/window/WindowFunctionsRegistration.h that has all individual function registrations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

return std::make_unique<protocol::tpch::TpchConnectorProtocol>();
}

#ifdef PRESTO_ENABLE_ARROW_FLIGHT_CONNECTOR
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic should be moved to a file in presto_cpp/main/arrow_flight (ArrowPrestoToVeloxConnector.h/.cpp). The registration logic should be part of ConnectorRegistration file proposed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

for (const auto& protocolSplit : source.splits) {
auto split = toVeloxSplit(protocolSplit);
if (split.hasConnectorSplit()) {
// Since the extra credential data is coming from TaskUpdateRequest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There shouldn't be such special case code in TaskManager.cpp.

The more appropriate way to do this is to add the extraCredentials as a parameter in the toVeloxSplit logic and move this specific logic over to the ArrowToPrestoVeloxConnector::toVeloxSplit method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


std::unique_ptr<StaticFlightServer> StaticFlightServerTest::server;

TEST_F(StaticFlightServerTest, ServerTest) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test names start with lower case. Maybe you can call this test "basic" inline with other tests in the code-base.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

std::unique_ptr<StaticFlightServer> StaticFlightServerTest::server;

TEST_F(StaticFlightServerTest, ServerTest) {
auto sampleTable = makeArrowTable(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its important to add other types as well... Would be great to demo all the scalar and complex types that are supported by this API.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The StaticFlightServer is just a thin wrapper around the basic Flight Server provided by the Arrow Flight SDK. Testing this server against specific datatypes is something better reserved for the unit tests in the Arrow C++ SDK. The test here is just to make sure that our wrapper is working and we're able to perform simple operations.

Note that the StaticFlightServer isn't actually used in the Flight connector code - it is only used in other unit tests for validating the Flight connector against a "real" server.

That said, we do test for all the different types supported by the Flight connector in ArrowFlightConnectorDataTypeTest.cpp. These test the full flow of data from the StaticFlightServer to the connector and out. We're testing with ints, reals, string types, date and time types etc.

{FlightConfig::kServerVerify, "true"}})) {}
};

TEST_F(FlightConnectorTlsNoCertTest, TlsNoCertTest) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change test names to start with lowercase letters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

FlightConnectorTlsNoCertTest()
: FlightConnectorTlsTestBase(std::make_shared<velox::config::ConfigBase>(
std::unordered_map<std::string, std::string>{
{FlightConfig::kServerVerify, "true"}})) {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this is always a bad config ? Can we error this at the outset itself ?

Copy link
Contributor

@ashkrisk ashkrisk Nov 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't necessarily a bad config. We could be dealing with servers which use certificates signed by trusted CAs, making additional certificates unnecessary.

In our case, this is a purely negative test because our test server is set up with a self-signed certificate, and it's impractical to acquire a certificate signed by a trusted CA just for this purpose.

.assertResults(makeRowVector({decimalVecBigInt}));
}

TEST_F(FlightConnectorDataTypeTest, GeneralTest) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename "allTypes"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@Rijin-N Rijin-N force-pushed the arrow-flight-connector branch 2 times, most recently from e51be47 to 9f48cec Compare November 27, 2024 08:40
@Rijin-N Rijin-N requested review from a team and steveburnett as code owners November 27, 2024 08:40
@Rijin-N Rijin-N requested a review from presto-oss November 27, 2024 08:40
The native Arrow Flight connector can be used to connect to any Arrow Flight enabled Data Source. The metadata layer is handled by the Presto coordinator and does not need to be re-implemented in C++. Any Java connector that inherits from `presto-base-arrow-flight` can use this connector as it's counterpart for the Prestissimo layer.

Different Arrow-Flight enabled data sources can differ in authentication styles. A plugin-style interface is provided to handle such cases with custom authentication code.

RFC: https://github.com/prestodb/rfcs/blob/main/RFC-0004-arrow-flight-connector.md#prestissimo-implementation

Co-authored-by: Ashwin Kumar <[email protected]>
Co-authored-by: Rijin-N <[email protected]>
Co-authored-by: Nischay Yadav <[email protected]>
@Rijin-N Rijin-N force-pushed the arrow-flight-connector branch from 9f48cec to bd02070 Compare November 27, 2024 11:57
}

RowVectorPtr FlightDataSource::projectOutputColumns(
std::shared_ptr<arrow::RecordBatch> input) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be const&.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

using namespace velox;
using namespace velox::connector;

// wrapper for CallOptions which doesn't add any members variables
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: when leaving comments make sure to fllow the comment guidelines

  • must be complete sentence (period at end and usually capitalization)

etc
This occurs multiple times in a few files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

uint64_t completedRows_ = 0;
uint64_t completedBytes_ = 0;
std::shared_ptr<auth::Authenticator> authenticator_;
velox::memory::MemoryPool* pool_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: can be velox::memory::MemoryPool* const pool_;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

explicit ArrowFlightConnectorFactory(const char* name)
: ConnectorFactory{name} {}

ArrowFlightConnectorFactory(const char* name, const char* authenticatorName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use std::string_view instead of const char* for the authenticatorName?

Why do we need this constructor in the first place? It doesn't seem to be used at all.
The authenticator comes from the FlightConfig. Assuming there is a different Authenticator implementation why would it not be named/configured in the FlightConfig and instead be hard coded at registration time?

Copy link
Contributor

@ashkrisk ashkrisk Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this constructor in the first place?

The idea here is to be able to register different connector names which use the same Flight connector code, but use different Authenticators. This would avoid the need to have two separate config properties to fully qualify the constructor; instead of specifying connector.name and arrow-flight.authenticator.name, you simply specify connector.name and are good to go.

This would tie in a little better with the Java version, where authentication needs to be handled by subclasses of presto-base-arrow-flight, each of which are independent connectors.

Also note that the config arrow-flight.authenticator.name is not present in the base Java Flight connector. A concrete connector extension would need to introduce this config explicitly to be able to switch between between C++ Authenticators within the same connector.
https://github.com/prestodb/presto/pull/23032/files#diff-9b8ea7947728279eacfeb69fc0492cb2390c3dd65897bc9148ecf1d021d37332

target_link_libraries(
presto_flight_connector
velox_connector
arrow
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is arrow referring to here? It doesn't look like a target but something that cam from PkgConfig? How does this differ to ${ArrowFlight_LIBRARIES}?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Removed arrow, Actually ${ArrowFlight_LIBRARIES} are enough here.


#define AFC_REGISTER_AUTH_FACTORY(factory) \
namespace { \
static bool FB_ANONYMOUS_VARIABLE(g_ConnectorFactory) = ::facebook::presto:: \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not understanding the reason for keeping this static bool around - to indicate a successful registration of NoOpAuthenticatorFactory. Once the factory is registered it is in the list. What's the point of having the static bool of the fact that it was registered stored?
The check that the registration was successful was already done (via VELOX_CHECK). Nothing is checking this bool again.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This variable is not meant to be used anywhere, which is why it's marked as an FB_ANONYMOUS_VARIABLE. This only reason this variable exists is to ensure that it is statically initialized at runtime, and thus ensure that the Authenticator in question is statically registered.

If C++ had static blocks similar to Java, we could have used that instead.

: Connector{id},
flightConfig_{std::make_shared<FlightConfig>(config)},
authenticator_{
// auth::getAuthenticatorFactory(config_->authenticatorName())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

.assertResults(makeRowVector({boolVec}));
}

TEST_F(FlightConnectorDataTypeTest, IntType) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better name: integerTypes.
Please see the other comment on naming for the tests (lowerCase) and maybe fix some of the names along the way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

protected:
explicit FlightConnectorTestBase(
std::shared_ptr<velox::config::ConfigBase> config)
: config_{config} {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::move(config)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

install_abseil
install_grpc

wget_and_untar https://archive.apache.org/dist/arrow/arrow-${ARROW_VERSION}/apache-arrow-${ARROW_VERSION}.tar.gz arrow
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets use the source from github https://github.com/apache/arrow/archive/apache-arrow-15.0.0.tar.gz. apache.org can have issues with downloads.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not find any documentation for me to review.

Should "Arrow Flight connector" be added to the Supported Use Cases topic in the Presto C++ documentation?

@Rijin-N
Copy link
Contributor Author

Rijin-N commented Dec 5, 2024

Hi @majetideepak @BryanCutler @aditi-pandit @czentgr , We have addressed or responded with our reasoning to all the comments you provided. Could you please review and let us know if there is anything else that needs to be done. Thanks.

Note: This format checker is working fine locally for us. We're not sure why it is failing in this PR.

Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the doc! Some suggestions for the data/README.md text.

Copy link
Contributor

@czentgr czentgr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rebase too.


std::optional<RowVectorPtr> FlightDataSource::next(
uint64_t size,
velox::ContinueFuture& future) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and in the declaration comment out the unused variable name with new name unused to only leave the type signature.
/* unused */
Doesn't seem like it is used so lets indicate it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

explicit ArrowFlightConnectorFactory(
const char* name,
const char* authenticatorName = nullptr)
: ConnectorFactory{name}, authenticatorName_{authenticatorName} {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Lets use () instead of {} for the initializer list for consistency.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@ethanyzhang
Copy link
Contributor

Hi @Rijin-N @ashkrisk based on the discussion we had with Shweta and Sandhya, we are gonna take over this work, @BryanCutler will help move this PR over the finishing line. We keep everyone who worked on this PR as co-authors. Thank you. cc @tdcmeehan

@Rijin-N
Copy link
Contributor Author

Rijin-N commented Jan 9, 2025

Hi @ethanyzhang, We have addressed all the comments so far. You can take over from here.
cc: @ashkrisk

Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! (docs)

Used "View file" in GitHub to view the .md files as they will be published. Looks good, thank you!

@BryanCutler
Copy link
Contributor

Superseded by #24504

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

from:IBM PR from IBM

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants