Skip to content

Use the Pinot Grpc Endpoint for Streaming Server Queries#12332

Merged
martint merged 4 commits intotrinodb:masterfrom
elonazoulay:pinot-grpc
Jun 22, 2022
Merged

Use the Pinot Grpc Endpoint for Streaming Server Queries#12332
martint merged 4 commits intotrinodb:masterfrom
elonazoulay:pinot-grpc

Conversation

@elonazoulay
Copy link
Copy Markdown
Member

@elonazoulay elonazoulay commented May 10, 2022

Description

Add support for querying Pinot via the more efficient GRPC endpoint.

Is this change a fix, improvement, new feature, refactoring, or other?

New feature

Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)

Pinot connector

Related issues, pull requests, and links

Documentation

(x) Documentation issue #12944 is filed, and can be handled later.

Release notes

(x) Release notes entries required with the following suggested text:

# Pinot
* Add support for querying Pinot via the gRPC endpoint. ({issue}`9296 `)

@cla-bot cla-bot bot added the cla-signed label May 10, 2022
@elonazoulay elonazoulay requested review from ebyhr and hashhar May 10, 2022 23:04
@ebyhr ebyhr removed their request for review May 10, 2022 23:57
Copy link
Copy Markdown
Member

@hashhar hashhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

skimmed, leaving some comments about commit boundaries as well (seems too fine-grained)

I'll need to look at the new PageSource in more detail

@hashhar
Copy link
Copy Markdown
Member

hashhar commented May 11, 2022

@hashhar
Copy link
Copy Markdown
Member

hashhar commented May 11, 2022

For the GRPC streaming is the only retained memory the one consumed by the byte buffers used for retrieving the Grpc responses? Can we account for them in the PageSource? Anything else that would need to be accounted (DataTable?)?

@elonazoulay
Copy link
Copy Markdown
Member Author

Reoardered the commits but had to keep Implement PinotDataFetcher separate as it depends on the Use maxPageSize in PinotSegmentPageSource commit. If anything I can squash those and move the 'Make PinotDataTableWithSize publicbefore theAdd Support for GRPC` commit.

re: memory size: the entire payload in bytes from the response is now added to the memory usage.

Copy link
Copy Markdown
Member

@hashhar hashhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments about commits but looks good overall.

@Praveen2112 Can you please take a look at last commit to see if the PageSource implementation looks correct (in terms of memory accounting, isFinished and getReadTimeNanos).

Copy link
Copy Markdown
Contributor

@xiangfu0 xiangfu0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks for adding this support @elonazoulay

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be enabled by default?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, less impact on the pinot servers and allows for larger result sets.

Copy link
Copy Markdown
Member Author

@elonazoulay elonazoulay May 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pinot servers will have to enable the endpoint in their configuration though, by default it's not enabled. Can add a note in the documentation.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've turned on the grpc flag on pinot by default recently. I think we can make this enabled after one or two versions.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lambda should use k. Also, give it a more meaninful name. At a minimum, key, but maybe hostAndPort would make it more obvious and readable.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who manages the lifecycle of these clients? Do they need to be torn down? Can they go into an "invalid" state (in which case, they'd need to be refreshed)?

Copy link
Copy Markdown
Member Author

@elonazoulay elonazoulay May 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grpc clients are long lived and use internal threads to manage connections.
The default idle time is 30 minutes for the ManagedChannel to close idle connections.
I added life cycle management and a comment in the code.

  • DONE: Manage lifecycle - close grpc clients on shutdown
  • DONE: Separate configs and modules for grpc and legacy query clients
  • DONE: Remove unused configs (separate commit)
  • TODO: extract to separate commits

Comment on lines 76 to 77
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that the grpc setting is global to the server, Instead of passing both, it'd be better for these to implement a common interface and pass the appropriate one depending how the connector was initialized.

@elonazoulay elonazoulay force-pushed the pinot-grpc branch 2 times, most recently from 68a14b4 to b068fbb Compare May 31, 2022 06:09
public HostAndPort getServerGrpcHostAndPort(String serverHost, int grpcPort)
{
ServerInstance serverInstance = getServerInstance(serverHost);
return HostAndPort.fromParts(serverInstance.getHostname(), grpcPort);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually ServerInstance has an API to get grpcPort, so you don't need to pass it.
https://github.com/apache/pinot/blob/master/pinot-core/src/main/java/org/apache/pinot/core/transport/ServerInstance.java#L93

This API can just be: public HostAndPort getServerGrpcHostAndPort(String serverHost)

In this case, you don't even need to have the config for grpcPort, just make this best effort try. If the grpc port is -1, you return null here, and the query will use netty query endpoint. Once server has grpc, then the query will use grpc query endpoint.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this connecting to a real cluster and the server instance returns -1 for the grpc port, it did not work. Is it ok to keep it explicitly specified? lmk what you think @xiangfu0 .

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean grpc is enabled but serverInstance gives grpc port -1? If that is the case, please keep this grpc port config as the backup and use the grpc port that > 0 if from the ServerInstance.
So the logic is best effort from ServerInstance if possible or use grpc port if not provided.

public class PinotGrpcServerQueryClientConfig
{
private int maxRowsPerSplitForSegmentQueries = Integer.MAX_VALUE - 1;
private int grpcPort = 8090;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need this grpcPort, just get it directly if grpc is enabled.

@xiangfu0 xiangfu0 requested a review from martint June 4, 2022 02:59
@xiangfu0
Copy link
Copy Markdown
Contributor

xiangfu0 commented Jun 6, 2022

@martint can you review this again?

{
Map<String, String> metadata = dataTable.getMetadata();
List<String> exceptions = new ArrayList<>();
metadata.forEach((k, v) -> {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use more descriptive names for k and v. It's not clear what they represent.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, was from legacy code. Also thanks to @xiangfu0 for the commit!

Comment on lines +80 to +87
interface Factory
{
PinotDataFetcher create(ConnectorSession session, String query, PinotSplit split);
}

interface PinotServerQueryClient
{
Iterator<PinotDataTableWithSize> queryPinot(ConnectorSession session, String query, String serverHost, List<String> segments);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need both a Factory and a PinotServerQueryClient interface? I'm not sure I understand the relationship between the two. Why do we need different implementations of the Factory, given that the query client is already abstracted to support both underlying protocols?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense, I think we just need factory no need for PinotServerQueryClient

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! Updating.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, apologies, should have removed that before.

@elonazoulay
Copy link
Copy Markdown
Member Author

TODO: extract to separate commits, lmk if that will make it easier to review

@xiangfu0 xiangfu0 requested a review from martint June 16, 2022 18:45
@xiangfu0
Copy link
Copy Markdown
Contributor

I feel it's good to keep the PR here for a full complete feature. It's also simpler for future reference.

Copy link
Copy Markdown
Contributor

@xiangfu0 xiangfu0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting this up!

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is no longer used

@martint martint merged commit ce0bef3 into trinodb:master Jun 22, 2022
@github-actions github-actions bot added this to the 387 milestone Jun 22, 2022
@xiangfu0
Copy link
Copy Markdown
Contributor

Thanks @martint @elonazoulay to get this PR in! 🥂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

[Pinot connector] Support gRPC connection to Pinot Broker

4 participants