[native] SystemConnector to query system.runtime.tasks table by aditi-pandit · Pull Request #21416 · prestodb/presto

aditi-pandit · 2023-11-18T20:48:07Z

Description

SystemConnector is a Presto Connector for system tables. System tables include runtime schema tables like system.runtime.{nodes|tasks|queries|transactions}, properties tables (table properties, schema properties, column properties, analyze properties), hive & iceberg metadata tables.

SystemConnector tables are unique in a way that all of them are populated from metadata structures on the co-ordinator (and optionally from workers). This metadata can be internal process metadata for the runtime tables or metadata obtained from HMS/Iceberg catalog.

The distribution of the SystemTable can be ALL_NODES, ALL_COORDINATORS, SINGLE_COORDINATOR (from https://github.com/prestodb/presto/blob/master/presto-spi/src/main/java/com/facebook/presto/spi/SystemTable.java#L24)

Only one table 'system.runtime.tasks' is populated on ALL_NODES. So this table gets results from the co-ordinator as well as the workers.

In the past, querying this table was broken on Prestissimo since there was no system catalog/connector on workers. This PR enhances the native workers with a System connector/catalog that is used to populate the tasks table. The SystemConnector uses Presto TaskManager task map to populate this table.

There is one more design point. The Java co-ordinator is not fully compatible with native workers. They use different hash functions and different intermediate results for aggregations. So some changes were needed for running the system connector on the co-ordinator.

The changes are :

[native] Disable partial aggregation over system table scan #21725
[native] Support system tables #21285 : This PR adds a GATHER exchange after a system table TableScanNode so that all the data gathers at a worker to make the partitioning consistent with all other tables.

Both of these planning rules are applicable for tasks table as well since it generates data at both co-ordinator and workers.

Motivation and Context

system.runtime.tasks table is very frequently used in deployment scripts. Querying this table was broken in Prestissimo.
#21413

Test Plan

Added e2e tests

== RELEASE NOTES ==

General Changes
* Add support for querying system.runtime.tasks table in native clusters

majetideepak

@aditi-pandit nice change! some comments.

presto-native-execution/presto_cpp/main/SystemConnector.h

presto-native-execution/presto_cpp/main/SystemConnector.cpp

presto-native-execution/presto_cpp/main/SystemConnector.h

presto-native-execution/presto_cpp/main/SystemConnector.cpp

presto-native-execution/presto_cpp/presto_protocol/Connectors.cpp

...ecution/src/test/java/com/facebook/presto/nativeworker/AbstractTestNativeGeneralQueries.java

mbasmanova

@aditi-pandit Is there a design document about this connector? I assume SystemConnector requires access to Metastore and we don't have that on the worker. Hence, wondering how this will work?

mbasmanova · 2024-02-13T22:48:28Z

CC: @spershin

aditi-pandit · 2024-02-15T08:17:29Z

@aditi-pandit Is there a design document about this connector? I assume SystemConnector requires access to Metastore and we don't have that on the worker. Hence, wondering how this will work?

@mbasmanova : This SystemConnector code was to query the tasks table. That seemed the only part of the SystemConnector that was needed at the worker for Prestissimo.

Other tables like nodes, queries were populated from in-memory structures in the co-ordinator itself. Any code accessing Metastore (like TablePropertiesSystemTable say) seemed to be required only at the co-ordinator part of the connector.

I just spent a day on this prototype to wire the pieces. I haven't put together a design doc.

mbasmanova

@aditi-pandit Aditi, thank you for clarifying. It is interesting that tasks table is populated on the workers. I wonder why. All the information is available on the coordinator. CC: @tdcmeehan

presto-native-execution/presto_cpp/main/SystemConnector.cpp

tdcmeehan · 2024-02-22T00:04:27Z

@aditi-pandit Aditi, thank you for clarifying. It is interesting that tasks table is populated on the workers. I wonder why. All the information is available on the coordinator. CC: @tdcmeehan

I think the reason for this is because historically you could always deploy Presto in a mode where many or all of the workers also functioned as coordinators. In this mode, any single coordinator would only know of the tasks whose queries are local to that coordinator.

mbasmanova · 2024-02-28T00:03:32Z

I think the reason for this is because historically you could always deploy Presto in a mode where many or all of the workers also functioned as coordinators. In this mode, any single coordinator would only know of the tasks whose queries are local to that coordinator.

@tdcmeehan Tim, thank you for clarifying. I didn't know about this deployment scheme. I'm not sure I understand how this works though. When there are multiple coordinators, wouldn't query results depend on which coordinator is being asked to process the query? Are you saying that in this setup a query can be routed to any coordinator and the results are expected to be the same? I guess in this case it is necessary to ask all the workers to report their tasks since as you pointed out a single coordinator knows about a subset of tasks only.

mbasmanova · 2024-02-28T00:06:17Z

Generally speaking, Java workers are not compatible with native workers. They use different hash functions and different intermediate results for aggregations. Hence, we had to make a change to run system connector only on coordinator and introduce an exchange before partial agg. These changes may get in the way of making this PR work.

See #21725 and #21285

tdcmeehan · 2024-02-28T14:29:10Z

I think the reason for this is because historically you could always deploy Presto in a mode where many or all of the workers also functioned as coordinators. In this mode, any single coordinator would only know of the tasks whose queries are local to that coordinator.

@tdcmeehan Tim, thank you for clarifying. I didn't know about this deployment scheme. I'm not sure I understand how this works though. When there are multiple coordinators, wouldn't query results depend on which coordinator is being asked to process the query? Are you saying that in this setup a query can be routed to any coordinator and the results are expected to be the same? I guess in this case it is necessary to ask all the workers to report their tasks since as you pointed out a single coordinator knows about a subset of tasks only.

In this scheme, queries are sticky to a single coordinator (after you POST a query, each nextUri always returns the host of the local coordinator). It's presumed there's something in front of the cluster to distribute the query creation, like a load balancer.

mbasmanova · 2024-02-28T14:33:28Z

@tdcmeehan Tim, thank you for clarifying. One more follow-up question. In a multi-coordinator deployment, do all workers report themselves to all coordinators or a given worker is fixed-assigned to just one coordinator? In other words, do we have N coordinators managing a shared pool of workers or we have just N "mini" clusters that are independent of each other?

tdcmeehan · 2024-02-28T14:51:02Z

@tdcmeehan Tim, thank you for clarifying. One more follow-up question. In a multi-coordinator deployment, do all workers report themselves to all coordinators or a given worker is fixed-assigned to just one coordinator? In other words, do we have N coordinators managing a shared pool of workers or we have just N "mini" clusters that are independent of each other?

Workers report themselves to a single discovery service, which is either replicated to other coordinators in an eventually consistent manner, or the discovery service is a single process which is separate from the coordinators. Originally, when this system connector was written, there was no concept of shared resources (e.g. resource groups, global memory management, etc.) and it relied purely on individual backpressure from workers, although there are now tools to help make that work.

mbasmanova · 2024-02-28T14:58:52Z

@tdcmeehan Tim, I wonder if it still makes sense to support this deployment model. What do you think? Does it makes sense to consider it when thinking about native workers?

tdcmeehan · 2024-02-28T15:10:18Z

Tactically and short term, I think it would be great to support this if there was an easy and not hacky way to get it to work with #21725 and #21285. But given that most people would be deploying Presto for their large to medium size data lakes, I don't think an Impala-style deployment model makes sense for Presto's future, and personally I feel comfortable saying we can deprecate it in the future.

That being said, system tables in the coordinator present a challenge for what I feel is one of the end goals of moving to native, which is simplifying our Java code. I'd like to think about a way to move this to C++ so it doesn't need to be present in the Java SPI (thinking way ahead in the future, if the only reason we retain page source providers is for system tables, I think it would be worthwhile to think about how to move system tables to C++). So I'd like to revisit the presumption at some point that system tables must be coordinator-provided tables, since even now that's not necessarily true.

aditi-pandit · 2024-03-01T22:06:21Z

@mbasmanova, @tdcmeehan : Thanks for the discussion. It has been informative.

If we want to stay with this approach of getting tasks table on worker we could modify #21725 and #21285 to not perform those rewrites for system.runtime.tasks table specifically as it based on the worker.

#21725 could work un-modified as well. It would just mean that we don't allow partial agg over the tasks table which might not be a big deal unless a massive numbers of queries are scheduled in the cluster.

wdyt ?

majetideepak · 2024-03-05T07:52:38Z

The other fixable issue we are hitting internally in a large setup when querying system tables is that the Native worker does not handle chunked HTTP responses yet. @tdcmeehan do you know what causes a chunked HTTP response from the coordinator? I tried reproducing with a large system table (many entries) but I could not.

https://github.com/prestodb/presto/blob/master/presto-native-execution/presto_cpp/main/PrestoExchangeSource.cpp#L246

mbasmanova · 2024-03-05T13:16:57Z

@majetideepak Chunked response used to be produced by task/async endpoint which was removed in #21772 . You should not be seeing issues if you update past that PR.

majetideepak · 2024-03-05T14:59:12Z

@mbasmanova thank you for the pointer!

mbasmanova

Some comments.

presto-native-execution/presto_cpp/main/SystemConnector.h

presto-native-execution/presto_cpp/main/SystemConnector.cpp

aditi-pandit · 2024-03-26T22:23:28Z

@mbasmanova, @majetideepak : Thanks for your previous input. This code is looking good for a full review now. Looking forward to your comments.

aditi-pandit · 2024-03-27T17:36:37Z

@mbasmanova, @tdcmeehan : Thanks for the discussion. It has been informative.

If we want to stay with this approach of getting tasks table on worker we could modify #21725 and #21285 to not perform those rewrites for system.runtime.tasks table specifically as it based on the worker.

#21725 could work un-modified as well. It would just mean that we don't allow partial agg over the tasks table which might not be a big deal unless a massive numbers of queries are scheduled in the cluster.

wdyt ?

tasks table gets data from all-nodes, so both the co-ordinator and workers. Since the co-ordinator generates data, both previous planner rules are also applicable.

arhimondr

Mostly style related comments / questions. Otherwise looks good.

presto-native-execution/presto_cpp/main/SystemConnector.cpp

presto-native-execution/presto_cpp/main/SystemConnector.h

arhimondr · 2024-04-02T21:15:51Z

presto-native-execution/presto_cpp/main/SystemConnector.h

+class SystemTableHandle : public velox::connector::ConnectorTableHandle {
+ public:
+  explicit SystemTableHandle(
+      std::string connectorId,


Do we usually prefer to pass by value and then move? Or pass by const reference and copy? When do we prefer one over the other?

@arhimondr : Good question. I prefer pass by const ref and copy to avoid use after move at the caller. But I've seen pass by value and move as a common pattern in Velox especially in PlanNode construction.

presto-native-execution/presto_cpp/main/SystemConnector.h

presto-native-execution/presto_cpp/main/SystemConnector.cpp

majetideepak

@aditi-pandit some comments. Thanks!

presto-native-execution/presto_cpp/main/PrestoServer.cpp

presto-native-execution/presto_cpp/main/PrestoTask.cpp

majetideepak · 2024-04-23T12:55:51Z

presto-native-execution/presto_cpp/main/PrestoTask.cpp

  obj["lastHeartbeatMs"] = lastHeartbeatMs;
  obj["lastTaskStatsUpdateMs"] = lastTaskStatsUpdateMs;
  obj["lastMemoryReservation"] = lastMemoryReservation;
+  obj["createTime"] = createTime;


Why are we updating these values in this PR?

@majetideepak : So in the other createTime fields the values were changed to a timestamp, so there was conversion back and forth. Hence, these new fields were added.

presto-native-execution/presto_cpp/main/SystemConnector.cpp

presto-native-execution/presto_cpp/main/SystemConnector.h

presto-native-execution/presto_cpp/main/PrestoServer.cpp

majetideepak · 2024-04-23T17:17:39Z

presto-native-execution/presto_cpp/presto_protocol/presto_protocol.yml

        - { name: HiveTableLayoutHandle,    key: hive }
        - { name: IcebergTableLayoutHandle, key: hive-iceberg }
        - { name: TpchTableLayoutHandle,    key: tpch }
+        - { name: SystemTableLayoutHandle,  key: $system }


Why don't we need system and $system@system here?

So we need the protocol json classes to be generated only once in this script. There isn't a need for all 3 catalog name mappings here. The mapping of the protocol to the key/catalog name happens in the PrestoToVeloxConnector code now. So that's where we have the 3 catalog name mappings.

presto-native-execution/presto_cpp/main/SystemConnector.cpp

czentgr

This is nice and a great tutorial on how to implement basic connectors!

presto-native-execution/presto_cpp/main/SystemConnector.h

aditi-pandit · 2024-04-25T00:52:33Z

@majetideepak : Have addressed your review comments. Would appreciate another pass. Thanks !

majetideepak

@aditi-pandit few comments. Thanks!

presto-native-execution/presto_cpp/main/SystemConnector.h

presto-native-execution/presto_cpp/main/SystemConnector.cpp

majetideepak

Thanks, @aditi-pandit

aditi-pandit requested a review from a team as a code owner November 18, 2023 20:48

aditi-pandit marked this pull request as draft November 18, 2023 20:48

aditi-pandit force-pushed the system_tables branch from e4fd7e5 to ec129ca Compare November 18, 2023 20:50

aditi-pandit mentioned this pull request Nov 22, 2023

[Native] - Cast from varchar to timestamp with time zone not supported in velox #21317

Open

majetideepak reviewed Dec 11, 2023

View reviewed changes

aditi-pandit mentioned this pull request Jan 8, 2024

[native] Querying system.runtime.tasks is broken in Prestissimo #21413

Closed

aditi-pandit force-pushed the system_tables branch from ec129ca to a01d4d5 Compare February 13, 2024 22:10

mbasmanova reviewed Feb 13, 2024

View reviewed changes

mbasmanova reviewed Feb 15, 2024

View reviewed changes

presto-native-execution/presto_cpp/main/SystemConnector.cpp Outdated Show resolved Hide resolved

presto-native-execution/presto_cpp/main/SystemConnector.cpp Outdated Show resolved Hide resolved

presto-native-execution/presto_cpp/main/SystemConnector.cpp Outdated Show resolved Hide resolved

aditi-pandit force-pushed the system_tables branch 4 times, most recently from 4aae242 to a79a770 Compare March 22, 2024 17:30

mbasmanova reviewed Mar 22, 2024

View reviewed changes

aditi-pandit force-pushed the system_tables branch 2 times, most recently from 4e3f161 to 5f792c2 Compare March 23, 2024 02:00

mbasmanova reviewed Mar 26, 2024

View reviewed changes

presto-native-execution/presto_cpp/main/SystemConnector.cpp Outdated Show resolved Hide resolved

aditi-pandit force-pushed the system_tables branch 2 times, most recently from 8d7e1a7 to 25cda8c Compare March 26, 2024 21:53

aditi-pandit marked this pull request as ready for review March 26, 2024 21:59

aditi-pandit force-pushed the system_tables branch 2 times, most recently from 38e1e4e to 7819d14 Compare March 26, 2024 22:08

aditi-pandit requested a review from arhimondr March 27, 2024 17:37

aditi-pandit force-pushed the system_tables branch from 7819d14 to 19426f5 Compare March 29, 2024 17:45

aditi-pandit requested a review from amitkdutta April 2, 2024 05:04

arhimondr reviewed Apr 2, 2024

View reviewed changes

aditi-pandit force-pushed the system_tables branch 3 times, most recently from c64d4a5 to b98a57c Compare April 23, 2024 06:27

majetideepak reviewed Apr 23, 2024

View reviewed changes

aditi-pandit commented Apr 23, 2024

View reviewed changes

presto-native-execution/presto_cpp/main/SystemConnector.cpp Show resolved Hide resolved

czentgr reviewed Apr 23, 2024

View reviewed changes

presto-native-execution/presto_cpp/main/SystemConnector.h Show resolved Hide resolved

aditi-pandit force-pushed the system_tables branch 3 times, most recently from 6672878 to faf9257 Compare April 25, 2024 00:39

majetideepak reviewed Apr 26, 2024

View reviewed changes

[native] SystemConnector to query system.runtime.tasks table

38bcdce

aditi-pandit force-pushed the system_tables branch from faf9257 to 38bcdce Compare April 26, 2024 20:57

majetideepak approved these changes Apr 26, 2024

View reviewed changes

aditi-pandit merged commit aeaa0b7 into master Apr 26, 2024

aditi-pandit deleted the system_tables branch April 27, 2024 01:29

wanglinsong mentioned this pull request Jun 25, 2024

Add release notes for 0.288 #23079

Merged

36 tasks

aditi-pandit mentioned this pull request Nov 8, 2024

[native] Add native plan checker and native endpoint for Velox plan conversion #23596

Merged

6 tasks

Conversation

aditi-pandit commented Nov 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

Test Plan

Uh oh!

majetideepak left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mbasmanova left a comment

Choose a reason for hiding this comment

Uh oh!

mbasmanova commented Feb 13, 2024

Uh oh!

aditi-pandit commented Feb 15, 2024

Uh oh!

mbasmanova left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tdcmeehan commented Feb 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mbasmanova commented Feb 28, 2024

Uh oh!

mbasmanova commented Feb 28, 2024

Uh oh!

tdcmeehan commented Feb 28, 2024

Uh oh!

mbasmanova commented Feb 28, 2024

Uh oh!

tdcmeehan commented Feb 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mbasmanova commented Feb 28, 2024

Uh oh!

tdcmeehan commented Feb 28, 2024

Uh oh!

aditi-pandit commented Mar 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

majetideepak commented Mar 5, 2024

Uh oh!

mbasmanova commented Mar 5, 2024

Uh oh!

majetideepak commented Mar 5, 2024

Uh oh!

mbasmanova left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aditi-pandit commented Mar 26, 2024

Uh oh!

aditi-pandit commented Mar 27, 2024

Uh oh!

arhimondr left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

aditi-pandit commented Nov 18, 2023 •

edited

Loading

tdcmeehan commented Feb 22, 2024 •

edited

Loading

tdcmeehan commented Feb 28, 2024 •

edited

Loading

aditi-pandit commented Mar 1, 2024 •

edited

Loading

aditi-pandit Apr 5, 2024 •

edited

Loading