Support iceberg reader for hive catalog and data stored on HDFS #2225

caneGuy · 2021-12-17T06:53:36Z

Co-authored-by: caneGuy [email protected]
Co-authored-by: Ielihs [email protected]

This is the first PR for #1030 which add support for Iceberg table stored on HDFS and use HiveCatalog for catalog.
Goals:

basic query ability for Iceberg table
support hive catalog
support iceberg data compressed by snappy
support parquet data format
support v1 format

Non Goals:

add iceberg statistic for planner
support hadoop catalog
support v2 format
metadata cache
support gzip
write iceberg table

Example:

CREATE DATABASE external_db;
USE external_db;

CREATE EXTERNAL RESOURCE "iceberg0"
PROPERTIES (
  "type" = "iceberg",
  "starrocks.catalog-type"="HIVE",
  "iceberg.catalog.hive.metastore.uris"="thrift://xxx:9083"
);
CREATE EXTERNAL TABLE `iceberg_tbl_snappy` (
  `id` bigint NULL,
  `data` varchar(200) NULL
) ENGINE=ICEBERG
PROPERTIES (
  "resource" = "iceberg0",
  "database" = "iceberg",
  "table" = "iceberg_table_snappy"
);

select * from iceberg_tbl_snappy;

We have run the tpcds queries for correctness check.

caneGuy · 2021-12-17T07:40:41Z

cc @openinx could you help review the logic related with iceberg？thanks

be/src/exec/vectorized/hdfs_scan_node.cpp

fe/fe-core/src/main/java/com/starrocks/analysis/CreateResourceStmt.java

fe/fe-core/src/main/java/com/starrocks/catalog/IcebergTable.java

openinx · 2021-12-17T10:05:01Z

Thanks for pinging me, @caneGuy . I'd like to take a look when I have a chance.

fe/fe-core/src/main/java/com/starrocks/external/iceberg/IcebergMetaCache.java

fe/fe-core/src/main/java/com/starrocks/planner/IcebergScanNode.java

fe/fe-core/src/main/java/com/starrocks/sql/optimizer/statistics/StatisticsCalculator.java

be/src/exec/vectorized/hdfs_scan_node.cpp

be/src/exec/vectorized/hdfs_scan_node.h

fe/fe-core/src/main/java/com/starrocks/analysis/CreateResourceStmt.java

fe/fe-core/src/main/java/com/starrocks/common/StarRocksFEMetaVersion.java

fe/fe-core/src/main/java/com/starrocks/external/iceberg/IcebergUtil.java

fe/fe-core/pom.xml

fe/fe-core/src/main/java/com/starrocks/catalog/Catalog.java

fe/fe-core/src/main/java/com/starrocks/catalog/IcebergTable.java

fe/fe-core/src/main/java/com/starrocks/sql/optimizer/operator/OperatorType.java

fe/fe-core/src/main/java/com/starrocks/catalog/IcebergResource.java

fe/fe-core/src/main/java/com/starrocks/external/iceberg/IcebergUtil.java

fe/fe-core/pom.xml

fe/fe-core/src/main/java/com/starrocks/analysis/CreateTableStmt.java

fe/fe-core/src/main/java/com/starrocks/catalog/Catalog.java

fe/fe-core/src/main/java/com/starrocks/catalog/IcebergResource.java

caneGuy · 2021-12-29T04:29:19Z

run starrocks_clang-format
run starrocks_be_unittest
run starrocks_fe_unittest
run starrocks_tscannode

dirtysalt · 2022-01-04T03:23:28Z

run starrocks_fe_unittest

caneGuy · 2022-01-04T03:43:14Z

run starrocks_clang-format

dirtysalt · 2022-01-04T06:15:33Z

run starrocks_fe_unittest

caneGuy · 2022-01-04T08:41:41Z

i have resolved comments from @openinx thanks PTAL

fe/fe-core/pom.xml

fe/fe-core/src/main/java/com/starrocks/analysis/CreateTableStmt.java

fe/fe-core/src/main/java/com/starrocks/catalog/IcebergTable.java

fe/fe-core/src/main/java/com/starrocks/external/iceberg/IcebergHiveCatalog.java

openinx · 2022-01-04T11:32:22Z

fe/fe-core/src/main/java/com/starrocks/external/iceberg/IcebergHiveCatalog.java

Here we use the default new Configuration() to initialize the HiveCatalog, could we access the iceberg table that backed by a hadoop filesystem ? I rise this question because in my view the HiveCatalog will initialize its filesystem FileIO by using this hadoop configuration. https://github.com/apache/iceberg/blob/master/hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java#L88

If we here use a default hadoop configuration, then how could we access the customized hadoop fs ?

There is a premise that we must put hadoop conf on classpath like hive table.
We will refactor for this use a common PR for hive external table and iceberg table

fe/fe-core/src/main/java/com/starrocks/catalog/IcebergTable.java

openinx · 2022-01-04T11:43:14Z

fe/fe-core/src/main/java/com/starrocks/catalog/IcebergTable.java

Here we don't handle the nested schema correctly , right ? Because I see the nested schema won't be indexed in this icebergColumns map. Maybe you can try the iceberg' TypeUtil#indexByName method to generate the name -> fieldId index.

Looks like the starrocks does not support nested fields, right ?

we do not support nested fields in this version @openinx

I will submit an other PR for nested schema

openinx · 2022-01-04T12:56:58Z

fe/fe-core/src/main/java/com/starrocks/planner/IcebergScanNode.java

I see those tasks are FileScanTask, will starrocks provides any bin-pack algorithm to balance the splits between different parallelism ?

Besides, I'm thinking that we may need to introduce an extra data structure to handle the read process for iceberg v2 table because its split will contains both data file split and delete file split.

The statistics here are now useless, I will delete these codes and submit another pr.

fe/fe-core/src/main/java/com/starrocks/planner/IcebergScanNode.java

openinx · 2022-01-04T13:04:19Z

fe/fe-core/src/main/java/com/starrocks/planner/IcebergScanNode.java

The div between task.length() and file.fileSizeInBytes() will always be 0 because it's a long-long division and the task.length() will always be less than file.fileSizeInBytes(). I will suggest to cast the task.length() to a double type.

The statistics here are now useless, I will delete these codes and submit another pr.

fe/fe-core/src/main/java/com/starrocks/external/iceberg/IcebergUtil.java

openinx · 2022-01-04T13:28:48Z

fe/fe-core/src/main/java/com/starrocks/external/iceberg/ExpressionConverter.java

Seems we currently only support expression like this pattern var op literal , actually there are more complex expression like AND, OR, NOT etc, is there any plan to support in the next version ?

In fact, I don't suggest to add the filter push down in this PR ( Because we are implementing an incomplete filter pushdown in this PR). It's good to focus a feature in one PR.

If plan to add filter push down for starrocks, the iceberg's SparkFilters class is a good example to follow.

As discussed with @imay we will submit an other pr to optimize this.
I will keep these codes in this pr for some users to try first thanks @openinx

fe/fe-core/src/main/java/com/starrocks/catalog/IcebergTable.java

fe/fe-core/src/main/java/com/starrocks/external/iceberg/ExpressionConverter.java

fe/fe-core/src/main/java/com/starrocks/planner/IcebergScanNode.java

Co-authored-by: caneGuy <[email protected]> Co-authored-by: Ielihs <[email protected]>

imay

LGTM

openinx

Looks good to me overall. Give my +1.

imay

LGTM

Seaven

Nice work, and I suggest add some UT cases to cover query iceberg table later, like PlanFragmentTest

caneGuy · 2022-01-05T09:29:37Z

Nice work, and I suggest add some UT cases to cover query iceberg table later, like PlanFragmentTest

i will add some test case in PlanFragmentTest thanks @Seaven

caneGuy · 2022-01-05T11:53:56Z

run starrocks_be_unittest

Seaven · 2022-01-05T12:17:06Z

run starrocks_be_unittest

Add support for custom catalog which can be defined by users themselves when creating iceberg external table. The custom catalog should be in the form of IcebergHiveCatalog, in other words extending BaseMetastoreCatalog and implementing IcebergCatalog. The catalog JAR should be placed into each fe/lib directory, and FE has to be restarted before custom catalog works. Usage: ```sql CREATE EXTERNAL RESOURCE "iceberg0" PROPERTIES ( "type" = "iceberg", "starrocks.catalog-type"="CUSTOM", "iceberg.catalog-impl"="{The full class name of custom catalog}" ); ``` Extra config users defined can be added in table properties when executing CREATE EXTERNAL TABLE, see #2225.

Add support for custom catalog which can be defined by users themselves when creating iceberg external table. The custom catalog should be in the form of IcebergHiveCatalog, in other words extending BaseMetastoreCatalog and implementing IcebergCatalog. The catalog JAR should be placed into each fe/lib directory, and FE has to be restarted before custom catalog works. Usage: ```sql CREATE EXTERNAL RESOURCE "iceberg0" PROPERTIES ( "type" = "iceberg", "starrocks.catalog-type"="CUSTOM", "iceberg.catalog-impl"="{The full class name of custom catalog}" ); ``` Extra config users defined can be added in table properties when executing CREATE EXTERNAL TABLE, see StarRocks#2225.

imay reviewed Dec 17, 2021

View reviewed changes

caneGuy commented Dec 17, 2021

View reviewed changes

fe/fe-core/src/main/java/com/starrocks/catalog/IcebergTable.java Outdated Show resolved Hide resolved

caneGuy commented Dec 19, 2021

View reviewed changes

fe/fe-core/src/main/java/com/starrocks/external/iceberg/IcebergMetaCache.java Outdated Show resolved Hide resolved

caneGuy commented Dec 19, 2021

View reviewed changes

fe/fe-core/src/main/java/com/starrocks/planner/IcebergScanNode.java Outdated Show resolved Hide resolved

caneGuy commented Dec 19, 2021

View reviewed changes

fe/fe-core/src/main/java/com/starrocks/sql/optimizer/statistics/StatisticsCalculator.java Outdated Show resolved Hide resolved

wyb reviewed Dec 24, 2021

View reviewed changes

caneGuy force-pushed the iceberg-reader branch from 54beee2 to 747cf1b Compare December 24, 2021 10:28

stephen-shelby reviewed Dec 27, 2021

View reviewed changes

fe/fe-core/src/main/java/com/starrocks/catalog/IcebergResource.java Outdated Show resolved Hide resolved

stephen-shelby reviewed Dec 28, 2021

View reviewed changes

fe/fe-core/src/main/java/com/starrocks/external/iceberg/IcebergUtil.java Outdated Show resolved Hide resolved

openinx reviewed Dec 28, 2021

View reviewed changes

caneGuy force-pushed the iceberg-reader branch from 5a9f493 to ec4968d Compare December 29, 2021 04:12

caneGuy force-pushed the iceberg-reader branch from 6d83da3 to 0dd8fe1 Compare December 31, 2021 02:40

caneGuy force-pushed the iceberg-reader branch from 2ee7eb1 to dda852c Compare January 4, 2022 03:43

dirtysalt previously approved these changes Jan 4, 2022

View reviewed changes

caneGuy dismissed dirtysalt’s stale review via 6c81ec3 January 4, 2022 08:47

caneGuy force-pushed the iceberg-reader branch from 3bb7a27 to 6c81ec3 Compare January 4, 2022 08:47

openinx reviewed Jan 4, 2022

View reviewed changes

fe/fe-core/src/main/java/com/starrocks/planner/IcebergScanNode.java Outdated Show resolved Hide resolved

openinx reviewed Jan 4, 2022

View reviewed changes

fe/fe-core/src/main/java/com/starrocks/external/iceberg/IcebergUtil.java Outdated Show resolved Hide resolved

openinx reviewed Jan 4, 2022

View reviewed changes

wyb reviewed Jan 5, 2022

View reviewed changes

fe/fe-core/src/main/java/com/starrocks/catalog/IcebergTable.java Outdated Show resolved Hide resolved

fe/fe-core/src/main/java/com/starrocks/catalog/IcebergTable.java Outdated Show resolved Hide resolved

wyb reviewed Jan 5, 2022

View reviewed changes

caneGuy and others added 3 commits January 5, 2022 12:39

Support iceberg reader hive catalog and data stored on HDFS

095bb57

Co-authored-by: caneGuy <[email protected]> Co-authored-by: Ielihs <[email protected]>

fix ut

940c0fe

resolve comments

fed910f

caneGuy force-pushed the iceberg-reader branch from 3631360 to fed910f Compare January 5, 2022 04:40

imay previously approved these changes Jan 5, 2022

View reviewed changes

resolve comments from wyb

27fd72c

caneGuy dismissed imay’s stale review via 27fd72c January 5, 2022 05:36

openinx previously approved these changes Jan 5, 2022

View reviewed changes

imay previously approved these changes Jan 5, 2022

View reviewed changes

fix binary type and conjuncts

fcda410

caneGuy dismissed stale reviews from imay and openinx via fcda410 January 5, 2022 08:53

fixed type should be consider as binary

dd33098

Seaven previously approved these changes Jan 5, 2022

View reviewed changes

wyb previously approved these changes Jan 5, 2022

View reviewed changes

satanson dismissed stale reviews from wyb and Seaven via dd33098 January 5, 2022 10:06

imay merged commit c40f7ca into StarRocks:main Jan 5, 2022

caneGuy deleted the iceberg-reader branch January 6, 2022 02:29

caneGuy mentioned this pull request Feb 17, 2022

Support iceberg external table #1030

Closed

14 tasks

mxdzs0612 mentioned this pull request Apr 15, 2022

Support custom catalog for iceberg external table #5117

Merged

4 tasks

caneGuy pushed a commit to caneGuy/starrocks that referenced this pull request Mar 28, 2023

fix community reported bugs (StarRocks#2225)

9ee8050

Support iceberg reader for hive catalog and data stored on HDFS #2225

Support iceberg reader for hive catalog and data stored on HDFS #2225

Uh oh!

Conversation

caneGuy commented Dec 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

caneGuy commented Dec 17, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

openinx commented Dec 17, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

caneGuy commented Dec 29, 2021

Uh oh!

dirtysalt commented Jan 4, 2022

Uh oh!

caneGuy commented Jan 4, 2022

Uh oh!

dirtysalt commented Jan 4, 2022

Uh oh!

caneGuy commented Jan 4, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

caneGuy commented Dec 17, 2021 •

edited

Loading

caneGuy Jan 5, 2022 •

edited

Loading

caneGuy commented Jan 5, 2022 •

edited

Loading