Skip to content

Conversation

@caneGuy
Copy link
Contributor

@caneGuy caneGuy commented Dec 17, 2021

Co-authored-by: caneGuy [email protected]
Co-authored-by: Ielihs [email protected]

This is the first PR for #1030 which add support for Iceberg table stored on HDFS and use HiveCatalog for catalog.
Goals:

  • basic query ability for Iceberg table
  • support hive catalog
  • support iceberg data compressed by snappy
  • support parquet data format
  • support v1 format

Non Goals:

  • add iceberg statistic for planner
  • support hadoop catalog
  • support v2 format
  • metadata cache
  • support gzip
  • write iceberg table

Example:

CREATE DATABASE external_db;
USE external_db;

CREATE EXTERNAL RESOURCE "iceberg0"
PROPERTIES (
  "type" = "iceberg",
  "starrocks.catalog-type"="HIVE",
  "iceberg.catalog.hive.metastore.uris"="thrift://xxx:9083"
);
CREATE EXTERNAL TABLE `iceberg_tbl_snappy` (
  `id` bigint NULL,
  `data` varchar(200) NULL
) ENGINE=ICEBERG
PROPERTIES (
  "resource" = "iceberg0",
  "database" = "iceberg",
  "table" = "iceberg_table_snappy"
);

select * from iceberg_tbl_snappy;

We have run the tpcds queries for correctness check.

@caneGuy
Copy link
Contributor Author

caneGuy commented Dec 17, 2021

cc @openinx could you help review the logic related with iceberg?thanks

@openinx
Copy link

openinx commented Dec 17, 2021

Thanks for pinging me, @caneGuy . I'd like to take a look when I have a chance.

@caneGuy
Copy link
Contributor Author

caneGuy commented Dec 29, 2021

run starrocks_clang-format
run starrocks_be_unittest
run starrocks_fe_unittest
run starrocks_tscannode

@dirtysalt
Copy link
Contributor

run starrocks_fe_unittest

@caneGuy
Copy link
Contributor Author

caneGuy commented Jan 4, 2022

run starrocks_clang-format

@dirtysalt
Copy link
Contributor

run starrocks_fe_unittest

dirtysalt
dirtysalt previously approved these changes Jan 4, 2022
@caneGuy
Copy link
Contributor Author

caneGuy commented Jan 4, 2022

i have resolved comments from @openinx thanks PTAL

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we use the default new Configuration() to initialize the HiveCatalog, could we access the iceberg table that backed by a hadoop filesystem ? I rise this question because in my view the HiveCatalog will initialize its filesystem FileIO by using this hadoop configuration. https://github.com/apache/iceberg/blob/master/hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java#L88

If we here use a default hadoop configuration, then how could we access the customized hadoop fs ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a premise that we must put hadoop conf on classpath like hive table.
We will refactor for this use a common PR for hive external table and iceberg table

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we don't handle the nested schema correctly , right ? Because I see the nested schema won't be indexed in this icebergColumns map. Maybe you can try the iceberg' TypeUtil#indexByName method to generate the name -> fieldId index.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the starrocks does not support nested fields, right ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do not support nested fields in this version @openinx

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will submit an other PR for nested schema

Comment on lines +127 to +129
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see those tasks are FileScanTask, will starrocks provides any bin-pack algorithm to balance the splits between different parallelism ?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides, I'm thinking that we may need to introduce an extra data structure to handle the read process for iceberg v2 table because its split will contains both data file split and delete file split.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The statistics here are now useless, I will delete these codes and submit another pr.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The div between task.length() and file.fileSizeInBytes() will always be 0 because it's a long-long division and the task.length() will always be less than file.fileSizeInBytes(). I will suggest to cast the task.length() to a double type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The statistics here are now useless, I will delete these codes and submit another pr.

Comment on lines +32 to +39
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems we currently only support expression like this pattern var op literal , actually there are more complex expression like AND, OR, NOT etc, is there any plan to support in the next version ?

In fact, I don't suggest to add the filter push down in this PR ( Because we are implementing an incomplete filter pushdown in this PR). It's good to focus a feature in one PR.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If plan to add filter push down for starrocks, the iceberg's SparkFilters class is a good example to follow.

Copy link
Contributor Author

@caneGuy caneGuy Jan 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed with @imay we will submit an other pr to optimize this.
I will keep these codes in this pr for some users to try first thanks @openinx

imay
imay previously approved these changes Jan 5, 2022
Copy link
Contributor

@imay imay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

openinx
openinx previously approved these changes Jan 5, 2022
Copy link

@openinx openinx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me overall. Give my +1.

imay
imay previously approved these changes Jan 5, 2022
Copy link
Contributor

@imay imay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@caneGuy caneGuy dismissed stale reviews from imay and openinx via fcda410 January 5, 2022 08:53
Seaven
Seaven previously approved these changes Jan 5, 2022
Copy link
Contributor

@Seaven Seaven left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, and I suggest add some UT cases to cover query iceberg table later, like PlanFragmentTest

wyb
wyb previously approved these changes Jan 5, 2022
@caneGuy
Copy link
Contributor Author

caneGuy commented Jan 5, 2022

Nice work, and I suggest add some UT cases to cover query iceberg table later, like PlanFragmentTest

i will add some test case in PlanFragmentTest thanks @Seaven

@satanson satanson dismissed stale reviews from wyb and Seaven via dd33098 January 5, 2022 10:06
@caneGuy
Copy link
Contributor Author

caneGuy commented Jan 5, 2022

run starrocks_be_unittest

1 similar comment
@Seaven
Copy link
Contributor

Seaven commented Jan 5, 2022

run starrocks_be_unittest

@imay imay merged commit c40f7ca into StarRocks:main Jan 5, 2022
@caneGuy caneGuy deleted the iceberg-reader branch January 6, 2022 02:29
@caneGuy caneGuy mentioned this pull request Feb 17, 2022
14 tasks
imay pushed a commit that referenced this pull request Apr 24, 2022
Add support for custom catalog which can be defined by users themselves when creating iceberg external table.

The custom catalog should be in the form of IcebergHiveCatalog, in other words extending BaseMetastoreCatalog and implementing IcebergCatalog. The catalog JAR should be placed into each fe/lib directory, and FE has to be restarted before custom catalog works.

Usage:
```sql
CREATE EXTERNAL RESOURCE "iceberg0"
PROPERTIES (
  "type" = "iceberg",
  "starrocks.catalog-type"="CUSTOM",
  "iceberg.catalog-impl"="{The full class name of custom catalog}"
);
```
Extra config users defined can be added in table properties when executing CREATE EXTERNAL TABLE, see #2225.
blackstar-baba pushed a commit to blackstar-baba/starrocks that referenced this pull request Apr 28, 2022
Add support for custom catalog which can be defined by users themselves when creating iceberg external table.

The custom catalog should be in the form of IcebergHiveCatalog, in other words extending BaseMetastoreCatalog and implementing IcebergCatalog. The catalog JAR should be placed into each fe/lib directory, and FE has to be restarted before custom catalog works.

Usage:
```sql
CREATE EXTERNAL RESOURCE "iceberg0"
PROPERTIES (
  "type" = "iceberg",
  "starrocks.catalog-type"="CUSTOM",
  "iceberg.catalog-impl"="{The full class name of custom catalog}"
);
```
Extra config users defined can be added in table properties when executing CREATE EXTERNAL TABLE, see StarRocks#2225.
caneGuy pushed a commit to caneGuy/starrocks that referenced this pull request Mar 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants