Skip to content

Added support for sorted_by while creating iceberg table#12872

Closed
osscm wants to merge 1 commit intotrinodb:masterfrom
osscm:add-sorted-by
Closed

Added support for sorted_by while creating iceberg table#12872
osscm wants to merge 1 commit intotrinodb:masterfrom
osscm:add-sorted-by

Conversation

@osscm
Copy link
Copy Markdown
Contributor

@osscm osscm commented Jun 15, 2022

Description

This PR is to allow to provide sorted_by as the properties like partitioning.

The sorted field definition will follow these rules:

  1. Sorted field needs to be present as the valid column.
  2. Allow quoted identifier similar to Support quoted identifiers in Iceberg partitioning #12227 , borrowed pattern, and 2 tests methods, thanks @findepi and @mdesmet
  3. Allow spaces like
    sorted_by = ARRAY['c1', '  year(    a_date )   ')

Is this change a fix, improvement, new feature, refactoring, or other?

improvement, new feature

Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)

the Iceberg connector

How would you describe this change to a non-technical end user or system administrator?

Add sorted_by properties while creating an Iceberg table like the Hive table.

CREATE TABLE test_table (
    c1 integer,
    c2 varchar,
    c3 double)
WITH (
    format = 'PARQUET',
    sorted_by = ARRAY['c1', 'c2'],
    location = '/var/my_tables/test_table')

Related issues, pull requests, and links

Fixes #12447

Documentation

( ) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

( ) No release notes entries required.
( ) Release notes entries required with the following suggested text:

# Section
* Fix some things. ({issue}`https://github.com/trinodb/trino/issues/12447`)

Will add changes to the doc as well.

@cla-bot cla-bot Bot added the cla-signed label Jun 15, 2022
@osscm
Copy link
Copy Markdown
Contributor Author

osscm commented Jun 15, 2022

cc @findepi @alexjo2144 thanks!

@alexjo2144
Copy link
Copy Markdown
Member

alexjo2144 commented Jun 16, 2022

Rather than copying the code from @mdesmet 's PR, can you cherry-pick the commit from that branch and then add work in commits on top of it?

@osscm
Copy link
Copy Markdown
Contributor Author

osscm commented Jun 17, 2022

Rather than copying the code from @mdesmet 's PR, can you cherry-pick the commit from that branch and then add work in commits on top of it?

Thanks, @alexjo2144 
tried that first, but the problem was @mdesmet code was changing only the Partition-related classes only like PartitionFields, TestPartitionFields, and BaseIcebergConnectorTest (changing partition-related test-code)
So, cherry-picked those commits, will add the code changes for the Partitions related, and then have to change it, as my PR will not push anything partition related.

The borrowed code is in IcebergUtil 

  • common methods toIdentifier and fromIdentifier 
  • and common regex 

In this PR, we are adding new test methods and classes (SortFields, TestSortFields), so we have to create a parallel to the PartitionFields as the SortFields.

Not sure about the process, but thought of adding @mdesmet as the co-author if that make sense.

@mdesmet
Copy link
Copy Markdown
Contributor

mdesmet commented Jun 17, 2022

The borrowed code is in IcebergUtil 

  • common methods toIdentifier and fromIdentifier
  • and common regex

I had also raised this in the PR. This would also apply to orc_bloom_filter defintions etc, i'll move this code to IcebergUtil in a separate commit. That would probably make this easier to cherrypick.

@osscm
Copy link
Copy Markdown
Contributor Author

osscm commented Jun 17, 2022

The borrowed code is in IcebergUtil 

  • common methods toIdentifier and fromIdentifier
  • and common regex

I had also raised this in the PR. This would also apply to orc_bloom_filter defintions etc, i'll move this code to IcebergUtil in a separate commit. That would probably make this easier to cherrypick.

sure @mdesmet, I can pull only that commit which has the change only to the IcebergUtil, thanks!

}

@SuppressWarnings("unchecked")
public static List<String> getSortOrder(Map<String, Object> tableProperties)
Copy link
Copy Markdown
Member

@ebyhr ebyhr Jun 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sort-order is stored to metadata, but it isn't used from anywhere if my understanding is correct (wrong?). What's the benefit adding this property in the current shape? Don't we need to respect the property during writes? It would be nice if you can add tests showing the benefit.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ebyhr this PR is the first step in supporting `sorted_by’.

So, this PR is only intended to add the support of sorted_by for the CREATE TABLE DDL syntax.

subsequently we can add for ALTER TABLE and then to support while writing.

DDL changes will help when Spark is being used for the ingestion (and that is the case we see it almost all the time). Spark uses icebergs sorting spec to write as well.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree adding sorted_by property without proper support. It's confusing to users.

@osscm
Copy link
Copy Markdown
Contributor Author

osscm commented Jul 4, 2022

The borrowed code is in IcebergUtil 

  • common methods toIdentifier and fromIdentifier
  • and common regex

I had also raised this in the PR. This would also apply to orc_bloom_filter defintions etc, i'll move this code to IcebergUtil in a separate commit. That would probably make this easier to cherrypick.

sure @mdesmet, I can pull only that commit which has the change only to the IcebergUtil, thanks!

@mdesmet https://github.com/trinodb/trino/pull/12227/commits seems to the commit only toIdentifier and fromIdentifier methods in IcebergUtil is removed, as I was trying to cherry-pick that. Please see, if you have time to add it, otherwise I can go ahead with the borrowed one, and mentioned as I did. thanks

@mdesmet
Copy link
Copy Markdown
Contributor

mdesmet commented Jul 4, 2022

@mdesmet https://github.com/trinodb/trino/pull/12227/commits seems to the commit only toIdentifier and fromIdentifier methods in IcebergUtil is removed, as I was trying to cherry-pick that. Please see, if you have time to add it, otherwise I can go ahead with the borrowed one, and mentioned as I did. thanks

@osscm: Those commits were squashed based on the feedback in the PR. I still suggest you to cherry pick the relevant commit and reuse the provided methods in IcebergUtil. However it's not guaranteed that you won't have to reapply the commit again it if some new feedback is raised.

@colebow
Copy link
Copy Markdown
Member

colebow commented Aug 1, 2022

This PR has been inactive for 4 weeks. What's the status here - do we need it changed, or is it waiting on that other PR for direction?

cc @ebyhr

@osscm
Copy link
Copy Markdown
Contributor Author

osscm commented Aug 2, 2022

This PR has been inactive for 4 weeks. What's the status here - do we need it changed, or is it waiting on that other PR for direction?

cc @ebyhr

@colebow I have to resume on this PR if there are any pending commits.

#12227 PR was using quote identifier in partitions, I have tried to use the commit from that PR, but it was not mergeable.
As that PR is approved, May be can wait for the merge.

@findepi
Copy link
Copy Markdown
Member

findepi commented Aug 8, 2022

i merged #12227 (thanks @mdesmet !)
@osscm can you please rebase?

@osscm
Copy link
Copy Markdown
Contributor Author

osscm commented Aug 22, 2022

i merged #12227 (thanks @mdesmet !) @osscm can you please rebase?

sure @findepi

@findepi
Copy link
Copy Markdown
Member

findepi commented Aug 22, 2022

it seems the PR adds ability to declare sorted_by, but doesn't implement actual sorting.
It's a good step, but not something we would want to release, so not something we would want to merge.
@osscm do you also want to implement sorting writer? or I can probably find someone on my team to work on this, if you prefer this way.

@osscm
Copy link
Copy Markdown
Contributor Author

osscm commented Aug 24, 2022 via email

@osscm
Copy link
Copy Markdown
Contributor Author

osscm commented Sep 7, 2022 via email

@osscm
Copy link
Copy Markdown
Contributor Author

osscm commented Sep 22, 2022

@alexjo2144 is it possible to start the review for the DDL part, meanwhile I'll add the writer part as well.
It will help to get going, and get early feedback, as overall scope is also a bit bigger.

@osscm
Copy link
Copy Markdown
Contributor Author

osscm commented Sep 27, 2022

it seems the PR adds ability to declare sorted_by, but doesn't implement actual sorting.
It's a good step, but not something we would want to release, so not something we would want to merge.
@osscm do you also want to implement sorting writer? or I can probably find someone on my team to work on this, if you prefer this way.

@findepi do you think, if we can keep this PR small so that it's easy to review and merge. and then in the 2nd PR we can support Sorting Writer. If PR is bit bigger it becomes difficult to review and merge.
What do you think.

@findepi
Copy link
Copy Markdown
Member

findepi commented Oct 13, 2022

if we can keep this PR small so that it's easy to review and merge. and then in the 2nd PR we can support Sorting Writer.

Yes, it makes sense to keep changes separate, as in separate commits.

However, we don't like to expose user-facing functionality that does nothing. If we merge this PR, we expose Trino Iceberg connector table property for defining sort, which is ignored by Trino. That makes Trino users confused and sad.
We want to expose this to users when it actually works.

@alexjo2144 could you please help here? would you be able to implement a sorting writer on top of this PR?

@osscm
Copy link
Copy Markdown
Contributor Author

osscm commented Oct 15, 2022

if we can keep this PR small so that it's easy to review and merge. and then in the 2nd PR we can support Sorting Writer.

Yes, it makes sense to keep changes separate, as in separate commits.

However, we don't like to expose user-facing functionality that does nothing. If we merge this PR, we expose Trino Iceberg connector table property for defining sort, which is ignored by Trino. That makes Trino users confused and sad. We want to expose this to users when it actually works.

@alexjo2144 could you please help here? would you be able to implement a sorting writer on top of this PR?

sure @findepi
Please let me know if any changes are required to support sorted_by for DDL statements.

@findepi
Copy link
Copy Markdown
Member

findepi commented Nov 14, 2022

Superseded by #14891

@findepi findepi closed this Nov 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

Iceberg's create table to support sort_order property

6 participants