Skip to content

Fix CTE Materialization unsupported hive bucket types#21549

Merged
jaystarshot merged 1 commit intoprestodb:masterfrom
jaystarshot:cte-bucket-fix
Jan 30, 2024
Merged

Fix CTE Materialization unsupported hive bucket types#21549
jaystarshot merged 1 commit intoprestodb:masterfrom
jaystarshot:cte-bucket-fix

Conversation

@jaystarshot
Copy link
Member

@jaystarshot jaystarshot commented Dec 15, 2023

Fixes #21540

Hive catalog does not allow bucketing on a lot of presto types including user defined types. link.

Hence making cte materialization avoid those types to bucket on. If all columns are of unsupported types, then the cte will not be materialized

Added test cases for all supported and unsupported types to make sure that the queries are successful

Motivation and Context

Impact

Test Plan

Contributor checklist

  • Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== RELEASE NOTES ==

Hive Changes
* Fixed CTE Materialization for unsupported hive bucket types


@jaystarshot jaystarshot force-pushed the cte-bucket-fix branch 4 times, most recently from 1f20316 to 839457c Compare December 15, 2023 04:38
@jaystarshot jaystarshot marked this pull request as ready for review December 15, 2023 09:53
@jaystarshot jaystarshot requested a review from a team as a code owner December 15, 2023 09:53
@tdcmeehan tdcmeehan self-assigned this Dec 15, 2023
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we think of a way to do this that doesn't require engine changes that presume the underlying connector's capabilities? I'm just thinking how this might work if we use something like Iceberg.

Copy link
Member Author

@jaystarshot jaystarshot Dec 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I thought of the same but that requires spi changes. (isTypeBucketable etc) or we can switch on partitioning_provider_catalog session property

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also not sure the iceberg connector supports temporary table functionality atm

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SPI changes could be fine if they're targeted and allow the engine to function properly without baking in underlying presumptions of the storage subsystem.

Copy link
Contributor

@feilong-liu feilong-liu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jaystarshot jaystarshot force-pushed the cte-bucket-fix branch 3 times, most recently from 430f5a2 to a9ed71f Compare January 2, 2024 21:36
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added so that the testing framwork (logicalCteOptimizer) can test successfully.

@jaystarshot
Copy link
Member Author

@tdcmeehan I have refactored to check via SPI, please review

@jaystarshot jaystarshot requested a review from tdcmeehan January 2, 2024 22:31
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use a more concise name, like isTypeBucketable?

@jaystarshot
Copy link
Member Author

Actually I had hardcoded all the uses of partitioning_provider_catalog to "hive" in our testing. On correcting and using partitioning_provider_catalog property, I see unrelated failures in different stack like https://github.com/prestodb/presto/blob/d121d92484856665157f36a74151be3ce3a3f474/p[…]com/facebook/presto/sql/planner/optimizations/AddExchanges.java without cte materialization
Seems like the property is used in some unrelated places like this one. Failures are the same, (hive bucketing not supported) which suggests that we should use a different session property for cte partitioning.

@jaystarshot
Copy link
Member Author

Should I separate those session properties in #21625 ?

@jaystarshot
Copy link
Member Author

Revisiting and regaining context

@jaystarshot
Copy link
Member Author

jaystarshot commented Jan 25, 2024

@tdcmeehan During his review mentioned that the stack trace in the attached ticket ( #21540 a bit suspicious since it suggests that a bucketFunctionType HIVE_COMPATIBLE was used.
I digged a little and I think this issue has already been solved for exchange materialization commit. Setting the hive.bucket_function_type_for_exchange = PRESTO_NATIVE, works well in this case.
This property is used here in code which was the main fix.
I think this would be the proper fix in our case too.
We might need an extension to the SPI there because we don't want to set hive.bucket_function_type_for_exchange = PRESTO_NATIVE for non materialized exchanges in case of cte materialization like
getPartitioningHandleForCte

Copy link
Contributor

@tdcmeehan tdcmeehan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. @feilong-liu would you like to have a look at the current PR again?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tdcmeehan I just saw and one test case is failing

@Test
   public void testCteWithZeroLengthVarchar()
   {
       String testQuery = "WITH temp AS (" +
               "  SELECT * FROM (VALUES " +
               "    (CAST('' AS VARCHAR(0)), 9)" +
               "  ) AS t (text_column, number_column)" +
               ") SELECT * FROM temp";
       QueryRunner queryRunner = getQueryRunner();
       compareResults(queryRunner.execute(getMaterializedSession(),
                       testQuery),
               queryRunner.execute(getSession(),
                       testQuery));
   }

Stack Trace

Caused by: java.lang.RuntimeException: Varchar length 0 out of allowed range [1, 65535]
	at org.apache.hadoop.hive.serde2.typeinfo.BaseCharUtils.validateVarcharParameter(BaseCharUtils.java:32)
	at org.apache.hadoop.hive.serde2.typeinfo.VarcharTypeInfo.<init>(VarcharTypeInfo.java:33)
	at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoFactory.createPrimitiveTypeInfo(TypeInfoFactory.java:159)
	at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoFactory.getPrimitiveTypeInfo(TypeInfoFactory.java:117)
	at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoFactory.getVarcharTypeInfo(TypeInfoFactory.java:183)
	at com.facebook.presto.hive.HiveTypeTranslator.translate(HiveTypeTranslator.java:98)
	at com.facebook.presto.hive.HiveType.toHiveType(HiveType.java:218)
	at com.facebook.presto.hive.HiveMetadata.getColumnHandles(HiveMetadata.java:3584)
	at com.facebook.presto.hive.HiveMetadata.createTemporaryTable(HiveMetadata.java:1107)


So we might need this check for the 0 length Varchar due to , because write with 0 length varchar fails.

See reference - While Presto supports Varchar of length 0 (as discussed in trinodb/trino#1136

Copy link
Member Author

@jaystarshot jaystarshot Jan 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am hence planning to disallow cte creation if there is a 0 length varchar but the current present fix was anyway not enough (checking for varchar and 0 length) because it didn't check for nested types, we need a proper separate fix for this since I suspect that this issue exists even in current CTAS

Copy link
Member Author

@jaystarshot jaystarshot Jan 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Si will differ this test case into a new issue

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created #21791 and adding todos in test

Copy link
Member Author

@jaystarshot jaystarshot Jan 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed that

CREATE TABLE tmp.test2512351 AS (
    SELECT * FROM (VALUES 
        (CAST('' AS VARCHAR(0)), 9)
    ) AS t(col1, col2)
);

fails in Presto

@jaystarshot jaystarshot force-pushed the cte-bucket-fix branch 3 times, most recently from c0c1e04 to 3ce1760 Compare January 25, 2024 22:41
@jaystarshot jaystarshot requested a review from tdcmeehan January 25, 2024 22:42
@jaystarshot
Copy link
Member Author

cc: @tdcmeehan, need an approval again thanks! Fixed some tests which were based on the previous unsupported bucket functions

@jaystarshot jaystarshot merged commit c61f424 into prestodb:master Jan 30, 2024
@wanglinsong wanglinsong mentioned this pull request Feb 12, 2024
64 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CTE materialization internal error - Hive bucket hashCode is not supported for Hive category: STRUCT and Binary

3 participants