Skip to content

Conversation

@CTTY
Copy link
Contributor

@CTTY CTTY commented Sep 14, 2023

Change Logs

Support Spark 3.5.0

Impact

No public API changes

Risk level (write none, low medium or high below)

Medium

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@vinothchandar
Copy link
Member

vinothchandar commented Sep 16, 2023

awesome ! @CTTY .

@yihua can we figure out how we can integrate with the native spark reader on top.

@CTTY CTTY force-pushed the ctty/hudi1x-spark35 branch from 533ea14 to 0a89361 Compare September 19, 2023 04:21
@yihua
Copy link
Contributor

yihua commented Sep 19, 2023

awesome ! @CTTY .

@yihua can we figure out how we can integrate with the native spark reader on top.

Yes. We should be able to use the native Spark parquet reader from the file format.

@CTTY CTTY changed the title [DNM] Support Spark 3.5.0 [HUDI-6806] Support Spark 3.5.0 Sep 19, 2023
@CTTY CTTY marked this pull request as ready for review September 19, 2023 17:03
@CTTY
Copy link
Contributor Author

CTTY commented Sep 22, 2023

@hudi-bot run azure

@yihua yihua self-assigned this Oct 17, 2023
@CTTY CTTY force-pushed the ctty/hudi1x-spark35 branch 2 times, most recently from 6ada390 to ef17855 Compare October 28, 2023 00:01
return RowEncoder.apply(schema)
.resolveAndBind(JavaConverters.asScalaBufferConverter(attributes).asScala().toSeq(),
SimpleAnalyzer$.MODULE$);
return SparkAdapterSupport$.MODULE$.sparkAdapter().getEncoder(schema);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SPARK-44531 Encoder inference moved elsewhere in Spark 3.5.0

<artifactId>parquet-hadoop-bundle</artifactId>
<version>${parquet.version}</version>
<scope>provided</scope>
</dependency>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added parquet-hadoop-bundle to fix classpath issues

java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.sql.execution.datasources.parquet.ParquetOptions$

    at org.apache.spark.sql.execution.datasources.parquet.ParquetOptions.<init>(ParquetOptions.scala:50)
    at org.apache.spark.sql.execution.datasources.parquet.ParquetOptions.<init>(ParquetOptions.scala:40)
    at org.apache.spark.sql.execution.datasources.parquet.Spark34LegacyHoodieParquetFileFormat.buildReaderWithPartitionValues(Spark34LegacyHoodieParquetFileFormat.scala:150)

case ae: AnalysisException if (ae.getMessage().startsWith("[INCOMPATIBLE_DATA_FOR_TABLE.CANNOT_FIND_DATA] Cannot write incompatible data for the table")
|| ae.getMessage().startsWith("Cannot write incompatible data to table")) =>
planUtils.resolveOutputColumns(catalogTable.catalogTableName, sparkAdapter.toAttributes(expectedSchema), query, byName = false, conf)
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SPARK-42309 Error message changed in Spark 3.5.0

case DateDiff(_, OrderPreservingTransformation(attrRef)) => Some(attrRef)
case FromUnixTime(OrderPreservingTransformation(attrRef), _, _) => Some(attrRef)
case FromUTCTimestamp(OrderPreservingTransformation(attrRef), _) => Some(attrRef)
case ParseToDate(OrderPreservingTransformation(attrRef), _, _, _) => Some(attrRef)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a new empty argument due to SPARK-43779, ParseToDate API change

import org.apache.spark.sql.catalyst.expressions._
import org.apache.spark.sql.catalyst.expressions.aggregate.{First, Last}
import org.apache.spark.sql.catalyst.parser.ParserUtils.{checkDuplicateClauses, checkDuplicateKeys, entry, escapedIdentifier, operationNotAllowed, source, string, stringWithoutUnescape, validate, withOrigin}
import org.apache.spark.sql.catalyst.parser.{EnhancedLogicalPlan, ParseException, ParserInterface}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SPARK-44333, EnhancedLogicalPlan moved to a different package

<include>com.github.ben-manes.caffeine:caffeine</include>
<!-- SPARK-43489 Spark 3.5+ has marked protobuf as provided -->
<include>com.google.protobuf:protobuf-java</include>
<include>com.twitter:bijection-avro_${scala.binary.version}</include>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is needed otherwise deltastreamer would fail due to

Exception in thread "main" java.lang.NoClassDefFoundError: com/google/protobuf/Message
    at java.lang.Class.getDeclaredMethods0(Native Method)
    at java.lang.Class.privateGetDeclaredMethods(Class.java:2729)

</property>
</activation>
</profile>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't changed the default Spark3 profile to Spark 3.5

+ "{\"name\": \"timestamp\",\"type\": \"double\"},{\"name\": \"_row_key\", \"type\": \"string\"},"
+ "{\"name\": \"non_pii_col\", \"type\": \"string\"},"
+ "{\"name\": \"pii_col\", \"type\": \"string\"}]},";
+ "{\"name\": \"pii_col\", \"type\": \"string\"}]}";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. Does it fail the test before with the comma at the end of the schema String?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it didn't fail before. In Avro 1.11.2 they enforce a stricter schema format

Comment on lines +200 to +206
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.binary.version}</artifactId>
<version>${spark2.version}</version>
<scope>provided</scope>
<optional>true</optional>
</dependency>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason of adding this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember seeing some classpath issues but somehow can't find the exact error message. We can try reverting this change

Comment on lines +160 to +166
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.binary.version}</artifactId>
<version>${spark30.version}</version>
<scope>provided</scope>
<optional>true</optional>
</dependency>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar here and other poms.

@yihua yihua added priority:blocker Production down; release blocker release-1.0.0 labels Nov 8, 2023
@yihua yihua force-pushed the ctty/hudi1x-spark35 branch 2 times, most recently from 43253ea to 6aeea1a Compare November 9, 2023 21:41
Copy link
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I addressed all the minor comments.

@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@yihua
Copy link
Contributor

yihua commented Nov 16, 2023

Azure CI on master also fails on the fourth task. Merging this PR.
Screenshot 2023-11-15 at 21 56 14

@yihua yihua merged commit 874b5de into apache:master Nov 16, 2023
jonvex pushed a commit to jonvex/hudi that referenced this pull request Nov 29, 2023
commit dfa3bde
Merge: bfc0a85 473cf9a
Author: Jonathan Vexler <=>
Date:   Wed Nov 29 15:01:45 2023 -0500

    Merge branch 'master' into fg_reader_implement_bootstrap

commit bfc0a85
Author: Jonathan Vexler <=>
Date:   Wed Nov 29 14:55:57 2023 -0500

    fix bug with nested required fields due to spark nested schema pruning bug

commit 473cf9a
Author: Rajesh Mahindra <[email protected]>
Date:   Wed Nov 29 08:37:40 2023 -0800

    [HUDI-7138] Fix error table writer and schema registry provider (apache#10173)

    ---------

    Co-authored-by: rmahindra123 <[email protected]>

commit 91eabab
Author: Lin Liu <[email protected]>
Date:   Tue Nov 28 23:49:37 2023 -0800

    [HUDI-7103] Support time travel queies for COW tables (apache#10109)

    This is based on HadoopFsRelation.

commit b300728
Author: Rajesh Mahindra <[email protected]>
Date:   Tue Nov 28 22:31:12 2023 -0800

    [HUDI-7086] Fix the default for gcp pub sub max sync time to 1min (apache#10171)

    Co-authored-by: rmahindra123 <[email protected]>

commit 8370c62
Author: Shiyan Xu <[email protected]>
Date:   Tue Nov 28 22:31:34 2023 -0600

    [HUDI-7149] Add a dbt example project with CDC capability (apache#10192)

commit 817d81a
Author: zhuanshenbsj1 <[email protected]>
Date:   Wed Nov 29 11:46:20 2023 +0800

    [MINOR] Add log to print wrong number of instant metadata files (apache#10196)

commit cadeade
Author: leixin <[email protected]>
Date:   Wed Nov 29 11:45:24 2023 +0800

    [minor] when metric prefix length is 0 ignore the metric prefix (apache#10190)

    Co-authored-by: leixin1 <[email protected]>

commit 91daa7d
Author: Lin Liu <[email protected]>
Date:   Tue Nov 28 19:03:50 2023 -0800

    [HUDI-7102] Fix bugs related to time travel queries (apache#10102)

commit d1dfa5b
Author: Dongsj <[email protected]>
Date:   Wed Nov 29 10:49:38 2023 +0800

    [HUDI-7148] Add an additional fix to the potential thread insecurity problem of heartbeat client (apache#10188)

    Co-authored-by: dongsj <[email protected]>

commit b0b711e
Author: Jonathan Vexler <=>
Date:   Tue Nov 28 21:35:20 2023 -0500

    nested schema kinda fix

commit 77cfb3a
Author: YueZhang <[email protected]>
Date:   Wed Nov 29 09:46:53 2023 +0800

    [HUDI-7147] Fix CDC write flush bug (apache#10186)

    * Using iterator instead of values to avoid unsupported operation exception

    * check style

commit b144ee0
Author: Jon Vexler <[email protected]>
Date:   Tue Nov 28 14:23:46 2023 -0500

    Update hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/HoodieFileGroupReaderBasedParquetFileFormat.scala

    Co-authored-by: Sagar Sumit <[email protected]>

commit 89fab14
Author: Jonathan Vexler <=>
Date:   Tue Nov 28 14:23:03 2023 -0500

    fix failing tests and address some of sagar pr review

commit 675abf1
Author: Tim Brown <[email protected]>
Date:   Mon Nov 27 23:21:56 2023 -0600

    [MINOR] Schema Converter should use default identity transform if not specified (apache#10178)

commit 5450aff
Author: Jonathan Vexler <=>
Date:   Mon Nov 27 22:21:06 2023 -0500

    disable vector for bootstrap

commit fb062df
Author: Danny Chan <[email protected]>
Date:   Tue Nov 28 10:52:33 2023 +0800

    [Minor] Fix the flaky tests in TestRemoteHoodieTableFileSystemView (apache#10179)

commit 3ae4d30
Author: Jonathan Vexler <=>
Date:   Mon Nov 27 21:07:17 2023 -0500

    fix various issues that caused failing tests

commit a045da6
Author: Jonathan Vexler <=>
Date:   Mon Nov 27 18:00:46 2023 -0500

    see if this works

commit 91be81a
Author: Jonathan Vexler <=>
Date:   Mon Nov 27 17:07:30 2023 -0500

    use java to create unary operator

commit c22d1db
Merge: 38b2603 4c3a1db
Author: Jonathan Vexler <=>
Date:   Mon Nov 27 15:56:39 2023 -0500

    Merge branch 'master' into fg_reader_implement_bootstrap

commit 38b2603
Author: Jonathan Vexler <=>
Date:   Mon Nov 27 15:42:22 2023 -0500

    set precombine in test

commit 2a9a363
Author: Jonathan Vexler <=>
Date:   Mon Nov 27 13:27:38 2023 -0500

    try to fix scala2.11 unary operator issue

commit 60bdf14
Author: Jonathan Vexler <=>
Date:   Mon Nov 27 13:02:16 2023 -0500

    try fix ci

commit 4c3a1db
Author: majian <[email protected]>
Date:   Mon Nov 27 16:44:25 2023 +0800

    [HUDI-7110][FOLLOW-UP] Improve call procedure for show column stats information (apache#10169)

commit 499423c
Author: zhuanshenbsj1 <[email protected]>
Date:   Sun Nov 26 10:13:46 2023 +0800

    [HUDI-7041] Optimize the memory usage of timeline server for table service (apache#10002)

commit 4f875ed
Author: Y Ethan Guo <[email protected]>
Date:   Sat Nov 25 15:10:37 2023 -0800

    [HUDI-7139] Fix operation type for bulk insert with row writer in Hudi Streamer (apache#10175)

    This commit fixes the bug which causes the `operationType` to be null in the commit metadata of bulk insert operation with row writer enabled in Hudi Streamer (`hoodie.datasource.write.row.writer.enable=true`).  `HoodieStreamerDatasetBulkInsertCommitActionExecutor` is updated so that `#preExecute` and `#afterExecute` should run the same logic as regular bulk insert operation without row writer.

commit 332e7e8
Author: harshal <[email protected]>
Date:   Sat Nov 25 14:04:29 2023 +0530

    [HUDI-7006] Reduce unnecessary is_empty rdd calls in StreamSync (apache#10158)

    ---------

    Co-authored-by: sivabalan <[email protected]>

commit 86232d2
Author: Sivabalan Narayanan <[email protected]>
Date:   Thu Nov 23 19:27:50 2023 -0800

    [HUDI-7095] Making perf enhancements to JSON serde (apache#10097)

commit a7fd27c
Author: Sivabalan Narayanan <[email protected]>
Date:   Thu Nov 23 19:20:01 2023 -0800

    [HUDI-7086] Scaling gcs event source (apache#10073)

    -  Scaling gcs event source

    ---------

    Co-authored-by: rmahindra123 <[email protected]>

commit bb42c4b
Author: Sivabalan Narayanan <[email protected]>
Date:   Thu Nov 23 18:33:32 2023 -0800

    [HUDI-7097] Fix instantiation of Hms Uri with HiveSync tool (apache#10099)

commit 0b7f47a
Author: Jonathan Vexler <=>
Date:   Thu Nov 23 16:27:36 2023 -0500

    decently working

commit bcb974b
Author: VitoMakarevich <[email protected]>
Date:   Thu Nov 23 11:22:14 2023 +0100

    [HUDI-7034] Fix refresh table/view (apache#10151)

    * [HUDI-7034] Refresh index fix - remove cached file slices within partitions

    ---------

    Co-authored-by: vmakarevich <[email protected]>
    Co-authored-by: Sagar Sumit <[email protected]>

commit b77eff2
Author: Lokesh Jain <[email protected]>
Date:   Thu Nov 23 10:47:40 2023 +0530

    [HUDI-7120] Performance improvements in deltastreamer executor code path (apache#10135)

commit 405be17
Author: Sivabalan Narayanan <[email protected]>
Date:   Wed Nov 22 21:00:33 2023 -0800

    [MINOR] Making misc fixes to deltastreamer sources(S3 and GCS) (apache#10095)

    * Making misc fixes to deltastreamer sources

    * Fixing test failures

    * adding inference to CloudSourceconfig... cloud.data.datafile.format

    * Fix the tests for s3 events source

    * Fix the tests for s3 events source

    ---------

    Co-authored-by: rmahindra123 <[email protected]>

commit 3d21285
Author: Tim Brown <[email protected]>
Date:   Wed Nov 22 22:51:14 2023 -0600

    [HUDI-7112] Reuse existing timeline server and performance improvements (apache#10122)

    - Reuse timeline server across tables.

    ---------

    Co-authored-by: sivabalan <[email protected]>

commit 72ff9a7
Author: Rajesh Mahindra <[email protected]>
Date:   Wed Nov 22 20:49:15 2023 -0800

    [HUDI-7052] Fix partition key validation for custom key generators. (apache#10014)

    ---------

    Co-authored-by: rmahindra123 <[email protected]>

commit 8d6d043
Author: majian <[email protected]>
Date:   Thu Nov 23 10:08:17 2023 +0800

    [HUDI-7110] Add call procedure for show column stats information (apache#10120)

commit aabaa99
Author: huangxiaoping <[email protected]>
Date:   Thu Nov 23 09:06:45 2023 +0800

    [MINOR] Remove unused import (apache#10159)

commit f88a73f
Author: Y Ethan Guo <[email protected]>
Date:   Wed Nov 22 10:48:48 2023 -0800

    [HUDI-7123] Improve CI scripts (apache#10136)

    Improves the CI scripts in the following aspects:
    - Removes `hudi-common` tests from `test-spark` job in GH CI as they are already covered by Azure CI
    - Removes unnecesary bundle validation jobs and adds new bundle validation images (`flink1153hive313spark323`, `flink1162hive313spark331`)
    - Updates `validate-release-candidate-bundles` jobs
    - Moves functional tests of `hudi-spark-datasource/hudi-spark` from job 4 (3 hours) to job 2 (1 hour) in Azure CI to rebalance the finish time.

commit 38c87b7
Author: harshal <[email protected]>
Date:   Wed Nov 22 20:53:42 2023 +0530

    [HUDI-7004] Add support of snapshotLoadQuerySplitter in s3/gcs sources (apache#10152)

commit d0edfb5
Author: Sivabalan Narayanan <[email protected]>
Date:   Wed Nov 22 10:22:53 2023 -0500

    [HUDI-6961] Fixing DefaultHoodieRecordPayload to honor deletion based on meta field as well as custome delete marker (apache#10150)

    - Fixing DefaultHoodieRecordPayload to honor deletion based on meta field as well as custom delete marker across all delete apis

commit cda9dbc
Author: Jing Zhang <[email protected]>
Date:   Wed Nov 22 18:04:39 2023 +0800

    [HUDI-7129] Fix bug when upgrade from table version three using UpgradeOrDowngradeProcedure (apache#10147)

commit 18f7181
Author: Shiyan Xu <[email protected]>
Date:   Wed Nov 22 02:00:27 2023 -0600

    [HUDI-7133] Improve dbt example for better guidance (apache#10155)

commit c5af85d
Author: Sivabalan Narayanan <[email protected]>
Date:   Wed Nov 22 01:33:49 2023 -0500

    [HUDI-7096] Improving incremental query to fetch partitions based on commit metadata (apache#10098)

commit 2522f6d
Author: xuzifu666 <[email protected]>
Date:   Wed Nov 22 11:53:21 2023 +0800

    [HUDI-7128] DeleteMarkerProcedures support delete in batch mode (apache#10148)

    Co-authored-by: xuyu <[email protected]>

commit a1afcdd
Author: Tim Brown <[email protected]>
Date:   Tue Nov 21 14:58:12 2023 -0600

    [HUDI-7115] Add in new options for the bigquery sync (apache#10125)

    - Add in new options for the bigquery sync

commit 35cd873
Author: Sivabalan Narayanan <[email protected]>
Date:   Tue Nov 21 13:11:21 2023 -0500

    [HUDI-7084] Fixing schema retrieval for table w/ no commits (apache#10069)

    * Fixing schema retrieval for table w/ no commits

    * fixing compilation failure

commit 74793d5
Author: Rajesh Mahindra <[email protected]>
Date:   Tue Nov 21 09:53:12 2023 -0800

    [HUDI-7106] Fix sqs deletes, deltasync service close and error table default configs. (apache#10117)

    Co-authored-by: rmahindra123 <[email protected]>

commit b981877
Author: harshal <[email protected]>
Date:   Tue Nov 21 22:52:28 2023 +0530

    [HUDI-7003] Add option to fallback to full table scan if files are deleted due to cleaner (apache#9941)

commit 600fd4d
Author: Akira Ajisaka <[email protected]>
Date:   Wed Nov 22 01:24:37 2023 +0900

    [HUDI-6734] Add back HUDI-5409: Avoid file index and use fs view cache in COW input format (apache#9567)

    * [HUDI-6734] Add back HUDI-5409: Avoid file index and use fs view cache in COW input format

    This reverts commit 2567ada.

     Conflicts:
    	hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieCopyOnWriteTableInputFormat.java
    	hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieMergeOnReadTableInputFormat.java

    * Always use file index if files partition is available

    ---------

    Co-authored-by: Sagar Sumit <[email protected]>

commit 9e2500c
Author: Sivabalan Narayanan <[email protected]>
Date:   Tue Nov 21 09:55:23 2023 -0500

    [HUDI-7083] Adding support for multiple tables with Prometheus Reporter (apache#10068)

    * Adding support for multiple tables with Prometheus Reporter

    * Fixing closure of http server

    * Remove entry from port-collector registry map after stopping http server

    ---------

    Co-authored-by: Sagar Sumit <[email protected]>

commit baffe1d
Author: Sivabalan Narayanan <[email protected]>
Date:   Tue Nov 21 09:32:39 2023 -0500

    [MINOR] Misc fixes in deltastreamer (apache#10067)

commit 0c4f3a3
Author: Sivabalan Narayanan <[email protected]>
Date:   Tue Nov 21 02:17:13 2023 -0500

    [HUDI-7127] Fixing set up and tear down in tests (apache#10146)

commit eaba114
Author: Akira Ajisaka <[email protected]>
Date:   Tue Nov 21 11:37:47 2023 +0900

    [HUDI-7107] Reused MetricsReporter fails to publish metrics in Spark streaming job (apache#10132)

commit 578e756
Author: Jing Zhang <[email protected]>
Date:   Tue Nov 21 10:04:33 2023 +0800

    [HUDI-7118] Set conf 'spark.sql.parquet.enableVectorizedReader' to true automatically only if the value is not explicitly set (apache#10134)

commit d24220a
Author: Jing Zhang <[email protected]>
Date:   Tue Nov 21 09:56:07 2023 +0800

    [HUDI-7111] Fix performance regression of tag when written into simple bucket index table (apache#10130)

commit 84990ae
Author: Rajesh Mahindra <[email protected]>
Date:   Mon Nov 20 11:17:45 2023 -0800

    Fix schema refresh for KafkaAvroSchemaDeserializer (apache#10118)

    Co-authored-by: rmahindra123 <[email protected]>

commit 979132b
Author: majian <[email protected]>
Date:   Mon Nov 20 10:43:11 2023 +0800

    [HUDI-7099] Providing metrics for archive and defining some string constants (apache#10101)

commit 3225625
Author: Fabio Buso <[email protected]>
Date:   Mon Nov 20 03:19:41 2023 +0100

    [MINOR] Add Hopsworks File System to StorageSchemes (apache#10141)

commit 3913dca
Author: Sivabalan Narayanan <[email protected]>
Date:   Sat Nov 18 23:50:37 2023 -0500

    [HUDI-7098] Add max bytes per partition with cloud stores source in DS (apache#10100)

commit 4c295b2
Author: hehuiyuan <[email protected]>
Date:   Sun Nov 19 09:43:52 2023 +0800

    [HUDI-7119] Don't write precombine field to hoodie.properties when the ts field does not exist for append mode (apache#10133)

commit b2f4493
Author: Jing Zhang <[email protected]>
Date:   Sun Nov 19 09:35:54 2023 +0800

    [HUDI-7072] Remove support for Flink 1.13 (apache#10052)

commit dfe1674
Author: Sagar Lakshmipathy <[email protected]>
Date:   Fri Nov 17 18:43:07 2023 -0800

    [Minor] Fixed twitter link to redirect to twitter (apache#10139)

commit f58d9cb
Author: Jonathan Vexler <=>
Date:   Fri Nov 17 18:10:00 2023 -0500

    current point

commit 184858b
Author: Jonathan Vexler <=>
Date:   Fri Nov 17 16:21:56 2023 -0500

    non-working. Want to review with team that this makes sense

commit 8240b6a
Author: Y Ethan Guo <[email protected]>
Date:   Fri Nov 17 11:20:57 2023 -0800

    [HUDI-7113] Update release scripts and docs for Spark 3.5 support (apache#10123)

commit 216aeb4
Author: Danny Chan <[email protected]>
Date:   Fri Nov 17 14:35:17 2023 +0800

    [HUDI-7116] Add docker image for flink 1.14 and spark 2.4.8 (apache#10126)

commit 3d0c450
Author: YueZhang <[email protected]>
Date:   Fri Nov 17 09:48:59 2023 +0800

    [HUDI-7109] Fix Flink may re-use a committed instant in append mode (apache#10119)

commit f06ff5b
Author: hehuiyuan <[email protected]>
Date:   Fri Nov 17 09:43:21 2023 +0800

    [HUDI-7090] Set the maxParallelism for singleton operator  (apache#10090)

commit faa73e9
Author: Y Ethan Guo <[email protected]>
Date:   Thu Nov 16 12:12:22 2023 -0800

    [MINOR] Disable failed test on master (apache#10124)

commit 6cc39bf
Author: Sivabalan Narayanan <[email protected]>
Date:   Thu Nov 16 06:00:54 2023 -0500

    [MINOR] Removing unnecessary guards to row writer (apache#10004)

commit 4ea752f
Author: voonhous <[email protected]>
Date:   Thu Nov 16 16:53:28 2023 +0800

    [MINOR] Modified description to include missing trigger strategy (apache#10114)

commit 874b5de
Author: Shawn Chang <[email protected]>
Date:   Wed Nov 15 21:57:14 2023 -0800

    [HUDI-6806] Support Spark 3.5.0 (apache#9717)

    ---------

    Co-authored-by: Shawn Chang <[email protected]>
    Co-authored-by: Y Ethan Guo <[email protected]>

commit 35af64d
Author: Shawn Chang <[email protected]>
Date:   Wed Nov 15 18:36:42 2023 -0800

    [Minor] Throw exceptions when cleaner/compactor fail (apache#10108)

    Co-authored-by: Shawn Chang <[email protected]>

commit bada5d9
Author: Shawn Chang <[email protected]>
Date:   Wed Nov 15 16:50:38 2023 -0800

    [HUDI-5936] Fix serialization problem when FileStatus is not serializable (apache#10065)

    Co-authored-by: Shawn Chang <[email protected]>

commit dcd5a81
Author: majian <[email protected]>
Date:   Wed Nov 15 16:10:15 2023 +0800

    [HUDI-7069] Optimize metaclient construction and include table config options (apache#10048)

commit f218e54
Author: Jing Zhang <[email protected]>
Date:   Wed Nov 15 16:07:04 2023 +0800

    [MINOR] Add detailed error logs in RunCompactionProcedure (apache#10070)

    * add detailed error logs in RunCompactionProcedure
    * only print 100 error file paths into logs

commit 2185abb
Author: Jing Zhang <[email protected]>
Date:   Wed Nov 15 16:03:23 2023 +0800

    [HUDI-7094] AlterTableAddColumnCommand/AlterTableChangeColumnCommand update table with ro/rt suffix (apache#10094)

commit abd3afc
Author: Hussein Awala <[email protected]>
Date:   Wed Nov 15 06:55:47 2023 +0200

    [HUDI-6695] Use the AWS provider chain in Glue sync and add a new provider for STS assume role (apache#9260)

commit 424e0ce
Author: chao chen <[email protected]>
Date:   Wed Nov 15 12:20:10 2023 +0800

    [HUDI-7050] Flink HoodieHiveCatalog supports hadoop parameters (apache#10013)

commit 19b3e7f
Author: leixin <[email protected]>
Date:   Wed Nov 15 09:24:29 2023 +0800

    [Minor] Throws an exception when using bulk_insert and stream mode (apache#10082)

    Co-authored-by: leixin1 <[email protected]>
@CTTY CTTY deleted the ctty/hudi1x-spark35 branch January 8, 2024 21:02
@pan3793
Copy link
Member

pan3793 commented Feb 26, 2024

I didn't find the bundle jar for Spark 3.5 on Maven Central, am I missing something?

yihua added a commit that referenced this pull request Feb 27, 2024
---------

Co-authored-by: Shawn Chang <[email protected]>
Co-authored-by: Y Ethan Guo <[email protected]>
@yihua
Copy link
Contributor

yihua commented Mar 9, 2024

I didn't find the bundle jar for Spark 3.5 on Maven Central, am I missing something?

Spark 3.5 bundle jar will be added in Hudi 0.15.0 release.

@ranwani
Copy link

ranwani commented Mar 12, 2024

@yihua : We need to use Hudi with Spark 3.5. Can you let me know when is Hudi 0.15.0 release planned?

@yihua
Copy link
Contributor

yihua commented Mar 18, 2024

@yihua : We need to use Hudi with Spark 3.5. Can you let me know when is Hudi 0.15.0 release planned?

The 0.15.0 release branch is planned to be cut this month once we verify engine integrations.

@melin
Copy link

melin commented Apr 2, 2024

The 0.15.0 release branch is planned to be cut this month once we verify engine integrations.

When will it be released?

@ranwani
Copy link

ranwani commented Apr 22, 2024

@yihua Any estimated date for the release?

yihua added a commit that referenced this pull request May 3, 2024
---------

Co-authored-by: Shawn Chang <[email protected]>
Co-authored-by: Y Ethan Guo <[email protected]>
@Gatsby-Lee
Copy link
Contributor

none

what is the release date for Hudi 0.15.0?

@CTTY
Copy link
Contributor Author

CTTY commented Jun 11, 2024

Hi @Gatsby-Lee , it was released last week: https://github.com/apache/hudi/tree/release-0.15.0

@Gatsby-Lee
Copy link
Contributor

Hi @Gatsby-Lee , it was released last week: https://github.com/apache/hudi/tree/release-0.15.0
Oh..
so, the doc has not been updated yet.

@CTTY
Copy link
Contributor Author

CTTY commented Oct 7, 2024

We upgraded hive-storage-api to 2.8.1 in this PR and recently we found that this may cause issues for HoodieStreamer + ORC source, please see: https://issues.apache.org/jira/browse/HUDI-8081

About why we added hive-storage-api to hudi-common and upgraded hive-storage-api:

  • Bunch of Hive dependencies issue
    * [ERROR] /Users/yxchang/code/Aws157Hudi/hudi-common/src/main/java/org/apache/hudi/common/util/OrcUtils.java:[36,45] package org.apache.hadoop.hive.ql.exec.vector does not exist
    [ERROR] /Users/yxchang/code/Aws157Hudi/hudi-common/src/main/java/org/apache/hudi/common/util/OrcUtils.java:[37,45] package org.apache.hadoop.hive.ql.exec.vector does not exist
    [ERROR] /Users/yxchang/code/Aws157Hudi/hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieAvroOrcWriter.java:[37,45] package org.apache.hadoop.hive.ql.exec.vector does not exist
    [ERROR] /Users/yxchang/code/Aws157Hudi/hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieAvroOrcWriter.java:[38,45] package org.apache.hadoop.hive.ql.exec.vector does not exist
    [ERROR] /Users/yxchang/code/Aws157Hudi/hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieAvroOrcWriter.java:[55,17] cannot find symbol
    * Fixed by adding hive-storage-api in hudi-common

testOrcIteratorReadData
* [ERROR] testOrcIteratorReadData Time elapsed: 2.171 s <<< ERROR!
java.lang.NoClassDefFoundError: org/apache/hadoop/hive/ql/exec/vector/DateColumnVector
at org.apache.orc.TypeDescription.createRowBatch(TypeDescription.java:491)
at org.apache.orc.TypeDescription.createRowBatch(TypeDescription.java:525)
at
* It's because orc 1.9.1 depends on hive-storage-api 2.8.1
* Upgrading hive-storage-api in hudi-common to 2.8.1 can solve the problem

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

8 participants