-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Description
Describe the problem you faced
We have a Spark application that successfully reads an AWS Glue table and deletes records that meet certain search criteria. However, when the application attempts to sync the metadata using the "Hive Sync" mode as HMS, it fails with the error message: "Could not sync using the meta sync class org.apache.hudi.hive.HiveSyncTool." The errors comes at the last stage of processing after deleting the records from the table.
App properties :
hoodie.datasource.write.recordkey.field=
hoodie.datasource.write.partitionpath.field=
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.TimestampBasedKeyGenerator
hoodie.deltastreamer.keygen.timebased.timestamp.type=SCALAR
hoodie.deltastreamer.keygen.timebased.timestamp.scalar.time.unit=MICROSECONDS
hoodie.deltastreamer.keygen.timebased.output.dateformat=yyyy/MM/dd
hoodie.deltastreamer.keygen.timebased.timezone=GMT
hoodie.deltastreamer.source.kafka.topic=
hoodie.datasource.hive_sync.database=
hoodie.datasource.hive_sync.table=
hoodie.datasource.write.hive_style_partitioning=false
hoodie.datasource.hive_sync.partition_fields=
hoodie.datasource.meta.sync.enable=true
hoodie.datasource.hive_sync.mode=hms
hoodie.datasource.hive_sync.enable=true
Table properties :
hoodie.datasource.write.recordkey.field=
hoodie.datasource.write.partitionpath.field=
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.TimestampBasedKeyGenerator
hoodie.deltastreamer.keygen.timebased.timestamp.type=EPOCHMILLISECONDS
hoodie.deltastreamer.keygen.timebased.output.dateformat=yyyy/MM/dd
hoodie.deltastreamer.source.kafka.topic=
hoodie.datasource.hive_sync.database=
hoodie.datasource.hive_sync.table=
hoodie.datasource.write.hive_style_partitioning=false
hoodie.datasource.hive_sync.partition_fields=
— enable-sync:option is set in Deltastreamer
We are facing issues on tables which are written by Deltastreamers.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
To ensure that the AWS Glue catalog reflects the latest changes, the Spark application must successfully delete the targeted records and then properly sync the metadata with the Hive metastore using the "HMS" mode. This will ensure that the catalog accurately reflects the state of the data, and that any subsequent queries or analysis using the Glue catalog are based on the most up-to-date information.
Environment Description
-
Hudi version : 0.11.0
-
Spark version : 3.2
-
Hive version : 3.1.2
-
Hadoop version : Amazon 3.2.1
-
Storage (HDFS/S3/GCS..) : S3
-
Running on Docker? (yes/no) : no
Additional context
Add any other context about the problem here.
Stacktrace
org.apache.hudi.exception.HoodieException: Could not sync using the meta sync class org.apache.hudi.hive.HiveSyncTool at org.apache.hudi.sync.common.util.SyncUtilHelpers.runHoodieMetaSync(SyncUtilHelpers.java:61) at org.apache.hudi.HoodieSparkSqlWriter$.$anonfun$metaSync$2(HoodieSparkSqlWriter.scala:623) at org.apache.hudi.HoodieSparkSqlWriter$.$anonfun$metaSync$2$adapted(HoodieSparkSqlWriter.scala:622) at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) at org.apache.hudi.HoodieSparkSqlWriter$.metaSync(HoodieSparkSqlWriter.scala:622) at org.apache.hudi.HoodieSparkSqlWriter$.commitAndPerformPostOperations(HoodieSparkSqlWriter.scala:681) at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:315) at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:171) at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84) at org.apache.spark.sql.execution.QueryExecution$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:115) at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107) at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232) at org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:110) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:135) at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107) at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:135) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:253) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:134) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:68) at org.apache.spark.sql.execution.QueryExecution$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:112) at org.apache.spark.sql.execution.QueryExecution$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:108) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:519) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:83) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:519) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$super$transformDownWithPruning(LogicalPlan.scala:30) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:495) at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:108) at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:95) at org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:93) at org.apache.spark.sql.execution.QueryExecution.assertCommandExecuted(QueryExecution.scala:136) at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:848) at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:382) at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:355) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:239) at com.comparethemarket.spark.io.Writer.write(Writer.java:33) at com.comparethemarket.spark.io.Writer.write(Writer.java:45) at com.comparethemarket.spark.job.SparkJob.run(SparkJob.java:60) at com.comparethemarket.spark.job.SparkJobExecutor.execute(SparkJobExecutor.java:51) at com.comparethemarket.spark.ModelAnonymizer.executeSparkJob(ModelAnonymizer.java:228) at com.comparethemarket.spark.ModelAnonymizer.lambda$createAndExecuteSparkJob$0(ModelAnonymizer.java:168) at com.comparethemarket.spark.ModelAnonymizer.lambda$createAndExecuteSparkJob$5(ModelAnonymizer.java:194) at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183) at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) at java.util.Iterator.forEachRemaining(Iterator.java:116) at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150) at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173) at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:485) at com.comparethemarket.spark.ModelAnonymizer.createAndExecuteSparkJob(ModelAnonymizer.java:194) at com.comparethemarket.spark.ModelAnonymizer.createAndExecuteSparkJobsForModels(ModelAnonymizer.java:144) at com.comparethemarket.spark.ModelAnonymizer.main(ModelAnonymizer.java:94) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.yarn.ApplicationMaster$anon$2.run(ApplicationMaster.scala:740)
Caused by: java.lang.IllegalArgumentException: Number of table partition keys must match number of partition values at com.google.common.base.Preconditions.checkArgument(Preconditions.java:92) at com.amazonaws.glue.catalog.metastore.GlueMetastoreClientDelegate.validateInputForBatchCreatePartitions(GlueMetastoreClientDelegate.java:832) at com.amazonaws.glue.catalog.metastore.GlueMetastoreClientDelegate.batchCreatePartitions(GlueMetastoreClientDelegate.java:766) at com.amazonaws.glue.catalog.metastore.GlueMetastoreClientDelegate.addPartitions(GlueMetastoreClientDelegate.java:748) at com.amazonaws.glue.catalog.metastore.AWSCatalogMetastoreClient.add_partitions(AWSCatalogMetastoreClient.java:339) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient$SynchronizedHandler.invoke(HiveMetaStoreClient.java:2350) at com.sun.proxy.$Proxy131.add_partitions(Unknown Source) at org.apache.hudi.hive.ddl.HMSDDLExecutor.addPartitionsToTable(HMSDDLExecutor.java:199) at org.apache.hudi.hive.HoodieHiveClient.addPartitionsToTable(HoodieHiveClient.java:98) at org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:397) ... 71 more
Metadata
Metadata
Assignees
Labels
Type
Projects
Status