Skip to content

Conversation

@Zouxxyy
Copy link
Contributor

@Zouxxyy Zouxxyy commented Aug 28, 2023

…Format

Change Logs

Currently, After doing schema evalution using spark-sql, query using hive will fail

-- spark-sql
set hoodie.schema.on.read.enable=true;

create table hudi_mor_test_tbl (
  id bigint,
  name string,
  ts int,
  dt string,
  hh string
) using hudi
tblproperties (
  type = 'mor',
  primaryKey = 'id',
  preCombineField = 'ts'
)
partitioned by (dt, hh);

insert into hudi_mor_test_tbl values (1, 'a1', 1001, '2021-12-09', '10');

ALTER TABLE hudi_mor_test_tbl ALTER COLUMN ts TYPE bigint;

-- hive
select * from hudi_mor_test_tbl_rt;

Failed with exception java.io.IOException:java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.io.IntWritable

The root cause is that FileInputFormat does not implement SelfDescribingInputFormatInterface, see

/**
 * Marker interface to indicate a given input format is self-describing and
 * can perform schema evolution itself.
 */
public interface SelfDescribingInputFormatInterface {

}

Impact

After doing schema evalution using spark-sql, query using hive will success

Risk level (write none, low medium or high below)

none

Documentation Update

none

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@danny0405 danny0405 added the engine:hive Hive integration label Aug 28, 2023
@danny0405
Copy link
Contributor

@Zouxxyy Can you elaborate a little more what the purpose of this change? Does it has risk of breaking the compatibility for low version Hive?

@Zouxxyy
Copy link
Contributor Author

Zouxxyy commented Aug 28, 2023

@danny0405

Can you elaborate a little more what the purpose of this change?

See updated Change Logs.

Does it has risk of breaking the compatibility for low version Hive?

This interface (SelfDescribingInputFormatInterface ) has existed since hive 2.0, and there is no compatibility problem

@danny0405 danny0405 added schema-evolution area:schema Schema evolution and data types labels Aug 29, 2023
@Zouxxyy
Copy link
Contributor Author

Zouxxyy commented Aug 30, 2023

@xushiyan @bvaradar Can someone help to understand why hudi-spark-common cannot automatically depend on hive-exec in hudi-hadoop-mr ?

 mvn dependency:tree -pl hudi-spark-datasource/hudi-spark-common -Dspark2

[INFO] --- maven-dependency-plugin:3.1.1:tree (default-cli) @ hudi-spark-common_2.12 ---
[INFO] org.apache.hudi:hudi-spark-common_2.12:jar:0.15.0-SNAPSHOT
[INFO] +- org.apache.hudi:hudi-hive-sync:jar:0.15.0-SNAPSHOT:compile
[INFO] |  +- org.apache.hudi:hudi-hadoop-mr:jar:0.15.0-SNAPSHOT:compile
[INFO] |  +- org.apache.hudi:hudi-sync-common:jar:0.15.0-SNAPSHOT:compile

@Zouxxyy
Copy link
Contributor Author

Zouxxyy commented Sep 9, 2023

here is the error in integration-tests, don't know much about the env of integration testing, can anyone help~

2023-09-08T05:11:59.7764700Z Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hive.ql.io.SelfDescribingInputFormatInterface
2023-09-08T05:11:59.7764906Z 	at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
2023-09-08T05:11:59.7765092Z 	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
2023-09-08T05:11:59.7765284Z 	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
2023-09-08T05:11:59.7765373Z 	... 58 more
2023-09-08T05:11:59.7765560Z 23/09/08 05:11:59 INFO util.ShutdownHookManager: Shutdown hook called
2023-09-08T05:11:59.7766126Z 23/09/08 05:11:59 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-b81218a3-32e6-4851-9b25-b15373acd05b
2023-09-08T05:11:59.7766507Z 23/09/08 05:11:59 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-9b58a267-201d-4404-baeb-49e617b23ad1
2023-09-08T05:11:59.7766647Z 
2023-09-08T05:11:59.7766919Z Sep 08, 2023 5:11:59 AM org.glassfish.jersey.internal.Errors logErrors
2023-09-08T05:11:59.7767534Z WARNING: The following warnings have been detected: WARNING: Cannot create new registration for component type class com.fasterxml.jackson.jaxrs.json.JacksonJsonProvider: Existing previous registration found for the type.
2023-09-08T05:11:59.7767548Z 
2023-09-08T05:11:59.7768090Z [ERROR] Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 96.75 s <<< FAILURE! - in org.apache.hudi.integ.command.ITTestHoodieSyncCommand

@danny0405
Copy link
Contributor

@yihua
Copy link
Contributor

yihua commented Sep 14, 2023

@Zouxxyy you need to run the exact same commands as shown in the logs in the docker environment to debug the failed integration. It looks like the HiveSyncTool spark job fails due to class not found. Likely the new class, SelfDescribingInputFormatInterface, is not included in the bundle.

Also wondering, how does SelfDescribingInputFormatInterface automatically fix the schema evolution (I don't see any API implemented)?

@yihua
Copy link
Contributor

yihua commented Sep 14, 2023

@xushiyan @bvaradar Can someone help to understand why hudi-spark-common cannot automatically depend on hive-exec in hudi-hadoop-mr ?

 mvn dependency:tree -pl hudi-spark-datasource/hudi-spark-common -Dspark2

[INFO] --- maven-dependency-plugin:3.1.1:tree (default-cli) @ hudi-spark-common_2.12 ---
[INFO] org.apache.hudi:hudi-spark-common_2.12:jar:0.15.0-SNAPSHOT
[INFO] +- org.apache.hudi:hudi-hive-sync:jar:0.15.0-SNAPSHOT:compile
[INFO] |  +- org.apache.hudi:hudi-hadoop-mr:jar:0.15.0-SNAPSHOT:compile
[INFO] |  +- org.apache.hudi:hudi-sync-common:jar:0.15.0-SNAPSHOT:compile

I think this dependency is already the case?

@Zouxxyy
Copy link
Contributor Author

Zouxxyy commented Sep 15, 2023

@yihua see #7129 , it turns out that this question has already been raised
I think he had the same problem as me
image
The hive dependency is not passed

@yihua
Copy link
Contributor

yihua commented Sep 22, 2023

@Zouxxyy have you figured out why integration tests failed in the GH actions?

@bvaradar
Copy link
Contributor

@Zouxxyy : Can you rebase and resolve the merge conflict ? We can take a look at the test failure after that

@bvaradar bvaradar self-assigned this Dec 15, 2023
@bvaradar
Copy link
Contributor

Fixed conflicts and rebased. Made minor changes to align with latest code.

@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@Zouxxyy
Copy link
Contributor Author

Zouxxyy commented Dec 21, 2023

@bvaradar Sorry for delay, thanks for your help, It seems that the CI is not stable.

@bvaradar
Copy link
Contributor

Rerunning failed jobs

@github-actions github-actions bot added the size:S PR with lines of changes in (10, 100] label Feb 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:schema Schema evolution and data types engine:hive Hive integration size:S PR with lines of changes in (10, 100]

Projects

Status: 🏗 Under discussion

Development

Successfully merging this pull request may close these issues.

6 participants