[HUDI-3221] Support querying a table as of a savepoint #4720

XuQianJin-Stars · 2022-01-30T01:32:00Z

Tips

Thank you very much for contributing to Apache Hudi.
Please review https://hudi.apache.org/contribute/how-to-contribute before opening a pull request.

What is the purpose of the pull request

Support querying a table as of a savepoint
link: HUDI-3221
Support Spark Version:

version	support
2.4.x	No
3.1.2	No
3.2.0	Yes
3.0.x	No

Brief change log

HoodieSpark3_2ExtendedSqlAstBuilder have comments in the code fork from org.apache.spark.sql.catalyst.parser.AstBuilder, Additional withTimeTravel method.
SqlBase.g4 have comments in the code forked from spark parser, Add SparkSQL Syntax TIMESTAMP AS OF and VERSION AS OF.
UT Test TestTimeTravelParser , TestTimeTravelTable

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

Added integration tests for end-to-end.
Added HoodieClientWriteTest to verify the change.
Manually verified the change by running a job locally.

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

fedsp · 2022-02-09T22:56:34Z

@xushiyan , there is anything that I can help from my side to help you on this PR?
I can help testing using real datasets

xushiyan · 2022-02-23T08:18:15Z

@xushiyan , there is anything that I can help from my side to help you on this PR? I can help testing using real datasets

hey @fedsp thank you for offering help. Feel free to build this branch against spark 3.2 (using maven profile spark3) and test with your datasets. Feel free to post any results or feedback. That'd be of great help!

fedsp · 2022-02-23T12:57:28Z

@xushiyan great! I will do this by today. I'm planning to use it on aws glue which unfortunately only offers spark 3.1 today. I know that Hudi documentation says explicitly that the supported version of spark is 3.2, but there is any chance that it will work on 3.1?

XuQianJin-Stars · 2022-02-23T15:11:38Z

@xushiyan great! I will do this by today. I'm planning to use it on aws glue which unfortunately only offers spark 3.1 today. I know that Hudi documentation says explicitly that the supported version of spark is 3.2, but there is any chance that it will work on 3.1?

hi @fedsp Here is a multi version PR, you can test it. #4885

xushiyan · 2022-02-28T14:46:56Z

@YannByron can you help reviewing this too? thanks

YannByron · 2022-03-01T02:25:27Z

@YannByron can you help reviewing this too? thanks

As discussed with @XuQianJin-Stars, I prefer to make a separate pr based on the separate spark env with https://issues.apache.org/jira/browse/SPARK-37219. And we can merge this to hudi master once Spark3.3 releases.

xushiyan

it'll be helpful to keep notes in hudi-spark-datasource README.md and make it clear which files are copied over from spark with modifications, and ready to be removed after using spark 3.3.

hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/SparkAdapterSupport.scala

...datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieAnalysis.scala

...rk-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestTimeTravelParser.scala

...ark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestTimeTravelTable.scala

...atasource/hudi-spark3-common/src/main/scala/org/apache/spark/sql/adapter/Spark3Adapter.scala

xushiyan · 2022-03-06T02:43:03Z

hudi-spark-datasource/hudi-spark3/pom.xml

-      <version>${spark3.version}</version>
+      <version>${spark3.2.version}</version>


can you clarify why this change? what if user needs to build the project with spark 3.1.x profile?

can you clarify why this change? what if user needs to build the project with spark 3.1.x profile?

spark 3.1.x profile use <hudi.spark.module>hudi-spark3.1.x</hudi.spark.module> with spark3.1.x version.
spark 3 profile use <hudi.spark.module>hudi-spark3</hudi.spark.module> with spark 3.2.0(and above) versions.

we should prefer to simple set of properties to maintain: both spark3.version and hudi.spark.module change based on different spark profile. When spark3.version switch to 3.1.2, module will be hudi-spark3.1.x and effectively ignores hudi-spark3 here. So I don't think we need to introduce more properties here.

...ark-datasource/hudi-spark3/src/main/scala/org/apache/spark/sql/adapter/Spark3_2Adapter.scala

xushiyan · 2022-03-06T02:47:40Z

...i-spark3/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/TimeTravelRelation.scala

+
+import org.apache.spark.sql.catalyst.expressions.{Attribute, Expression}
+
+case class TimeTravelRelation(


this looks identical to the one in hudi-spark. can we deduplicate?

xushiyan · 2022-03-06T02:51:13Z

pom.xml

+    <spark3.1.version>3.1.2</spark3.1.version>
+    <spark3.2.version>3.2.1</spark3.2.version>


not sure why we need these new properties. spark3.version is always the default and point to the latest supported spark 3. and we shall build the project with spark3.1.x if we want spark3.version point to 3.1. can you clarify

hudi-spark-datasource/README.md

xushiyan · 2022-03-06T02:58:25Z

@xushiyan great! I will do this by today. I'm planning to use it on aws glue which unfortunately only offers spark 3.1 today. I know that Hudi documentation says explicitly that the supported version of spark is 3.2, but there is any chance that it will work on 3.1?

This PR is ready for testing with TIMESTAMP AS OF syntax to fetch older commits' values. @fedsp any chance you have tested this branch out? we'd happy to land this with some real-world verifications. :)

fedsp · 2022-03-06T18:56:41Z

Hi @xushiyan! Sorry for the long response, I had some trouble building the .jar (not used to do it)

Anyways, I'm receiving a error from a straight sql statement running from AwsAthena (no problem creating the table on aws glue using spark 3 tho)

This is the error:

And this is the query that I used on athena:
select * from hudi_0_11_tst TIMESTAMP AS OF '2022-03-06 15:30:58'

and this is the DDL of the table creation (I created the table by hand, not on the spark write operation):

CREATE EXTERNAL TABLE hudi_0_11_tst(
    `tpep_pickup_datetime` timestamp,
    `tpep_dropoff_datetime` timestamp,
    `passenger_count` int,
    `trip_distance` double,
    `ratecodeid` int,
    `store_and_fwd_flag` string,
    `pulocationid` int,
    `dolocationid` int,
    `payment_type` int,
    `fare_amount` double,
    `extra` double,
    `mta_tax` double,
    `tip_amount` double,
    `tolls_amount` double,
    `improvement_surcharge` double,
    `total_amount` double,
    `congestion_surcharge` double,
    `pk_col` bigint
)
PARTITIONED BY (
    `vendorid` string)
ROW FORMAT SERDE
    'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES ( 
    'path'='s3://mybuckethudi_0_11_tst')
STORED AS INPUTFORMAT
    'org.apache.hudi.hadoop.HoodieParquetInputFormat'
OUTPUTFORMAT
    'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
    's3://mybucket/hudi_0_11_tst'

Did I did something wrong?

XuQianJin-Stars · 2022-03-07T01:13:06Z

hi @fedsp Your spark version is 3.2.x?

fedsp · 2022-03-07T01:26:59Z

No, @XuQianJin-Stars , unfortunatelly it is 3.1, since I am limited to the aws glue environment. But I used your branch #4885

fedsp · 2022-03-07T01:28:56Z

Also please note that this error is not from a spark environment, but from AwsAthena, which uses prestodb as a engine.

There is any additional setup?

XuQianJin-Stars · 2022-03-07T01:59:57Z

Also please note that this error is not from a spark environment, but from AwsAthena, which uses prestodb as a engine.

There is any additional setup?

prestodb also needs to support this syntax.

xushiyan · 2022-03-07T02:46:19Z

@fedsp let me clarify that the time travel query here is supported in Spark SQL with Spark 3.2+ but not yet in other query engine like presto. So you won't be able to run this in Athena. Are you able to verify the branch by running spark sql against your datasets?

fedsp · 2022-03-07T02:51:06Z

Thank you for the clarification @xushiyan!
If you are avaliable, I can test right now screensharing and we can see the results live

YannByron · 2022-03-07T03:37:07Z

...datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieAnalysis.scala

+
+        LogicalRelation(dataSource.resolveRelation(checkFilesExist = false), table)
+      } else {
+        plan


if not a Hoodie table, it should return the origin object: l.

YannByron · 2022-03-07T03:42:35Z

hudi-spark-datasource/hudi-spark3/src/main/scala/org/apache/spark/sql/AnalysisException.scala

+ * @since 1.3.0
+ */
+@Stable
+class AnalysisException protected[sql](


why we need to define this instead of using AnalysisException inside of Spark.

YannByron · 2022-03-07T03:44:38Z

...ark-datasource/hudi-spark3/src/main/scala/org/apache/spark/sql/adapter/Spark3_2Adapter.scala

+/**
+ * The adapter for spark3.2.
+ */
+class Spark3_2Adapter extends SparkAdapter {


can you let Spark3_2Adapter extend Spark3Adapter, and only overwrite isRelationTimeTravel and getRelationTimeTravel.

fedsp · 2022-03-07T04:08:41Z

Now, from a spark context (glue context), I tried the following pyspark command:

df = spark_context.sql("SELECT * FROM SANDBOX.hudi_0_11_tst TIMESTAMP AS OF '2022-03-06 15:30:58'")

and it gave me the following error:

File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/context.py", line 433, in sql
    return self.sparkSession.sql(sqlQuery)
  File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 723, in sql
    return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
  File "/opt/amazon/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in _call_
    answer, self.gateway_client, self.target_id, self.name)
  File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 117, in deco
    raise converted from None
pyspark.sql.utils.ParseException: 
mismatched input 'AS' expecting {<EOF>, ';'}(line 1, pos 46)

== SQL ==
SELECT * FROM SANDBOX.hudi_0_11_tst TIMESTAMP AS OF '2022-03-06 15:30:58'
----------------------------------------------^^^

XuQianJin-Stars · 2022-03-08T10:27:51Z

@hudi-bot run azure

hudi-bot · 2022-03-08T15:46:24Z

CI report:

82a8886 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

xushiyan · 2022-03-08T18:00:08Z

@fedsp it's a spark bundle or version mismatch problem where the syntax is not recognized. Maybe previous version from the branch has some misconfig but now it's resolved. I verified the feature in spark 3.2.1

➜ mvn -T 2.5C clean install -DskipTests -Djacoco.skip=true -Dmaven.javadoc.skip=true -Dcheckstyle.skip=true -Dscala-2.12 -Dspark3

// test using packaging/hudi-spark-bundle/target/hudi-spark3.2.1-bundle_2.12-0.11.0-SNAPSHOT.jar

// Spark 3.2.1
// COW
spark.sql("""
create table hudi_cow_pt_tbl (
  id bigint,
  name string,
  ts bigint,
  dt string,
  hh string
) using hudi
tblproperties (
  type = 'cow',
  primaryKey = 'id',
  preCombineField = 'ts'
 )
partitioned by (dt, hh)
location '/tmp/hudi/hudi_cow_pt_tbl';
""")
spark.sql("insert into hudi_cow_pt_tbl select 1, 'a0', 1000, '2021-12-09', '10'")
spark.sql("select * from hudi_cow_pt_tbl").show()
spark.sql("insert into hudi_cow_pt_tbl select 1, 'a1', 1001, '2021-12-09', '10'")
spark.sql("select * from hudi_cow_pt_tbl").show()


// time travel based on first commit time
spark.sql("select * from hudi_cow_pt_tbl timestamp as of '20220308175415995' where id = 1").show()
// time travel based on different timestamp formats
spark.sql("select * from hudi_cow_pt_tbl timestamp as of '2022-03-08 17:54:15.995' where id = 1").show()
spark.sql("select * from hudi_cow_pt_tbl timestamp as of '2022-03-09' where id = 1").show()

XuQianJin-Stars force-pushed the HUDI-3221 branch from 21448f9 to bd1cd8e Compare January 30, 2022 14:56

nsivabalan added the priority:critical Production degraded; pipelines stalled label Jan 31, 2022

nsivabalan requested a review from xushiyan February 3, 2022 22:58

nsivabalan assigned xushiyan and nsivabalan Feb 3, 2022

nsivabalan added priority:blocker Production down; release blocker and removed priority:critical Production degraded; pipelines stalled labels Feb 8, 2022

XuQianJin-Stars force-pushed the HUDI-3221 branch 2 times, most recently from 15931c0 to c9b323f Compare February 15, 2022 11:17

nsivabalan removed their assignment Feb 16, 2022

XuQianJin-Stars force-pushed the HUDI-3221 branch from 4774500 to 7d05809 Compare February 16, 2022 09:26

XuQianJin-Stars force-pushed the HUDI-3221 branch from 9efcab8 to 53e7897 Compare February 23, 2022 15:15

This was referenced Feb 28, 2022

[DO NOT MERGE][HUDI-3221] Support querying a table as of a savepoint #4885

Closed

[QUESTION] Athena Hudi Time Travel Queries #4502

Closed

xushiyan reviewed Mar 2, 2022

View reviewed changes

XuQianJin-Stars force-pushed the HUDI-3221 branch from 53e7897 to e543483 Compare March 4, 2022 02:28

xushiyan reviewed Mar 6, 2022

View reviewed changes

XuQianJin-Stars force-pushed the HUDI-3221 branch from 908fec5 to 62a9c1d Compare March 6, 2022 12:45

fedsp mentioned this pull request Mar 7, 2022

Support to the new Hudi timetravel sintax prestodb/presto#17415

Open

YannByron reviewed Mar 7, 2022

View reviewed changes

XuQianJin-Stars force-pushed the HUDI-3221 branch from 59e0270 to b989ca6 Compare March 8, 2022 09:23

apache deleted a comment from hudi-bot Mar 8, 2022

[HUDI-3221] Support querying a table as of a savepoint

82a8886

XuQianJin-Stars force-pushed the HUDI-3221 branch from 3cc3b11 to 82a8886 Compare March 8, 2022 14:36

xushiyan merged commit 08fd80c into apache:master Mar 8, 2022

vingov pushed a commit to vingov/hudi that referenced this pull request Apr 3, 2022

[HUDI-3221] Support querying a table as of a savepoint (apache#4720)

cc94a91

stayrascal pushed a commit to stayrascal/hudi that referenced this pull request Apr 12, 2022

[HUDI-3221] Support querying a table as of a savepoint (apache#4720)

1409c0b

		<version>${spark3.version}</version>
		<version>${spark3.2.version}</version>


		import org.apache.spark.sql.catalyst.expressions.{Attribute, Expression}

		case class TimeTravelRelation(

		<spark3.1.version>3.1.2</spark3.1.version>
		<spark3.2.version>3.2.1</spark3.2.version>

[HUDI-3221] Support querying a table as of a savepoint #4720

[HUDI-3221] Support querying a table as of a savepoint #4720

Uh oh!

Conversation

XuQianJin-Stars commented Jan 30, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Tips

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

Uh oh!

fedsp commented Feb 9, 2022

Uh oh!

xushiyan commented Feb 23, 2022

Uh oh!

fedsp commented Feb 23, 2022

Uh oh!

XuQianJin-Stars commented Feb 23, 2022

Uh oh!

xushiyan commented Feb 28, 2022

Uh oh!

YannByron commented Mar 1, 2022

Uh oh!

xushiyan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

xushiyan commented Mar 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fedsp commented Mar 6, 2022

Uh oh!

XuQianJin-Stars commented Mar 7, 2022

Uh oh!

fedsp commented Mar 7, 2022

Uh oh!

fedsp commented Mar 7, 2022

Uh oh!

XuQianJin-Stars commented Mar 7, 2022

Uh oh!

xushiyan commented Mar 7, 2022

Uh oh!

fedsp commented Mar 7, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fedsp commented Mar 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

XuQianJin-Stars commented Mar 8, 2022

Uh oh!

hudi-bot commented Mar 8, 2022

CI report:

Uh oh!

XuQianJin-Stars commented Jan 30, 2022 •

edited

Loading

xushiyan commented Mar 6, 2022 •

edited

Loading

fedsp commented Mar 7, 2022 •

edited

Loading

xushiyan commented Mar 8, 2022 •

edited

Loading