-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-5301] Spark SQL queries support setting parameters through set #7339
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
063ead2
adef447
d63ed06
8c2a468
e033aaf
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -62,6 +62,11 @@ object DataSourceReadOptions { | |
| "(or) Read Optimized mode (obtain latest view, based on base files) (or) Snapshot mode " + | ||
| "(obtain latest view, by merging base and (if any) log files)") | ||
|
|
||
| val QUERY_USE_DATABASE: ConfigProperty[Boolean] = ConfigProperty | ||
| .key("hoodie.query.use.database") | ||
| .defaultValue(false) | ||
| .withDocumentation("Whether to add database name to qualify table name when setting parameters in Spark SQL query") | ||
|
|
||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does this modification have somethings to do with the pr title?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @xiarixiaoyao This title is not reflected because the form of set parameter is not supported previously. Adding this parameter is mainly consistent with Hive incremental query: ` HoodieHiveUtils.HOODIE_ INCREMENTAL_ USE_ DATABASE ', mainly considering the case that different databases have the same table name. The reason why it is not described in detail in the PR is that it is uncertain whether the community will approve this form of query. If necessary, I can add a detailed description in the PR. In addition, only incremental queries are added to the test cases, excluding other query types. If necessary, I can add more detailed test cases |
||
| val INCREMENTAL_FORMAT_LATEST_STATE_VAL = "latest_state" | ||
| val INCREMENTAL_FORMAT_CDC_VAL = "cdc" | ||
| val INCREMENTAL_FORMAT: ConfigProperty[String] = ConfigProperty | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,204 @@ | ||
| /* | ||
| * Licensed to the Apache Software Foundation (ASF) under one or more | ||
| * contributor license agreements. See the NOTICE file distributed with | ||
| * this work for additional information regarding copyright ownership. | ||
| * The ASF licenses this file to You under the Apache License, Version 2.0 | ||
| * (the "License"); you may not use this file except in compliance with | ||
| * the License. You may obtain a copy of the License at | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, software | ||
| * distributed under the License is distributed on an "AS IS" BASIS, | ||
| * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| * See the License for the specific language governing permissions and | ||
| * limitations under the License. | ||
| */ | ||
|
|
||
| package org.apache.spark.sql.hudi | ||
|
|
||
| class TestQueryTable extends HoodieSparkSqlTestBase { | ||
|
|
||
| test("Test incremental query with set parameters") { | ||
| val tableName = generateTableName | ||
|
|
||
| spark.sql( | ||
| s""" | ||
| |create table $tableName ( | ||
| | id int, | ||
| | name string, | ||
| | price double, | ||
| | ts long, | ||
| | dt string | ||
| |) using hudi | ||
| | partitioned by (dt) | ||
| | options ( | ||
| | primaryKey = 'id', | ||
| | preCombineField = 'ts', | ||
| | type = 'cow' | ||
| | ) | ||
| |""".stripMargin) | ||
| spark.sql(s"insert into $tableName values (1,'a1', 10, 1000, '2022-11-25')") | ||
| spark.sql(s"insert into $tableName values (2,'a2', 20, 2000, '2022-11-25')") | ||
| spark.sql(s"insert into $tableName values (3,'a3', 30, 3000, '2022-11-26')") | ||
| spark.sql(s"insert into $tableName values (4,'a4', 40, 4000, '2022-12-26')") | ||
| spark.sql(s"insert into $tableName values (5,'a5', 50, 5000, '2022-12-27')") | ||
|
|
||
| checkAnswer(s"select id, name, price, ts, dt from $tableName")( | ||
| Seq(1, "a1", 10.0, 1000, "2022-11-25"), | ||
| Seq(2, "a2", 20.0, 2000, "2022-11-25"), | ||
| Seq(3, "a3", 30.0, 3000, "2022-11-26"), | ||
| Seq(4, "a4", 40.0, 4000, "2022-12-26"), | ||
| Seq(5, "a5", 50.0, 5000, "2022-12-27") | ||
| ) | ||
|
|
||
|
|
||
| import spark.implicits._ | ||
| val commits = spark.sql(s"call show_commits(table => '$tableName')").select("commit_time").map(k => k.getString(0)).take(10) | ||
| val beginTime = commits(commits.length - 2) | ||
|
|
||
| spark.sql(s"set hoodie.$tableName.datasource.query.type = incremental") | ||
| spark.sql(s"set hoodie.$tableName.datasource.read.begin.instanttime = $beginTime") | ||
| spark.sql(s"refresh table $tableName") | ||
| checkAnswer(s"select id, name, price, ts, dt from $tableName")( | ||
| Seq(3, "a3", 30.0, 3000, "2022-11-26"), | ||
| Seq(4, "a4", 40.0, 4000, "2022-12-26"), | ||
| Seq(5, "a5", 50.0, 5000, "2022-12-27") | ||
| ) | ||
|
|
||
| spark.sql(s"set hoodie.query.use.database = true") | ||
| spark.sql(s"refresh table $tableName") | ||
| val cnt = spark.sql(s"select * from $tableName").count() | ||
| assertResult(5)(cnt) | ||
|
|
||
| spark.sql(s"set hoodie.default.$tableName.datasource.query.type = incremental") | ||
| spark.sql(s"set hoodie.default.$tableName.datasource.read.begin.instanttime = $beginTime") | ||
| val endTime = commits(1) | ||
| spark.sql(s"set hoodie.default.$tableName.datasource.read.end.instanttime = $endTime") | ||
| spark.sql(s"refresh table $tableName") | ||
| checkAnswer(s"select id, name, price, ts, dt from $tableName")( | ||
| Seq(3, "a3", 30.0, 3000, "2022-11-26"), | ||
| Seq(4, "a4", 40.0, 4000, "2022-12-26") | ||
| ) | ||
|
|
||
| spark.sql(s"set hoodie.default.$tableName.datasource.read.incr.path.glob = /dt=2022-11*/*") | ||
| spark.sql(s"refresh table $tableName") | ||
| checkAnswer(s"select id, name, price, ts, dt from $tableName")( | ||
| Seq(3, "a3", 30.0, 3000, "2022-11-26") | ||
| ) | ||
|
|
||
| spark.conf.unset("hoodie.query.use.database") | ||
| spark.conf.unset(s"hoodie.$tableName.datasource.query.type") | ||
| spark.conf.unset(s"hoodie.default.$tableName.datasource.query.type") | ||
| } | ||
|
|
||
| test("Test snapshot query with set parameters") { | ||
| val tableName = generateTableName | ||
|
|
||
| spark.sql( | ||
| s""" | ||
| |create table $tableName ( | ||
| | id int, | ||
| | name string, | ||
| | price double, | ||
| | ts long, | ||
| | dt string | ||
| |) using hudi | ||
| | partitioned by (dt) | ||
| | options ( | ||
| | primaryKey = 'id', | ||
| | preCombineField = 'ts', | ||
| | type = 'cow' | ||
| | ) | ||
| |""".stripMargin) | ||
|
|
||
| spark.sql(s"insert into $tableName values (1,'a1', 10, 1000, '2022-11-25')") | ||
| spark.sql(s"insert into $tableName values (2,'a2', 20, 2000, '2022-11-25')") | ||
| spark.sql(s"update $tableName set price = 22 where id = 2") | ||
|
|
||
| checkAnswer(s"select id, name, price, ts, dt from $tableName")( | ||
| Seq(1, "a1", 10.0, 1000, "2022-11-25"), | ||
| Seq(2, "a2", 22.0, 2000, "2022-11-25") | ||
| ) | ||
|
|
||
| import spark.implicits._ | ||
| val commits = spark.sql(s"call show_commits(table => '$tableName')").select("commit_time").map(k => k.getString(0)).take(10) | ||
| val beginTime = commits(commits.length - 2) | ||
|
|
||
| spark.sql(s"set $tableName.as.of.instant = $beginTime") | ||
| spark.sql(s"refresh table $tableName") | ||
| checkAnswer(s"select id, name, price, ts, dt from $tableName where id = 2")( | ||
| Seq(2, "a2", 20, 2000, "2022-11-25") | ||
| ) | ||
| spark.sql(s"set hoodie.query.use.database = true") | ||
| spark.sql(s"refresh table $tableName") | ||
| checkAnswer(s"select id, name, price, ts, dt from $tableName where id = 2")( | ||
| Seq(2, "a2", 22, 2000, "2022-11-25") | ||
| ) | ||
|
|
||
| spark.sql(s"set default.$tableName.as.of.instant = $beginTime") | ||
| spark.sql(s"refresh table $tableName") | ||
| checkAnswer(s"select id, name, price, ts, dt from $tableName where id = 2")( | ||
| Seq(2, "a2", 20, 2000, "2022-11-25") | ||
| ) | ||
|
|
||
| spark.conf.unset("hoodie.query.use.database") | ||
| spark.conf.unset(s"hoodie.$tableName.datasource.query.type") | ||
| spark.conf.unset(s"hoodie.default.$tableName.datasource.query.type") | ||
| } | ||
|
|
||
| test("Test read_optimized query with set parameters") { | ||
| val tableName = generateTableName | ||
|
|
||
| spark.sql( | ||
| s""" | ||
| |create table $tableName ( | ||
| | id int, | ||
| | name string, | ||
| | price double, | ||
| | ts long, | ||
| | dt string | ||
| |) using hudi | ||
| | partitioned by (dt) | ||
| | options ( | ||
| | primaryKey = 'id', | ||
| | preCombineField = 'ts', | ||
| | type = 'mor' | ||
| | ) | ||
| |""".stripMargin) | ||
|
|
||
| spark.sql(s"insert into $tableName values (1,'a1', 10, 1000, '2022-11-25')") | ||
| spark.sql(s"insert into $tableName values (2,'a2', 20, 2000, '2022-11-25')") | ||
| spark.sql(s"update $tableName set price = 22 where id = 2") | ||
|
|
||
| checkAnswer(s"select id, name, price, ts, dt from $tableName")( | ||
| Seq(1, "a1", 10.0, 1000, "2022-11-25"), | ||
| Seq(2, "a2", 22.0, 2000, "2022-11-25") | ||
| ) | ||
|
|
||
| spark.sql(s"set hoodie.$tableName.datasource.query.type = read_optimized") | ||
| spark.sql(s"refresh table $tableName") | ||
| checkAnswer(s"select id, name, price, ts, dt from $tableName")( | ||
| Seq(1, "a1", 10.0, 1000, "2022-11-25"), | ||
| Seq(2, "a2", 20.0, 2000, "2022-11-25") | ||
| ) | ||
|
|
||
| spark.sql(s"set hoodie.query.use.database = true") | ||
| spark.sql(s"refresh table $tableName") | ||
| checkAnswer(s"select id, name, price, ts, dt from $tableName")( | ||
| Seq(1, "a1", 10.0, 1000, "2022-11-25"), | ||
| Seq(2, "a2", 22.0, 2000, "2022-11-25") | ||
| ) | ||
|
|
||
| spark.sql(s"set hoodie.default.$tableName.datasource.query.type = read_optimized") | ||
| spark.sql(s"refresh table $tableName") | ||
| checkAnswer(s"select id, name, price, ts, dt from $tableName")( | ||
| Seq(1, "a1", 10.0, 1000, "2022-11-25"), | ||
| Seq(2, "a2", 20.0, 2000, "2022-11-25") | ||
| ) | ||
|
|
||
| spark.conf.unset("hoodie.query.use.database") | ||
| spark.conf.unset(s"hoodie.$tableName.datasource.query.type") | ||
| spark.conf.unset(s"hoodie.default.$tableName.datasource.query.type") | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, I am a little confused about the config and the use case of this config.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@leesf This configuration is reflected in the test case. The main consideration is that if different databases have the same table name, such as db1.table1 and db2.table1, and if the two tables are queried in the same session at the same time, I only want to set the incremental query parameters of db1.table1:
In this way, although I only want to query db1.table1 incrementally, I will also perform incremental queries when querying db2.table1. This is not the effect I expected, so I have this parameter:
In this way, we can only perform incremental queries on db1.table1. This configuration is false by default, which is consistent with the Hive incremental query parameters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR of Hive incremental query:#4083
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it only affects incremental query, maybe
hoodie.query.incremental.databaseis a better name? or it is also affect other types of query? then we need to add more test cases.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It also affects other types of queries. I can add test cases of other query types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@leesf Hello, I have added test cases of other query types