-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-3457] Refactored Spark DataSource Relations to avoid code duplication #4877
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
33 commits
Select commit
Hold shift + click to select a range
edeea57
Abstracted & unified `buildScan` functionality for COW/MOR Relations;
b5cf9f0
`BaseFileOnlyViewRelation` > `BaseFileRelation`
787b6a3
Fixing compilation
d5d3a3a
Extracted common converter utils to `HoodieCommonUtils`;
1bf0933
Abstracted common functionality;
bc639ed
Extracted common functionality to lists latest base files into `Hoodi…
c86bba7
Streamlined `MergeOnReadSnapshotRelation` to re-use common functional…
70356e5
Killing dead code;
0b2d604
Further simplified `MergeOnReadSnapshotRelation`
1bc09d3
`lint`
f8aa085
Cleaned up & streamlined `MergeOnReadIncrementalRelation`
b9fa316
Tidying up
804bb96
Extract most of the incremental-specific aspects into a trait that co…
899db46
Fixing compilation
48af420
Cleaning up unnecessary filtering
6027652
After rebase fixes
1d45bf0
Scaffolded `HoodieInMemoryFileIndex` and replicated `HoodieHadoopFSUt…
b7a4f8b
Fixed usages
40b0c05
Moved tests
0ab3b9b
Missing licenses
86e8fe3
Disabling linter
83bd0ea
Fixed compilation for Spark 2.x
fe8c7a8
Added missing scala-docs
dcd693d
Fixed incorrect casting
35eb6df
Fixed partition path handling for MOR Incremental Relation
eee5151
Fixed `HoodieIncrementalRelationTrait` to extend `HoodieBaseRelation`…
d85be0b
Handle the case when there are no commits to handle in Incremental Re…
f966aec
Return empty RDD in case there's no file-splits to handle
71b1435
Cleaned up `listLatestBaseFiles`
e39f963
Added TODO
f37854b
Fixing file handle leak
b0aa03e
Disabled vectorized reader to make sure MOR Incremental Relation work…
40e5a85
Fixed Parquet column-projection tests
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
29 changes: 29 additions & 0 deletions
29
hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieConversionUtils.scala
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,29 @@ | ||
| /* | ||
| * Licensed to the Apache Software Foundation (ASF) under one | ||
| * or more contributor license agreements. See the NOTICE file | ||
| * distributed with this work for additional information | ||
| * regarding copyright ownership. The ASF licenses this file | ||
| * to you under the Apache License, Version 2.0 (the | ||
| * "License"); you may not use this file except in compliance | ||
| * with the License. You may obtain a copy of the License at | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, software | ||
| * distributed under the License is distributed on an "AS IS" BASIS, | ||
| * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| * See the License for the specific language governing permissions and | ||
| * limitations under the License. | ||
| */ | ||
|
|
||
| package org.apache.hudi | ||
|
|
||
| object HoodieConversionUtils { | ||
|
|
||
| def toJavaOption[T](opt: Option[T]): org.apache.hudi.common.util.Option[T] = | ||
| if (opt.isDefined) org.apache.hudi.common.util.Option.of(opt.get) else org.apache.hudi.common.util.Option.empty() | ||
|
|
||
| def toScalaOption[T](opt: org.apache.hudi.common.util.Option[T]): Option[T] = | ||
| if (opt.isPresent) Some(opt.get) else None | ||
|
|
||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
94 changes: 94 additions & 0 deletions
94
...rk-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/BaseFileOnlyRelation.scala
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,94 @@ | ||
| /* | ||
| * Licensed to the Apache Software Foundation (ASF) under one | ||
| * or more contributor license agreements. See the NOTICE file | ||
| * distributed with this work for additional information | ||
| * regarding copyright ownership. The ASF licenses this file | ||
| * to you under the Apache License, Version 2.0 (the | ||
| * "License"); you may not use this file except in compliance | ||
| * with the License. You may obtain a copy of the License at | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, software | ||
| * distributed under the License is distributed on an "AS IS" BASIS, | ||
| * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| * See the License for the specific language governing permissions and | ||
| * limitations under the License. | ||
| */ | ||
|
|
||
| package org.apache.hudi | ||
|
|
||
| import org.apache.hadoop.conf.Configuration | ||
| import org.apache.hadoop.fs.Path | ||
| import org.apache.hudi.HoodieBaseRelation.createBaseFileReader | ||
| import org.apache.hudi.common.table.HoodieTableMetaClient | ||
| import org.apache.spark.sql.{HoodieCatalystExpressionUtils, SQLContext} | ||
| import org.apache.spark.sql.catalyst.InternalRow | ||
| import org.apache.spark.sql.catalyst.expressions.Expression | ||
| import org.apache.spark.sql.execution.datasources._ | ||
| import org.apache.spark.sql.sources.{BaseRelation, Filter} | ||
| import org.apache.spark.sql.types.StructType | ||
|
|
||
| /** | ||
| * [[BaseRelation]] implementation only reading Base files of Hudi tables, essentially supporting following querying | ||
| * modes: | ||
| * <ul> | ||
| * <li>For COW tables: Snapshot</li> | ||
| * <li>For MOR tables: Read-optimized</li> | ||
| * </ul> | ||
| * | ||
| * NOTE: The reason this Relation is used in liue of Spark's default [[HadoopFsRelation]] is primarily due to the | ||
| * fact that it injects real partition's path as the value of the partition field, which Hudi ultimately persists | ||
| * as part of the record payload. In some cases, however, partition path might not necessarily be equal to the | ||
| * verbatim value of the partition path field (when custom [[KeyGenerator]] is used) therefore leading to incorrect | ||
| * partition field values being written | ||
| */ | ||
| class BaseFileOnlyRelation(sqlContext: SQLContext, | ||
xushiyan marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| metaClient: HoodieTableMetaClient, | ||
| optParams: Map[String, String], | ||
| userSchema: Option[StructType], | ||
| globPaths: Seq[Path]) | ||
| extends HoodieBaseRelation(sqlContext, metaClient, optParams, userSchema) with SparkAdapterSupport { | ||
|
|
||
| override type FileSplit = HoodieBaseFileSplit | ||
|
|
||
| protected override def composeRDD(fileSplits: Seq[HoodieBaseFileSplit], | ||
| partitionSchema: StructType, | ||
| tableSchema: HoodieTableSchema, | ||
| requiredSchema: HoodieTableSchema, | ||
| filters: Array[Filter]): HoodieUnsafeRDD = { | ||
| val baseFileReader = createBaseFileReader( | ||
| spark = sparkSession, | ||
| partitionSchema = partitionSchema, | ||
| tableSchema = tableSchema, | ||
| requiredSchema = requiredSchema, | ||
| filters = filters, | ||
| options = optParams, | ||
| // NOTE: We have to fork the Hadoop Config here as Spark will be modifying it | ||
| // to configure Parquet reader appropriately | ||
| hadoopConf = new Configuration(conf) | ||
| ) | ||
|
|
||
| new HoodieFileScanRDD(sparkSession, baseFileReader, fileSplits) | ||
| } | ||
|
|
||
| protected def collectFileSplits(partitionFilters: Seq[Expression], dataFilters: Seq[Expression]): Seq[HoodieBaseFileSplit] = { | ||
| val partitions = listLatestBaseFiles(globPaths, partitionFilters, dataFilters) | ||
| val fileSplits = partitions.values.toSeq.flatMap { files => | ||
| files.flatMap { file => | ||
| // TODO move to adapter | ||
| // TODO fix, currently assuming parquet as underlying format | ||
| HoodieDataSourceHelper.splitFiles( | ||
| sparkSession = sparkSession, | ||
| file = file, | ||
| // TODO clarify why this is required | ||
| partitionValues = InternalRow.empty | ||
| ) | ||
| } | ||
| } | ||
|
|
||
| val maxSplitBytes = sparkSession.sessionState.conf.filesMaxPartitionBytes | ||
|
|
||
| sparkAdapter.getFilePartitions(sparkSession, fileSplits, maxSplitBytes).map(HoodieBaseFileSplit.apply) | ||
| } | ||
| } | ||
141 changes: 0 additions & 141 deletions
141
...atasource/hudi-spark-common/src/main/scala/org/apache/hudi/BaseFileOnlyViewRelation.scala
This file was deleted.
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.