-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-3396] Refactoring MergeOnReadRDD to avoid duplication, fetch only projected columns
#4888
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
37 commits
Select commit
Hold shift + click to select a range
3660a0b
Rebased `HoodieUnsafeRDD` to become a trait;
7aa0b61
Refactored `HoodieMergeOnReadRDD` iterators to re-use iteration logic
390c47c
Tidying up
04590d3
Tidying up (cont)
fdbcc70
Cleaned up MOR Relation data-flows
09b1778
Tidying up
271f7f4
Tidying up
c926bb6
Adding missing docs
bb60abb
Decoupled Record Merging iterator from assuming particular schema of …
dc6f336
Removed superfluous `Option`, causing incorrect split processing
a90bf12
Added optimization to avoid full-schema base-file read in cases when
3132b10
Fixing compilation
45d8265
Replacing self-type restriction w/ sub-classing
db74831
Missing license
c1e7282
Fixing compilation
0e08367
After rebase fixes
6b036e6
Tidying up (after rebase)
3d80eab
Extracted `DeltaLogSupport` trait encapsulating reading from Delta Logs
a6a9831
Fixed Merging Record iterator incorrectly projecting merged record
5a7648f
Tidying up
f45e98d
Inlined `DeltaLogSupport` trait
98658f8
Fixing NPE
2f63d1d
Adjusted column projection tests to reflect the changes;
83e97b3
Tidying up
16e51f8
Allow non-partitioned table to rely on virtual-keys
f8ce13d
Fixed tests
65d67ed
Pushed down `mandatoryColumns` defs into individual Relations
b4b6d0f
Clarified `requiredKeyField` semantic
692d96d
Relaxed requirements to read only projected schema to only cases when…
e23b648
Fixed tests for Spark 2.x
762f555
Tidying up
16f6204
Fixed tests for Spark 3.x
f497df2
Revert some inadvertent changes
b146570
Clean up referecnes to MT
73a645b
Tidying up
7bfcb40
Inlined `maxCompactionMemoryInBytes` w/in `HoodieMergeOnReadRDD`
54be92b
Tidying up
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -23,7 +23,7 @@ import org.apache.hadoop.conf.Configuration | |
| import org.apache.hadoop.fs.{FileStatus, Path} | ||
| import org.apache.hadoop.hbase.io.hfile.CacheConfig | ||
| import org.apache.hadoop.mapred.JobConf | ||
| import org.apache.hudi.HoodieBaseRelation.{getPartitionPath, isMetadataTable} | ||
| import org.apache.hudi.HoodieBaseRelation.getPartitionPath | ||
| import org.apache.hudi.HoodieConversionUtils.toScalaOption | ||
| import org.apache.hudi.common.config.SerializableConfiguration | ||
| import org.apache.hudi.common.fs.FSUtils | ||
|
|
@@ -32,8 +32,9 @@ import org.apache.hudi.common.table.timeline.{HoodieInstant, HoodieTimeline} | |
| import org.apache.hudi.common.table.view.HoodieTableFileSystemView | ||
| import org.apache.hudi.common.table.{HoodieTableConfig, HoodieTableMetaClient, TableSchemaResolver} | ||
| import org.apache.hudi.common.util.StringUtils | ||
| import org.apache.hudi.common.util.ValidationUtils.checkState | ||
| import org.apache.hudi.io.storage.HoodieHFileReader | ||
| import org.apache.hudi.metadata.{HoodieMetadataPayload, HoodieTableMetadata} | ||
| import org.apache.hudi.metadata.HoodieTableMetadata | ||
| import org.apache.spark.execution.datasources.HoodieInMemoryFileIndex | ||
| import org.apache.spark.internal.Logging | ||
| import org.apache.spark.rdd.RDD | ||
|
|
@@ -53,8 +54,12 @@ trait HoodieFileSplit {} | |
|
|
||
| case class HoodieTableSchema(structTypeSchema: StructType, avroSchemaStr: String) | ||
|
|
||
| case class HoodieTableState(recordKeyField: String, | ||
| preCombineFieldOpt: Option[String]) | ||
| case class HoodieTableState(tablePath: String, | ||
| latestCommitTimestamp: String, | ||
| recordKeyField: String, | ||
| preCombineFieldOpt: Option[String], | ||
| usesVirtualKeys: Boolean, | ||
| recordPayloadClassName: String) | ||
|
|
||
| /** | ||
| * Hoodie BaseRelation which extends [[PrunedFilteredScan]]. | ||
|
|
@@ -78,13 +83,30 @@ abstract class HoodieBaseRelation(val sqlContext: SQLContext, | |
|
|
||
| protected lazy val basePath: String = metaClient.getBasePath | ||
|
|
||
| // If meta fields are enabled, always prefer key from the meta field as opposed to user-specified one | ||
| // NOTE: This is historical behavior which is preserved as is | ||
| // NOTE: Record key-field is assumed singular here due to the either of | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. same. can we avoid references to metadata table from this layer
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This actually refers to metadata fields, not MT. Will amend |
||
| // - In case Hudi's meta fields are enabled: record key will be pre-materialized (stored) as part | ||
| // of the record's payload (as part of the Hudi's metadata) | ||
| // - In case Hudi's meta fields are disabled (virtual keys): in that case record has to bear _single field_ | ||
| // identified as its (unique) primary key w/in its payload (this is a limitation of [[SimpleKeyGenerator]], | ||
| // which is the only [[KeyGenerator]] permitted for virtual-keys payloads) | ||
| protected lazy val recordKeyField: String = | ||
| if (tableConfig.populateMetaFields()) HoodieRecord.RECORD_KEY_METADATA_FIELD | ||
| else tableConfig.getRecordKeyFieldProp | ||
| if (tableConfig.populateMetaFields()) { | ||
| HoodieRecord.RECORD_KEY_METADATA_FIELD | ||
| } else { | ||
| val keyFields = tableConfig.getRecordKeyFields.get() | ||
| checkState(keyFields.length == 1) | ||
| keyFields.head | ||
| } | ||
|
|
||
| protected lazy val preCombineFieldOpt: Option[String] = getPrecombineFieldProperty | ||
| protected lazy val preCombineFieldOpt: Option[String] = | ||
| Option(tableConfig.getPreCombineField) | ||
| .orElse(optParams.get(DataSourceWriteOptions.PRECOMBINE_FIELD.key)) match { | ||
| // NOTE: This is required to compensate for cases when empty string is used to stub | ||
| // property value to avoid it being set with the default value | ||
| // TODO(HUDI-3456) cleanup | ||
| case Some(f) if !StringUtils.isNullOrEmpty(f) => Some(f) | ||
alexeykudinkin marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| case _ => None | ||
| } | ||
|
|
||
| protected lazy val specifiedQueryTimestamp: Option[String] = | ||
| optParams.get(DataSourceReadOptions.TIME_TRAVEL_AS_OF_INSTANT.key) | ||
|
|
@@ -118,16 +140,14 @@ abstract class HoodieBaseRelation(val sqlContext: SQLContext, | |
| FileStatusCache.getOrCreate(sparkSession)) | ||
|
|
||
| /** | ||
| * Columns that relation has to read from the storage to properly execute on its semantic: for ex, | ||
| * for Merge-on-Read tables key fields as well and pre-combine field comprise mandatory set of columns, | ||
| * meaning that regardless of whether this columns are being requested by the query they will be fetched | ||
| * regardless so that relation is able to combine records properly (if necessary) | ||
| * | ||
| * @VisibleInTests | ||
| */ | ||
| lazy val mandatoryColumns: Seq[String] = { | ||
| if (isMetadataTable(metaClient)) { | ||
| Seq(HoodieMetadataPayload.KEY_FIELD_NAME, HoodieMetadataPayload.SCHEMA_FIELD_NAME_TYPE) | ||
| } else { | ||
| // TODO this is MOR table requirement, not necessary for COW | ||
| Seq(recordKeyField) ++ preCombineFieldOpt.map(Seq(_)).getOrElse(Seq()) | ||
| } | ||
| } | ||
| val mandatoryColumns: Seq[String] | ||
|
|
||
| protected def timeline: HoodieTimeline = | ||
| // NOTE: We're including compaction here since it's not considering a "commit" operation | ||
|
|
@@ -136,9 +156,8 @@ abstract class HoodieBaseRelation(val sqlContext: SQLContext, | |
| protected def latestInstant: Option[HoodieInstant] = | ||
| toScalaOption(timeline.lastInstant()) | ||
|
|
||
| protected def queryTimestamp: Option[String] = { | ||
| specifiedQueryTimestamp.orElse(toScalaOption(timeline.lastInstant()).map(i => i.getTimestamp)) | ||
| } | ||
| protected def queryTimestamp: Option[String] = | ||
| specifiedQueryTimestamp.orElse(toScalaOption(timeline.lastInstant()).map(_.getTimestamp)) | ||
|
|
||
| override def schema: StructType = tableStructSchema | ||
|
|
||
|
|
@@ -257,15 +276,17 @@ abstract class HoodieBaseRelation(val sqlContext: SQLContext, | |
| requestedColumns ++ missing | ||
| } | ||
|
|
||
| private def getPrecombineFieldProperty: Option[String] = | ||
| Option(tableConfig.getPreCombineField) | ||
| .orElse(optParams.get(DataSourceWriteOptions.PRECOMBINE_FIELD.key)) match { | ||
| // NOTE: This is required to compensate for cases when empty string is used to stub | ||
| // property value to avoid it being set with the default value | ||
| // TODO(HUDI-3456) cleanup | ||
| case Some(f) if !StringUtils.isNullOrEmpty(f) => Some(f) | ||
| case _ => None | ||
| } | ||
| protected def getTableState: HoodieTableState = { | ||
| // Subset of the state of table's configuration as of at the time of the query | ||
| HoodieTableState( | ||
| tablePath = basePath, | ||
| latestCommitTimestamp = queryTimestamp.get, | ||
| recordKeyField = recordKeyField, | ||
| preCombineFieldOpt = preCombineFieldOpt, | ||
| usesVirtualKeys = !tableConfig.populateMetaFields(), | ||
| recordPayloadClassName = tableConfig.getPayloadClass | ||
| ) | ||
| } | ||
|
|
||
| private def imbueConfigs(sqlContext: SQLContext): Unit = { | ||
| sqlContext.sparkSession.sessionState.conf.setConfString("spark.sql.parquet.filterPushdown", "true") | ||
|
|
@@ -280,9 +301,6 @@ object HoodieBaseRelation { | |
| def getPartitionPath(fileStatus: FileStatus): Path = | ||
| fileStatus.getPath.getParent | ||
|
|
||
| def isMetadataTable(metaClient: HoodieTableMetaClient): Boolean = | ||
| HoodieTableMetadata.isMetadataTable(metaClient.getBasePath) | ||
|
|
||
| /** | ||
| * Returns file-reader routine accepting [[PartitionedFile]] and returning an [[Iterator]] | ||
| * over [[InternalRow]] | ||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.