Skip to content

Conversation

@alexeykudinkin
Copy link
Contributor

@alexeykudinkin alexeykudinkin commented Dec 21, 2022

Change Logs

Currently MOR tables w/ NestedSchemaPruning (NSP) rule successfully applied (ie being able to prune nested schema) would fail to read in case any log-file merging would occur.

  • Fixing HoodieClientTestHarness to properly init SparkSession so that HoodieSparkSessionExtensions are injected properly
  • Converted all custom Hudi relations to be case-classes (to make them immutable)
  • Implement projection being able to project nested structs as well
  • Cleaning up

Impact

Addresses issue w/ MOR tables where after applying NSP optimization, it would fail to read in case delta-log merging will nee d to happen

Risk level (write none, low medium or high below)

Low

Documentation Update

N/A

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

userSchema: Option[StructType],
globPaths: Seq[Path])
extends HoodieBaseRelation(sqlContext, metaClient, optParams, userSchema) with SparkAdapterSupport {
case class BaseFileOnlyRelation(override val sqlContext: SQLContext,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Primary change here is converting the class to be a case class, which in turn entails that all of the ctor parameters would become field values requiring corresponding annotation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason this method is converted to a case-class is to avoid any in-place mutations and instead make updatePrunedDataSchema produce new instance instead

* Get all PartitionDirectories based on globPaths if specified, otherwise use the table path.
* Will perform pruning if necessary
*/
private def listPartitionDirectories(globPaths: Seq[Path], partitionFilters: Seq[Expression], dataFilters: Seq[Expression]): Seq[PartitionDirectory] = {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Combined this 2 methods into 1

}
}

protected def getColName(f: StructField): String = {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dead code

extends MergeOnReadSnapshotRelation(sqlContext, optParams, userSchema, Seq(), metaClient) with HoodieIncrementalRelationTrait {

override type FileSplit = HoodieMergeOnReadFileSplit
case class MergeOnReadIncrementalRelation(override val sqlContext: SQLContext,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same changes as other relations

globPaths: Seq[Path],
metaClient: HoodieTableMetaClient)
extends HoodieBaseRelation(sqlContext, metaClient, optParams, userSchema) {
case class MergeOnReadSnapshotRelation(override val sqlContext: SQLContext,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same changes as other relations

@alexeykudinkin alexeykudinkin force-pushed the ak/mor-sch-prj-fix branch 5 times, most recently from 5994fda to 9339277 Compare December 21, 2022 07:28
@alexeykudinkin alexeykudinkin added priority:blocker Production down; release blocker engine:spark Spark integration area:sql SQL interfaces labels Dec 21, 2022
@alexeykudinkin
Copy link
Contributor Author

@hudi-bot run azure

@alexeykudinkin alexeykudinkin changed the title [HUDI-5443] Fixing exception trying to read MOR table after NestedSchemaPruning rule has been applied [HUDI-5443] Fixing exception trying to read MOR table after NestedSchemaPruning rule has been applied Dec 21, 2022
@alexeykudinkin
Copy link
Contributor Author

@hudi-bot run azure

@alexeykudinkin alexeykudinkin requested review from yihua and removed request for codope January 10, 2023 01:01
@alexeykudinkin alexeykudinkin assigned yihua and unassigned codope Jan 10, 2023
Copy link
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexeykudinkin I left a few comments. Besides, is there any test that failed before and is now fixed? If not, could you add such a test to verify the fix?

sourceExpr

case (sourceType: StructType, targetType: StructType) =>
val fieldValueExprs = targetType.fields.map { tf =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a subset of nested fields may be taken during the projection, e.g., if the source has a {a.b, a.c, a.d} and the target has a.b, we only keep a.b instead of the whole StructType a. Does this happen or the caller of this function always makes sure the targetStructType is properly constructed to preserve the root-level field instead of a subset of nested fields? Is this a problem for projection, where the parquet and log reader can read files with a schema containing a subset of nested fields?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a subset of nested fields may be taken during the projection, e.g., if the source has a {a.b, a.c, a.d} and the target has a.b, we only keep a.b instead of the whole StructType a. Does this happen or the caller of this function always makes sure the targetStructType is properly constructed to preserve the root-level field instead of a subset of nested fields?

It does happen. It's actually the reason for this PR -- previously it was only handling nested field projections, but NestedSchemaPruning could produce schemas w/ nested fields being pruned as well. Therefore, we need to make sure we handle this appropriately when reading log-files (by projecting records into the new schema)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it

Copy link
Contributor Author

@alexeykudinkin alexeykudinkin Jan 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Realized that this is actually not the right approach and the problem is elsewhere:

  • Problem was that we're simply not reading projected records from the Parquet -- and the reason for that was that in case when non-whitelisted RecordPayload is used -- we will fallback to reading full record, but we still were allowing NestedSchemaPruning to be applied nevertheless

}
private def getHadoopConf: Configuration = {
val conf = hadoopConfBroadcast.value.value
new Configuration(conf)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still need the lock for concurrency?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't actually think we need the lock

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that this might introduce side effects and there's little time before the code freeze of the release to verify the removal of the lock, could you keep this part the same as before? It is not essential to the PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's any real explanation for why that lock might be needed

  • Underlying broadcast implementation is thread-safe (takes its own lock)
  • Hadoop configuration doesn't need a lock

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But even more importantly -- there's no real concurrent access to this method

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This lock is from the beginning of the implementation of HoodieMergeOnReadRDD and it is likely due to the broadcast configuration issue in Spark 2.4 (#1848 (comment)). @garyli1019 could you confirm if that's the case? I'd rather keep it and keep the code safe from potentially breaking the MOR queries.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would it break MOR queries though?

See whichever way we turn this around this lock doesn't really make sense:

  • There's no concurrency accessing this method, each task gets its own copy of the object
  • Only concurrency is w/in accessing the Broadcast shared cache which is hedged by its own lock

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Synced up offline. @alexeykudinkin and I are aligned in that we need to clean up the legacy code and remove any unnecessary code path. For this particular case, there could be a problem regarding the modification of the hadoop conf returned by this function (check: HoodieMergeOnReadRDD#compute -> LogFileIterator.logRecords -> scanLog -> FSUtils.getFs -> prepareHadoopConf -> conf.set). That's likely why the lock is put in from the beginning. So the change is going to be reverted in this PR and we'll revisit this in a separate PR to be merged after 0.13.0.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yihua Hi Ethan, the hadoop configuration issue that I can recall was related to serialization during the broadcast. The hadoop configuration was not serializable. The initial implementation of HoodieMergeOnReadRDD somehow new a Configuration object or use a serializableConfiguration to solve this issue.

@alexeykudinkin
Copy link
Contributor Author

CI is green:

Screenshot 2023-01-23 at 8 07 37 PM

https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=14575&view=results

Copy link
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@yihua yihua merged commit 1769ff8 into apache:master Jan 24, 2023
fengjian428 pushed a commit to fengjian428/hudi that referenced this pull request Jan 31, 2023
…hemaPruning` rule has been applied (apache#7528)

Addresses issue w/ MOR tables where after applying `NestedSchemaPruning` optimization, it would fail to read in case delta-log merging will need to happen
fengjian428 pushed a commit to fengjian428/hudi that referenced this pull request Apr 5, 2023
…hemaPruning` rule has been applied (apache#7528)

Addresses issue w/ MOR tables where after applying `NestedSchemaPruning` optimization, it would fail to read in case delta-log merging will need to happen
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:sql SQL interfaces engine:spark Spark integration priority:blocker Production down; release blocker

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

5 participants