Skip to content

Conversation

@Zouxxyy
Copy link
Contributor

@Zouxxyy Zouxxyy commented Dec 3, 2022

Change Logs

fix https://issues.apache.org/jira/projects/HUDI/issues/HUDI-5289

There is a patch similar to this problem, but after this patch is added, the problem still exists.

Impact

Could improve the clustering performance.

Risk level (write none, low medium or high below)

low

Documentation Update

None

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

Copy link
Contributor

@nsivabalan nsivabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when I put up the fix, I did verify that clustering was not triggered multiple times. Can you confirm that you are seeing clustering being triggered twice?
If you keep inspecting the data directory, you will find additional data files just during the validaiton.
Or other way to test this is, marker directory reconciliation will delete some additional files if dag was triggered twice. If not, you should not see any marker file deletions.

@nsivabalan nsivabalan added priority:blocker Production down; release blocker release-0.12.2 Patches targetted for 0.12.2 labels Dec 5, 2022
@Zouxxyy Zouxxyy force-pushed the xinyu/fix-rdd-recaluated branch from c762aa3 to a683b28 Compare December 6, 2022 05:40
@Zouxxyy
Copy link
Contributor Author

Zouxxyy commented Dec 6, 2022

@nsivabalan yes, I confirm, and @boneanxs find it too, So I think we should persist it

@hudi-bot
Copy link
Collaborator

hudi-bot commented Dec 6, 2022

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@boneanxs
Copy link
Contributor

boneanxs commented Dec 6, 2022

@nsivabalan yes, I confirm, and @boneanxs find it too, So I think we should persist it

I find there was a issue when I first implemented clustering as row, but didn't investigate too much at that time.

But I'm thinking whether we can do collectAsList and then parallelize this list to build HoodieData[WriteStatus] in performClusteringWithRecordsRDD, just like performClusteringWithRecordsAsRow has already done (dereference the RDD to a list) to permanently fix this?

JavaRDD<WriteStatus>[] writeStatuses = convertStreamToArray(writeStatusesStream.map(HoodieJavaRDD::getJavaRDD));
JavaRDD<WriteStatus> writeStatusRDD = engineContext.union(writeStatuses);
// Persist writeStatus, since it may be reused
writeStatusRDD.persist(StorageLevel.MEMORY_AND_DISK());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can't rely on cache since cache could be invalidated and we have had experiences where actual files written did not match w/ whats in writeStatus. Hence if something fails, we let dag to get retriggered here. we added guard rails at BasecommitActionExecutor by means of cloning.
CC @alexeykudinkin

@nsivabalan
Copy link
Contributor

Are you guys saying that, even in a successful path, the dag is getting triggered twice. Or it happens only during exception path.

@Zouxxyy
Copy link
Contributor Author

Zouxxyy commented Dec 12, 2022

@nsivabalan
yes, even in a successful path, the dag is getting triggered twice. But you must turn off hoodie.datasource.write.row.writer.enable, because when it is turned on, it will avoid recalculation becauce of the collect at HoodieDatasetBulkInsertHelper. At present, it seems that we do not recommend using the persist, and I will close these two PRs.

@Zouxxyy Zouxxyy closed this Dec 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority:blocker Production down; release blocker release-0.12.2 Patches targetted for 0.12.2

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

4 participants