Skip to content

Conversation

@nsivabalan
Copy link
Contributor

@nsivabalan nsivabalan commented May 2, 2021

What is the purpose of the pull request

https://issues.apache.org/jira/browse/HUDI-1343 was added so that default and nulls are handled properly. This fix is applicable only to deltastreamer flow. But later, we had a fix in spark ds layer which fixed the defaults and null values. So, the schema post processor is not really required anymore. Hence making the default value to disabled.

Brief change log

(for example:)

  • Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end.
  • Added HoodieClientWriteTest to verify the change.
  • Manually verified the change by running a job locally.

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@nsivabalan nsivabalan added the status:in-progress Work in progress label May 2, 2021
@codecov-commenter
Copy link

codecov-commenter commented May 2, 2021

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 52.90%. Comparing base (c5220b9) to head (701640c).
Report is 4136 commits behind head on master.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff            @@
##             master    #2911   +/-   ##
=========================================
  Coverage     52.90%   52.90%           
+ Complexity     3748     3746    -2     
=========================================
  Files           488      488           
  Lines         23574    23574           
  Branches       2510     2510           
=========================================
  Hits          12472    12472           
- Misses         9998     9999    +1     
+ Partials       1104     1103    -1     
Flag Coverage Δ
hudicli 39.53% <ø> (ø)
hudiclient ∅ <ø> (∅)
hudicommon 50.38% <ø> (+0.01%) ⬆️
hudiflink 58.98% <ø> (ø)
hudihadoopmr 33.33% <ø> (ø)
hudisparkdatasource 73.33% <ø> (ø)
hudisync 46.73% <ø> (ø)
huditimelineservice 64.36% <ø> (ø)
hudiutilities 69.43% <100.00%> (-0.10%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...in/java/org/apache/hudi/utilities/UtilHelpers.java 63.37% <100.00%> (-1.17%) ⬇️

... and 1 file with indirect coverage changes

Copy link
Contributor

@n3nash n3nash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Will wait for @bvaradar to confirm later tonight/tomorrow before merging this.

@nsivabalan nsivabalan changed the title [WIP] Setting default value to false for enabling schema post processor [HUDI-1887] Setting default value to false for enabling schema post processor May 9, 2021
@nsivabalan nsivabalan removed the status:in-progress Work in progress label May 9, 2021
@nsivabalan nsivabalan requested a review from bvaradar May 9, 2021 04:10
@vinothchandar
Copy link
Member

@nsivabalan are n't there any tests affected by this change? Also do we even need this post processor feature anymore? Should can the entire feature instead of just making it not enabled by default?

@vinothchandar vinothchandar self-assigned this May 11, 2021
@nsivabalan nsivabalan added the priority:critical Production degraded; pipelines stalled label May 11, 2021
@nsivabalan nsivabalan changed the title [HUDI-1887] Setting default value to false for enabling schema post processor [WIP][HUDI-1887] Setting default value to false for enabling schema post processor May 14, 2021
@nsivabalan nsivabalan added the status:in-progress Work in progress label May 14, 2021
@nsivabalan
Copy link
Contributor Author

yeah, I am trying to understand the implications of removing the post processor. we may not have namespace in delta streamer's target schema if we remove the post processor. so what we need to understand is, what incase schema has namespace in hudi data files, and then delta streamer produces records w/ schema not having namespace. marking as WIP for now.

@nsivabalan
Copy link
Contributor Author

yes, Only reason we still wanted to have is, just incase some user wishes to migrate from a path of having namespace to a path of not having namespace. but if we can totally remove it, I am all for it.

@nsivabalan nsivabalan changed the title [WIP][HUDI-1887] Setting default value to false for enabling schema post processor [HUDI-1887] Setting default value to false for enabling schema post processor Jun 11, 2021
@nsivabalan
Copy link
Contributor Author

@n3nash : could you think of any reason why we need to have this instead of removing it altogether.

@n3nash
Copy link
Contributor

n3nash commented Jul 2, 2021

@nsivabalan It's the same reason you mentioned. The post processor adds the namespace. So if users have written log files with AVRO name-spacing and we disable post process for them, it will break. But this only happens in a specific path of DeltaSync - can you point out which one ? From my interactions in the community, I can suggest whether that path is a frequently used path or not - if it's not, let's remove this complexity.

@nsivabalan
Copy link
Contributor Author

@n3nash : when there is no target schema set and so hudi falls back to RowBasedSchemaProvider for target shema.

@nsivabalan nsivabalan added priority:blocker Production down; release blocker and removed status:in-progress Work in progress labels Jul 9, 2021
@vinothchandar
Copy link
Member

I am marking this a non-blocker.

@vinothchandar vinothchandar removed the priority:blocker Production down; release blocker label Aug 4, 2021
@nsivabalan
Copy link
Contributor Author

@vinothchandar : If you are good with the patch, I can rebase and land it.

@nsivabalan nsivabalan added priority:high Significant impact; potential bugs and removed priority:critical Production degraded; pipelines stalled labels Aug 5, 2021
@hudi-bot
Copy link
Collaborator

hudi-bot commented Nov 5, 2021

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@yihua yihua added the area:ingest Ingestion into Hudi label Sep 7, 2022
@xushiyan xushiyan added the status:in-progress Work in progress label Oct 31, 2022
@github-actions github-actions bot added the size:XS PR with lines of changes in <= 10 label Feb 26, 2024
@yihua
Copy link
Contributor

yihua commented Sep 10, 2024

We have added more SchemaPostProcessor implementation since this PR is posted, including ChainedSchemaPostProcessor, KafkaOffsetPostProcessor, DropColumnSchemaPostProcessor, and DeleteSupportSchemaPostProcessor, we cannot disable it by default any more. Closing this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:ingest Ingestion into Hudi priority:high Significant impact; potential bugs size:XS PR with lines of changes in <= 10 status:in-progress Work in progress

Projects

Status: No status
Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

7 participants