Skip to content

Conversation

@nsivabalan
Copy link
Contributor

@nsivabalan nsivabalan commented Mar 10, 2022

What is the purpose of the pull request

when Deltastreamer consumes an empty batch, but checkpoint moved ahead, hudi injects NullSchemaProvider as part of consumed InputBatch. Which returns null schema (Schema.create(Schema.Type.NULL)) schema when asked for source or target schema. So, this gets serialized into commit metadata and later runs into issues when table schema is fetched.

Fix: when serializing the schema to commit metadata, we avoid adding the entry if its null schema (Schema.create(Schema.Type.NULL)).

Ideally, we want to fix the schema provider returned as part of empty batch itself. But don't want to fix it right away as it could blow up somewhere else. So, created a ticket for later.

TableSchemaProvider is capable of walking back to previous commits to find the right schema and hence skipping to add schema should be fine.

Brief change log

Fixed serializing the avro schema to commit metadata to avoid adding the entry if its null schema (Schema.create(Schema.Type.NULL))

Verify this pull request

This change added tests and can be verified as follows:

  • Added a test to TestHoodieDeltastreamer to test the fix.

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@nsivabalan nsivabalan added the priority:critical Production degraded; pipelines stalled label Mar 10, 2022
@XuQianJin-Stars
Copy link
Contributor

hi @nsivabalan rebase this pr, I have added the time out option in the master branch to solve CI the canceled of IT modules.

@nsivabalan nsivabalan force-pushed the fixNullSchemaEmptyBatch branch from 4e621f7 to 8053f01 Compare March 10, 2022 14:27
@nsivabalan nsivabalan force-pushed the fixNullSchemaEmptyBatch branch from 97d4e40 to 96b8587 Compare March 11, 2022 02:06
@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@nsivabalan nsivabalan merged commit 9dc6df5 into apache:master Mar 11, 2022
vingov pushed a commit to vingov/hudi that referenced this pull request Apr 3, 2022
stayrascal pushed a commit to stayrascal/hudi that referenced this pull request Apr 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority:critical Production degraded; pipelines stalled

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants