[HUDI-426][WIP] Initial implementation for Bootstrapping data source #1475

umehrot2 · 2020-04-01T02:48:15Z

What is the purpose of the pull request

This is a first draft of spark data source implementation for bootstrapped tables. This work is currently in progress, and I have put this WIP PR out to gather feedback feedback early. l will continue to work on this and push to this PR. Currently this implementation is able to infer schema, merge skeleton and data files and perform column pruning.

Note: This depends on the PR by @bvaradar #1112 for getting the file system view of external data files.

Feedback is welcome.

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

Added integration tests for end-to-end.
Added HoodieClientWriteTest to verify the change.
Manually verified the change by running a job locally.

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

umehrot2 · 2020-04-01T02:58:07Z

@bvaradar if you glance through my implementation one thing I might require from you is that in HoodieBaseFile if we can store FileStatus for external data file instead of just the string path. I need the list of FileStatus of external data files to work with here.

…ata from being mapped to incorrect columns in the results, as well as filtering issues.

…d files

…isting. This makes the behaviour of this relation similar to other file relations in spark like parquet.

umehrot2 · 2020-06-04T00:01:59Z

Closing this pull request, in favor of the new pull request #1702 where I have consolidated all the datasource related changes in one PR for review. It includes this read datasource part as well.

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

Initial implementation for Bootstrapping data source

43b10e0

vinothchandar assigned bvaradar Apr 1, 2020

umehrot2 added 9 commits April 1, 2020 18:33

Return data with columns in the same order as requested. This fixes d…

97d40d1

…ata from being mapped to incorrect columns in the results, as well as filtering issues.

Convert to Row with schema having columns in the same order as requested

1787059

Remove need for Internal Row to Row conversion

938ef25

Add support for reading from partition paths

c295597

Switch to listing recursively instead of finding partition paths

05184b2

Support both row wise and columnar batch (vectorization) merge.

5f22176

Support reading from both metadata bootstrapped, and full bootstrappe…

83b9f23

…d files

Support filter pushdown for fully bootstrapped files

3fcaaed

Add support for path pattern, and use Spark's InMemoryFileIndex for l…

fbb3db5

…isting. This makes the behaviour of this relation similar to other file relations in spark like parquet.

umehrot2 closed this Jun 4, 2020

kroushan-nit pushed a commit to kroushan-nit/hudi-oss-fork that referenced this pull request Aug 28, 2025

[AUDIT-1071]: Changelog for release-v1.157.0 (apache#1475)

1fa2589

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

hudi-bot mentioned this pull request Nov 30, 2025

Efficiently reading hudi tables through spark-shell #14566

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-426][WIP] Initial implementation for Bootstrapping data source #1475

[HUDI-426][WIP] Initial implementation for Bootstrapping data source #1475

Uh oh!

umehrot2 commented Apr 1, 2020 •

edited

Loading

Uh oh!

umehrot2 commented Apr 1, 2020

Uh oh!

umehrot2 commented Jun 4, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[HUDI-426][WIP] Initial implementation for Bootstrapping data source #1475

[HUDI-426][WIP] Initial implementation for Bootstrapping data source #1475

Uh oh!

Conversation

umehrot2 commented Apr 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is the purpose of the pull request

Verify this pull request

Committer checklist

Uh oh!

umehrot2 commented Apr 1, 2020

Uh oh!

umehrot2 commented Jun 4, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

umehrot2 commented Apr 1, 2020 •

edited

Loading