Skip to content

Conversation

@nsivabalan
Copy link
Contributor

@nsivabalan nsivabalan commented Nov 5, 2021

  • Adding parquet data block with inline read support.
  • Added a config hoodie.table.log.block.type to determine the log block type. If not set, we will fallback to determining log block type based on base file format. if set, we will use the config value.
  • Added some minimal tests to ensure adding a parquet data block and reading it using AbstractLogRecordScanner works fine.

Pending items:

  • where to store the configs for parquet data block (we need bloom filter configs, and parquet write configs here). Don't think we can reuse existing index config as its purpose is different. I am thinking of HoodieStorageConfig. Once have consensus, will have to fix the configs for the same. Also, none of LogDataBlock takes in any configs for now. Will have to write in the right configs once I figure out where to store these.
  • Added only inline read support in this patch. Wanted to understand if we need to support the non-inline way (read entire content and then deserialize records).
  • Add more end to end tests to test parquet data blocks.

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end.
  • Added HoodieClientWriteTest to verify the change.
  • Manually verified the change by running a job locally.

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@nsivabalan nsivabalan added the priority:blocker Production down; release blocker label Nov 5, 2021
@hudi-bot
Copy link
Collaborator

hudi-bot commented Nov 5, 2021

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@vinothchandar
Copy link
Member

@nsivabalan we should push this back a bit and may be even @codope can review this and take over? this is not directly related to metadata table right.

@nsivabalan
Copy link
Contributor Author

agreed. will remove from release blocker.

@nsivabalan nsivabalan removed the priority:blocker Production down; release blocker label Nov 7, 2021
}

public HoodieLogBlockType getLogDataBlockFormat() {
switch (getBaseFileFormat()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this config is not required.

For a dataset with a PARQUET file format, an inline log block format of PARQUET makes sense. Mix and match of base file format and inline log format does not seem to make writing data or query engine side faster.

So we can have a config - useInlineFiles() - when enabled, the inline file format will be format and it will have the same format as the base file format. This needs to be fixed for the dataset so can be from table config.


@Override
protected byte[] serializeRecords() throws IOException {
// TODO: Need to decide from where to fetch all config values required below. We can't re-use index config as the purpose is different.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the configs here should match the configs for the dataset base file format.

Under what case would the user want separate configs for base file and inline log blocks?

@danny0405
Copy link
Contributor

Hi @nsivabalan , do you want this patch into release 0.10.0, you can tag it with release-blocker if you think it is necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants