Skip to content

Conversation

@alamb
Copy link
Contributor

@alamb alamb commented Oct 30, 2021

(Built on #810, so to see the changes proposed in this PR just look at the last commit)

Which issue does this PR close?

Closes #814
Closes #85
Related to #817

Rationale for this change

Follow Arrow spec for UnionArray, make it possible to validate with generic code added in #810, and conform to Rust safety conventions.

See https://arrow.apache.org/docs/format/Columnar.html#union-layout for a description of the Union layout

What changes are included in this PR?

  1. Rename UnionArray::new to unsafe UnionArray::new_unchecked() to follow standard Rust safety conventions
  2. Add UnionMode to DataType::Union so the type carries information on expected format
  3. Update null values offset handling to match https://arrow.apache.org/docs/format/Columnar.html#union-layout
  4. Add validation in ArrayData::validate

Are there any user-facing changes?

  1. UnionArray::new has been renamed to unsafe UnionArray::new_unchecked()
  2. Must specify Dense or Sparse when creating a DataType::Union
  3. Accepted UnionArray format is different

@alamb alamb added the api-change Changes to the arrow API label Oct 30, 2021
@github-actions github-actions bot added arrow Changes to the arrow crate parquet Changes to the parquet crate labels Oct 30, 2021
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here is the API change

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here is the core format change -- the length of type_ids must match the length of the values (rather than skipping nulls)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can now validate the buffer layout based on DataType!

@codecov-commenter
Copy link

codecov-commenter commented Oct 30, 2021

Codecov Report

Merging #885 (c8737a1) into master (fe106e0) will increase coverage by 0.02%.
The diff coverage is 93.69%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #885      +/-   ##
==========================================
+ Coverage   82.33%   82.35%   +0.02%     
==========================================
  Files         169      169              
  Lines       49719    49788      +69     
==========================================
+ Hits        40936    41003      +67     
- Misses       8783     8785       +2     
Impacted Files Coverage Δ
arrow/src/array/equal/mod.rs 93.13% <ø> (ø)
arrow/src/array/equal/utils.rs 74.00% <ø> (ø)
arrow/src/datatypes/datatype.rs 66.38% <0.00%> (ø)
parquet/src/arrow/arrow_writer.rs 98.07% <ø> (ø)
parquet/src/arrow/levels.rs 82.87% <0.00%> (ø)
parquet/src/arrow/schema.rs 85.56% <ø> (ø)
arrow/src/array/builder.rs 86.49% <80.00%> (-0.05%) ⬇️
arrow/src/array/array_union.rs 90.76% <86.66%> (-0.22%) ⬇️
arrow/src/array/data.rs 82.28% <96.96%> (+1.14%) ⬆️
arrow/src/array/array.rs 85.25% <100.00%> (ø)
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fe106e0...c8737a1. Read the comment docs.

@github-actions github-actions bot added the arrow-flight Changes to the arrow-flight crate label Nov 8, 2021
@alamb alamb changed the title Update Union Array to match Arrow Spec, rename new -> unsafe new_unchecked() Update Union Array to add UnionMode, match latest Arrow Spec, and rename new -> unsafe new_unchecked() Nov 8, 2021
@alamb alamb added the stale-pr label Dec 20, 2021
@alamb
Copy link
Contributor Author

alamb commented Dec 20, 2021

Marking PRs that are over a month old as stale -- please let us know if there is additional work planned or if we should close them.

@alamb
Copy link
Contributor Author

alamb commented Dec 20, 2021

(I do have some idea in my head I can finish this one up)

@alamb alamb marked this pull request as ready for review December 20, 2021 17:05
@alamb
Copy link
Contributor Author

alamb commented Dec 20, 2021

Ok, I got the yak shaving bug and cleaned up this PR and marked it ready for review. 🙏

@alamb alamb removed the stale-pr label Dec 21, 2021
@alamb
Copy link
Contributor Author

alamb commented Dec 29, 2021

@jimexist @paddyhoran @nevi-me -- might one of you have time to review this PR? It revamps how UnionArray is supported to conform to the modern Arrow spec, and adds additional validation. I would like to get it in for the 7.0.0 release in a few weeks time

@paddyhoran
Copy link
Contributor

Sorry I've been so quiet on Arrow. I'll find some time to review in the coming days.

Copy link
Contributor

@paddyhoran paddyhoran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @alamb

/// are used to index into the `child_arrays`.
///
/// The `value_offsets` `Buffer` is only provided in the case of a dense union, sparse unions
/// should use `None`. If provided the `value_offsets` `Buffer` should contain `i32` values.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you removed the period at the end of this line by mistake, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think so. Good catch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in c8737a1

@alamb alamb merged commit 89d5273 into apache:master Jan 2, 2022
@alamb alamb deleted the alamb/union_fix branch January 2, 2022 14:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api-change Changes to the arrow API arrow Changes to the arrow crate arrow-flight Changes to the arrow-flight crate parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Dense/Sparse annotation to DataType::Union Implement Union validity bitmap changes from ARROW-9222

3 participants