Skip to content

Conversation

@nevi-me
Copy link
Contributor

@nevi-me nevi-me commented Nov 7, 2020

This changes a list datatype to use Box<Field> instead of Box<DataType>.

This change is needed in order to make both Parquet and IPC roundtrips pass.
The C++ implementation uses Field, as it allows for preservice list field names and nullability.

There are some implementation details which we did not cover in this PR, and will work on as a follow-up, and these are:

  • Documenting the naming conventions of Arrow and Parquet on list names (Arrow C++ uses "item", Parquet has different compatibility options)
  • Fixing Parquet failing list tests that have been ignored (I'd like to do this separately, as this isn't a full fix)

@github-actions
Copy link

github-actions bot commented Nov 7, 2020

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed all the changes in this PR and they look good to me -- they do what the PR says they do and the rationale for the change on https://issues.apache.org/jira/browse/ARROW-10261 makes sense too

One thought I had that might make the code slightly simpler would be to add a DataType::new_list() function, which would allow code like this:

DataType::List(Box::new(Field::new(
    "array",
    DataType::Boolean,
    true,
))),

to be written like this

DataType::new_list(DataType::Boolean, true)

We can always add such a convenience function later, however. It is definitely not required in this PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there is any reason to leave this commented out? As in why not remove the old version?

I have the same question for the other instances in this file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot about it(them), I was struggling with compiling due to lifetime issues, so I commented out those lines in case I needed to revert them. I've now removed them

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks a lot, @nevi-me !

This is part 1 of potentially 3 parts, it only changes the `arrow` crate

`parquet` depends on `arrow`, and `datafusion` depends on both.
`datafusion` will thus only compile after `parquet` is fixed.
@nevi-me nevi-me force-pushed the ARROW-10261-BREAKING branch from 69198da to dd8178c Compare November 7, 2020 11:45
@nevi-me
Copy link
Contributor Author

nevi-me commented Nov 7, 2020

We can always add such a convenience function later, however. It is definitely not required in this PR

Yes @alamb, we can address it as a follow-up. There's more work that I also need to do on top of this, I've hardcoded "item" in a bunch of places, and on the Parquet side, there's some compatibility tests that I need to write.

@jorgecarleitao happy that we merge this if/when tests pass, or would you like another reviewer to look at this?

@jorgecarleitao
Copy link
Member

Ready to merge if the tests pass. I went through all the changes and they seem reasonable. Most issues I had were the same that @alamb suggested, and the rest seems straightforward given the changes on the DataType.

I have a question related to why do we use a Box<> instead of just putting it into the stack, but other parts of the code have that, and thus makes sense to keep it as is.

@nevi-me
Copy link
Contributor Author

nevi-me commented Nov 7, 2020

I have a question related to why do we use a Box<> instead of just putting it into the stack, but other parts of the code have that, and thus makes sense to keep it as is.

We need some indirection to break the potential infinite recursion

error[E0072]: recursive type `datatypes::Field` has infinite size
   --> arrow\src\datatypes.rs:188:1
    |
188 | pub struct Field {
    | ^^^^^^^^^^^^^^^^ recursive type has infinite size
189 |     name: String,
190 |     data_type: DataType,
    |     ------------------- recursive without indirection
    |
    = help: insert indirection (e.g., a `Box`, `Rc`, or `&`) at some point to make `datatypes::Field` representable

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants