BUG: Write compliant Parquet with `pyarrow` by judahrand · Pull Request #43690 · pandas-dev/pandas

judahrand · 2021-09-21T21:21:09Z

closes BUG: DataFrame().to_parquet() does not write Parquet compliant data for nested arrays #43689
tests added / passed
Ensure all linting tests pass, see here for how to run them
whatsnew entry

jreback

umm we can discuss updating pyarrow version though i think we just did

jreback · 2021-09-22T00:47:29Z

pls try to solve this conditionally based on the pyarrow version

judahrand · 2021-09-22T07:03:23Z

Doesn't it feel a bit odd to have a different output format depending on the version of PyArrow that just happens to be installed?

(I believe fastparquet doesn't matter as they don't support any nested types)

pep8speaks · 2021-09-22T08:59:56Z

Hello @judahrand! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-09-30 21:10:49 UTC

judahrand · 2021-09-22T11:45:17Z

Do you think this is something which can / should be back ported to 1.3.4 or should it only appear in 1.4.0?

jreback · 2021-09-22T13:06:41Z

doc/source/whatsnew/v1.4.0.rst

+.. _whatsnew_140.notable_bug_fixes.write_compliant_parquet_nested_type:

-notable_bug_fix3
+Write compliant Parquet nested types if possible


you don't need to add a whole sub-section on this. a single line note is fine.

I just figured that since it could break people's stuff it was worth calling out but shall remove if you'd prefer?

Let me know.

the entire point is this wont' break anyone, yes just a single entry is fine

Do you want to expand on this? This change will change the data that Pandas outputs. If a user is expecting and handling the current output (not necessarily back into Pandas as Parquet is a cross platform format) then this could break their code elsewhere?

Am I misunderstanding your point?

just a single line is fine here

pandas/io/parquet.py

jreback · 2021-09-22T13:07:43Z

Do you think this is something which can / should be back ported to 1.3.4 or should it only appear in 1.4.0?

no

jreback · 2021-09-22T21:29:39Z

pandas/tests/io/test_parquet.py

+            df.to_parquet(path, pa)
+            result = pyarrow.parquet.read_table(path)
+
+        assert str(result.schema.field_by_name("a").type) == "list<element: int64>"


use check_round_trip(df, pa) as this will assert the results

That isn't what I'm testing though... I'm testing that the Parquet that is written is correct...

ie. "list<element: int64>" not "list<item: int64>"

The serializer and deserializer always worked.

https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#nested-types

apache/arrow#9489

well if the serialize & de-serialize works then i am not sure i understand the problem. you can certainly assert the meta-data if you want. the most important point is that you need this option to have correct results yes?

The problem is that to_parquet doesn't write Parquet compliant data. Parquet is a 'standard' after all. Pandas doesn't follow the standard.

What if I want to load data from Pandas into a different language?

Here is an example of a place where this is an issue: googleapis/python-bigquery#980
And the use_compliant_nested_type has to be passed manually.

Really use_compliant_nested_type should be the default in PyArrow but it isn't for compatibility reasons. It's all in the Arrow PR I linked.

In summary: yes, use_compliant_nested_type is needed in order to output the correct, compliant Parquet format.

But then that comes back around to changing output format and so my point about calling users attention to that in What's New. It they're using the data elsewhere (not Pandas) it could cause them problems.

judahrand · 2021-09-23T06:54:04Z

Test timed out rather than failed.

jreback · 2021-09-29T13:39:23Z

doc/source/whatsnew/v1.4.0.rst

+.. _whatsnew_140.notable_bug_fixes.write_compliant_parquet_nested_type:

-notable_bug_fix3
+Write compliant Parquet nested types if possible


just a single line is fine here

judahrand · 2021-09-29T13:57:36Z

@jreback I've found your feedback on this change quite unhelpful so far. Are you happy with this change in general? I'm not going to put more time into this if the change isn't likely to be merged anyway. I've still not really had any acknowledgement that you understand the problem that this change addresses. Do you?

If I adjust the release note to a single entry and get the tests passing are you happy with this change?

jreback · 2021-09-29T15:22:00Z

@jreback I've found your feedback on this change quite unhelpful so far. Are you happy with this change in general? I'm not going to put more time into this if the change isn't likely to be merged anyway. I've still not really had any acknowledgement that you understand the problem that this change addresses. Do you?

If I adjust the release note to a single entry and get the tests passing are you happy with this change?

@judahrand yes the change is acceptable for code & tests. a 1-line whatsnew note is all that is needed here.

ps. a helpful attitude is a good thing. We have many PRs.

judahrand · 2021-09-29T15:34:41Z

ps. a helpful attitude is a good thing. We have many PRs.
Completely agree.

@judahrand yes the change is acceptable for code & tests. a 1-line whatsnew note is all that is needed here.

Great, I'll deal with this when I get a chance.

jreback · 2021-09-30T22:57:52Z

@jorisvandenbossche if good here

jorisvandenbossche · 2021-10-01T13:39:48Z

Thanks for the ping.

My first reaction is: if we think this is the better default, we should bring that up to change it in pyarrow, and not in pandas (otherwise you get inconsistencies between when using pandas or pyarrow, or eg when using dask which can use pyarrow directly instead of going through the pandas layer).
In https://issues.apache.org/jira/browse/ARROW-11497 it was actually also mentioned that long term Arrow would like to switch to conforming the spec by default, but on the short term it only exposed this keyword.

Now, it's certainly harder / higher barrier to change this in Arrow (and you put already a lot of effort in this PR, thanks for that!), but I would personally still prefer to at least first bring it up, and see what the reaction is of other Arrow devs.

jorisvandenbossche · 2021-10-01T14:06:09Z

I opened https://issues.apache.org/jira/browse/ARROW-14196 for this

judahrand · 2021-10-01T14:07:06Z

I opened https://issues.apache.org/jira/browse/ARROW-14196 for this

Haha! I didn't clock that had only just been opened when I commented on it!

github-actions · 2021-11-02T00:06:16Z

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

jreback · 2021-11-28T21:01:26Z

closing as this going to be addressed as above in pyarrow

jreback requested changes Sep 22, 2021

View reviewed changes

judahrand force-pushed the compliant-parquet branch 3 times, most recently from 6837a30 to 996452e Compare September 22, 2021 08:59

judahrand force-pushed the compliant-parquet branch from 996452e to c3ead23 Compare September 22, 2021 09:00

judahrand requested a review from jreback September 22, 2021 09:06

judahrand force-pushed the compliant-parquet branch 2 times, most recently from 20e8518 to 2c404b6 Compare September 22, 2021 09:12

judahrand added 2 commits September 22, 2021 10:15

Write compliant Parquet with pyarrow if supported

e3f1687

Import pa_version_under5p0 in compat submodule

39010eb

judahrand force-pushed the compliant-parquet branch from 2c404b6 to 39010eb Compare September 22, 2021 09:15

Add whatsnew entry

cc7e6a8

judahrand force-pushed the compliant-parquet branch from 69e2f01 to cc7e6a8 Compare September 22, 2021 11:12

jreback requested changes Sep 22, 2021

View reviewed changes

jreback added IO Parquet parquet, feather Bug labels Sep 22, 2021

judahrand force-pushed the compliant-parquet branch 2 times, most recently from a0149d8 to 52e579c Compare September 22, 2021 18:15

Add test for compliant Parquet

17b4d17

judahrand force-pushed the compliant-parquet branch from 996a4e5 to 17b4d17 Compare September 22, 2021 19:55

jreback requested changes Sep 22, 2021

View reviewed changes

Add check_round_trip to test

5f06acd

judahrand requested a review from jreback September 23, 2021 06:54

jreback requested changes Sep 29, 2021

View reviewed changes

judahrand requested a review from jreback September 30, 2021 21:00

Replace long What's New entry with one liner

4025014

judahrand force-pushed the compliant-parquet branch from 7c810f4 to 4025014 Compare September 30, 2021 21:10

jreback added this to the 1.4 milestone Sep 30, 2021

jreback approved these changes Sep 30, 2021

View reviewed changes

trucnguyenlam approved these changes Oct 2, 2021

View reviewed changes

github-actions bot added the Stale label Nov 2, 2021

jreback closed this Nov 28, 2021

Uh oh!

Conversation

judahrand commented Sep 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

jreback commented Sep 22, 2021

Uh oh!

judahrand commented Sep 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pep8speaks commented Sep 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comment last updated at 2021-09-30 21:10:49 UTC

Uh oh!

judahrand commented Sep 22, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

judahrand Sep 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jreback Sep 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jreback commented Sep 22, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

judahrand Sep 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

judahrand Sep 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

judahrand Sep 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

judahrand Sep 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

judahrand commented Sep 23, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

judahrand commented Sep 29, 2021

Uh oh!

jreback commented Sep 29, 2021

Uh oh!

judahrand commented Sep 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jreback commented Sep 30, 2021

Uh oh!

jorisvandenbossche commented Oct 1, 2021

Uh oh!

jorisvandenbossche commented Oct 1, 2021

Uh oh!

judahrand commented Oct 1, 2021

Uh oh!

github-actions bot commented Nov 2, 2021

judahrand commented Sep 21, 2021 •

edited

Loading

judahrand commented Sep 22, 2021 •

edited

Loading

pep8speaks commented Sep 22, 2021 •

edited

Loading

judahrand Sep 22, 2021 •

edited

Loading

jreback Sep 22, 2021 •

edited

Loading

judahrand Sep 22, 2021 •

edited

Loading

judahrand Sep 22, 2021 •

edited

Loading

judahrand Sep 22, 2021 •

edited

Loading

judahrand Sep 22, 2021 •

edited

Loading

judahrand commented Sep 29, 2021 •

edited

Loading