BUG: Write compliant Parquet with pyarrow#43690
BUG: Write compliant Parquet with pyarrow#43690judahrand wants to merge 6 commits intopandas-dev:masterfrom
pyarrow#43690Conversation
jreback
left a comment
There was a problem hiding this comment.
umm we can discuss updating pyarrow version though i think we just did
|
pls try to solve this conditionally based on the pyarrow version |
|
Doesn't it feel a bit odd to have a different output format depending on the version of PyArrow that just happens to be installed? (I believe |
6837a30 to
996452e
Compare
|
Hello @judahrand! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2021-09-30 21:10:49 UTC |
996452e to
c3ead23
Compare
20e8518 to
2c404b6
Compare
2c404b6 to
39010eb
Compare
69e2f01 to
cc7e6a8
Compare
|
Do you think this is something which can / should be back ported to |
doc/source/whatsnew/v1.4.0.rst
Outdated
| .. _whatsnew_140.notable_bug_fixes.write_compliant_parquet_nested_type: | ||
|
|
||
| notable_bug_fix3 | ||
| Write compliant Parquet nested types if possible |
There was a problem hiding this comment.
you don't need to add a whole sub-section on this. a single line note is fine.
There was a problem hiding this comment.
I just figured that since it could break people's stuff it was worth calling out but shall remove if you'd prefer?
Let me know.
There was a problem hiding this comment.
the entire point is this wont' break anyone, yes just a single entry is fine
There was a problem hiding this comment.
Do you want to expand on this? This change will change the data that Pandas outputs. If a user is expecting and handling the current output (not necessarily back into Pandas as Parquet is a cross platform format) then this could break their code elsewhere?
Am I misunderstanding your point?
There was a problem hiding this comment.
just a single line is fine here
no |
a0149d8 to
52e579c
Compare
996a4e5 to
17b4d17
Compare
| df.to_parquet(path, pa) | ||
| result = pyarrow.parquet.read_table(path) | ||
|
|
||
| assert str(result.schema.field_by_name("a").type) == "list<element: int64>" |
There was a problem hiding this comment.
use check_round_trip(df, pa) as this will assert the results
There was a problem hiding this comment.
That isn't what I'm testing though... I'm testing that the Parquet that is written is correct...
ie. "list<element: int64>" not "list<item: int64>"
The serializer and deserializer always worked.
There was a problem hiding this comment.
There was a problem hiding this comment.
well if the serialize & de-serialize works then i am not sure i understand the problem. you can certainly assert the meta-data if you want. the most important point is that you need this option to have correct results yes?
There was a problem hiding this comment.
The problem is that to_parquet doesn't write Parquet compliant data. Parquet is a 'standard' after all. Pandas doesn't follow the standard.
What if I want to load data from Pandas into a different language?
There was a problem hiding this comment.
Here is an example of a place where this is an issue: googleapis/python-bigquery#980
And the use_compliant_nested_type has to be passed manually.
Really use_compliant_nested_type should be the default in PyArrow but it isn't for compatibility reasons. It's all in the Arrow PR I linked.
There was a problem hiding this comment.
In summary: yes, use_compliant_nested_type is needed in order to output the correct, compliant Parquet format.
But then that comes back around to changing output format and so my point about calling users attention to that in What's New. It they're using the data elsewhere (not Pandas) it could cause them problems.
|
Test timed out rather than failed. |
doc/source/whatsnew/v1.4.0.rst
Outdated
| .. _whatsnew_140.notable_bug_fixes.write_compliant_parquet_nested_type: | ||
|
|
||
| notable_bug_fix3 | ||
| Write compliant Parquet nested types if possible |
There was a problem hiding this comment.
just a single line is fine here
|
@jreback I've found your feedback on this change quite unhelpful so far. Are you happy with this change in general? I'm not going to put more time into this if the change isn't likely to be merged anyway. I've still not really had any acknowledgement that you understand the problem that this change addresses. Do you? If I adjust the release note to a single entry and get the tests passing are you happy with this change? |
@judahrand yes the change is acceptable for code & tests. a 1-line whatsnew note is all that is needed here. ps. a helpful attitude is a good thing. We have many PRs. |
Great, I'll deal with this when I get a chance. |
7c810f4 to
4025014
Compare
|
@jorisvandenbossche if good here |
|
Thanks for the ping. My first reaction is: if we think this is the better default, we should bring that up to change it in pyarrow, and not in pandas (otherwise you get inconsistencies between when using pandas or pyarrow, or eg when using dask which can use pyarrow directly instead of going through the pandas layer). Now, it's certainly harder / higher barrier to change this in Arrow (and you put already a lot of effort in this PR, thanks for that!), but I would personally still prefer to at least first bring it up, and see what the reaction is of other Arrow devs. |
|
I opened https://issues.apache.org/jira/browse/ARROW-14196 for this |
Haha! I didn't clock that had only just been opened when I commented on it! |
|
This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this. |
|
closing as this going to be addressed as above in pyarrow |
DataFrame().to_parquet()does not write Parquet compliant data for nested arrays #43689