-
Notifications
You must be signed in to change notification settings - Fork 144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds parquet writer #103
Adds parquet writer #103
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two comments :).
self._writers[filename] = pq.ParquetWriter( | ||
file_handler, schema=pa.table({name: [val] for name, val in document.items()}).schema | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Document
's attributes have fixed types, so I wonder if it would make more sense to pass pa.schema({"text": pa.string(), "id": pa.string(), media: pa.struct({"type": pa.int32(), "url": pa.string(), "alt": pa.string(), "local_path": pa.string()}), "metadata": pa.string()})
for the schema.
Parquet still doesn't support unions (see apache/parquet-format#44), so we would have to work around this limitation by turning the metadata
value into a string using json.dumps(metadata)
. Then, to make the ParquetReader
compatible with this format, we would also have to add metadata to the schema (pa.schema(fields, metadata=...)
), which the reader would check and perform deserialization (using json.loads
) on the other side if needed.
But the current solution is good enough, so this can also be addressed later.
PS: To be extra strict, the default nullability of non-nullable fields ("text", "id", etc.) in the above schema can be disabled with pa.field(pa_type, nullable=False)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They used to have fixed types but now we support an adapter
so that people can choose their output format (still a dictionary, but they can do whatever they want with the fields)
Regarding unions, does this mean if we have different value types in metadata
(let's say strings and floats) then this doesn't work?
Regarding nullability, the problem would also be the custom user formats
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we could also have pa.RecordBatch.from_pylist([document]).schema
here instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They used to have fixed types but now we support an
adapter
so that people can choose their output format (still a dictionary, but they can do whatever they want with the fields)
We could only use the fixed schema if adapter
is not specified.
Regarding unions, does this mean if we have different value types in
metadata
(let's say strings and floats) then this doesn't work?
JSON supports these types, so it will work.
maybe we could also have
pa.RecordBatch.from_pylist([document]).schema
here instead?
Yes, this would be cleaner indeed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I think maybe for now we will keep the current format so that even when people upload to the hub directly and so on there isn't a big json field
Co-authored-by: Mario Šaško <[email protected]>
No description provided.