-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix #63. Add --input-meta parameter to explicitly specify the jsonl field dtypes #75
Conversation
1aa9a72
to
18fce31
Compare
Signed-off-by: Miguel Martínez <[email protected]> Signed-off-by: Miguel Martínez <[email protected]>
18fce31
to
638f7ff
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this idea! I left a few changes I'd like to see made, but nothing major. Also it looks like there are some merge conflicts so you should probably resolve those.
After sleeping on it, I do have one more nit. I think the argument for all the |
Hi @ryantwolf , thank you for all your feedback. I am addressing all the comments you have made. My first thought about the name of the parameter was to name it Even if I think it is very likely that Looking forward to your thoughts. Thanks!!! |
Signed-off-by: Miguel Martínez <[email protected]>
Signed-off-by: Miguel Martínez <[email protected]>
Signed-off-by: Miguel Martínez <[email protected]>
Signed-off-by: Miguel Martínez <[email protected]>
Signed-off-by: Miguel Martínez <[email protected]>
Signed-off-by: Miguel Martínez <[email protected]>
Signed-off-by: Miguel Martínez <[email protected]>
Signed-off-by: Miguel Martínez <[email protected]>
Signed-off-by: Miguel Martínez <[email protected]>
Signed-off-by: Miguel Martínez <[email protected]>
Signed-off-by: Miguel Martínez <[email protected]>
Signed-off-by: Miguel Martínez <[email protected]>
Signed-off-by: Miguel Martínez <[email protected]>
Signed-off-by: Miguel Martínez <[email protected]>
Signed-off-by: Miguel Martínez <[email protected]>
Signed-off-by: Miguel Martínez <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes! Left a few suggestions, but generally looks good.
In addition to the changes, is it also possible to add a couple of tests, that attempt to use DocumentDataset.read_json
with the input_meta
param set both as a string and a dict, and verify that the result dtypes are what we'd expect.
Signed-off-by: Miguel Martínez <[email protected]>
Signed-off-by: Miguel Martínez <[email protected]>
Signed-off-by: Miguel Martínez <[email protected]>
Signed-off-by: Miguel Martínez <[email protected]>
Done! See |
Signed-off-by: Miguel Martínez <[email protected]>
Signed-off-by: Miguel Martínez <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for ddressing all my comments.
When reading jsonl files with Dask, the dataframe datatypes are inferred unless explicitly specified.
Inferring the data types can lead to several issues, such as incorrect type inference, degradation of performance and increased memory usage among others.
I think we could mitigate those issues if we would add a
--input-meta
parameter, which would receive a dictionary of datatypes.That parameter would be optional, and be similar to the
--meta
parameter available here: https://docs.dask.org/en/latest/generated/dask.dataframe.read_json.html.