-
Notifications
You must be signed in to change notification settings - Fork 121
[FEA] Add --meta parameter to explicitly specify the jsonl field dtypes #63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Labels
enhancement
New feature or request
Comments
I will work in the feature. |
PR #75. |
ayushdg
pushed a commit
that referenced
this issue
May 30, 2024
…ield dtypes (#75) * Add dtype support (optional) when reading jsonl files Signed-off-by: Miguel Martínez <[email protected]> Signed-off-by: Miguel Martínez <[email protected]> * Resolve merge conflit Signed-off-by: Miguel Martínez <[email protected]> * Resolve merge conflit Signed-off-by: Miguel Martínez <[email protected]> * Resolve merge conflit Signed-off-by: Miguel Martínez <[email protected]> * Resolve merge conflit Signed-off-by: Miguel Martínez <[email protected]> * Resolve merge conflit Signed-off-by: Miguel Martínez <[email protected]> * Resolve merge conflit Signed-off-by: Miguel Martínez <[email protected]> * Change input_meta type hint Signed-off-by: Miguel Martínez <[email protected]> * Change input_meta type hint Signed-off-by: Miguel Martínez <[email protected]> * Resolve merge conflit Signed-off-by: Miguel Martínez <[email protected]> * Resolve merge conflit Signed-off-by: Miguel Martínez <[email protected]> * Resolve merge conflit Signed-off-by: Miguel Martínez <[email protected]> * Resolve merge conflit Signed-off-by: Miguel Martínez <[email protected]> * Resolve merge conflit Signed-off-by: Miguel Martínez <[email protected]> * Resolve merge conflit Signed-off-by: Miguel Martínez <[email protected]> * Resolve merge conflit Signed-off-by: Miguel Martínez <[email protected]> * Resolve merge conflit Signed-off-by: Miguel Martínez <[email protected]> * Assign input_meta to the right variable Signed-off-by: Miguel Martínez <[email protected]> * Add warning when input_meta is used with non jsonl files. Signed-off-by: Miguel Martínez <[email protected]> * Explicitly check for None when validating input_meta Signed-off-by: Miguel Martínez <[email protected]> * Add input_meta test Signed-off-by: Miguel Martínez <[email protected]> * Add description to function Signed-off-by: Miguel Martínez <[email protected]> * Add test_meta_str Signed-off-by: Miguel Martínez <[email protected]> --------- Signed-off-by: Miguel Martínez <[email protected]> Signed-off-by: Miguel Martínez <[email protected]> Co-authored-by: Miguel Martínez <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Is your feature request related to a problem? Please describe.
When reading jsonl files with Dask, the dataframe datatypes are inferred unless explicitly specified.
Inferring the data types can lead to several issues, such as incorrect type inference, degradation of performance and increased memory usage among others.
I think we could mitigate those issues if we would add a
--meta
parameter, which would receive a dictionary of datatypes.That parameter would be optional, and be similar to the
--meta
parameter available here: https://docs.dask.org/en/latest/generated/dask.dataframe.read_json.html.The text was updated successfully, but these errors were encountered: