Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] UTF8 to binary coercion flag #2893

Merged
merged 13 commits into from
Sep 24, 2024
Merged

Conversation

raunakab
Copy link
Contributor

@raunakab raunakab commented Sep 23, 2024

Overview

  • add string-to-binary coercion flag
  • cleaned up some logic and instances of non-idiomatic Rust code

@raunakab raunakab changed the title Feature/utf8 to binary coercion [Feat] UTF8 to binary coercion flag Sep 23, 2024
@raunakab raunakab marked this pull request as ready for review September 23, 2024 21:58
@raunakab raunakab changed the title [Feat] UTF8 to binary coercion flag [FEAT] UTF8 to binary coercion flag Sep 23, 2024
@github-actions github-actions bot added the enhancement New feature or request label Sep 23, 2024
Copy link

codspeed-hq bot commented Sep 23, 2024

CodSpeed Performance Report

Merging #2893 will not alter performance

Comparing feature/utf8-to-binary-coercion (58c0730) with main (b519944)

Summary

✅ 17 untouched benchmarks

Copy link

codecov bot commented Sep 23, 2024

Codecov Report

Attention: Patch coverage is 88.14815% with 16 lines in your changes missing coverage. Please review.

Project coverage is 66.16%. Comparing base (78a92a2) to head (58c0730).
Report is 23 commits behind head on main.

Files with missing lines Patch % Lines
src/daft-parquet/src/read.rs 87.80% 10 Missing ⚠️
src/arrow2/src/io/parquet/read/schema/mod.rs 53.84% 6 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2893      +/-   ##
==========================================
+ Coverage   65.93%   66.16%   +0.23%     
==========================================
  Files        1010     1002       -8     
  Lines      113440   114108     +668     
==========================================
+ Hits        74794    75497     +703     
+ Misses      38646    38611      -35     
Flag Coverage Δ
66.16% <88.14%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
daft/table/table.py 61.01% <100.00%> (ø)
src/arrow2/src/io/parquet/read/schema/convert.rs 95.88% <100.00%> (+0.04%) ⬆️
src/daft-micropartition/src/micropartition.rs 90.75% <100.00%> (-2.07%) ⬇️
src/daft-parquet/src/file.rs 74.20% <100.00%> (+0.03%) ⬆️
src/daft-parquet/src/lib.rs 50.00% <ø> (ø)
src/daft-parquet/src/python.rs 58.91% <100.00%> (+0.15%) ⬆️
src/daft-parquet/src/stream_reader.rs 90.56% <100.00%> (+0.20%) ⬆️
src/daft-scan/src/glob.rs 87.64% <100.00%> (+0.04%) ⬆️
src/arrow2/src/io/parquet/read/schema/mod.rs 76.47% <53.84%> (-10.49%) ⬇️
src/daft-parquet/src/read.rs 74.66% <87.80%> (+1.19%) ⬆️

... and 175 files with indirect coverage changes

Copy link
Contributor

@jaychia jaychia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs tests!

daft/daft/__init__.pyi Outdated Show resolved Hide resolved
src/arrow2/src/io/parquet/read/schema/mod.rs Show resolved Hide resolved
pub struct ParquetSchemaInferenceOptionsBuilder {
pub coerce_int96_timestamp_unit: Option<PyTimeUnit>,
pub coerce_string_to_binary: Option<bool>,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need a builder? Could we not just implement a ParquetSchemaInferenceOptions::new()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could, but the new function will seem to get busy in the future... I don't mind adding another param to ParquetSchemaInferenceOptions::new, but as we continue to add more features and flags to those options, seems like we'll end up deferring back to a builder anyways.

@raunakab raunakab linked an issue Sep 24, 2024 that may be closed by this pull request
@raunakab raunakab marked this pull request as draft September 24, 2024 19:12
daft/table/table.py Outdated Show resolved Hide resolved
@raunakab raunakab marked this pull request as ready for review September 24, 2024 22:24
Copy link
Contributor

@jaychia jaychia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stamped with remaining comments

tests/table/table_io/test_parquet.py Outdated Show resolved Hide resolved
tests/table/table_io/test_parquet.py Show resolved Hide resolved
@raunakab raunakab enabled auto-merge (squash) September 24, 2024 22:48
@raunakab raunakab merged commit 260a2c7 into main Sep 24, 2024
38 checks passed
@raunakab raunakab deleted the feature/utf8-to-binary-coercion branch September 24, 2024 22:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Better handling of invalid utf-8 from Parquet string columns
3 participants