-
Notifications
You must be signed in to change notification settings - Fork 3
Complete dictionary support #3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Improve variance merging method to address stability issue when merging short chunks with approximate mean value. Improve reference variance accuracy by leveraging Kahan summation. Closes apache#8437 from cyb70289/variance-stability Authored-by: Yibo Cai <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
The benchmarks were only benchmarking planning, not execution, of the plans. This PR fixes this. Closes apache#8452 from jorgecarleitao/bench Authored-by: Jorge C. Leitao <[email protected]> Signed-off-by: Andy Grove <[email protected]>
…umulators. This PR replaces `Rc<RefCell<>>` by `Box<>`. We do not need interior mutability on the accumulations. Closes apache#8456 from jorgecarleitao/box Authored-by: Jorge C. Leitao <[email protected]> Signed-off-by: Andy Grove <[email protected]>
We were reading dictionaries in the file reader, but not in the stream reader. This was a trivial change, as we needed to add the dictionary to the stream when we encounter it, and then read the next message until we reach a record batch. I tested with the 0.14.1 golden file, I'm going to test with later versions (1.0.0-littleendian) when I get to `arrow::ipc::MetadataVersion::V5` support, hopefully soon. Closes apache#8450 from nevi-me/ARROW-10289 Authored-by: Neville Dipale <[email protected]> Signed-off-by: Jorge C. Leitao <[email protected]>
Currently, `mergeExec` uses `tokio::spawn` to parallelize the work, by calling `tokio::spawn` once per logical thread. However, `tokio::spawn` returns a task / future, which `tokio` runtime will then schedule on its thread pool. Therefore, there is no need to limit the number of tasks to the number of logical threads, as tokio's runtime itself is responsible for that work. In particular, since we are using [`rt-threaded`](https://docs.rs/tokio/0.2.22/tokio/runtime/index.html#threaded-scheduler), tokio already declares a thread pool from the number of logical threads available. This PR removes the coupling, in `mergeExec`, between the number of logical threads (`max_concurrency`) and the number of created tasks. I observe no change in performance: <details> <summary>Benchmark results</summary> ``` Switched to branch 'simplify_merge' Your branch is up to date with 'origin/simplify_merge'. Compiling datafusion v2.0.0-SNAPSHOT (/Users/jorgecarleitao/projects/arrow/rust/datafusion) Finished bench [optimized] target(s) in 38.02s Running /Users/jorgecarleitao/projects/arrow/rust/target/release/deps/aggregate_query_sql-5241a705a1ff29ae Gnuplot not found, using plotters backend aggregate_query_no_group_by 15 12 time: [715.17 us 722.60 us 730.19 us] change: [-8.3167% -5.2253% -2.2675%] (p = 0.00 < 0.05) Performance has improved. Found 3 outliers among 100 measurements (3.00%) 1 (1.00%) high mild 2 (2.00%) high severe aggregate_query_group_by 15 12 time: [5.6538 ms 5.6695 ms 5.6892 ms] change: [+0.1012% +0.5308% +0.9913%] (p = 0.02 < 0.05) Change within noise threshold. Found 10 outliers among 100 measurements (10.00%) 4 (4.00%) high mild 6 (6.00%) high severe aggregate_query_group_by_with_filter 15 12 time: [2.6598 ms 2.6665 ms 2.6751 ms] change: [-0.5532% -0.1446% +0.2679%] (p = 0.51 > 0.05) No change in performance detected. Found 7 outliers among 100 measurements (7.00%) 3 (3.00%) high mild 4 (4.00%) high severe ``` </details> Closes apache#8453 from jorgecarleitao/simplify_merge Authored-by: Jorge C. Leitao <[email protected]> Signed-off-by: Jorge C. Leitao <[email protected]>
Also adds a GHA job that tests on R-devel so we catch issues like this sooner. Closes apache#8447 from nealrichardson/r-timestamp-test Authored-by: Neal Richardson <[email protected]> Signed-off-by: Neal Richardson <[email protected]>
…e.empty for schemas containing compound types (List, FixedSizeList, Map)
Steps for reproduction:
```js
const foo = new arrow.List(new arrow.Field('bar', new arrow.Float64()))
const table = arrow.Table.empty(foo) // ⚡
```
The Data constructor assumes childData is either falsey, a zero-length array (still falsey, but worth distinguishing) or a non-zero length array of valid instances of Data or objects with a data property. Coercing undefineds to empty arrays a little earlier for compound types (List, FixedSizeList, Map) avoids this.
Closes apache#7771 from H-Plus-Time/ARROW-9479
Authored-by: H-Plus-Time <[email protected]>
Signed-off-by: Brian Hulette <[email protected]>
…alls back to string Closes apache#8462 from bkietz/10145-Integer-like-partition-fi Authored-by: Benjamin Kietzman <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>
When translating between the memory FieldType and message FieldType for dictionary encoded vectors the children of the dictionary field were not handled correctly. * When going from memory format to message format the Field must have the children of the dictionary field. * When going from message format to memory format the Field must have no children but the dictionary must have the mapped children Closes apache#8363 from HedgehogCode/bug/ARROW-10174-dict-structs Authored-by: Benjamin Wilhelm <[email protected]> Signed-off-by: liyafan82 <[email protected]>
…ataFusion This is a PR incorporating the feedback from @nevi-me and @jorgecarleitao from apache#8400 It adds 1. a `can_cast_types` function to the Arrow cast kernel (as suggested by @jorgecarleitao / @nevi-me in apache#8400 (comment)) that encodes the valid type casting 2. A test that ensures `can_cast_types` and `cast` remain in sync 3. Bug fixes that the test above uncovered (I'll comment inline) 4. Change DataFuson to use `can_cast_types` so that it plans casting consistently with what arrow allows Previously the notions of coercion and casting were somewhat conflated in DataFusion. I have tried to clarify them in apache#8399 and this PR. See also apache#8340 (comment) for more discussion. I am adding this functionality so DataFusion gains rudimentary support `DictionaryArray`. Codewise, I am concerned about the duplication in logic between the match statements in `cast` and `can_cast_types. I have some thoughts on how to unify them (see apache#8400 (comment)), but I don't have time to implement that as it is a bigger change. I think this approach with some duplication is ok, and the test will ensure they remain in sync. Closes apache#8460 from alamb/alamb/ARROW-10236-casting-rules-2 Authored-by: alamb <[email protected]> Signed-off-by: Neville Dipale <[email protected]>
Unlike other fixed width vectors, DecimalVectors have some APIs that directly manipulate an ArrowBuf (e.g. `void set(int index, int isSet, int start, ArrowBuf buffer)`. After supporting 64-bit ArrowBufs, we need to adjust such APIs so that they work properly. Closes apache#8455 from liyafan82/fly_1012_dec Authored-by: liyafan82 <[email protected]> Signed-off-by: Micah Kornfield <[email protected]>
…tor in… …stead Issue link: https://issues.apache.org/jira/browse/ARROW-9475. Closes apache#7768 from zhztheplayer/ARROW-9475 Authored-by: Hongze Zhang <[email protected]> Signed-off-by: Micah Kornfield <[email protected]>
This improves CSV string conversion performance by about 30%. Closes apache#8470 from pitrou/ARROW-10313-faster-utf8-validate Authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
…ions Moving the castint/float functions to gdv_function_stubs outside of precompiled module Closes apache#8096 from projjal/castint and squashes the following commits: 85179a5 <Projjal Chanda> moved castInt to gdv_fn_stubs c09077e <Projjal Chanda> fixed castfloat function ddc429d <Projjal Chanda> added java test case f666f54 <Projjal Chanda> fix error handling in castint Authored-by: Projjal Chanda <[email protected]> Signed-off-by: Praveen <[email protected]>
- Fix the verification build setups - Expose `--param` options to crossbow.py submit to override jinja parameters - Expose the same option to the comment bot, so `crossbow submit -p release=2.0.0 -p rc=2 -g verify-rc` will work next time Closes apache#8464 from kszucs/release-verification Authored-by: Krisztián Szűcs <[email protected]> Signed-off-by: Krisztián Szűcs <[email protected]>
… compiler Also build the SIMD files as ARROW_RUNTIME_SIMD_LEVEL. Signed-off-by: Frank Du <[email protected]> Closes apache#8478 from jianxind/avx512_runtime_level_build Authored-by: Frank Du <[email protected]> Signed-off-by: Neal Richardson <[email protected]>
**Note**: I started making changes to apache#6785, and ended up deviating a lot, so I opted for making a new draft PR in case my approach is not suitable. ___ This is a draft to implement an arrow writer for parquet. It supports the following (no complete test coverage yet): * writing primitives except for booleans and binary * nested structs * null values (via definition levels) It does not yet support: - Boolean arrays (have to be handled differently from numeric values) - Binary arrays - Dictionary arrays - Union arrays (are they even possible?) I have only added a test by creating a nested schema, which I tested on pyarrow. ```jupyter # schema of test_complex.parquet a: int32 not null b: int32 c: struct<d: double, e: struct<f: float>> not null child 0, d: double child 1, e: struct<f: float> child 0, f: float ``` This PR potentially addresses: * https://issues.apache.org/jira/browse/ARROW-8289 * https://issues.apache.org/jira/browse/ARROW-8423 * https://issues.apache.org/jira/browse/ARROW-8424 * https://issues.apache.org/jira/browse/ARROW-8425 And I would like to propose either opening new JIRAs for the above incomplete items, or renaming the last 3 above. ___ **Help Needed** I'm implementing the definition and repetition levels on first principle from an old Parquet blog post from the Twitter engineering blog. It's likely that I'm not getting some concepts correct, so I would appreciate help with: * Checking if my logic is correct * Guidance or suggestions on how to more efficiently extract levels from arrays * Adding tests - I suspect we might need a lot of tests, so far we only test writing 1 batch, so I don't know how paging would work when writing a large enough file I also don't know if the various encoding levels (dictionary, RLE, etc.) and compression levels are applied automagically, or if that'd be something we need to explicitly enable. CC @sunchao @sadikovi @andygrove @paddyhoran Might be of interest to @mcassels @maxburke Closes apache#7319 from nevi-me/arrow-parquet-writer Lead-authored-by: Neville Dipale <[email protected]> Co-authored-by: Max Burke <[email protected]> Co-authored-by: Andy Grove <[email protected]> Co-authored-by: Max Burke <[email protected]> Signed-off-by: Neville Dipale <[email protected]>
This will allow preserving Arrow-specific metadata when writing or reading Parquet files created from C++ or Rust. If the schema can't be deserialised, the normal Parquet > Arrow schema conversion is performed. Closes apache#7917 from nevi-me/ARROW-8243 Authored-by: Neville Dipale <[email protected]> Signed-off-by: Neville Dipale <[email protected]>
…arrow_schema with ipc changes Note that this PR is deliberately filed against the rust-parquet-arrow-writer branch, not master!! Hi! 👋 I'm looking to help out with the rust-parquet-arrow-writer branch, and I just pulled it down and it wasn't compiling because in 75f804e, `schema_to_bytes` was changed to take `IpcWriteOptions` and to return `EncodedData`. This updates `encode_arrow_schema` to use those changes, which should get this branch compiling and passing tests again. I'm kind of guessing which JIRA ticket this should be associated with; honestly I think this commit can just be squashed with apache@8f0ed91 next time this branch gets rebased. Please let me know if I should change anything, I'm happy to! Closes apache#8274 from carols10cents/update-with-ipc-changes Authored-by: Carol (Nichols || Goulding) <[email protected]> Signed-off-by: Neville Dipale <[email protected]>
In this commit, I: - Extracted a `build_field` function for some code shared between `schema_to_fb` and `schema_to_fb_offset` that needed to change - Uncommented the dictionary field from the Arrow schema roundtrip test and add a dictionary field to the IPC roundtrip test - If a field is a dictionary field, call `add_dictionary` with the dictionary field information on the flatbuffer field, building the dictionary as [the C++ code does][cpp-dictionary] and describe with the same comment - When getting the field type for a dictionary field, use the `value_type` as [the C++ code does][cpp-value-type] and describe with the same comment The tests pass because the Parquet -> Arrow conversion for dictionaries is [already supported][parquet-to-arrow]. [cpp-dictionary]: https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/cpp/src/arrow/ipc/metadata_internal.cc#L426-L440 [cpp-value-type]: https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/cpp/src/arrow/ipc/metadata_internal.cc#L662-L667 [parquet-to-arrow]: https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/rust/arrow/src/ipc/convert.rs#L120-L127 Closes apache#8291 from carols10cents/rust-parquet-arrow-writer Authored-by: Carol (Nichols || Goulding) <[email protected]> Signed-off-by: Neville Dipale <[email protected]>
…r all supported Arrow DataTypes Note that this PR goes to the rust-parquet-arrow-writer branch, not master. Inspired by tests in cpp/src/parquet/arrow/arrow_reader_writer_test.cc These perform round-trip Arrow -> Parquet -> Arrow of a single RecordBatch with a single column of values of each the supported data types and some of the unsupported ones. Tests that currently fail are either marked with `#[should_panic]` (if the reason they fail is because of a panic) or `#[ignore]` (if the reason they fail is because the values don't match). I am comparing the RecordBatch's column's data before and after the round trip directly; I'm not sure that this is appropriate or not because for some data types, the `null_bitmap` isn't matching and I'm not sure if it's supposed to or not. So I would love advice on that front, and I would love to know if these tests are useful or not! Closes apache#8330 from carols10cents/roundtrip-tests Lead-authored-by: Carol (Nichols || Goulding) <[email protected]> Co-authored-by: Neville Dipale <[email protected]> Signed-off-by: Andy Grove <[email protected]>
…m Parquet metadata when available @nevi-me This is one commit on top of apache#8330 that I'm opening to get some feedback from you on about whether this will help with ARROW-10168. I *think* this will bring the Rust implementation more in line with C++, but I'm not certain. I tried removing the `#[ignore]` attributes from the `LargeArray` and `LargeUtf8` tests, but they're still failing because the schemas don't match yet-- it looks like [this code](https://github.com/apache/arrow/blob/b2842ab2eb0d7a7a633049a5591e1eaa254d4446/rust/parquet/src/arrow/array_reader.rs#L595-L638) will need to be changed as well. That `build_array_reader` function's code looks very similar to the code I've changed here, is there a possibility for the code to be shared or is there a reason they're separate? Closes apache#8354 from carols10cents/schema-roundtrip Lead-authored-by: Carol (Nichols || Goulding) <[email protected]> Co-authored-by: Neville Dipale <[email protected]> Signed-off-by: Neville Dipale <[email protected]>
Closes apache#8388 from nevi-me/ARROW-10225 Authored-by: Neville Dipale <[email protected]> Signed-off-by: Neville Dipale <[email protected]>
This allows writing an Arrow NullArray to Parquet. Support was added a few years ago in Parquet, and the C++ implementation supports writing null arrays. The array is stored as an int32 which has all values set as null. In order to implement this, we introduce a `null -> int32` cast, which creates a null int32 of same length. Semantically, the write is the same as writing an int32 that's all null, but we create a null writer to preserve the data type. Closes apache#8484 from nevi-me/ARROW-10334 Authored-by: Neville Dipale <[email protected]> Signed-off-by: Neville Dipale <[email protected]>
This is a port of apache#6770 to the parquet-writer branch. We'll have more of a chance to test this reader,and ensure that we can roundtrip on list types. Closes apache#8449 from nevi-me/ARROW-7842-cherry Authored-by: Neville Dipale <[email protected]> Signed-off-by: Neville Dipale <[email protected]>
This adds more support for: - When converting Arrow -> Parquet containing an Arrow Dictionary, materialize the Dictionary values and send to Parquet to be encoded with a dictionary or not according to the Parquet settings (not supported: converting an Arrow Dictionary directly to Parquet DictEncoding, also only supports Int32 index types in this commit, also removes NULLs) - When converting Parquet -> Arrow, noticing that the Arrow schema metadata in a Parquet file has a Dictionary type and converting the data to an Arrow dictionary (right now this only supports String dictionaries
This allows us to compare padded buffers with unpaddded ones. When reading buffers from IPC, they are padded.
This leaves a door open to also support dictionaries with non-string values, but that's not currently implemented.
As the C++ implementation was updated to do in b1a7a73, and as supported by the unsigned integer types that implement ArrowDictionaryKeyType.
Dictionaries can be indexed by either signed or unsigned integers.
| // My assumption is that we can't get to an illegal cast as we can only | ||
| // generate types that are supported, because we'd have gotten them from | ||
| // the metadata which was written to the Parquet sink | ||
| let array = arrow::compute::cast(&array, self.get_data_type())?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the main addition, everything else is to accommodate being able to do this.
The Arrow type was getting erased by the cast converters, so I removed them, and instead preserve Option<DataType> along from the schema to get here.
This was also going to become important when dealing with nested lists, as there's a breaking change that we need to make to Arrow in order to correctly roundtrip lists.
fc0ac04 to
1cc53e4
Compare
|
Cherry-picked this to apache#8402 :) |
From a deadlocked run... ``` #0 0x00007f8a5d48dccd in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x00007f8a5d486f05 in pthread_mutex_lock () from /lib64/libpthread.so.0 #2 0x00007f8a566e7e89 in arrow::internal::FnOnce<void ()>::FnImpl<arrow::Future<Aws::Utils::Outcome<Aws::S3::Model::ListObjectsV2Result, Aws::S3::S3Error> >::Callback<arrow::fs::(anonymous namespace)::TreeWalker::ListObjectsV2Handler> >::invoke() () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so #3 0x00007f8a5650efa0 in arrow::FutureImpl::AddCallback(arrow::internal::FnOnce<void ()>) () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so #4 0x00007f8a566e67a9 in arrow::fs::(anonymous namespace)::TreeWalker::ListObjectsV2Handler::SpawnListObjectsV2() () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so #5 0x00007f8a566e723f in arrow::fs::(anonymous namespace)::TreeWalker::WalkChild(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int) () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so apache#6 0x00007f8a566e827d in arrow::internal::FnOnce<void ()>::FnImpl<arrow::Future<Aws::Utils::Outcome<Aws::S3::Model::ListObjectsV2Result, Aws::S3::S3Error> >::Callback<arrow::fs::(anonymous namespace)::TreeWalker::ListObjectsV2Handler> >::invoke() () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so apache#7 0x00007f8a5650efa0 in arrow::FutureImpl::AddCallback(arrow::internal::FnOnce<void ()>) () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so apache#8 0x00007f8a566e67a9 in arrow::fs::(anonymous namespace)::TreeWalker::ListObjectsV2Handler::SpawnListObjectsV2() () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so apache#9 0x00007f8a566e723f in arrow::fs::(anonymous namespace)::TreeWalker::WalkChild(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int) () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so apache#10 0x00007f8a566e74b1 in arrow::fs::(anonymous namespace)::TreeWalker::DoWalk() () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so ``` The callback `ListObjectsV2Handler` is being called recursively and the mutex is non-reentrant thus deadlock. To fix it I got rid of the mutex on `TreeWalker` by using `arrow::util::internal::TaskGroup` instead of manually tracking the #/status of in-flight requests. Closes apache#9842 from westonpace/bugfix/arrow-12040 Lead-authored-by: Weston Pace <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
This simplifies dictionary support by moving the primitive casts to one place. The idea is to not cast Parquet primitive types, as these can be mapped to 4 Arrow types (i32, i64, f32, f64).
Once these 4 primitive Arrow arrays are created, we can leverage the machinery in
arrow::compute::castto cast to many Arrow types.I've left some TODOs for technical debt which I'd love for us to address in this PR, lest we never get to it.