Skip to content

Conversation

@nevi-me
Copy link
Collaborator

@nevi-me nevi-me commented Oct 25, 2020

This simplifies dictionary support by moving the primitive casts to one place. The idea is to not cast Parquet primitive types, as these can be mapped to 4 Arrow types (i32, i64, f32, f64).
Once these 4 primitive Arrow arrays are created, we can leverage the machinery in arrow::compute::cast to cast to many Arrow types.

I've left some TODOs for technical debt which I'd love for us to address in this PR, lest we never get to it.

cyb70289 and others added 30 commits October 13, 2020 11:11
Improve variance merging method to address stability issue when merging
short chunks with approximate mean value.

Improve reference variance accuracy by leveraging Kahan summation.

Closes apache#8437 from cyb70289/variance-stability

Authored-by: Yibo Cai <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
The benchmarks were only benchmarking planning, not execution, of the plans. This PR fixes this.

Closes apache#8452 from jorgecarleitao/bench

Authored-by: Jorge C. Leitao <[email protected]>
Signed-off-by: Andy Grove <[email protected]>
…umulators.

This PR replaces `Rc<RefCell<>>` by `Box<>`. We do not need interior mutability on the accumulations.

Closes apache#8456 from jorgecarleitao/box

Authored-by: Jorge C. Leitao <[email protected]>
Signed-off-by: Andy Grove <[email protected]>
We were reading dictionaries in the file reader, but not in the stream reader.
This was a trivial change, as we needed to add the dictionary to the stream when we encounter it, and then read the next message until we reach a record batch.

I tested with the 0.14.1 golden file, I'm going to test with later versions (1.0.0-littleendian) when I get to `arrow::ipc::MetadataVersion::V5` support, hopefully soon.

Closes apache#8450 from nevi-me/ARROW-10289

Authored-by: Neville Dipale <[email protected]>
Signed-off-by: Jorge C. Leitao <[email protected]>
Currently, `mergeExec` uses `tokio::spawn` to parallelize the work, by calling `tokio::spawn` once per logical thread. However, `tokio::spawn` returns a task / future, which `tokio` runtime will then schedule on its thread pool.

Therefore, there is no need to limit the number of tasks to the number of logical threads, as tokio's runtime itself is responsible for that work. In particular, since we are using [`rt-threaded`](https://docs.rs/tokio/0.2.22/tokio/runtime/index.html#threaded-scheduler), tokio already declares a thread pool from the number of logical threads available.

This PR removes the coupling, in `mergeExec`, between the number of logical threads (`max_concurrency`) and the number of created tasks. I observe no change in performance:

<details>
 <summary>Benchmark results</summary>

```
Switched to branch 'simplify_merge'
Your branch is up to date with 'origin/simplify_merge'.
   Compiling datafusion v2.0.0-SNAPSHOT (/Users/jorgecarleitao/projects/arrow/rust/datafusion)
    Finished bench [optimized] target(s) in 38.02s
     Running /Users/jorgecarleitao/projects/arrow/rust/target/release/deps/aggregate_query_sql-5241a705a1ff29ae
Gnuplot not found, using plotters backend
aggregate_query_no_group_by 15 12
                        time:   [715.17 us 722.60 us 730.19 us]
                        change: [-8.3167% -5.2253% -2.2675%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe

aggregate_query_group_by 15 12
                        time:   [5.6538 ms 5.6695 ms 5.6892 ms]
                        change: [+0.1012% +0.5308% +0.9913%] (p = 0.02 < 0.05)
                        Change within noise threshold.
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) high mild
  6 (6.00%) high severe

aggregate_query_group_by_with_filter 15 12
                        time:   [2.6598 ms 2.6665 ms 2.6751 ms]
                        change: [-0.5532% -0.1446% +0.2679%] (p = 0.51 > 0.05)
                        No change in performance detected.
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) high mild
  4 (4.00%) high severe
```

</details>

Closes apache#8453 from jorgecarleitao/simplify_merge

Authored-by: Jorge C. Leitao <[email protected]>
Signed-off-by: Jorge C. Leitao <[email protected]>
Also adds a GHA job that tests on R-devel so we catch issues like this sooner.

Closes apache#8447 from nealrichardson/r-timestamp-test

Authored-by: Neal Richardson <[email protected]>
Signed-off-by: Neal Richardson <[email protected]>
…e.empty for schemas containing compound types (List, FixedSizeList, Map)

Steps for reproduction:
```js
const foo = new arrow.List(new arrow.Field('bar', new arrow.Float64()))
const table = arrow.Table.empty(foo) // ⚡
```

The Data constructor assumes childData is either falsey, a zero-length array (still falsey, but worth distinguishing) or a non-zero length array of valid instances of Data or objects with a data property. Coercing undefineds to empty arrays a little earlier for compound types (List, FixedSizeList, Map) avoids this.

Closes apache#7771 from H-Plus-Time/ARROW-9479

Authored-by: H-Plus-Time <[email protected]>
Signed-off-by: Brian Hulette <[email protected]>
…alls back to string

Closes apache#8462 from bkietz/10145-Integer-like-partition-fi

Authored-by: Benjamin Kietzman <[email protected]>
Signed-off-by: Joris Van den Bossche <[email protected]>
When translating between the memory FieldType and message FieldType for
dictionary encoded vectors the children of the dictionary field were not
handled correctly.
* When going from memory format to message format the Field must have the
  children of the dictionary field.
* When going from message format to memory format the Field must have no
  children but the dictionary must have the mapped children

Closes apache#8363 from HedgehogCode/bug/ARROW-10174-dict-structs

Authored-by: Benjamin Wilhelm <[email protected]>
Signed-off-by: liyafan82 <[email protected]>
…ataFusion

This is a PR incorporating the feedback from @nevi-me  and @jorgecarleitao  from apache#8400

It adds
1. a `can_cast_types` function to the Arrow cast kernel (as suggested by @jorgecarleitao  / @nevi-me  in apache#8400 (comment)) that encodes the valid type casting
2. A test that ensures `can_cast_types` and `cast` remain in sync
3. Bug fixes that the test above uncovered (I'll comment inline)
4. Change DataFuson to use `can_cast_types` so that it plans casting consistently with what arrow allows

Previously the notions of coercion and casting were somewhat conflated in DataFusion. I have tried to clarify them in apache#8399 and this PR. See also apache#8340 (comment) for more discussion.

I am adding this functionality so DataFusion gains rudimentary support `DictionaryArray`.

Codewise, I am concerned about the duplication in logic between the match statements in `cast` and `can_cast_types. I have some thoughts on how to unify them (see apache#8400 (comment)), but I don't have time to implement that as it is a bigger change. I think this approach with some duplication is ok, and the test will ensure they remain in sync.

Closes apache#8460 from alamb/alamb/ARROW-10236-casting-rules-2

Authored-by: alamb <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
Unlike other fixed width vectors, DecimalVectors have some APIs that directly manipulate an ArrowBuf (e.g. `void set(int index, int isSet, int start, ArrowBuf buffer)`.

After supporting 64-bit ArrowBufs, we need to adjust such APIs so that they work properly.

Closes apache#8455 from liyafan82/fly_1012_dec

Authored-by: liyafan82 <[email protected]>
Signed-off-by: Micah Kornfield <[email protected]>
…tor in…

…stead

Issue link: https://issues.apache.org/jira/browse/ARROW-9475.

Closes apache#7768 from zhztheplayer/ARROW-9475

Authored-by: Hongze Zhang <[email protected]>
Signed-off-by: Micah Kornfield <[email protected]>
This improves CSV string conversion performance by about 30%.

Closes apache#8470 from pitrou/ARROW-10313-faster-utf8-validate

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
…ions

Moving the castint/float functions to gdv_function_stubs outside of precompiled module

Closes apache#8096 from projjal/castint and squashes the following commits:

85179a5 <Projjal Chanda> moved castInt to gdv_fn_stubs
c09077e <Projjal Chanda> fixed castfloat function
ddc429d <Projjal Chanda> added java test case
f666f54 <Projjal Chanda> fix error handling in castint

Authored-by: Projjal Chanda <[email protected]>
Signed-off-by: Praveen <[email protected]>
- Fix the verification build setups
- Expose `--param` options to crossbow.py submit to override jinja parameters
- Expose the same option to the comment bot, so `crossbow submit -p release=2.0.0 -p rc=2 -g verify-rc` will work next time

Closes apache#8464 from kszucs/release-verification

Authored-by: Krisztián Szűcs <[email protected]>
Signed-off-by: Krisztián Szűcs <[email protected]>
… compiler

Also build the SIMD files as ARROW_RUNTIME_SIMD_LEVEL.

Signed-off-by: Frank Du <[email protected]>

Closes apache#8478 from jianxind/avx512_runtime_level_build

Authored-by: Frank Du <[email protected]>
Signed-off-by: Neal Richardson <[email protected]>
**Note**: I started making changes to apache#6785, and ended up deviating a lot, so I opted for making a new draft PR in case my approach is not suitable.
___

This is a draft to implement an arrow writer for parquet. It supports the following (no complete test coverage yet):

* writing primitives except for booleans and binary
* nested structs
* null values (via definition levels)

It does not yet support:

- Boolean arrays (have to be handled differently from numeric values)
- Binary arrays
- Dictionary arrays
- Union arrays (are they even possible?)

I have only added a test by creating a nested schema, which I tested on pyarrow.

```jupyter
# schema of test_complex.parquet

a: int32 not null
b: int32
c: struct<d: double, e: struct<f: float>> not null
  child 0, d: double
  child 1, e: struct<f: float>
      child 0, f: float
```

This PR potentially addresses:

* https://issues.apache.org/jira/browse/ARROW-8289
* https://issues.apache.org/jira/browse/ARROW-8423
* https://issues.apache.org/jira/browse/ARROW-8424
* https://issues.apache.org/jira/browse/ARROW-8425

And I would like to propose either opening new JIRAs for the above incomplete items, or renaming the last 3 above.

___

**Help Needed**

I'm implementing the definition and repetition levels on first principle from an old Parquet blog post from the Twitter engineering blog. It's likely that I'm not getting some concepts correct, so I would appreciate help with:

* Checking if my logic is correct
* Guidance or suggestions on how to more efficiently extract levels from arrays
* Adding tests - I suspect we might need a lot of tests, so far we only test writing 1 batch, so I don't know how paging would work when writing a large enough file

I also don't know if the various encoding levels (dictionary, RLE, etc.) and compression levels are applied automagically, or if that'd be something we need to explicitly enable.

CC @sunchao @sadikovi @andygrove @paddyhoran

Might be of interest to @mcassels @maxburke

Closes apache#7319 from nevi-me/arrow-parquet-writer

Lead-authored-by: Neville Dipale <[email protected]>
Co-authored-by: Max Burke <[email protected]>
Co-authored-by: Andy Grove <[email protected]>
Co-authored-by: Max Burke <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
This will allow preserving Arrow-specific metadata when writing or reading Parquet files created from C++ or Rust.
If the schema can't be deserialised, the normal Parquet > Arrow schema conversion is performed.

Closes apache#7917 from nevi-me/ARROW-8243

Authored-by: Neville Dipale <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
…arrow_schema with ipc changes

Note that this PR is deliberately filed against the rust-parquet-arrow-writer branch, not master!!

Hi! 👋 I'm looking to help out with the rust-parquet-arrow-writer branch, and I just pulled it down and it wasn't compiling because in 75f804e, `schema_to_bytes` was changed to take `IpcWriteOptions` and to return `EncodedData`. This updates `encode_arrow_schema` to use those changes, which should get this branch compiling and passing tests again.

I'm kind of guessing which JIRA ticket this should be associated with; honestly I think this commit can just be squashed with apache@8f0ed91 next time this branch gets rebased.

Please let me know if I should change anything, I'm happy to!

Closes apache#8274 from carols10cents/update-with-ipc-changes

Authored-by: Carol (Nichols || Goulding) <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
In this commit, I:

- Extracted a `build_field` function for some code shared between
`schema_to_fb` and `schema_to_fb_offset` that needed to change

- Uncommented the dictionary field from the Arrow schema roundtrip test
and add a dictionary field to the IPC roundtrip test

- If a field is a dictionary field, call `add_dictionary` with the
dictionary field information on the flatbuffer field, building the
dictionary as [the C++ code does][cpp-dictionary] and describe with the
same comment

- When getting the field type for a dictionary field, use the `value_type`
as [the C++ code does][cpp-value-type] and describe with the same
comment

The tests pass because the Parquet -> Arrow conversion for dictionaries
is [already supported][parquet-to-arrow].

[cpp-dictionary]: https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/cpp/src/arrow/ipc/metadata_internal.cc#L426-L440
[cpp-value-type]: https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/cpp/src/arrow/ipc/metadata_internal.cc#L662-L667
[parquet-to-arrow]: https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/rust/arrow/src/ipc/convert.rs#L120-L127

Closes apache#8291 from carols10cents/rust-parquet-arrow-writer

Authored-by: Carol (Nichols || Goulding) <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
…r all supported Arrow DataTypes

Note that this PR goes to the rust-parquet-arrow-writer branch, not master.

Inspired by tests in cpp/src/parquet/arrow/arrow_reader_writer_test.cc

These perform round-trip Arrow -> Parquet -> Arrow of a single RecordBatch with a single column of values of each the supported data types and some of the unsupported ones.

Tests that currently fail are either marked with `#[should_panic]` (if the
reason they fail is because of a panic) or `#[ignore]` (if the reason
they fail is because the values don't match).

I am comparing the RecordBatch's column's data before and after the round trip directly; I'm not sure that this is appropriate or not because for some data types, the `null_bitmap` isn't matching and I'm not sure if it's supposed to or not.

So I would love advice on that front, and I would love to know if these tests are useful or not!

Closes apache#8330 from carols10cents/roundtrip-tests

Lead-authored-by: Carol (Nichols || Goulding) <[email protected]>
Co-authored-by: Neville Dipale <[email protected]>
Signed-off-by: Andy Grove <[email protected]>
…m Parquet metadata when available

@nevi-me This is one commit on top of apache#8330 that I'm opening to get some feedback from you on about whether this will help with ARROW-10168. I *think* this will bring the Rust implementation more in line with C++, but I'm not certain.

I tried removing the `#[ignore]` attributes from the `LargeArray` and `LargeUtf8` tests, but they're still failing because the schemas don't match yet-- it looks like [this code](https://github.com/apache/arrow/blob/b2842ab2eb0d7a7a633049a5591e1eaa254d4446/rust/parquet/src/arrow/array_reader.rs#L595-L638) will need to be changed as well.

That `build_array_reader` function's code looks very similar to the code I've changed here, is there a possibility for the code to be shared or is there a reason they're separate?

Closes apache#8354 from carols10cents/schema-roundtrip

Lead-authored-by: Carol (Nichols || Goulding) <[email protected]>
Co-authored-by: Neville Dipale <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
Closes apache#8388 from nevi-me/ARROW-10225

Authored-by: Neville Dipale <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
This allows writing an Arrow NullArray to Parquet.
Support was added a few years ago in Parquet, and the C++ implementation supports writing null arrays.
The array is stored as an int32 which has all values set as null.
In order to implement this, we introduce a `null -> int32` cast, which creates a null int32 of same length.
Semantically, the write is the same as writing an int32 that's all null, but we create a null writer to preserve the data type.

Closes apache#8484 from nevi-me/ARROW-10334

Authored-by: Neville Dipale <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
This is a port of apache#6770 to the parquet-writer branch.

We'll have more of a chance to test this reader,and ensure that we can roundtrip on list types.

Closes apache#8449 from nevi-me/ARROW-7842-cherry

Authored-by: Neville Dipale <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
This adds more support for:

- When converting Arrow -> Parquet containing an Arrow Dictionary,
materialize the Dictionary values and send to Parquet to be encoded with
a dictionary or not according to the Parquet settings (not supported:
converting an Arrow Dictionary directly to Parquet DictEncoding, also
only supports Int32 index types in this commit, also removes NULLs)
- When converting Parquet -> Arrow, noticing that the Arrow schema
metadata in a Parquet file has a Dictionary type and converting the data
to an Arrow dictionary (right now this only supports String dictionaries
@nevi-me nevi-me requested a review from carols10cents October 25, 2020 08:15
// My assumption is that we can't get to an illegal cast as we can only
// generate types that are supported, because we'd have gotten them from
// the metadata which was written to the Parquet sink
let array = arrow::compute::cast(&array, self.get_data_type())?;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the main addition, everything else is to accommodate being able to do this.
The Arrow type was getting erased by the cast converters, so I removed them, and instead preserve Option<DataType> along from the schema to get here.
This was also going to become important when dealing with nested lists, as there's a breaking change that we need to make to Arrow in order to correctly roundtrip lists.

@carols10cents
Copy link
Member

Cherry-picked this to apache#8402 :)

@nevi-me nevi-me deleted the integer32llc_dict branch October 26, 2020 21:23
carols10cents pushed a commit that referenced this pull request Apr 15, 2021
From a deadlocked run...

```
#0  0x00007f8a5d48dccd in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f8a5d486f05 in pthread_mutex_lock () from /lib64/libpthread.so.0
#2  0x00007f8a566e7e89 in arrow::internal::FnOnce<void ()>::FnImpl<arrow::Future<Aws::Utils::Outcome<Aws::S3::Model::ListObjectsV2Result, Aws::S3::S3Error> >::Callback<arrow::fs::(anonymous namespace)::TreeWalker::ListObjectsV2Handler> >::invoke() () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so
#3  0x00007f8a5650efa0 in arrow::FutureImpl::AddCallback(arrow::internal::FnOnce<void ()>) () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so
#4  0x00007f8a566e67a9 in arrow::fs::(anonymous namespace)::TreeWalker::ListObjectsV2Handler::SpawnListObjectsV2() () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so
#5  0x00007f8a566e723f in arrow::fs::(anonymous namespace)::TreeWalker::WalkChild(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int) () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so
apache#6  0x00007f8a566e827d in arrow::internal::FnOnce<void ()>::FnImpl<arrow::Future<Aws::Utils::Outcome<Aws::S3::Model::ListObjectsV2Result, Aws::S3::S3Error> >::Callback<arrow::fs::(anonymous namespace)::TreeWalker::ListObjectsV2Handler> >::invoke() () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so
apache#7  0x00007f8a5650efa0 in arrow::FutureImpl::AddCallback(arrow::internal::FnOnce<void ()>) () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so
apache#8  0x00007f8a566e67a9 in arrow::fs::(anonymous namespace)::TreeWalker::ListObjectsV2Handler::SpawnListObjectsV2() () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so
apache#9  0x00007f8a566e723f in arrow::fs::(anonymous namespace)::TreeWalker::WalkChild(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int) () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so
apache#10 0x00007f8a566e74b1 in arrow::fs::(anonymous namespace)::TreeWalker::DoWalk() () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so
```

The callback `ListObjectsV2Handler` is being called recursively and the mutex is non-reentrant thus deadlock.

To fix it I got rid of the mutex on `TreeWalker` by using `arrow::util::internal::TaskGroup` instead of manually tracking the #/status of in-flight requests.

Closes apache#9842 from westonpace/bugfix/arrow-12040

Lead-authored-by: Weston Pace <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.