Skip to content

Conversation

@westonpace
Copy link
Owner

No description provided.

bkietz and others added 30 commits February 24, 2021 17:12
Closes apache#9562 from bkietz/11767-Scalarhash-may-segfault-f

Authored-by: Benjamin Kietzman <[email protected]>
Signed-off-by: Benjamin Kietzman <[email protected]>
I recently added additional tests to:

https://github.com/apache/arrow/blob/master/dev/archery/tests/test_benchmarks.py

But just noticed that they aren't actually being executed by CI (because they aren't in the expected location):

```
      - name: Archery Unittests
        working-directory: dev/archery
        run: pytest -v archery
```

Here you can see them running now.

<img width="866" alt="archery_tests" src="https://user-images.githubusercontent.com/878798/109078651-ed2bd700-76ba-11eb-910e-1fa9fb6308ec.png">

Closes apache#9564 from dianaclarke/ARROW-11771

Authored-by: Diana Clarke <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
The background and rational for this is described [here](https://github.com/jorgecarleitao/arrow2/tree/proposal); the idea is that this is groundwork to make our buffers typed, so that we can start introducing strong typing in the crate.

This change is backward incompatible:

1. Our allocator is now a generic over type `T: NativeType`, which implies that we can now allocate certain types.
2. The allocator moved from `memory` to a new module `alloc` (inspired after `std::alloc`).

Necessary steps to migrate existing code:

1. `use arrow::memory` -> `use arrow::alloc`
2. `memory::allocate_aligned(...)` -> `alloc::allocate_aligned::<u8>(...)`

Note how `NativeType` contains `to_le_bytes`; we will use this method for IPC, where we need to serialize buffers with a specific endianess. This is ground work to enable multiple endianesses support

Closes apache#9495 from jorgecarleitao/alloc_t

Authored-by: Jorge C. Leitao <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
Closes apache#9566 from pachamaltese/master

Authored-by: Mauricio Vargas <[email protected]>
Signed-off-by: Neal Richardson <[email protected]>
Polars uses the `arrow::memory` module. With the backwards incompatible change of apache#9495, the API is refactored to `arrow::alloc`.

By making `alloc` public, users can shift to the new changes.

Closes apache#9572 from ritchie46/make_alloc_public

Authored-by: Ritchie Vink <[email protected]>
Signed-off-by: Andrew Lamb <[email protected]>
This patch adds impl AsRef<[u8]> to the append_value/append of StringBuilder and BinaryBuilder.

```rust
pub fn append_value(&mut self, value: impl AsRef<[u8]>) -> Result<()>
```

Non-breaking change that will enable data to be passed as value as well.

Closes apache#9570 from Max-Meldrum/as_ref_patch

Authored-by: Max Meldrum <[email protected]>
Signed-off-by: Andrew Lamb <[email protected]>
First steps:

* Rework `selected_columns` to hold field_refs instead of string column names; add code to back out the string field names where needed (e.g. dataset `Project()`)
* Create an `array_ref` pseudo-function to do the same as `field_ref` for `array_expressions`
* Add a `data` argument to `eval_array_expression` in order to bind `array_ref`s to the actual Arrays before evaluating
* Refactor `filter()` NSE code for reuse in `mutate()`
* Split up dplyr tests because we're going to be adding lots more

Then:

* Basic `mutate()` and `transmute()` (done in apache@578d492)
* Go through the examples in the dplyr::mutate() docs and add tests for all cases. Where possible they're implemented in arrow fully; where we don't support the functions, it falls back to the current behavior of pulling the data into R first.

Followup JIRAs:

* ARROW-11704: Wire up dplyr::mutate() for datasets
* ARROW-16999: Implement dplyr::across() and autosplicing
* ARROW-11700: Internationalize error handling in tidy eval
* ARROW-11701: Implement dplyr::relocate()
* ARROW-11702: Enable ungrouped aggregations in non-Dataset expressions
* ARROW-11658: Handle mutate/rename inside group_by
* ARROW-11705: Support scalar value recycling in RecordBatch/Table$create()
* ARROW-11754: Support dplyr::compute()
* ARROW-11752: Replace usage of testthat::expect_is()
* ARROW-11755: Add tests from dplyr/test-mutate.r
* ARROW-11785: Fallback when filtering Table with if_any() expression fails

Closes apache#9521 from nealrichardson/mutate

Authored-by: Neal Richardson <[email protected]>
Signed-off-by: Neal Richardson <[email protected]>
gRPC 1.34 and 1.36 both change up the API, so we have to detect both of those versions. The CMake config and C++ code was also refactored a bit so that there's less to copy-paste for each gRPC version change.

Closes apache#9569 from lidavidm/arrow-11695-detection

Authored-by: David Li <[email protected]>
Signed-off-by: Uwe L. Korn <[email protected]>
This offers possibly performance naive CSV writer with
limited options to keep the initial PR down.

Obvious potential improvements to this approach
are:

- Smarter casts for dictionaries
- Arena allocation for intermediate cast results

The implementation also means that for all primitive type
support we might have to fill in gaps in our cast function.

Closes apache#9504 from emkornfield/csv

Lead-authored-by: Micah Kornfield <[email protected]>
Co-authored-by: emkornfield <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Co-authored-by: Micah Kornfield <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
The `io::IOContext` class allows passing various settings such as the MemoryPool used for allocation and the Executor for async methods.

Closes apache#9474 from pitrou/ARROW-10420-io-context

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Benjamin Kietzman <[email protected]>
Closes apache#9578 from lidavidm/arrow-11786

Authored-by: David Li <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
…tream interface

Unit tests now cover the bug to avoid regressions.

Closes apache#9574 from edrevo/fix-coalescebatchesstream

Authored-by: Ximo Guanter <[email protected]>
Signed-off-by: Andrew Lamb <[email protected]>
Arrow IPC files are safe to load concurrently. The implementation of
`ipc.FileReader.Record(i)` is not safe due to stashing the current
record internally. This adds a backward-compatible function `RecordAt`
that behaves like ReadAt.

Closes apache#9584 from fsaintjacques/go-concurrent-file-reader

Authored-by: François Saint-Jacques <[email protected]>
Signed-off-by: Neal Richardson <[email protected]>
Also enable nth_to_indices on decimal and fixed size binary data.

Closes apache#9577 from pitrou/ARROW-11662-sort-decimal

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Benjamin Kietzman <[email protected]>
This PR fixes a bug on which `GenericListArray` is not validating the datatype passed on to `from(ArrayData)`, causing all types of bugs, such as undefined behavior in interpreting the offset buffer.

This PR adds this validation, panicking if the DataType does not match.

This PR also fixes casting from and to Lists, which was creating an `ArrayData` out of spec.

Closes apache#9425 from jorgecarleitao/check_list

Authored-by: Jorge C. Leitao <[email protected]>
Signed-off-by: Andrew Lamb <[email protected]>
# Rationale:
This is part of a larger effort, described on [ARROW-11689](https://issues.apache.org/jira/browse/ARROW-11689). for making improvements to the DataFusion query optimizer easier to write and making it more efficient,.

The idea is that by splitting out the expr traversal code from the code that does the actual rewriting, we will:
1. Reduce the amount repetitions in optimizer rules to make them easier to implement
2. Make the actual logic clearer as it is not intertwined in code needed to do recursion
2. Set the stage for a general `PlanRewriter` that doesn't have  to clone its input, and  can modify take their input by value and consume them.

# Changes
This PR introduce a `ExpressionRewriter`, the mutable counterpart to `ExpressionVisitor` and demonstrates its usefulness by using it in the constant folding algorithm.

Note this also reduces a bunch of copies in the constant folding algorithm.

Closes apache#9545 from alamb/alamb/expr_rewriter

Authored-by: Andrew Lamb <[email protected]>
Signed-off-by: Andrew Lamb <[email protected]>
After apache#9523 RepartitionExec is blocking to pull all data into memory before starting the stream which crashes on large sets.

Closes apache#9580 from seddonm1/yield-now

Authored-by: Mike Seddon <[email protected]>
Signed-off-by: Andrew Lamb <[email protected]>
This PR implements a first version of hash repartition.
This can used to (further) parallelize hash joins / aggregates or to implement distributed algorithms like `ShuffleHashJoin`(https://github.com/ballista-compute/ballista/issues/595 )

I didn't yet optimize for speed, as I think it makes sense to implement it and look for improvements later.

FYI @andygrove

Closes apache#9548 from Dandandan/hash_repartition

Authored-by: Heres, Daniel <[email protected]>
Signed-off-by: Andrew Lamb <[email protected]>
- ScanOptions is more explicit about scan time schemas: dataset (aka reader) and projected schemas are independent
- RecordBatchProjector is replaced with a projection expression
- Field metadata and nullability is preserved by projection exprs *if* the expression is a plain field reference; it's clear that `field_ref("alpha")` should inherit the attributes of "alpha" in the original dataset whereas `equal(field_ref("alpha"), field_ref("beta"))` doesn't necessarily inherit attributes from either field

No significant changes have been made to the Python/R bindings as the C++ API remains backward compatible with subset-of-columns projection. Adding these featutes will be handled in follow up.
- R: https://issues.apache.org/jira/browse/ARROW-11704
- Python: https://issues.apache.org/jira/browse/ARROW-11750

Deferred to https://issues.apache.org/jira/browse/ARROW-11749 :
- Ensure that stacked projection exprs are simplifiable to a single project expr
- Refactor UnionDataset to support reconciling its children with projections

Closes apache#9532 from bkietz/11174-Make-Expressions-availabl

Authored-by: Benjamin Kietzman <[email protected]>
Signed-off-by: Neal Richardson <[email protected]>
…LogicalPlan ratherthan helpers in util

# Rationale
Goal is to consolidate the expression walking in one place as part of my longer term plan to make plan rewriting avoid so much copying in the DataFusion optimizer (https://issues.apache.org/jira/browse/ARROW-11689). This is setting the stage for `LogicalPlans` to be able to rewrite themselves rather than being rewritten by the optimizer utils

# Changes
move `expressions` and `inputs` into LogicalPlan rather than helpers in util. There are no logic changes.

Closes apache#9568 from alamb/alamb/move_to_expr

Authored-by: Andrew Lamb <[email protected]>
Signed-off-by: Andrew Lamb <[email protected]>
…s newline delimited json streams

## Rationale
Currently the Arrow json writer makes JSON that looks like this (one record per line):
```json
{"foo":1}
{"bar":1}
```

Which is not technically valid JSON, which would look something like this:

```
[
  {"foo":1},
  {"bar":1}
]
```

## New Features
This PR parameterizes the JSON writer so it can write in either format. Note I needed this feature for in IOx, in influxdata/influxdb_iox#870, and I want to propose contributing it back here).

## Other Changes:
1. Added the function `into_inner()` to retrieve the inner writer from the JSON writer, following the model of the Rust standard library (e.g. [BufReader::into_inner](https://doc.rust-lang.org/std/io/struct.BufReader.html#method.into_inner)

2. Per Rust standard pattern, I change the JSON writer so that it doesn't add any Buffering (via `BufReader`) itself, and instead allows the caller the choice of what type of buffering, if any, is needed.

3. Added / cleaned up a bunch of documentation and comments.

## Questions
I went with parameterizing the `Writer` output as a trait rather than runtime dispatch, for performance. This shouldn't have backwards compatible issues Given the writer has not yet been released yet (introduced by @houqp apache#9256)

However would people prefer a single `Writer` that took an `Options` struct or something to determine how it wrote out data?

Closes apache#9575 from alamb/alamb/json_arrays

Authored-by: Andrew Lamb <[email protected]>
Signed-off-by: Andrew Lamb <[email protected]>
Sorry that the PR's are not more clustered, but they occur to me in the wild.

This PR allows casting from LargeUtf8 to numerical and temporal types. It also modifies the already existing string to temporal casts such that it uses the new faster `from_trusted_length` iterator API.

Closes apache#9571 from ritchie46/large_utf8_to_numerical

Lead-authored-by: Ritchie Vink <[email protected]>
Co-authored-by: Neville Dipale <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
…es with rust

# Background
Thanks to the 🦅 👁️  of @lidavidm apache#9583 (comment) it appears that the Rust flight integration test prints `"Server listening"` before the server is *actually* read to accept connections. Thus creating a race condition: if the tests try and start faster than the rust server can actually be setup then we see integration test failures as shown in https://issues.apache.org/jira/browse/ARROW-11717:

```
Traceback (most recent call last):
  File "/arrow/dev/archery/archery/integration/runner.py", line 308, in _run_flight_test_case
    consumer.flight_request(port, **client_args)
  File "/arrow/dev/archery/archery/integration/tester_cpp.py", line 116, in flight_request
    run_cmd(cmd)
  File "/arrow/dev/archery/archery/integration/util.py", line 148, in run_cmd
    raise RuntimeError(sio.getvalue())
RuntimeError: Command failed: /build/cpp/debug/flight-test-integration-client -host localhost -port=33569 -scenario auth:basic_proto
With output:
--------------
-- Arrow Fatal Error --
Invalid: Expected UNAUTHENTICATED but got Unavailable
```

# Changes
This PR changes the Rust flight scenario harnesses to print the ready message after the server is *actually* ready.

Closes apache#9593 from alamb/alamb/arrow-11717

Authored-by: Andrew Lamb <[email protected]>
Signed-off-by: David Li <[email protected]>
This patch set makes it possible for downstream application to do schema inference with IOs other than local filesystem.

Closes apache#9534 from houqp/qp_csv

Authored-by: Qingping Hou <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
Edits and fixes for some missing words, punctuation, and wording.

Closes apache#9576 from pierwill/docs-edit

Authored-by: pierwill <[email protected]>
Signed-off-by: Andrew Lamb <[email protected]>
…rged schema

* Added `contains` method for `arrow::datatypes::Schema` and
`arrow::datatypes::Field`
* Relax batch schema validation using `contains` check when creating a
MemTable in datafusion

Closes apache#9537 from houqp/qp_schema

Authored-by: Qingping Hou <[email protected]>
Signed-off-by: Andrew Lamb <[email protected]>
This PR fixes a bug where appending an empty list vector fails with a `NullPointerException`. Instead of attempting to process the delta vector, we just return early with the unmodified target vector, since there is nothing to append to the target vector from the delta vector. In addition, it fixes cases where appending an empty delta vector would've caused the same NPE, or where it'd be an optimization to not attempt processing an empty delta vector. Unit tests are added for all supported `VectorAppender` vector types.

Closes apache#9581 from nbruno/ARROW-11788

Authored-by: Nick Bruno <[email protected]>
Signed-off-by: liyafan82 <[email protected]>
This is a test PR with a minor fix, that has no JIRA issue, that I am using to validate my new fancy script (apache#9598). However, I think it is a reasonable change on its own

Closes apache#9594 from alamb/alamb/test_change

Authored-by: Andrew Lamb <[email protected]>
Signed-off-by: Andrew Lamb <[email protected]>
Allows to compare both allocators in the benchmarks.
Looks like there is no clear faster overall allocator, both are faster on some TCP-H queries.

Closes apache#9601 from Dandandan/mimalloc

Authored-by: Heres, Daniel <[email protected]>
Signed-off-by: Andrew Lamb <[email protected]>
Updates the submodule for changes in ARROW-11666

Closes apache#9587 from nevi-me/ARROW-11798

Authored-by: Neville Dipale <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
ritchie46 and others added 5 commits March 31, 2021 13:46
The `append` functions in the `Builder` structs are often used in "hot" code. This PR tags them with `#[inline]`, making it possible to inline the function calls across crate boundaries.

Closes apache#9860 from ritchie46/inline_builder_appends

Authored-by: Ritchie Vink <[email protected]>
Signed-off-by: Andrew Lamb <[email protected]>
…n was a form of nested parallelism and causing nested deadlocks. This commit brings over some of the work in ARROW-7001 and allows the CSV scan task to be called in an async fashion. In addition, an async path is put in the scanner and dataset write so that all internal uses of ScanTask()->Execute happen in an async-friendly way. External uses of ScanTask()->Execute should already be outside the CPU thread pool and should not cause deadlock
@github-actions
Copy link

github-actions bot commented Apr 1, 2021

Thanks for opening a pull request!

Could you open an issue for this pull request on JIRA?
https://issues.apache.org/jira/browse/ARROW

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

See also:

@codecov-io
Copy link

Codecov Report

Merging #7 (4ddc0ac) into master (4de992c) will increase coverage by 0.20%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master       #7      +/-   ##
==========================================
+ Coverage   82.40%   82.60%   +0.20%     
==========================================
  Files         244      255      +11     
  Lines       56209    59674    +3465     
==========================================
+ Hits        46318    49295    +2977     
- Misses       9891    10379     +488     
Impacted Files Coverage Δ
rust/arrow/src/array/transform/fixed_binary.rs 0.00% <0.00%> (-84.22%) ⬇️
rust/arrow/src/array/transform/null.rs 0.00% <0.00%> (-66.67%) ⬇️
rust/arrow/src/array/transform/structure.rs 44.44% <0.00%> (-38.89%) ⬇️
rust/arrow/src/array/transform/primitive.rs 71.42% <0.00%> (-28.58%) ⬇️
rust/arrow/src/array/transform/mod.rs 70.23% <0.00%> (-22.96%) ⬇️
rust/arrow/src/datatypes/native.rs 76.59% <0.00%> (-12.88%) ⬇️
rust/arrow/src/array/transform/variable_size.rs 87.87% <0.00%> (-12.13%) ⬇️
rust/arrow/src/array/array_binary.rs 86.05% <0.00%> (-5.70%) ⬇️
rust/parquet/src/schema/parser.rs 86.27% <0.00%> (-3.93%) ⬇️
rust/datafusion/src/physical_plan/group_scalar.rs 61.84% <0.00%> (-3.44%) ⬇️
... and 114 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4de992c...4ddc0ac. Read the comment docs.

@westonpace westonpace closed this Apr 6, 2021
westonpace added a commit that referenced this pull request Apr 7, 2021
From a deadlocked run...

```
#0  0x00007f8a5d48dccd in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f8a5d486f05 in pthread_mutex_lock () from /lib64/libpthread.so.0
#2  0x00007f8a566e7e89 in arrow::internal::FnOnce<void ()>::FnImpl<arrow::Future<Aws::Utils::Outcome<Aws::S3::Model::ListObjectsV2Result, Aws::S3::S3Error> >::Callback<arrow::fs::(anonymous namespace)::TreeWalker::ListObjectsV2Handler> >::invoke() () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so
#3  0x00007f8a5650efa0 in arrow::FutureImpl::AddCallback(arrow::internal::FnOnce<void ()>) () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so
#4  0x00007f8a566e67a9 in arrow::fs::(anonymous namespace)::TreeWalker::ListObjectsV2Handler::SpawnListObjectsV2() () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so
#5  0x00007f8a566e723f in arrow::fs::(anonymous namespace)::TreeWalker::WalkChild(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int) () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so
#6  0x00007f8a566e827d in arrow::internal::FnOnce<void ()>::FnImpl<arrow::Future<Aws::Utils::Outcome<Aws::S3::Model::ListObjectsV2Result, Aws::S3::S3Error> >::Callback<arrow::fs::(anonymous namespace)::TreeWalker::ListObjectsV2Handler> >::invoke() () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so
#7  0x00007f8a5650efa0 in arrow::FutureImpl::AddCallback(arrow::internal::FnOnce<void ()>) () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so
#8  0x00007f8a566e67a9 in arrow::fs::(anonymous namespace)::TreeWalker::ListObjectsV2Handler::SpawnListObjectsV2() () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so
#9  0x00007f8a566e723f in arrow::fs::(anonymous namespace)::TreeWalker::WalkChild(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int) () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so
#10 0x00007f8a566e74b1 in arrow::fs::(anonymous namespace)::TreeWalker::DoWalk() () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so
```

The callback `ListObjectsV2Handler` is being called recursively and the mutex is non-reentrant thus deadlock.

To fix it I got rid of the mutex on `TreeWalker` by using `arrow::util::internal::TaskGroup` instead of manually tracking the #/status of in-flight requests.

Closes apache#9842 from westonpace/bugfix/arrow-12040

Lead-authored-by: Weston Pace <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
@westonpace westonpace deleted the experiment/find-r-output-files-on-win branch April 14, 2021 20:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.