-
Notifications
You must be signed in to change notification settings - Fork 0
Experiment/find r output files on win #7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Closes apache#9562 from bkietz/11767-Scalarhash-may-segfault-f Authored-by: Benjamin Kietzman <[email protected]> Signed-off-by: Benjamin Kietzman <[email protected]>
I recently added additional tests to: https://github.com/apache/arrow/blob/master/dev/archery/tests/test_benchmarks.py But just noticed that they aren't actually being executed by CI (because they aren't in the expected location): ``` - name: Archery Unittests working-directory: dev/archery run: pytest -v archery ``` Here you can see them running now. <img width="866" alt="archery_tests" src="https://user-images.githubusercontent.com/878798/109078651-ed2bd700-76ba-11eb-910e-1fa9fb6308ec.png"> Closes apache#9564 from dianaclarke/ARROW-11771 Authored-by: Diana Clarke <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
The background and rational for this is described [here](https://github.com/jorgecarleitao/arrow2/tree/proposal); the idea is that this is groundwork to make our buffers typed, so that we can start introducing strong typing in the crate. This change is backward incompatible: 1. Our allocator is now a generic over type `T: NativeType`, which implies that we can now allocate certain types. 2. The allocator moved from `memory` to a new module `alloc` (inspired after `std::alloc`). Necessary steps to migrate existing code: 1. `use arrow::memory` -> `use arrow::alloc` 2. `memory::allocate_aligned(...)` -> `alloc::allocate_aligned::<u8>(...)` Note how `NativeType` contains `to_le_bytes`; we will use this method for IPC, where we need to serialize buffers with a specific endianess. This is ground work to enable multiple endianesses support Closes apache#9495 from jorgecarleitao/alloc_t Authored-by: Jorge C. Leitao <[email protected]> Signed-off-by: Neville Dipale <[email protected]>
Closes apache#9566 from pachamaltese/master Authored-by: Mauricio Vargas <[email protected]> Signed-off-by: Neal Richardson <[email protected]>
Polars uses the `arrow::memory` module. With the backwards incompatible change of apache#9495, the API is refactored to `arrow::alloc`. By making `alloc` public, users can shift to the new changes. Closes apache#9572 from ritchie46/make_alloc_public Authored-by: Ritchie Vink <[email protected]> Signed-off-by: Andrew Lamb <[email protected]>
This patch adds impl AsRef<[u8]> to the append_value/append of StringBuilder and BinaryBuilder. ```rust pub fn append_value(&mut self, value: impl AsRef<[u8]>) -> Result<()> ``` Non-breaking change that will enable data to be passed as value as well. Closes apache#9570 from Max-Meldrum/as_ref_patch Authored-by: Max Meldrum <[email protected]> Signed-off-by: Andrew Lamb <[email protected]>
First steps: * Rework `selected_columns` to hold field_refs instead of string column names; add code to back out the string field names where needed (e.g. dataset `Project()`) * Create an `array_ref` pseudo-function to do the same as `field_ref` for `array_expressions` * Add a `data` argument to `eval_array_expression` in order to bind `array_ref`s to the actual Arrays before evaluating * Refactor `filter()` NSE code for reuse in `mutate()` * Split up dplyr tests because we're going to be adding lots more Then: * Basic `mutate()` and `transmute()` (done in apache@578d492) * Go through the examples in the dplyr::mutate() docs and add tests for all cases. Where possible they're implemented in arrow fully; where we don't support the functions, it falls back to the current behavior of pulling the data into R first. Followup JIRAs: * ARROW-11704: Wire up dplyr::mutate() for datasets * ARROW-16999: Implement dplyr::across() and autosplicing * ARROW-11700: Internationalize error handling in tidy eval * ARROW-11701: Implement dplyr::relocate() * ARROW-11702: Enable ungrouped aggregations in non-Dataset expressions * ARROW-11658: Handle mutate/rename inside group_by * ARROW-11705: Support scalar value recycling in RecordBatch/Table$create() * ARROW-11754: Support dplyr::compute() * ARROW-11752: Replace usage of testthat::expect_is() * ARROW-11755: Add tests from dplyr/test-mutate.r * ARROW-11785: Fallback when filtering Table with if_any() expression fails Closes apache#9521 from nealrichardson/mutate Authored-by: Neal Richardson <[email protected]> Signed-off-by: Neal Richardson <[email protected]>
gRPC 1.34 and 1.36 both change up the API, so we have to detect both of those versions. The CMake config and C++ code was also refactored a bit so that there's less to copy-paste for each gRPC version change. Closes apache#9569 from lidavidm/arrow-11695-detection Authored-by: David Li <[email protected]> Signed-off-by: Uwe L. Korn <[email protected]>
This offers possibly performance naive CSV writer with limited options to keep the initial PR down. Obvious potential improvements to this approach are: - Smarter casts for dictionaries - Arena allocation for intermediate cast results The implementation also means that for all primitive type support we might have to fill in gaps in our cast function. Closes apache#9504 from emkornfield/csv Lead-authored-by: Micah Kornfield <[email protected]> Co-authored-by: emkornfield <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Co-authored-by: Micah Kornfield <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
The `io::IOContext` class allows passing various settings such as the MemoryPool used for allocation and the Executor for async methods. Closes apache#9474 from pitrou/ARROW-10420-io-context Authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Benjamin Kietzman <[email protected]>
Closes apache#9578 from lidavidm/arrow-11786 Authored-by: David Li <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
…tream interface Unit tests now cover the bug to avoid regressions. Closes apache#9574 from edrevo/fix-coalescebatchesstream Authored-by: Ximo Guanter <[email protected]> Signed-off-by: Andrew Lamb <[email protected]>
Arrow IPC files are safe to load concurrently. The implementation of `ipc.FileReader.Record(i)` is not safe due to stashing the current record internally. This adds a backward-compatible function `RecordAt` that behaves like ReadAt. Closes apache#9584 from fsaintjacques/go-concurrent-file-reader Authored-by: François Saint-Jacques <[email protected]> Signed-off-by: Neal Richardson <[email protected]>
Also enable nth_to_indices on decimal and fixed size binary data. Closes apache#9577 from pitrou/ARROW-11662-sort-decimal Authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Benjamin Kietzman <[email protected]>
This PR fixes a bug on which `GenericListArray` is not validating the datatype passed on to `from(ArrayData)`, causing all types of bugs, such as undefined behavior in interpreting the offset buffer. This PR adds this validation, panicking if the DataType does not match. This PR also fixes casting from and to Lists, which was creating an `ArrayData` out of spec. Closes apache#9425 from jorgecarleitao/check_list Authored-by: Jorge C. Leitao <[email protected]> Signed-off-by: Andrew Lamb <[email protected]>
# Rationale: This is part of a larger effort, described on [ARROW-11689](https://issues.apache.org/jira/browse/ARROW-11689). for making improvements to the DataFusion query optimizer easier to write and making it more efficient,. The idea is that by splitting out the expr traversal code from the code that does the actual rewriting, we will: 1. Reduce the amount repetitions in optimizer rules to make them easier to implement 2. Make the actual logic clearer as it is not intertwined in code needed to do recursion 2. Set the stage for a general `PlanRewriter` that doesn't have to clone its input, and can modify take their input by value and consume them. # Changes This PR introduce a `ExpressionRewriter`, the mutable counterpart to `ExpressionVisitor` and demonstrates its usefulness by using it in the constant folding algorithm. Note this also reduces a bunch of copies in the constant folding algorithm. Closes apache#9545 from alamb/alamb/expr_rewriter Authored-by: Andrew Lamb <[email protected]> Signed-off-by: Andrew Lamb <[email protected]>
After apache#9523 RepartitionExec is blocking to pull all data into memory before starting the stream which crashes on large sets. Closes apache#9580 from seddonm1/yield-now Authored-by: Mike Seddon <[email protected]> Signed-off-by: Andrew Lamb <[email protected]>
This PR implements a first version of hash repartition. This can used to (further) parallelize hash joins / aggregates or to implement distributed algorithms like `ShuffleHashJoin`(https://github.com/ballista-compute/ballista/issues/595 ) I didn't yet optimize for speed, as I think it makes sense to implement it and look for improvements later. FYI @andygrove Closes apache#9548 from Dandandan/hash_repartition Authored-by: Heres, Daniel <[email protected]> Signed-off-by: Andrew Lamb <[email protected]>
- ScanOptions is more explicit about scan time schemas: dataset (aka reader) and projected schemas are independent
- RecordBatchProjector is replaced with a projection expression
- Field metadata and nullability is preserved by projection exprs *if* the expression is a plain field reference; it's clear that `field_ref("alpha")` should inherit the attributes of "alpha" in the original dataset whereas `equal(field_ref("alpha"), field_ref("beta"))` doesn't necessarily inherit attributes from either field
No significant changes have been made to the Python/R bindings as the C++ API remains backward compatible with subset-of-columns projection. Adding these featutes will be handled in follow up.
- R: https://issues.apache.org/jira/browse/ARROW-11704
- Python: https://issues.apache.org/jira/browse/ARROW-11750
Deferred to https://issues.apache.org/jira/browse/ARROW-11749 :
- Ensure that stacked projection exprs are simplifiable to a single project expr
- Refactor UnionDataset to support reconciling its children with projections
Closes apache#9532 from bkietz/11174-Make-Expressions-availabl
Authored-by: Benjamin Kietzman <[email protected]>
Signed-off-by: Neal Richardson <[email protected]>
…LogicalPlan ratherthan helpers in util # Rationale Goal is to consolidate the expression walking in one place as part of my longer term plan to make plan rewriting avoid so much copying in the DataFusion optimizer (https://issues.apache.org/jira/browse/ARROW-11689). This is setting the stage for `LogicalPlans` to be able to rewrite themselves rather than being rewritten by the optimizer utils # Changes move `expressions` and `inputs` into LogicalPlan rather than helpers in util. There are no logic changes. Closes apache#9568 from alamb/alamb/move_to_expr Authored-by: Andrew Lamb <[email protected]> Signed-off-by: Andrew Lamb <[email protected]>
…s newline delimited json streams
## Rationale
Currently the Arrow json writer makes JSON that looks like this (one record per line):
```json
{"foo":1}
{"bar":1}
```
Which is not technically valid JSON, which would look something like this:
```
[
{"foo":1},
{"bar":1}
]
```
## New Features
This PR parameterizes the JSON writer so it can write in either format. Note I needed this feature for in IOx, in influxdata/influxdb_iox#870, and I want to propose contributing it back here).
## Other Changes:
1. Added the function `into_inner()` to retrieve the inner writer from the JSON writer, following the model of the Rust standard library (e.g. [BufReader::into_inner](https://doc.rust-lang.org/std/io/struct.BufReader.html#method.into_inner)
2. Per Rust standard pattern, I change the JSON writer so that it doesn't add any Buffering (via `BufReader`) itself, and instead allows the caller the choice of what type of buffering, if any, is needed.
3. Added / cleaned up a bunch of documentation and comments.
## Questions
I went with parameterizing the `Writer` output as a trait rather than runtime dispatch, for performance. This shouldn't have backwards compatible issues Given the writer has not yet been released yet (introduced by @houqp apache#9256)
However would people prefer a single `Writer` that took an `Options` struct or something to determine how it wrote out data?
Closes apache#9575 from alamb/alamb/json_arrays
Authored-by: Andrew Lamb <[email protected]>
Signed-off-by: Andrew Lamb <[email protected]>
Sorry that the PR's are not more clustered, but they occur to me in the wild. This PR allows casting from LargeUtf8 to numerical and temporal types. It also modifies the already existing string to temporal casts such that it uses the new faster `from_trusted_length` iterator API. Closes apache#9571 from ritchie46/large_utf8_to_numerical Lead-authored-by: Ritchie Vink <[email protected]> Co-authored-by: Neville Dipale <[email protected]> Signed-off-by: Neville Dipale <[email protected]>
…es with rust # Background Thanks to the 🦅 👁️ of @lidavidm apache#9583 (comment) it appears that the Rust flight integration test prints `"Server listening"` before the server is *actually* read to accept connections. Thus creating a race condition: if the tests try and start faster than the rust server can actually be setup then we see integration test failures as shown in https://issues.apache.org/jira/browse/ARROW-11717: ``` Traceback (most recent call last): File "/arrow/dev/archery/archery/integration/runner.py", line 308, in _run_flight_test_case consumer.flight_request(port, **client_args) File "/arrow/dev/archery/archery/integration/tester_cpp.py", line 116, in flight_request run_cmd(cmd) File "/arrow/dev/archery/archery/integration/util.py", line 148, in run_cmd raise RuntimeError(sio.getvalue()) RuntimeError: Command failed: /build/cpp/debug/flight-test-integration-client -host localhost -port=33569 -scenario auth:basic_proto With output: -------------- -- Arrow Fatal Error -- Invalid: Expected UNAUTHENTICATED but got Unavailable ``` # Changes This PR changes the Rust flight scenario harnesses to print the ready message after the server is *actually* ready. Closes apache#9593 from alamb/alamb/arrow-11717 Authored-by: Andrew Lamb <[email protected]> Signed-off-by: David Li <[email protected]>
This patch set makes it possible for downstream application to do schema inference with IOs other than local filesystem. Closes apache#9534 from houqp/qp_csv Authored-by: Qingping Hou <[email protected]> Signed-off-by: Neville Dipale <[email protected]>
Edits and fixes for some missing words, punctuation, and wording. Closes apache#9576 from pierwill/docs-edit Authored-by: pierwill <[email protected]> Signed-off-by: Andrew Lamb <[email protected]>
…rged schema * Added `contains` method for `arrow::datatypes::Schema` and `arrow::datatypes::Field` * Relax batch schema validation using `contains` check when creating a MemTable in datafusion Closes apache#9537 from houqp/qp_schema Authored-by: Qingping Hou <[email protected]> Signed-off-by: Andrew Lamb <[email protected]>
This PR fixes a bug where appending an empty list vector fails with a `NullPointerException`. Instead of attempting to process the delta vector, we just return early with the unmodified target vector, since there is nothing to append to the target vector from the delta vector. In addition, it fixes cases where appending an empty delta vector would've caused the same NPE, or where it'd be an optimization to not attempt processing an empty delta vector. Unit tests are added for all supported `VectorAppender` vector types. Closes apache#9581 from nbruno/ARROW-11788 Authored-by: Nick Bruno <[email protected]> Signed-off-by: liyafan82 <[email protected]>
This is a test PR with a minor fix, that has no JIRA issue, that I am using to validate my new fancy script (apache#9598). However, I think it is a reasonable change on its own Closes apache#9594 from alamb/alamb/test_change Authored-by: Andrew Lamb <[email protected]> Signed-off-by: Andrew Lamb <[email protected]>
Allows to compare both allocators in the benchmarks. Looks like there is no clear faster overall allocator, both are faster on some TCP-H queries. Closes apache#9601 from Dandandan/mimalloc Authored-by: Heres, Daniel <[email protected]> Signed-off-by: Andrew Lamb <[email protected]>
Updates the submodule for changes in ARROW-11666 Closes apache#9587 from nevi-me/ARROW-11798 Authored-by: Neville Dipale <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
The `append` functions in the `Builder` structs are often used in "hot" code. This PR tags them with `#[inline]`, making it possible to inline the function calls across crate boundaries. Closes apache#9860 from ritchie46/inline_builder_appends Authored-by: Ritchie Vink <[email protected]> Signed-off-by: Andrew Lamb <[email protected]>
…n was a form of nested parallelism and causing nested deadlocks. This commit brings over some of the work in ARROW-7001 and allows the CSV scan task to be called in an async fashion. In addition, an async path is put in the scanner and dataset write so that all internal uses of ScanTask()->Execute happen in an async-friendly way. External uses of ScanTask()->Execute should already be outside the CPU thread pool and should not cause deadlock
|
Thanks for opening a pull request! Could you open an issue for this pull request on JIRA? Then could you also rename pull request title in the following format? See also: |
Codecov Report
@@ Coverage Diff @@
## master #7 +/- ##
==========================================
+ Coverage 82.40% 82.60% +0.20%
==========================================
Files 244 255 +11
Lines 56209 59674 +3465
==========================================
+ Hits 46318 49295 +2977
- Misses 9891 10379 +488
Continue to review full report at Codecov.
|
From a deadlocked run... ``` #0 0x00007f8a5d48dccd in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x00007f8a5d486f05 in pthread_mutex_lock () from /lib64/libpthread.so.0 #2 0x00007f8a566e7e89 in arrow::internal::FnOnce<void ()>::FnImpl<arrow::Future<Aws::Utils::Outcome<Aws::S3::Model::ListObjectsV2Result, Aws::S3::S3Error> >::Callback<arrow::fs::(anonymous namespace)::TreeWalker::ListObjectsV2Handler> >::invoke() () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so #3 0x00007f8a5650efa0 in arrow::FutureImpl::AddCallback(arrow::internal::FnOnce<void ()>) () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so #4 0x00007f8a566e67a9 in arrow::fs::(anonymous namespace)::TreeWalker::ListObjectsV2Handler::SpawnListObjectsV2() () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so #5 0x00007f8a566e723f in arrow::fs::(anonymous namespace)::TreeWalker::WalkChild(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int) () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so #6 0x00007f8a566e827d in arrow::internal::FnOnce<void ()>::FnImpl<arrow::Future<Aws::Utils::Outcome<Aws::S3::Model::ListObjectsV2Result, Aws::S3::S3Error> >::Callback<arrow::fs::(anonymous namespace)::TreeWalker::ListObjectsV2Handler> >::invoke() () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so #7 0x00007f8a5650efa0 in arrow::FutureImpl::AddCallback(arrow::internal::FnOnce<void ()>) () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so #8 0x00007f8a566e67a9 in arrow::fs::(anonymous namespace)::TreeWalker::ListObjectsV2Handler::SpawnListObjectsV2() () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so #9 0x00007f8a566e723f in arrow::fs::(anonymous namespace)::TreeWalker::WalkChild(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int) () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so #10 0x00007f8a566e74b1 in arrow::fs::(anonymous namespace)::TreeWalker::DoWalk() () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so ``` The callback `ListObjectsV2Handler` is being called recursively and the mutex is non-reentrant thus deadlock. To fix it I got rid of the mutex on `TreeWalker` by using `arrow::util::internal::TaskGroup` instead of manually tracking the #/status of in-flight requests. Closes apache#9842 from westonpace/bugfix/arrow-12040 Lead-authored-by: Weston Pace <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
No description provided.