Skip to content

Conversation

@westonpace
Copy link
Owner

No description provided.

pitrou and others added 30 commits October 6, 2020 22:17
…g array to Pandas

Fix a crash on conversion of e.g. struct of dictionaries.

Closes #8361 from pitrou/ARROW-10192-struct-of-dict

Lead-authored-by: Antoine Pitrou <[email protected]>
Co-authored-by: Krisztián Szűcs <[email protected]>
Signed-off-by: Krisztián Szűcs <[email protected]>
Closes #8369 from bkietz/10176-Nightly-valgrind-job-fail

Authored-by: Benjamin Kietzman <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
This is a follow-up of #7938. #7938 forgot downloading a required dll `libprotobuf.so.18` at the build time.

Closes #8368 from kiszk/ARROW-10200

Authored-by: Kazuaki Ishizaki <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
Because it often causes "No output has been received in the last
10m0s, this potentially indicates a stalled build or something wrong
with the build itself." on Travis CI.

Closes #8372 from kou/cpp-ci-disable-s3-on-arm64

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
Use a lookup table to emulate PEXT 5 bits at a time.
Remove the slow scalar path.

Closes #8320 from pitrou/ARROW-10058-faster-sw-pext

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Micah Kornfield <[email protected]>
Because repo.msys2.org is still down.

Closes #8373 from kou/ci-msys2-use-mirror

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
Spark master branch has been updated for Arrow 1.0.1, the patch to fix Java compilation can be removed.

Closes #8352 from BryanCutler/ci-spark-integration-ARROW-10178

Authored-by: Bryan Cutler <[email protected]>
Signed-off-by: Joris Van den Bossche <[email protected]>
Closes #8343 from bkietz/9147-Support-null-other-type-p

Authored-by: Benjamin Kietzman <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
This change adds conversion for a `pyarrow.MapArray` to Pandas as a column of lists of tuples, where each tuple is a key/item pair. Unit tests were added for python to verify conversion for Pandas round-trip, chunked arrays and `MapArray` with NULL map and NULL items.

Closes #8337 from BryanCutler/map-type-to-pandas

Lead-authored-by: Bryan Cutler <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
…Parquet

This allows roundtripping complex types such as `list<dictionary<utf8>>`, `list<extension>`, etc.

Closes #8366 from pitrou/ARROW-9943-parquet-nested-metadata

Lead-authored-by: Antoine Pitrou <[email protected]>
Co-authored-by: Benjamin Kietzman <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
- Add option to Java FlightClient.Builder to turn off server verification
(verification is on by default).

Closes #8377 from jduo/ARROW-10105-Java

Authored-by: James Duong <[email protected]>
Signed-off-by: David Li <[email protected]>
New types supported:
- Fixed Size list (will throw an incorrect error if nulls).   Null slot count is best implemented after we recursively apply metadata.
- LargeList
- Maps

Still missing: LargeString and LargeBytes

Unimplemented functionality:
  - Removing duplicate maps on read.
  - Single column maps are converted to List of struct.

Other:
 Fixed two bugs for FixedSizeLists:
1.  Def levels genereated were incorrect (they only made sense for
    two level lists).
2.  Slices where not handled appropriately

Closes #8376 from emkornfield/read_most_types

Lead-authored-by: Micah Kornfield <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
Closes #8341 from romainfrancois/ARROW-10093/disable_int64_autoconv

Lead-authored-by: Romain Francois <[email protected]>
Co-authored-by: Romain François <[email protected]>
Signed-off-by: Neal Richardson <[email protected]>
Closes #8379 from pitrou/ARROW-10214-repr-undecodable-metadata

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
* Only run the pull-request-labeler if the relevant paths are modified and only on opening/reopening a PR
* Cut 2 R builds from the matrix

Closes #8380 from nealrichardson/fewer-ci-jobs

Authored-by: Neal Richardson <[email protected]>
Signed-off-by: Neal Richardson <[email protected]>
…mitive types

This is the associated draft implementation of a draft proposal on ARROW-10030, with associated document here: https://docs.google.com/document/d/1d6rV1WmvIH6uW-bcHKrYBSyPddrpXH8Q4CtVfFHtI04/edit

<details><summary>Benchmarks where the builder was replaced by this from for the `cast`: +5% to -25% improvement over the builder (with the same memory utilization)</summary>
<p>

```
cast int32 to int32 512 time:   [25.493 ns 25.498 ns 25.505 ns]
                        change: [+1.2840% +1.7037% +2.1989%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 15 outliers among 100 measurements (15.00%)
  4 (4.00%) high mild
  11 (11.00%) high severe

cast int32 to uint32 512
                        time:   [6.8259 us 6.8323 us 6.8390 us]
                        change: [+3.6216% +5.6472% +7.0263%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) high mild
  7 (7.00%) high severe

cast int32 to float32 512
                        time:   [5.7733 us 5.7763 us 5.7795 us]
                        change: [-14.264% -13.671% -13.106%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  7 (7.00%) high severe

cast int32 to float64 512
                        time:   [5.2627 us 5.2667 us 5.2711 us]
                        change: [-23.365% -22.746% -22.147%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) high mild
  6 (6.00%) high severe

cast int32 to int64 512 time:   [5.2307 us 5.2381 us 5.2438 us]
                        change: [-21.490% -20.984% -20.509%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high severe

cast float32 to int32 512
                        time:   [6.2755 us 6.2794 us 6.2833 us]
                        change: [-10.606% -9.9724% -9.3770%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  2 (2.00%) high mild
  7 (7.00%) high severe

cast float64 to float32 512
                        time:   [6.1900 us 6.2032 us 6.2206 us]
                        change: [-14.505% -13.963% -13.416%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) high mild
  3 (3.00%) high severe

cast float64 to uint64 512
                        time:   [6.4047 us 6.4189 us 6.4347 us]
                        change: [-12.287% -11.651% -11.050%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

cast int64 to int32 512 time:   [6.3697 us 6.3913 us 6.4168 us]
                        change: [+3.5998% +4.2837% +4.9649%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
  4 (4.00%) high mild
  5 (5.00%) high severe

cast date64 to date32 512
                        time:   [8.4563 us 8.4631 us 8.4702 us]
                        change: [+0.2066% +0.9093% +1.6066%] (p = 0.01 < 0.05)
                        Change within noise threshold.
Found 10 outliers among 100 measurements (10.00%)
  3 (3.00%) high mild
  7 (7.00%) high severe

cast date32 to date64 512
                        time:   [8.0482 us 8.0610 us 8.0823 us]
                        change: [+0.2428% +0.8807% +1.5657%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) high mild
  6 (6.00%) high severe

cast time32s to time32ms 512
                        time:   [2.2259 us 2.2294 us 2.2333 us]
                        change: [-2.3114% -1.6027% -0.7039%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe

cast time32s to time64us 512
                        time:   [7.8948 us 7.9060 us 7.9179 us]
                        change: [-27.022% -25.727% -24.361%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe

cast time64ns to time32s 512
                        time:   [11.546 us 11.558 us 11.571 us]
                        change: [-0.9659% -0.3235% +0.2847%] (p = 0.33 > 0.05)
                        No change in performance detected.
Found 19 outliers among 100 measurements (19.00%)
  2 (2.00%) low severe
  8 (8.00%) low mild
  1 (1.00%) high mild
  8 (8.00%) high severe

cast timestamp_ns to timestamp_s 512
                        time:   [27.676 ns 27.690 ns 27.714 ns]
                        change: [+0.9641% +1.2917% +1.6195%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 14 outliers among 100 measurements (14.00%)
  4 (4.00%) high mild
  10 (10.00%) high severe

cast timestamp_ms to timestamp_ns 512
                        time:   [2.8195 us 2.8238 us 2.8300 us]
                        change: [-0.4123% +0.7148% +1.6494%] (p = 0.19 > 0.05)
                        No change in performance detected.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) high mild
  6 (6.00%) high severe

cast timestamp_ms to i64 512
                        time:   [398.49 ns 399.60 ns 400.91 ns]
                        change: [-6.2549% -5.5279% -4.7295%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 17 outliers among 100 measurements (17.00%)
  5 (5.00%) high mild
  12 (12.00%) high severe
```

</p>
</details>

Closes #8211 from jorgecarleitao/iterator

Lead-authored-by: Jorge C. Leitao <[email protected]>
Co-authored-by: Jorge Leitao <[email protected]>
Signed-off-by: Jorge C. Leitao <[email protected]>
This PR allows the tests to be compiled on 32-bit ARM (e.g. Raspberry Pi).

Closes #8353 from andygrove/arm32bit

Authored-by: Andy Grove <[email protected]>
Signed-off-by: Jorge C. Leitao <[email protected]>
Closes #8362 from bkietz/10196-Add-FutureDeferNotOk

Authored-by: Benjamin Kietzman <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
Closes #8149 from jorisvandenbossche/ARROW-9645-deprecate-legacy-filesystems

Authored-by: Joris Van den Bossche <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
Closes #8381 from pitrou/ARROW-9964-csv-dates

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
Fix for https://issues.apache.org/jira/browse/ARROW-6972

Add support for `StructArray`. It'd be nice to have an easier way to construct a `StructArray`, but that can come in a follow up PR.

cc @eerhardt @nealrichardson
Not sure who else works on the C# parts of the repo.

Closes #8348 from pgovind/structtype

Lead-authored-by: Prashanth Govindarajan <[email protected]>
Co-authored-by: Eric Erhardt <[email protected]>
Signed-off-by: Eric Erhardt <[email protected]>
This caused a test failure in `aggregate_grouped_empty` when the `simd` feature is enabled. One could argue that it's an error in the compare kernel to set bits outside of the array len, but only taking the valid range into account solves that problem neatly.

Closes #8378 from jhorstmann/ARROW-10204-test-failure-group-empty

Authored-by: Jörn Horstmann <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
Depends on #8304

Closes #8315 from kszucs/macos3

Authored-by: Krisztián Szűcs <[email protected]>
Signed-off-by: Krisztián Szűcs <[email protected]>
…umns

Removed cardinality considerations for inferring partition field types. There is now a boolean flag (`inspect_dictionary`, default false) which if set causes fields to be inferred as a dictionary encoded type instead of the corresponding value type.

Closes #8367 from bkietz/10099-Also-allow-integer-partit

Authored-by: Benjamin Kietzman <[email protected]>
Signed-off-by: Benjamin Kietzman <[email protected]>
…lding tests

Closes #8356 from kou/cpp-arrow-testing

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
Python:
- ParquetFileFormat.write_options has been removed
- Added classes {,Parquet,Ipc}FileWriteOptions
- FileWriteOptions are constructed using FileFormat.make_write_options(...)
- FileWriteOptions are passed as a parameter to _filesystemdataset_write()

R:
- FileWriteOptions$create(...) to make write options; no subclasses exposed in R
- A filter() on the dataset is applied to restrict written rows.

C++:
- FileSystemDataset::Write's parameters have been consolidated into
  - A Scanner, from which the batches to be written are pulled
  - A FileSystemDatasetWriteOptions, which is an options struct specifying
    - destination filesystem
    - base directory
    - partitioning
    - basenames (via a string template, ex "dat_{i}.feather")
    - format specific write options
- Format specific write options are represented using the FileWriteOptions hierarchy. An instance of these can be constructed from a format using FileFormat::DefaultWriteOptions(), after which the instance can be modified.
  - ParquetFileFormat::{writer_properties, arrow_writer_properties} have been moved to ParquetFileWriteOptions, an implementation of FileWriteOptions.

Internal C++:
- Individual files can now be incrementally written using a FileWriter, constructible from a format using FileFormat::MakeWriter
- FileSystemDataset::Write now parallelizes across scan tasks rather than fragments, so there will be no difference in performance for different arrangements of tables/batches/lists of tables and batches when writing from memory
- FileSystemDataset::Write::WriteQueue provides a threadsafe channel for batches awaiting write, allowing threads to produce batches as another thread flushes the queue to disk.

Closes #8305 from bkietz/9782-more-configurable-writing

Lead-authored-by: Benjamin Kietzman <[email protected]>
Co-authored-by: Neal Richardson <[email protected]>
Signed-off-by: Benjamin Kietzman <[email protected]>
Closes #8317 from bkietz/10134-Add-ParquetFileFragmentnu

Authored-by: Benjamin Kietzman <[email protected]>
Signed-off-by: Benjamin Kietzman <[email protected]>
…_size

A chunk_size that is too small will cause metadata bloat in the parquet file, leading to poor read performance. Set the chunk_size to be the same value as the table size so that one file becomes one row_group.

Closes #8391 from kanga333/ruby-use-table-size-for-chunk-size

Authored-by: kanga333 <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
noticed this on rereviewing the merged code.

Closes #8392 from emkornfield/remove_log

Authored-by: Micah Kornfield <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
nealrichardson and others added 21 commits October 20, 2020 14:27
Closes #8495 from nealrichardson/r-post-2.0.0

Authored-by: Neal Richardson <[email protected]>
Signed-off-by: Neal Richardson <[email protected]>
This PR:

* Renamed `ExecutionError` to `DataFusionError`
* Renamed `DataFusionError::ParserError` to `DataFusionError::SQL`
* Renamed `DataFusionError::InternalError` to `DataFusionError::Internal`
* Renamed `DataFusionError::ExecutionError` to `DataFusionError::Execution`
* Adds a new error variant `DataFusionError::Plan` that is used during planning
* Removes `DataFusionError::InvalidColumn` that was not being used.
* Removes `DataFusionError::General`, replacing them by the appropriate errors
* Improves the message of `DataFusionError::Internal` to incentivize users to file a bug report when it happens
* Extended the documentation of every variant

The design behind this PR is that the error variants should correspond to what happened:

* `Internal`: a Datafusion's internal invariant was violated (e.g. downcast failed) => file a bug report
* `Plan`: planning was incorrect
* `NotImplemented`: something is not implemented and we know about it. Ideally, we should have an associated JIRA issue
* `Execution`: an error during execution. We should avoid raising these, but sometimes it is impossible.
* `IoError`: stuff related with reading and writing

I went through every error that we return in `DataFusion` and verified that it is assigned correctly to one of these variants.

I am a bit uncertain about the `ParquetError` and `ArrowError`. IMO `ArrowError` should be mapped to `DataFusionError::Execution`, as it only happens during execution, and `ParquetError` should be mapped to an `IoError`.

I also think that we should split `NotImplemented` in two: `NotImplemented` and `NotSupported` as e.g. `float16` is something that we will likely never support, while "modulus" is just not implemented yet.

Closes #8481 from jorgecarleitao/error

Authored-by: Jorge C. Leitao <[email protected]>
Signed-off-by: Jorge C. Leitao <[email protected]>
…sion python->pyarrow

Closes #8489 from jorisvandenbossche/ARROW-9963

Authored-by: Joris Van den Bossche <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
This library is 2x to 3x faster for parsing strings to binary floating-point numbers.

Closes #8494 from pitrou/ARROW-10328-fast-float

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
The bug was fixed in 3.18.0:
https://gitlab.kitware.com/cmake/cmake/-/issues/20425

Closes #8500 from pitrou/ARROW-10363-remove-cmake-bug-workaround

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
Improve variance kernel performance for integers by leveraging
textbook one pass algorithm and integer arithmetic.

Closes #8466 from cyb70289/variance-integer

Authored-by: Yibo Cai <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
New releases of python-semver don't support comparing `VersionInfo` objects with strings.

Closes #8506 from kszucs/ARROW-10369

Authored-by: Krisztián Szűcs <[email protected]>
Signed-off-by: Krisztián Szűcs <[email protected]>
…tBuilder::AppendNull

Closes #7887 from tianchen92/ARROW-9304

Lead-authored-by: tianchen <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
Nulls were propagated incorrectly.  We can simply let the kernel machinery do this for us.

Closes #8496 from pitrou/ARROW-10208-split-string-sliced

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
…ernels

Closes #8498 from pitrou/ARROW-10207-kernel-preallocate-offsets

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Benjamin Kietzman <[email protected]>
Closes #8502 from nealrichardson/dup-s3-flag

Authored-by: Neal Richardson <[email protected]>
Signed-off-by: Neal Richardson <[email protected]>
Support nested dictionaries inside list arrays for arrow JSON reader.
This pr makes reading nested JSON fields a little bit easier by making a single reader method for the list arrays.

Closes #8430 from vertexclique/ARROW-10249-support-nested-dictionaries-in-list-arrays

Authored-by: Mahmut Bulut <[email protected]>
Signed-off-by: Jorge C. Leitao <[email protected]>
The package `org.apache.arrow.util` was present in the artifacts arrow-vector and arrow-vector-memory-core. Split packages are a problem for OSGI and the Java 9+ modules (JPMS).

This PR moves the classes `AutoClosables` and `Collections2` to arrow-memory-core because they are generally useful and have no coupling with arrow-vector.
The class `DataSizeRoundingUtil` is tighter coupled with arrow-vector and is therefore moved to the package `org.apache.arrow.vector.util`.

Closes #8483 from HedgehogCode/fix-split-packages

Authored-by: Benjamin Wilhelm <[email protected]>
Signed-off-by: liyafan82 <[email protected]>
This provides sufficient coverage to support round trip between C++ and Java.  There are still some gaps in python.  Based on review, I will open JIRAs to track missing functionality (i.e. parquet support in C++).  Marking as draft until i can triage CI failures but early feedback is welcome.

Open questions I have:

[C++]
* Should we retain logic in decimal() factory function to adjust type on scale/precision or take an explicit argument or keep it as an alias for decimal128?

[Java]
* Naming:  Would Decimal256 be better then BigDecimal?

Closes #8475 from emkornfield/decimal256

Lead-authored-by: Mingyu Zhong <[email protected]>
Co-authored-by: Micah Kornfield <[email protected]>
Co-authored-by: Micah Kornfield <[email protected]>
Co-authored-by: emkornfield <[email protected]>
Co-authored-by: Ezra <[email protected]>
Signed-off-by: Micah Kornfield <[email protected]>
Closes #8458 from xhochy/ARROW-10302

Authored-by: Uwe L. Korn <[email protected]>
Signed-off-by: Uwe L. Korn <[email protected]>
A simplification to the Arrow crate, since it is no longer needed, now that we no longer have specialization.

Closes #8512 from jorgecarleitao/clean_ops

Authored-by: Jorge C. Leitao <[email protected]>
Signed-off-by: Jorge C. Leitao <[email protected]>
This PR fixes typos in files under `docs` directory.

Closes #8519 from kiszk/ARROW-10383

Authored-by: Kazuaki Ishizaki <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
This PR fixes typos in files under `cpp` directory.

Closes #8520 from kiszk/ARROW-10384

Authored-by: Kazuaki Ishizaki <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
This PR fixes typos in files under `rust` directory.

Closes #8518 from kiszk/ARROW-10382

Authored-by: Kazuaki Ishizaki <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
This PR proposes the following changes:

1. Make the CSV reader accept an optional argument to bound its iteration
2. Simplify the `next` code via iterators
3. Add a new struct to perform buffered iterations (useful to any reader)

Closes #8482 from jorgecarleitao/csv_many

Authored-by: Jorge C. Leitao <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
@westonpace westonpace merged this pull request into westonpace:master Oct 26, 2020
westonpace added a commit that referenced this pull request Apr 7, 2021
From a deadlocked run...

```
#0  0x00007f8a5d48dccd in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f8a5d486f05 in pthread_mutex_lock () from /lib64/libpthread.so.0
#2  0x00007f8a566e7e89 in arrow::internal::FnOnce<void ()>::FnImpl<arrow::Future<Aws::Utils::Outcome<Aws::S3::Model::ListObjectsV2Result, Aws::S3::S3Error> >::Callback<arrow::fs::(anonymous namespace)::TreeWalker::ListObjectsV2Handler> >::invoke() () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so
#3  0x00007f8a5650efa0 in arrow::FutureImpl::AddCallback(arrow::internal::FnOnce<void ()>) () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so
#4  0x00007f8a566e67a9 in arrow::fs::(anonymous namespace)::TreeWalker::ListObjectsV2Handler::SpawnListObjectsV2() () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so
#5  0x00007f8a566e723f in arrow::fs::(anonymous namespace)::TreeWalker::WalkChild(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int) () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so
#6  0x00007f8a566e827d in arrow::internal::FnOnce<void ()>::FnImpl<arrow::Future<Aws::Utils::Outcome<Aws::S3::Model::ListObjectsV2Result, Aws::S3::S3Error> >::Callback<arrow::fs::(anonymous namespace)::TreeWalker::ListObjectsV2Handler> >::invoke() () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so
#7  0x00007f8a5650efa0 in arrow::FutureImpl::AddCallback(arrow::internal::FnOnce<void ()>) () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so
#8  0x00007f8a566e67a9 in arrow::fs::(anonymous namespace)::TreeWalker::ListObjectsV2Handler::SpawnListObjectsV2() () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so
#9  0x00007f8a566e723f in arrow::fs::(anonymous namespace)::TreeWalker::WalkChild(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int) () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so
#10 0x00007f8a566e74b1 in arrow::fs::(anonymous namespace)::TreeWalker::DoWalk() () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so
```

The callback `ListObjectsV2Handler` is being called recursively and the mutex is non-reentrant thus deadlock.

To fix it I got rid of the mutex on `TreeWalker` by using `arrow::util::internal::TaskGroup` instead of manually tracking the #/status of in-flight requests.

Closes apache#9842 from westonpace/bugfix/arrow-12040

Lead-authored-by: Weston Pace <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
westonpace added a commit that referenced this pull request Jun 16, 2021
Before change:

```
Direct leak of 65536 byte(s) in 1 object(s) allocated from:
    #0 0x522f09 in
    #1 0x7f28ae5826f4 in
    #2 0x7f28ae57fa5d in
    #3 0x7f28ae58cb0f in
    #4 0x7f28ae58bda0 in
    ...
```

After change:
```
Direct leak of 65536 byte(s) in 1 object(s) allocated from:
    #0 0x522f09 in posix_memalign (/build/cpp/debug/arrow-dataset-file-csv-test+0x522f09)
    #1 0x7f28ae5826f4 in arrow::(anonymous namespace)::SystemAllocator::AllocateAligned(long, unsigned char**) /arrow/cpp/src/arrow/memory_pool.cc:213:24
    #2 0x7f28ae57fa5d in arrow::BaseMemoryPoolImpl<arrow::(anonymous namespace)::SystemAllocator>::Allocate(long, unsigned char**) /arrow/cpp/src/arrow/memory_pool.cc:405:5
    #3 0x7f28ae58cb0f in arrow::PoolBuffer::Reserve(long) /arrow/cpp/src/arrow/memory_pool.cc:717:9
    #4 0x7f28ae58bda0 in arrow::PoolBuffer::Resize(long, bool) /arrow/cpp/src/arrow/memory_pool.cc:741:7
    ...
```

Closes apache#10498 from westonpace/feature/ARROW-13027--c-fix-asan-stack-traces-in-ci

Authored-by: Weston Pace <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
westonpace pushed a commit that referenced this pull request Nov 29, 2021
Error log of Valgrind failure:
```
[----------] 3 tests from TestArrowReadDeltaEncoding
[ RUN      ] TestArrowReadDeltaEncoding.DeltaBinaryPacked
[       OK ] TestArrowReadDeltaEncoding.DeltaBinaryPacked (812 ms)
[ RUN      ] TestArrowReadDeltaEncoding.DeltaByteArray
==12587== Conditional jump or move depends on uninitialised value(s)
==12587==    at 0x4F12C57: Advance (bit_stream_utils.h:426)
==12587==    by 0x4F12C57: parquet::(anonymous namespace)::DeltaBitPackDecoder<parquet::PhysicalType<(parquet::Type::type)1> >::GetInternal(int*, int) (encoding.cc:2216)
==12587==    by 0x4F13823: Decode (encoding.cc:2091)
==12587==    by 0x4F13823: parquet::(anonymous namespace)::DeltaByteArrayDecoder::SetData(int, unsigned char const*, int) (encoding.cc:2360)
==12587==    by 0x4E89EF5: parquet::(anonymous namespace)::ColumnReaderImplBase<parquet::PhysicalType<(parquet::Type::type)6> >::InitializeDataDecoder(parquet::DataPage const&, long) (column_reader.cc:797)
==12587==    by 0x4E9AE63: ReadNewPage (column_reader.cc:614)
==12587==    by 0x4E9AE63: HasNextInternal (column_reader.cc:576)
==12587==    by 0x4E9AE63: parquet::internal::(anonymous namespace)::TypedRecordReader<parquet::PhysicalType<(parquet::Type::type)6> >::ReadRecords(long) (column_reader.cc:1228)
==12587==    by 0x4DFB19F: parquet::arrow::(anonymous namespace)::LeafReader::LoadBatch(long) (reader.cc:467)
==12587==    by 0x4DF513C: parquet::arrow::ColumnReaderImpl::NextBatch(long, std::shared_ptr<arrow::ChunkedArray>*) (reader.cc:108)
==12587==    by 0x4DFB74D: parquet::arrow::(anonymous namespace)::FileReaderImpl::ReadColumn(int, std::vector<int, std::allocator<int> > const&, parquet::arrow::ColumnReader*, std::shared_ptr<arrow::ChunkedArray>*) (reader.cc:273)
==12587==    by 0x4E11FDA: operator() (reader.cc:1180)
==12587==    by 0x4E11FDA: arrow::Future<std::vector<std::shared_ptr<arrow::ChunkedArray>, std::allocator<arrow::Future> > > arrow::internal::OptionalParallelForAsync<parquet::arrow::(anonymous namespace)::FileReaderImpl::DecodeRowGroups(std::shared_ptr<parquet::arrow::(anonymous namespace)::FileReaderImpl>, std::vector<int, std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&, arrow::internal::Executor*)::{lambda(unsigned long, std::shared_ptr<parquet::arrow::ColumnReaderImpl>)#1}&, std::shared_ptr<parquet::arrow::ColumnReaderImpl>, std::shared_ptr<arrow::ChunkedArray> >(bool, std::vector<std::shared_ptr<parquet::arrow::ColumnReaderImpl>, std::allocator<arrow::Future<std::vector<std::shared_ptr<arrow::ChunkedArray>, std::allocator<arrow::Future> > > > >, parquet::arrow::(anonymous namespace)::FileReaderImpl::DecodeRowGroups(std::shared_ptr<parquet::arrow::(anonymous namespace)::FileReaderImpl>, std::vector<int, std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&, arrow::internal::Executor*)::{lambda(unsigned long, std::shared_ptr<parquet::arrow::ColumnReaderImpl>)#1}&, arrow::internal::Executor*) (parallel.h:95)
==12587==    by 0x4E126A9: parquet::arrow::(anonymous namespace)::FileReaderImpl::DecodeRowGroups(std::shared_ptr<parquet::arrow::(anonymous namespace)::FileReaderImpl>, std::vector<int, std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&, arrow::internal::Executor*) (reader.cc:1198)
==12587==    by 0x4E12F50: parquet::arrow::(anonymous namespace)::FileReaderImpl::ReadRowGroups(std::vector<int, std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&, std::shared_ptr<arrow::Table>*) (reader.cc:1160)
==12587==    by 0x4DFA2BC: parquet::arrow::(anonymous namespace)::FileReaderImpl::ReadTable(std::vector<int, std::allocator<int> > const&, std::shared_ptr<arrow::Table>*) (reader.cc:198)
==12587==    by 0x4DFA392: parquet::arrow::(anonymous namespace)::FileReaderImpl::ReadTable(std::shared_ptr<arrow::Table>*) (reader.cc:289)
==12587==    by 0x1DCE62: parquet::arrow::TestArrowReadDeltaEncoding::ReadTableFromParquetFile(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::shared_ptr<arrow::Table>*) (arrow_reader_writer_test.cc:4174)
==12587==    by 0x2266D2: parquet::arrow::TestArrowReadDeltaEncoding_DeltaByteArray_Test::TestBody() (arrow_reader_writer_test.cc:4209)
==12587==    by 0x4AD2C9B: void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (gtest.cc:2607)
==12587==    by 0x4AC9DD1: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (gtest.cc:2643)
==12587==    by 0x4AA4C02: testing::Test::Run() (gtest.cc:2682)
==12587==    by 0x4AA563A: testing::TestInfo::Run() (gtest.cc:2861)
==12587==    by 0x4AA600F: testing::TestSuite::Run() (gtest.cc:3015)
==12587==    by 0x4AB631B: testing::internal::UnitTestImpl::RunAllTests() (gtest.cc:5855)
==12587==    by 0x4AD3CE7: bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) (gtest.cc:2607)
==12587==    by 0x4ACB063: bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) (gtest.cc:2643)
==12587==    by 0x4AB47B6: testing::UnitTest::Run() (gtest.cc:5438)
==12587==    by 0x4218918: RUN_ALL_TESTS() (gtest.h:2490)
==12587==    by 0x421895B: main (gtest_main.cc:52)
```

Closes apache#11725 from pitrou/ARROW-14704-parquet-valgrind

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
westonpace pushed a commit that referenced this pull request Feb 11, 2022
Documented SimpleExtensionType and simplified ExtensionSet
westonpace pushed a commit that referenced this pull request Mar 3, 2022
TODOs:
Convert cheat sheet to PDF and hide slide #1.

Closes apache#12445 from pachadotdev/patch-4

Lead-authored-by: Stephanie Hazlitt <[email protected]>
Co-authored-by: Pachá <[email protected]>
Co-authored-by: Mauricio Vargas <[email protected]>
Co-authored-by: Pachá <[email protected]>
Signed-off-by: Jonathan Keane <[email protected]>
westonpace pushed a commit that referenced this pull request Apr 1, 2025
…n timezone (apache#45051)

### Rationale for this change

If the timezone database is present on the system, but does not contain a timezone referenced in a ORC file, the ORC reader will crash with an uncaught C++ exception.

This can happen for example on Ubuntu 24.04 where some timezone aliases have been removed from the main `tzdata` package to a `tzdata-legacy` package. If `tzdata-legacy` is not installed, trying to read a ORC file that references e.g. the "US/Pacific" timezone would crash.

Here is a backtrace excerpt:
```
#12 0x00007f1a3ce23a55 in std::terminate() () from /lib/x86_64-linux-gnu/libstdc++.so.6
#13 0x00007f1a3ce39391 in __cxa_throw () from /lib/x86_64-linux-gnu/libstdc++.so.6
#14 0x00007f1a3f4accc4 in orc::loadTZDB(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
   from /tmp/arrow-HEAD.ArqTs/venv-wheel-3.12-manylinux_2_17_x86_64.manylinux2014_x86_64/lib/python3.12/site-packages/pyarrow/libarrow.so.1900
#15 0x00007f1a3f4ad392 in std::call_once<orc::LazyTimezone::getImpl() const::{lambda()#1}>(std::once_flag&, orc::LazyTimezone::getImpl() const::{lambda()#1}&&)::{lambda()#2}::_FUN() () from /tmp/arrow-HEAD.ArqTs/venv-wheel-3.12-manylinux_2_17_x86_64.manylinux2014_x86_64/lib/python3.12/site-packages/pyarrow/libarrow.so.1900
#16 0x00007f1a4298bec3 in __pthread_once_slow (once_control=0xa5ca7c8, init_routine=0x7f1a3ce69420 <__once_proxy>) at ./nptl/pthread_once.c:116
#17 0x00007f1a3f4a9ad0 in orc::LazyTimezone::getEpoch() const ()
   from /tmp/arrow-HEAD.ArqTs/venv-wheel-3.12-manylinux_2_17_x86_64.manylinux2014_x86_64/lib/python3.12/site-packages/pyarrow/libarrow.so.1900
#18 0x00007f1a3f4e76b1 in orc::TimestampColumnReader::TimestampColumnReader(orc::Type const&, orc::StripeStreams&, bool) ()
   from /tmp/arrow-HEAD.ArqTs/venv-wheel-3.12-manylinux_2_17_x86_64.manylinux2014_x86_64/lib/python3.12/site-packages/pyarrow/libarrow.so.1900
#19 0x00007f1a3f4e84ad in orc::buildReader(orc::Type const&, orc::StripeStreams&, bool, bool, bool) ()
   from /tmp/arrow-HEAD.ArqTs/venv-wheel-3.12-manylinux_2_17_x86_64.manylinux2014_x86_64/lib/python3.12/site-packages/pyarrow/libarrow.so.1900
#20 0x00007f1a3f4e8dd7 in orc::StructColumnReader::StructColumnReader(orc::Type const&, orc::StripeStreams&, bool, bool) ()
   from /tmp/arrow-HEAD.ArqTs/venv-wheel-3.12-manylinux_2_17_x86_64.manylinux2014_x86_64/lib/python3.12/site-packages/pyarrow/libarrow.so.1900
#21 0x00007f1a3f4e8532 in orc::buildReader(orc::Type const&, orc::StripeStreams&, bool, bool, bool) ()
   from /tmp/arrow-HEAD.ArqTs/venv-wheel-3.12-manylinux_2_17_x86_64.manylinux2014_x86_64/lib/python3.12/site-packages/pyarrow/libarrow.so.1900
#22 0x00007f1a3f4925e9 in orc::RowReaderImpl::startNextStripe() ()
   from /tmp/arrow-HEAD.ArqTs/venv-wheel-3.12-manylinux_2_17_x86_64.manylinux2014_x86_64/lib/python3.12/site-packages/pyarrow/libarrow.so.1900
#23 0x00007f1a3f492c9d in orc::RowReaderImpl::next(orc::ColumnVectorBatch&) ()
   from /tmp/arrow-HEAD.ArqTs/venv-wheel-3.12-manylinux_2_17_x86_64.manylinux2014_x86_64/lib/python3.12/site-packages/pyarrow/libarrow.so.1900
#24 0x00007f1a3e6b251f in arrow::adapters::orc::ORCFileReader::Impl::ReadBatch(orc::RowReaderOptions const&, std::shared_ptr<arrow::Schema> const&, long) ()
   from /tmp/arrow-HEAD.ArqTs/venv-wheel-3.12-manylinux_2_17_x86_64.manylinux2014_x86_64/lib/python3.12/site-packages/pyarrow/libarrow.so.1900
```

### What changes are included in this PR?

Catch C++ exceptions when iterating ORC batches instead of letting them slip through.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.
* GitHub Issue: apache#40633

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.