Skip to content

Conversation

@alamb
Copy link
Owner

@alamb alamb commented Nov 10, 2020

Add a note about being patient between completing the steps and being able to merge PRs

emkornfield and others added 30 commits October 2, 2020 21:05
Will start a thread on the mailing list for a Vote.

Closes apache#8293 from emkornfield/decimal256_format

Authored-by: Micah Kornfield <[email protected]>
Signed-off-by: Micah Kornfield <[email protected]>
This PR is built on top of apache#8225 and Replaces `Arc<Mutex<dyn ...>>` by `Box<dyn ...>`.

In the TopK example, I had to move some functions away from the `impl`. This is because `self` cannot be borrowed as mutable and immutable at the same time, and, during iteration, it was being borrowed as mutable (to update the BTree) and as immutable (to access the `input`). There is probably a better way of achieving this e.g. via interior mutability.

Closes apache#8307 from jorgecarleitao/box_iterator

Authored-by: Jorge C. Leitao <[email protected]>
Signed-off-by: Andy Grove <[email protected]>
First attempt at contributing to Arrow/DataFusion, just fixing this small issue around aliased aggregate columns. Please let me know if this isn't an ideal implementation!

Closes apache#8322 from returnString/allow_aliased_aggregates

Authored-by: Ruan Pearce-Authers <[email protected]>
Signed-off-by: Andy Grove <[email protected]>
Just a small cleanup by moving common code to a macro and using it in the tests.

Closes apache#8328 from jorgecarleitao/clean_tests

Authored-by: Jorge C. Leitao <[email protected]>
Signed-off-by: Andy Grove <[email protected]>
…onality

Updates the top level README with more details about the crates, their current functionality, and what they enable.

Closes apache#8313 from jorgecarleitao/doc-base

Lead-authored-by: Jorge C. Leitao <[email protected]>
Co-authored-by: Jorge Leitao <[email protected]>
Signed-off-by: Jorge C. Leitao <[email protected]>
Null values should be printed as "" when pretty printing. Prior to this PR, , null values in primitive arrays  were rendered as the type's default value as pointed out by @jhorstmann  on apache#8331 (comment)

Closes apache#8332 from alamb/alamb/ARROW-10169-fix-null-in-pretty-print

Authored-by: alamb <[email protected]>
Signed-off-by: Jorge C. Leitao <[email protected]>
Extended lib documentation, migrating some of it from the README.

Closes apache#8323 from jorgecarleitao/readme

Authored-by: Jorge C. Leitao <[email protected]>
Signed-off-by: Jorge C. Leitao <[email protected]>
This adds some logic for handling dictionary arrays when pretty printing batches. Part of a larger set of work I am working on for better DictionaryArray handling: https://issues.apache.org/jira/browse/ARROW-10159

Closes apache#8331 from alamb/alamb/ARROW-10162-pretty-dictionary

Authored-by: alamb <[email protected]>
Signed-off-by: Jorge C. Leitao <[email protected]>
…onContextState>

This makes this initialization public, which is natural as `ExecutionContext` only attribute is `ExecutionContextState`.

Closes apache#8335 from jorgecarleitao/pub_from_state

Authored-by: Jorge C. Leitao <[email protected]>
Signed-off-by: Jorge C. Leitao <[email protected]>
This PR adds a new test job for big-endian Java environment to TravisCI. It still keeps as
```
   allow_failures:
     - arch: s390x
```

Closes apache#7938 from kiszk/ARROW-9701

Lead-authored-by: Kazuaki Ishizaki <[email protected]>
Co-authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
@kszucs @nevi-me @andygrove @paddyhoran what do you think about this? (please tag others if relevant)

The idea being:

* PRs that change code in `rust/` are labeled `rust-lang`
* PRs that change code in `rust/datafusion` are labeled `datafusion`

The problems it solves:

* it is currently difficult to filter PRs by component, as we cannot bookmark a view for that filter
* It is currently difficult to find the component in the title (than it is to find on labels, due to the coloring)
* PRs may change code in `rust/` even though it is not labeled as such, and folks in Rust would probably like to know about it

Since @andygrove has been adding labels, I though that we could automatize it as an action.

I did not expand to other labels as I am not sure other parts use github labels at all.

Closes apache#8324 from jorgecarleitao/labeler

Authored-by: Jorge C. Leitao <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
Closes apache#8339 from romainfrancois/ARROW-9786/unvendorcpp11

Authored-by: Romain Francois <[email protected]>
Signed-off-by: Neal Richardson <[email protected]>
See: https://issues.apache.org/jira/browse/ARROW-9941

Before:

```
>>> ty
IntegerType(extension<arrow.py_extension_type>)
>>> str(ty)
'extension<arrow.py_extension_type>'
>>> repr(ty)
'IntegerType(extension<arrow.py_extension_type>)'
>>> pa.DataType.__str__(ty)
'extension<arrow.py_extension_type>'
```

After:

```
>>> ty
IntegerType(DataType(int64))
>>> str(ty)
'extension<arrow.py_extension_type<IntegerType>>'
>>> repr(ty)
'IntegerType(DataType(int64))'
>>> pa.DataType.__str__(ty)
'extension<arrow.py_extension_type<IntegerType>>'
```

Closes apache#8312 from dianaclarke/ARROW-9941

Authored-by: Diana Clarke <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
* Upgrades to TypeScript 4.0.2
* Fixes typings for TypeScript versions 3.9+

Closes https://issues.apache.org/jira/browse/ARROW-8394

Closes apache#8216 from trxcllnt/fix/typescript-3.9-errors

Authored-by: ptaylor <[email protected]>
Signed-off-by: Neal Richardson <[email protected]>
…tion

Minimal change to allow this to be used on arm 32 targets.

A better solution may be to utilise the crc32 instruction ala the optimised intel implementation but this is better than nothing. There should probably be a crate we could just import the hash implementation from rather than having to repeat optimisations everywhere.

Closes apache#8338 from andrew-stevenson-frequenz/issue-8735

Authored-by: Andrew Stevenson <[email protected]>
Signed-off-by: Andy Grove <[email protected]>
…t columns

Parquet row group statistics did not respect dict encoding. Also added a workaround to support filtering a dictionary encoded column.

Closes apache#8311 from bkietz/10008-dict-row-group-stats

Authored-by: Benjamin Kietzman <[email protected]>
Signed-off-by: Benjamin Kietzman <[email protected]>
When a dictionary changes from the previous batch, it is emitted again in the IPC stream.
If this happens when writing the IPC file format, an error is returned.

Closes apache#8302 from pitrou/ARROW-10121-ipc-emit-dicts

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Wes McKinney <[email protected]>
Closes apache#8304 from kszucs/awsforneal

Lead-authored-by: Neal Richardson <[email protected]>
Co-authored-by: Sutou Kouhei <[email protected]>
Co-authored-by: Krisztián Szűcs <[email protected]>
Signed-off-by: Neal Richardson <[email protected]>
…rializable

See: https://issues.apache.org/jira/browse/ARROW-10147

Before:

```
>>> import numpy as np
>>> import pyarrow as pa
>>> import pandas as pd
>>> idx = pd.RangeIndex(0, 4, name=np.int64(6))
>>> df = pd.DataFrame(index=idx)
>>> pa.table(df)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow/table.pxi", line 2042, in pyarrow.lib.table
    return Table.from_pandas(data, schema=schema, nthreads=nthreads)
  File "pyarrow/table.pxi", line 1394, in pyarrow.lib.Table.from_pandas
    arrays, schema = dataframe_to_arrays(
  File "/Users/diana/workspace/arrow/python/pyarrow/pandas_compat.py", line 604, in dataframe_to_arrays
    pandas_metadata = construct_metadata(df, column_names, index_columns,
  File "/Users/diana/workspace/arrow/python/pyarrow/pandas_compat.py", line 237, in construct_metadata
    b'pandas': json.dumps({
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/json/__init__.py", line 231, in dumps
    return _default_encoder.encode(obj)
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/json/encoder.py", line 199, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/json/encoder.py", line 257, in iterencode
    return _iterencode(o, 0)
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/json/encoder.py", line 179, in default
    raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type int64 is not JSON serializable
```

After:

```
>>> import numpy as np
>>> import pyarrow as pa
>>> import pandas as pd
>>> idx = pd.RangeIndex(0, 4, name=np.int64(6))
>>> df = pd.DataFrame(index=idx)
>>> table = pa.table(df)
>>> metadata = table.schema.pandas_metadata
>>> import pprint
>>> pprint.pprint(metadata)
{'column_indexes': [{'field_name': None,
                     'metadata': None,
                     'name': None,
                     'numpy_type': 'object',
                     'pandas_type': 'empty'}],
 'columns': [],
 'creator': {'library': 'pyarrow',
             'version': '2.0.0.dev381+g06830c954.d20201001'},
 'index_columns': [{'kind': 'range',
                    'name': '6',
                    'start': 0,
                    'step': 1,
                    'stop': 4}],
 'pandas_version': '1.1.2'}
```

Closes apache#8314 from dianaclarke/ARROW-10147

Authored-by: Diana Clarke <[email protected]>
Signed-off-by: Wes McKinney <[email protected]>
…ames.

This adds a check, during logical planning, to the uniqueness of column names after a projection and aggregation.

This eliminates the risk that a user creates a wrong logical query by the first column named X when there are two columns named X.

This does not eliminate this risk when the column name is duplicated on a scan.

Closes apache#8334 from jorgecarleitao/check_schema

Authored-by: Jorge C. Leitao <[email protected]>
Signed-off-by: Jorge C. Leitao <[email protected]>
This PR fixes a bug on which GenericStringArray::from(ArrayData) was able to be created with the wrong DataType.

It also simplifies generics that use `GenericStringArray`, as they no longer need to pass the corresponding `DataType` and can instead rely on a `const` to do it for them.

Closes apache#8296 from jorgecarleitao/check_datatypes

Authored-by: Jorge C. Leitao <[email protected]>
Signed-off-by: Jorge C. Leitao <[email protected]>
It seems that I used a notation in the labeler that [is only supported in the master version of it](actions/labeler#73 (comment)), which is causing all PRs to dail. This PR fixes that. I verified it on my own fork: https://github.com/jorgecarleitao/arrow/pull/12

Closes apache#8350 from jorgecarleitao/labeler

Authored-by: Jorge C. Leitao <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
This fixes a small typo in the example provided in the C data interface on which the [specification](https://arrow.apache.org/docs/format/CDataInterface.html#data-type-description-format-strings) uses `"i"` of india to represent i32, but the [example](https://arrow.apache.org/docs/format/CDataInterface.html#exporting-a-simple-int32-array) uses `"l"` of lima to represent i32.

Closes apache#8357 from jorgecarleitao/fix_example

Authored-by: Jorge C. Leitao <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
Also renamed an example, that refered to the old name "table_api".

Closes apache#8355 from jorgecarleitao/fix

Authored-by: Jorge C. Leitao <[email protected]>
Signed-off-by: Andy Grove <[email protected]>
…sts, by using standard pretty printer

This PR removes (most of) the special case pretty printing code in DataFusion's sql integration test, `sql.rs` in favor of the standard pretty printer in `arrow::utils::pretty` and in the process adds support for DictionaryArray printing as well as standardizing how tests are run and output is compared.

I am working on adding support for `DictionaryArray` in DataFusion (and thus want to add tests to sql.rs). This specific PR's changes are larger than strictly necessary, but I felt it made it easier to write new tests in sql.rs. However,  if people prefer, I could instead add a special case for `DictionaryArray` in `array_str` and accomplish my goal with a much smaller PR

Note: I found that using `Vec<Vec<String>>` to encode expected results rather than a `String` retains the nice property that differences to expected output are shown reasonably in the test output. For example:

```
---- csv_query_cast_literal stdout ----
thread 'csv_query_cast_literal' panicked at 'assertion failed: `(left == right)`
  left: `[["0.9294097332465232", "1.0"], ["0.3114712539863804", "1.0"]]`,
 right: `[["0.9294097332465232", "1"], ["0.3114712539863804", "1"]]`', datafusion/tests/sql.rs:502:5
```

Closes apache#8333 from alamb/alamb/ARROW-10167-cleanup-sql-display

Authored-by: alamb <[email protected]>
Signed-off-by: Andy Grove <[email protected]>
Closes apache#8342 from pitrou/ARROW-10120-nested-pq-benchmarks

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
…rray

```
pyarrow/tests/test_convert_builtin.py::test_fixed_size_binary_length_check ../src/arrow/array/builder_binary.cc:53:  Check failed: (size) == (byte_width_) Appending wrong size to FixedSizeBinaryBuilder
0   libarrow.200.0.0.dylib              0x000000010e7f9704 _ZN5arrow4util7CerrLog14PrintBackTraceEv + 52
1   libarrow.200.0.0.dylib              0x000000010e7f9622 _ZN5arrow4util7CerrLogD2Ev + 98
2   libarrow.200.0.0.dylib              0x000000010e7f9585 _ZN5arrow4util7CerrLogD1Ev + 21
3   libarrow.200.0.0.dylib              0x000000010e7f95ac _ZN5arrow4util7CerrLogD0Ev + 28
4   libarrow.200.0.0.dylib              0x000000010e7f9492 _ZN5arrow4util8ArrowLogD2Ev + 82
5   libarrow.200.0.0.dylib              0x000000010e7f94c5 _ZN5arrow4util8ArrowLogD1Ev + 21
6   libarrow.200.0.0.dylib              0x000000010e303ec1 _ZN5arrow22FixedSizeBinaryBuilder14CheckValueSizeEx + 209
7   libarrow.200.0.0.dylib              0x000000010e30c361 _ZN5arrow22FixedSizeBinaryBuilder12UnsafeAppendEN6nonstd7sv_lite17basic_string_viewIcNSt3__111char_traitsIcEEEE + 49
8   libarrow_python.200.0.0.dylib       0x000000010b4efa7d _ZN5arrow2py20PyPrimitiveConverterINS_19FixedSizeBinaryTypeEvE6AppendEP7_object + 813
```

The input `const char*` value gets implicitly casted to `string_view` which makes the length check fail in debug builds.

Closes apache#8360 from kszucs/ARROW-10193

Authored-by: Krisztián Szűcs <[email protected]>
Signed-off-by: Krisztián Szűcs <[email protected]>
…g array to Pandas

Fix a crash on conversion of e.g. struct of dictionaries.

Closes apache#8361 from pitrou/ARROW-10192-struct-of-dict

Lead-authored-by: Antoine Pitrou <[email protected]>
Co-authored-by: Krisztián Szűcs <[email protected]>
Signed-off-by: Krisztián Szűcs <[email protected]>
Closes apache#8369 from bkietz/10176-Nightly-valgrind-job-fail

Authored-by: Benjamin Kietzman <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
This is a follow-up of apache#7938. apache#7938 forgot downloading a required dll `libprotobuf.so.18` at the build time.

Closes apache#8368 from kiszk/ARROW-10200

Authored-by: Kazuaki Ishizaki <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
jensenrichardson and others added 25 commits November 6, 2020 05:32
Adds cmake cases for Intel compiler which allows compilation using the Intel compiler. Fixes unknown compiler message, making it "Unknown compiler: <compiler id> <compiler version>" instead of "Unknown compiler: <compiler version> <compiler version>".

Closes apache#8601 from jensenrichardson/master

Authored-by: jensenrichardson <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
…builds

Closes apache#8599 from xhochy/ARROW-10502

Authored-by: Uwe L. Korn <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
…nt headers.

PR for https://issues.apache.org/jira/browse/ARROW-10467

Add a CallOption for specifying arbitrary properties as headers for RPC calls.
Add a ServerMiddleware that will intercept and pass the properties to a handler
when registered via the FlightServer builder.

Closes apache#8572 from kylepbit/ARROW-10467

Authored-by: Kyle Porter <[email protected]>
Signed-off-by: David Li <[email protected]>
Closes apache#8580 from jorisvandenbossche/ARROW-10482

Authored-by: Joris Van den Bossche <[email protected]>
Signed-off-by: Uwe L. Korn <[email protected]>
Add a ClientMiddleware that can read HTTP Set-Cookie and Set-Cookie2 headers
from server responses and transmit corresponding Cookie headers

Closes apache#8554 from jduo/cookie-handling

Lead-authored-by: James Duong <[email protected]>
Co-authored-by: Keerat Singh <[email protected]>
Co-authored-by: Keerat Singh <[email protected]>
Signed-off-by: David Li <[email protected]>
This change adds support for `LargeList` in `take()`.

There is an additional update to the underlying implementation of `take()` such that the indices may be any `PrimitiveArray` of `ArrowNumericType`, rather than only `UInt32Array`. This change is motivated by the recursive call to `take()` in `take_list()` ([here](https://github.com/apache/arrow/blob/b109195b77d85e513aab80650bd4b193e26a5471/rust/arrow/src/compute/kernels/take.rs#L324)), since in order to support `LargeListArray`, which use `i64` offsets, the recursive call must support indices arrays that are `Int64Array`.

Closes apache#8556 from drusso/ARROW-10378

Authored-by: Daniel Russo <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
This allows binary arrays to be iterated and be constructed from an iterator. This is analogous to what we already support for primitive and string arrays.

Closes apache#8576 from jorgecarleitao/iter_binary

Authored-by: Jorge C. Leitao <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
… or Option<String>

Currently, our code-base supports creating a `StringArray` from an iterator of `String`.

However, from arrow's perspective, we should not care if it is a `String` or a `&str`, as long as it can be represented by an `AsRef<str>`. A user sometimes is able to create an iterator of `&str` instead of `String`, and should not have to convert one to the other before passing it to Arrow.

This PR makes this change.

Closes apache#8575 from jorgecarleitao/array_from_str

Authored-by: Jorge C. Leitao <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
This changes a list datatype to use `Box<Field>` instead of `Box<DataType>`.

This change is needed in order to make both Parquet and IPC roundtrips pass.
The C++ implementation uses `Field`, as it allows for preservice list field names and nullability.

There are some implementation details which we did not cover in this PR, and will work on as a follow-up, and these are:
* Documenting the naming conventions of Arrow and Parquet on list names (Arrow C++ uses "item", Parquet has different compatibility options)
* Fixing Parquet failing list tests that have been ignored (I'd like to do this separately, as this isn't a full fix)

Closes apache#8608 from nevi-me/ARROW-10261-BREAKING

Authored-by: Neville Dipale <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
This PR:

1. Makes `DictionaryArray::keys` be an `PrimitiveArray`.
2. Removes `NullIter` and many of the `unsafe` code that it contained
3. Simplifies the parquet writer implementation around dictionaries
4. Indirectly removes a bug on NullIter's `DoubleEndedIterator` on which the implementation was not following the spec with respect to `end == current` (compare with implementation of `DoubleEndedIterator` for std's `Vec`'s `IntoIter`).
5. Fixes error in computing the size of a dictionary, which was double-counting `data` and `values` (`values` is a reference to `data`)
6. Adds check that the dictionary's ArrayData's datatype matches `T::DATA_TYPE`.

Since `NullIter` was not being directly tested, no tests were removed.

This is backward incompatible and the migration requires replacing `dict.keys()`  by `dict.keys().into_iter()` whenever the user's intention is to use `keys()` as an iterator.

Closes apache#8561 from jorgecarleitao/dictionary_clean

Authored-by: Jorge C. Leitao <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
This PR removes `PrimitiveArray::new`. `PrimitiveArray::new` is `pub`, but it is dangerous because:
* when the buffer's content is not aligned with T, many of its methods cause UB
* when used with `null_count > 0`, many calls panic as the null bitmap is `None`, but the `null_count != 0`
* when used with `null_count > 0`, it creates an array out of spec (as the buffer for the null bitmap is `None` but the null count is not zero)

Since:
* a change in this method's signature (to either add the bitmap or remove `null_count`) requires a backward incompatible change
* it is only used in tests
* we have good offers to create primitive arrays:
    * from an ArrayData,
    * from a vector or vector of optionals
    * from an iterator

This PR removes it.

Closes apache#8560 from jorgecarleitao/remove_new

Authored-by: Jorge C. Leitao <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
This puts it in line with `array::NullIter`. This PR is complementary to apache#8561 (which removes `NullIter`), so that the PrimitiveArrayIterator can be reversed in the same way `NullIter` can. Together with apache#8561, it re-allows Dictionary keys to be iterated backwards.

Closes apache#8562 from jorgecarleitao/double_ended

Authored-by: Jorge C. Leitao <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
[ARROW-10510](https://issues.apache.org/jira/browse/ARROW-10510)

This change adds benchmarks for `COUNT(DISTINCT)` queries. This is a small follow-up to [ARROW-10043](https://issues.apache.org/jira/browse/ARROW-10043) / apache#8222. In that PR, a number of implementation ideas were discussed for follow-ups, and having benchmarks will help evaluate them.

---

There are two benchmarks added:

* wide: all of the values are distinct; this is looking at worst-case performance
* narrow: only a handful of distinct values; this is closer to best-case performance

The wide benchmark runs ~ 7x slower than the narrow benchmark.

Closes apache#8606 from drusso/ARROW-10510

Authored-by: Daniel Russo <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
This is a major refactor of the `equal.rs` module.

The rational for this change is many fold:

* currently array comparison requires downcasting the array ref to its concrete types. This is painful and not very ergonomics, as the user must "guess" what to downcast for comparison. We can see this in the hacks around `sort`, `take` and `concatenate` kernel's tests, and some of the tests of the builders.
* the code in array comparison is difficult to follow given the amount of calls that they perform around offsets.
* The implementation currently indirectly uses many of the `unsafe` APIs that we have (via pointer aritmetics), which makes it risky to operate and mutate.
* Some code is being repeated.

This PR:

1. adds `impl PartialEq for dyn Array`, to allow `Array` comparison based on `Array::data` (main change)
2. Makes array equality to only depend on `ArrayData`, i.e. it no longer depends on concrete array types (such as `PrimitiveArray` and related API) to perform comparisons.
3. Significantly reduces the risk of panics and UB when composite arrays are of different types, by checking the types on `range` comparison
4. Makes array equality be statically dispatched, via `match datatype`.
5. DRY the code around array equality
6. Fixes an error in equality of dictionary with equal values
7. Added tests to equalities that were not tested (fixed binary, some edge cases of dictionaries)
8. splits `equal.rs` in smaller, more manageable files.
9. Removes `ArrayListOps`, since it it no longer needed
10. Moves Json equality to its own module, for clarity.
11. removes the need to have two functions per type to compare arrays.
12. Adds the number of buffers and their respective width to datatypes from the specification. This was backported from apache#8401
13. adds a benchmark for array equality

Note that this does not implement `PartialEq` for `ArrayData`, only `dyn Array`, as different data does not imply a different array (due to nullability). That implementation is being worked on apache#8200.

IMO this PR significantly simplifies the code around array comparison, to the point where many implementations are 5 lines long.

This also improves performance by 10-40%.

<details>
 <summary>Benchmark results</summary>

```
Previous HEAD position was 3dd3c69d7 Added bench for equality.
Switched to branch 'equal'
Your branch is up to date with 'origin/equal'.
   Compiling arrow v3.0.0-SNAPSHOT (/Users/jorgecarleitao/projects/arrow/rust/arrow)
    Finished bench [optimized] target(s) in 51.28s
     Running /Users/jorgecarleitao/projects/arrow/rust/target/release/deps/equal-176c3cb11360bd12
Gnuplot not found, using plotters backend
equal_512               time:   [36.861 ns 36.894 ns 36.934 ns]
                        change: [-43.752% -43.400% -43.005%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  7 (7.00%) high mild
  5 (5.00%) high severe

equal_nulls_512         time:   [2.3271 us 2.3299 us 2.3331 us]
                        change: [-10.846% -9.0877% -7.7336%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  4 (4.00%) high mild
  7 (7.00%) high severe

equal_string_512        time:   [49.219 ns 49.347 ns 49.517 ns]
                        change: [-30.789% -30.538% -30.235%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  3 (3.00%) high mild
  6 (6.00%) high severe

equal_string_nulls_512  time:   [3.7873 us 3.7939 us 3.8013 us]
                        change: [-8.2944% -7.0636% -5.4266%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  4 (4.00%) high mild
  8 (8.00%) high severe
```

</details>

All tests are there, plus new tests for some of the edge cases and untested arrays.

This change is backward incompatible `array1.equals(&array2)` no longer works: use `array1 == array2` instead, which is the idiomatic way of comparing structs and trait objects in rust.

Closes apache#8541 from jorgecarleitao/equal

Authored-by: Jorge C. Leitao <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
Basically a reopening of apache#8393. After further discussion, this provides the cleanest contribution in terms of IP clearances and git history, and it should be fairly straightforward to manage the Julia-side of package release registering.

Closes apache#8547 from quinnj/jq/ARROW-10228

Authored-by: Jacob Quinn <[email protected]>
Signed-off-by: Neal Richardson <[email protected]>
…ath issue on Windows

Closes apache#8539 from jorisvandenbossche/ci-windows-fsspec

Authored-by: Joris Van den Bossche <[email protected]>
Signed-off-by: Joris Van den Bossche <[email protected]>
…e on CI

Closes apache#8573 from jorisvandenbossche/ARROW-10471-s3fs

Authored-by: Joris Van den Bossche <[email protected]>
Signed-off-by: Joris Van den Bossche <[email protected]>
At some point, Intel changed __force_inline to __forceinline. Arrow does not compile without this. Changes the header file to redefine the macro from __force_inline to __forceinline. Source: [here](https://software.intel.com/content/www/us/en/develop/documentation/cpp-compiler-developer-guide-and-reference/top/compiler-reference/compiler-options/compiler-option-details/inlining-options/inline-forceinline-qinline-forceinline.html)

Closes apache#8600 from jensenrichardson/patch-1

Authored-by: jensenrichardson <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
…lang+Windows

Closes apache#8604 from xhochy/ARROW-10509

Authored-by: Uwe L. Korn <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
…WS config

Closes apache#8568 from xhochy/ARROW-10346

Authored-by: Uwe L. Korn <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
Separate Mode and Variance/Stddev kernels registration from basic
aggregation kernels.

Closes apache#8523 from cyb70289/agg-refine

Authored-by: Yibo Cai <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
Remove inclusion of heavyweights "arrow/api.h" and "arrow/builder.h", among other changes.

This seems to reduce CPU build time by about 3%.

Closes apache#8511 from pitrou/ARROW-7531-header-reduction

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
…nd string types

Closes apache#8529 from david1437/ARROW-5394

Lead-authored-by: david <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
This PR splits the `array.rs` into multiple modules, so that it is a bit easier to navigate in the project, particularly `array.rs`. Semantically, there should be no difference, and this change is also backward compatible, public items are only exposed in `mod.rs`.

This change was initially proposed by @paddyhoran on ARROW-9361, and this is just a proposed implementation.

Some notes:

* Each commit is a different array
* I named the files `array_[type]` to distinguish from other modules that do not contain arrays.
* `null` and `Union` were not renamed
* Some functions were moved around / incorporated in `impl` since they were used in multiple places.

Closes apache#8543 from jorgecarleitao/clean_slice

Authored-by: Jorge C. Leitao <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
@alamb alamb closed this Nov 10, 2020
@alamb alamb deleted the alamb/update_docs branch December 7, 2020 19:30
alamb pushed a commit that referenced this pull request Apr 7, 2021
From a deadlocked run...

```
#0  0x00007f8a5d48dccd in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f8a5d486f05 in pthread_mutex_lock () from /lib64/libpthread.so.0
#2  0x00007f8a566e7e89 in arrow::internal::FnOnce<void ()>::FnImpl<arrow::Future<Aws::Utils::Outcome<Aws::S3::Model::ListObjectsV2Result, Aws::S3::S3Error> >::Callback<arrow::fs::(anonymous namespace)::TreeWalker::ListObjectsV2Handler> >::invoke() () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so
apache#3  0x00007f8a5650efa0 in arrow::FutureImpl::AddCallback(arrow::internal::FnOnce<void ()>) () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so
apache#4  0x00007f8a566e67a9 in arrow::fs::(anonymous namespace)::TreeWalker::ListObjectsV2Handler::SpawnListObjectsV2() () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so
apache#5  0x00007f8a566e723f in arrow::fs::(anonymous namespace)::TreeWalker::WalkChild(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int) () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so
apache#6  0x00007f8a566e827d in arrow::internal::FnOnce<void ()>::FnImpl<arrow::Future<Aws::Utils::Outcome<Aws::S3::Model::ListObjectsV2Result, Aws::S3::S3Error> >::Callback<arrow::fs::(anonymous namespace)::TreeWalker::ListObjectsV2Handler> >::invoke() () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so
apache#7  0x00007f8a5650efa0 in arrow::FutureImpl::AddCallback(arrow::internal::FnOnce<void ()>) () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so
apache#8  0x00007f8a566e67a9 in arrow::fs::(anonymous namespace)::TreeWalker::ListObjectsV2Handler::SpawnListObjectsV2() () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so
apache#9  0x00007f8a566e723f in arrow::fs::(anonymous namespace)::TreeWalker::WalkChild(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int) () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so
apache#10 0x00007f8a566e74b1 in arrow::fs::(anonymous namespace)::TreeWalker::DoWalk() () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so
```

The callback `ListObjectsV2Handler` is being called recursively and the mutex is non-reentrant thus deadlock.

To fix it I got rid of the mutex on `TreeWalker` by using `arrow::util::internal::TaskGroup` instead of manually tracking the #/status of in-flight requests.

Closes apache#9842 from westonpace/bugfix/arrow-12040

Lead-authored-by: Weston Pace <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
alamb pushed a commit that referenced this pull request Dec 22, 2021
Before change:

```
Direct leak of 65536 byte(s) in 1 object(s) allocated from:
    #0 0x522f09 in
    #1 0x7f28ae5826f4 in
    #2 0x7f28ae57fa5d in
    apache#3 0x7f28ae58cb0f in
    apache#4 0x7f28ae58bda0 in
    ...
```

After change:
```
Direct leak of 65536 byte(s) in 1 object(s) allocated from:
    #0 0x522f09 in posix_memalign (/build/cpp/debug/arrow-dataset-file-csv-test+0x522f09)
    #1 0x7f28ae5826f4 in arrow::(anonymous namespace)::SystemAllocator::AllocateAligned(long, unsigned char**) /arrow/cpp/src/arrow/memory_pool.cc:213:24
    #2 0x7f28ae57fa5d in arrow::BaseMemoryPoolImpl<arrow::(anonymous namespace)::SystemAllocator>::Allocate(long, unsigned char**) /arrow/cpp/src/arrow/memory_pool.cc:405:5
    apache#3 0x7f28ae58cb0f in arrow::PoolBuffer::Reserve(long) /arrow/cpp/src/arrow/memory_pool.cc:717:9
    apache#4 0x7f28ae58bda0 in arrow::PoolBuffer::Resize(long, bool) /arrow/cpp/src/arrow/memory_pool.cc:741:7
    ...
```

Closes apache#10498 from westonpace/feature/ARROW-13027--c-fix-asan-stack-traces-in-ci

Authored-by: Weston Pace <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
alamb pushed a commit that referenced this pull request Dec 22, 2021
Error log of Valgrind failure:
```
[----------] 3 tests from TestArrowReadDeltaEncoding
[ RUN      ] TestArrowReadDeltaEncoding.DeltaBinaryPacked
[       OK ] TestArrowReadDeltaEncoding.DeltaBinaryPacked (812 ms)
[ RUN      ] TestArrowReadDeltaEncoding.DeltaByteArray
==12587== Conditional jump or move depends on uninitialised value(s)
==12587==    at 0x4F12C57: Advance (bit_stream_utils.h:426)
==12587==    by 0x4F12C57: parquet::(anonymous namespace)::DeltaBitPackDecoder<parquet::PhysicalType<(parquet::Type::type)1> >::GetInternal(int*, int) (encoding.cc:2216)
==12587==    by 0x4F13823: Decode (encoding.cc:2091)
==12587==    by 0x4F13823: parquet::(anonymous namespace)::DeltaByteArrayDecoder::SetData(int, unsigned char const*, int) (encoding.cc:2360)
==12587==    by 0x4E89EF5: parquet::(anonymous namespace)::ColumnReaderImplBase<parquet::PhysicalType<(parquet::Type::type)6> >::InitializeDataDecoder(parquet::DataPage const&, long) (column_reader.cc:797)
==12587==    by 0x4E9AE63: ReadNewPage (column_reader.cc:614)
==12587==    by 0x4E9AE63: HasNextInternal (column_reader.cc:576)
==12587==    by 0x4E9AE63: parquet::internal::(anonymous namespace)::TypedRecordReader<parquet::PhysicalType<(parquet::Type::type)6> >::ReadRecords(long) (column_reader.cc:1228)
==12587==    by 0x4DFB19F: parquet::arrow::(anonymous namespace)::LeafReader::LoadBatch(long) (reader.cc:467)
==12587==    by 0x4DF513C: parquet::arrow::ColumnReaderImpl::NextBatch(long, std::shared_ptr<arrow::ChunkedArray>*) (reader.cc:108)
==12587==    by 0x4DFB74D: parquet::arrow::(anonymous namespace)::FileReaderImpl::ReadColumn(int, std::vector<int, std::allocator<int> > const&, parquet::arrow::ColumnReader*, std::shared_ptr<arrow::ChunkedArray>*) (reader.cc:273)
==12587==    by 0x4E11FDA: operator() (reader.cc:1180)
==12587==    by 0x4E11FDA: arrow::Future<std::vector<std::shared_ptr<arrow::ChunkedArray>, std::allocator<arrow::Future> > > arrow::internal::OptionalParallelForAsync<parquet::arrow::(anonymous namespace)::FileReaderImpl::DecodeRowGroups(std::shared_ptr<parquet::arrow::(anonymous namespace)::FileReaderImpl>, std::vector<int, std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&, arrow::internal::Executor*)::{lambda(unsigned long, std::shared_ptr<parquet::arrow::ColumnReaderImpl>)#1}&, std::shared_ptr<parquet::arrow::ColumnReaderImpl>, std::shared_ptr<arrow::ChunkedArray> >(bool, std::vector<std::shared_ptr<parquet::arrow::ColumnReaderImpl>, std::allocator<arrow::Future<std::vector<std::shared_ptr<arrow::ChunkedArray>, std::allocator<arrow::Future> > > > >, parquet::arrow::(anonymous namespace)::FileReaderImpl::DecodeRowGroups(std::shared_ptr<parquet::arrow::(anonymous namespace)::FileReaderImpl>, std::vector<int, std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&, arrow::internal::Executor*)::{lambda(unsigned long, std::shared_ptr<parquet::arrow::ColumnReaderImpl>)#1}&, arrow::internal::Executor*) (parallel.h:95)
==12587==    by 0x4E126A9: parquet::arrow::(anonymous namespace)::FileReaderImpl::DecodeRowGroups(std::shared_ptr<parquet::arrow::(anonymous namespace)::FileReaderImpl>, std::vector<int, std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&, arrow::internal::Executor*) (reader.cc:1198)
==12587==    by 0x4E12F50: parquet::arrow::(anonymous namespace)::FileReaderImpl::ReadRowGroups(std::vector<int, std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&, std::shared_ptr<arrow::Table>*) (reader.cc:1160)
==12587==    by 0x4DFA2BC: parquet::arrow::(anonymous namespace)::FileReaderImpl::ReadTable(std::vector<int, std::allocator<int> > const&, std::shared_ptr<arrow::Table>*) (reader.cc:198)
==12587==    by 0x4DFA392: parquet::arrow::(anonymous namespace)::FileReaderImpl::ReadTable(std::shared_ptr<arrow::Table>*) (reader.cc:289)
==12587==    by 0x1DCE62: parquet::arrow::TestArrowReadDeltaEncoding::ReadTableFromParquetFile(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::shared_ptr<arrow::Table>*) (arrow_reader_writer_test.cc:4174)
==12587==    by 0x2266D2: parquet::arrow::TestArrowReadDeltaEncoding_DeltaByteArray_Test::TestBody() (arrow_reader_writer_test.cc:4209)
==12587==    by 0x4AD2C9B: void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (gtest.cc:2607)
==12587==    by 0x4AC9DD1: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (gtest.cc:2643)
==12587==    by 0x4AA4C02: testing::Test::Run() (gtest.cc:2682)
==12587==    by 0x4AA563A: testing::TestInfo::Run() (gtest.cc:2861)
==12587==    by 0x4AA600F: testing::TestSuite::Run() (gtest.cc:3015)
==12587==    by 0x4AB631B: testing::internal::UnitTestImpl::RunAllTests() (gtest.cc:5855)
==12587==    by 0x4AD3CE7: bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) (gtest.cc:2607)
==12587==    by 0x4ACB063: bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) (gtest.cc:2643)
==12587==    by 0x4AB47B6: testing::UnitTest::Run() (gtest.cc:5438)
==12587==    by 0x4218918: RUN_ALL_TESTS() (gtest.h:2490)
==12587==    by 0x421895B: main (gtest_main.cc:52)
```

Closes apache#11725 from pitrou/ARROW-14704-parquet-valgrind

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.