Alamb/update docs #1

alamb · 2020-11-10T12:09:15Z

Add a note about being patient between completing the steps and being able to merge PRs

Will start a thread on the mailing list for a Vote. Closes apache#8293 from emkornfield/decimal256_format Authored-by: Micah Kornfield <[email protected]> Signed-off-by: Micah Kornfield <[email protected]>

This PR is built on top of apache#8225 and Replaces `Arc<Mutex<dyn ...>>` by `Box<dyn ...>`. In the TopK example, I had to move some functions away from the `impl`. This is because `self` cannot be borrowed as mutable and immutable at the same time, and, during iteration, it was being borrowed as mutable (to update the BTree) and as immutable (to access the `input`). There is probably a better way of achieving this e.g. via interior mutability. Closes apache#8307 from jorgecarleitao/box_iterator Authored-by: Jorge C. Leitao <[email protected]> Signed-off-by: Andy Grove <[email protected]>

First attempt at contributing to Arrow/DataFusion, just fixing this small issue around aliased aggregate columns. Please let me know if this isn't an ideal implementation! Closes apache#8322 from returnString/allow_aliased_aggregates Authored-by: Ruan Pearce-Authers <[email protected]> Signed-off-by: Andy Grove <[email protected]>

Just a small cleanup by moving common code to a macro and using it in the tests. Closes apache#8328 from jorgecarleitao/clean_tests Authored-by: Jorge C. Leitao <[email protected]> Signed-off-by: Andy Grove <[email protected]>

…onality Updates the top level README with more details about the crates, their current functionality, and what they enable. Closes apache#8313 from jorgecarleitao/doc-base Lead-authored-by: Jorge C. Leitao <[email protected]> Co-authored-by: Jorge Leitao <[email protected]> Signed-off-by: Jorge C. Leitao <[email protected]>

@jhorstmann

Null values should be printed as "" when pretty printing. Prior to this PR, , null values in primitive arrays were rendered as the type's default value as pointed out by @jhorstmann on apache#8331 (comment) Closes apache#8332 from alamb/alamb/ARROW-10169-fix-null-in-pretty-print Authored-by: alamb <[email protected]> Signed-off-by: Jorge C. Leitao <[email protected]>

Extended lib documentation, migrating some of it from the README. Closes apache#8323 from jorgecarleitao/readme Authored-by: Jorge C. Leitao <[email protected]> Signed-off-by: Jorge C. Leitao <[email protected]>

This adds some logic for handling dictionary arrays when pretty printing batches. Part of a larger set of work I am working on for better DictionaryArray handling: https://issues.apache.org/jira/browse/ARROW-10159 Closes apache#8331 from alamb/alamb/ARROW-10162-pretty-dictionary Authored-by: alamb <[email protected]> Signed-off-by: Jorge C. Leitao <[email protected]>

…onContextState> This makes this initialization public, which is natural as `ExecutionContext` only attribute is `ExecutionContextState`. Closes apache#8335 from jorgecarleitao/pub_from_state Authored-by: Jorge C. Leitao <[email protected]> Signed-off-by: Jorge C. Leitao <[email protected]>

This PR adds a new test job for big-endian Java environment to TravisCI. It still keeps as ``` allow_failures: - arch: s390x ``` Closes apache#7938 from kiszk/ARROW-9701 Lead-authored-by: Kazuaki Ishizaki <[email protected]> Co-authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>

@kszucs

@kszucs @nevi-me @andygrove @paddyhoran what do you think about this? (please tag others if relevant) The idea being: * PRs that change code in `rust/` are labeled `rust-lang` * PRs that change code in `rust/datafusion` are labeled `datafusion` The problems it solves: * it is currently difficult to filter PRs by component, as we cannot bookmark a view for that filter * It is currently difficult to find the component in the title (than it is to find on labels, due to the coloring) * PRs may change code in `rust/` even though it is not labeled as such, and folks in Rust would probably like to know about it Since @andygrove has been adding labels, I though that we could automatize it as an action. I did not expand to other labels as I am not sure other parts use github labels at all. Closes apache#8324 from jorgecarleitao/labeler Authored-by: Jorge C. Leitao <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>

Closes apache#8339 from romainfrancois/ARROW-9786/unvendorcpp11 Authored-by: Romain Francois <[email protected]> Signed-off-by: Neal Richardson <[email protected]>

See: https://issues.apache.org/jira/browse/ARROW-9941 Before: ``` >>> ty IntegerType(extension<arrow.py_extension_type>) >>> str(ty) 'extension<arrow.py_extension_type>' >>> repr(ty) 'IntegerType(extension<arrow.py_extension_type>)' >>> pa.DataType.__str__(ty) 'extension<arrow.py_extension_type>' ``` After: ``` >>> ty IntegerType(DataType(int64)) >>> str(ty) 'extension<arrow.py_extension_type<IntegerType>>' >>> repr(ty) 'IntegerType(DataType(int64))' >>> pa.DataType.__str__(ty) 'extension<arrow.py_extension_type<IntegerType>>' ``` Closes apache#8312 from dianaclarke/ARROW-9941 Authored-by: Diana Clarke <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

* Upgrades to TypeScript 4.0.2 * Fixes typings for TypeScript versions 3.9+ Closes https://issues.apache.org/jira/browse/ARROW-8394 Closes apache#8216 from trxcllnt/fix/typescript-3.9-errors Authored-by: ptaylor <[email protected]> Signed-off-by: Neal Richardson <[email protected]>

…tion Minimal change to allow this to be used on arm 32 targets. A better solution may be to utilise the crc32 instruction ala the optimised intel implementation but this is better than nothing. There should probably be a crate we could just import the hash implementation from rather than having to repeat optimisations everywhere. Closes apache#8338 from andrew-stevenson-frequenz/issue-8735 Authored-by: Andrew Stevenson <[email protected]> Signed-off-by: Andy Grove <[email protected]>

…t columns Parquet row group statistics did not respect dict encoding. Also added a workaround to support filtering a dictionary encoded column. Closes apache#8311 from bkietz/10008-dict-row-group-stats Authored-by: Benjamin Kietzman <[email protected]> Signed-off-by: Benjamin Kietzman <[email protected]>

When a dictionary changes from the previous batch, it is emitted again in the IPC stream. If this happens when writing the IPC file format, an error is returned. Closes apache#8302 from pitrou/ARROW-10121-ipc-emit-dicts Authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Wes McKinney <[email protected]>

Closes apache#8304 from kszucs/awsforneal Lead-authored-by: Neal Richardson <[email protected]> Co-authored-by: Sutou Kouhei <[email protected]> Co-authored-by: Krisztián Szűcs <[email protected]> Signed-off-by: Neal Richardson <[email protected]>

…rializable See: https://issues.apache.org/jira/browse/ARROW-10147 Before: ``` >>> import numpy as np >>> import pyarrow as pa >>> import pandas as pd >>> idx = pd.RangeIndex(0, 4, name=np.int64(6)) >>> df = pd.DataFrame(index=idx) >>> pa.table(df) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "pyarrow/table.pxi", line 2042, in pyarrow.lib.table return Table.from_pandas(data, schema=schema, nthreads=nthreads) File "pyarrow/table.pxi", line 1394, in pyarrow.lib.Table.from_pandas arrays, schema = dataframe_to_arrays( File "/Users/diana/workspace/arrow/python/pyarrow/pandas_compat.py", line 604, in dataframe_to_arrays pandas_metadata = construct_metadata(df, column_names, index_columns, File "/Users/diana/workspace/arrow/python/pyarrow/pandas_compat.py", line 237, in construct_metadata b'pandas': json.dumps({ File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/json/__init__.py", line 231, in dumps return _default_encoder.encode(obj) File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/json/encoder.py", line 199, in encode chunks = self.iterencode(o, _one_shot=True) File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/json/encoder.py", line 257, in iterencode return _iterencode(o, 0) File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/json/encoder.py", line 179, in default raise TypeError(f'Object of type {o.__class__.__name__} ' TypeError: Object of type int64 is not JSON serializable ``` After: ``` >>> import numpy as np >>> import pyarrow as pa >>> import pandas as pd >>> idx = pd.RangeIndex(0, 4, name=np.int64(6)) >>> df = pd.DataFrame(index=idx) >>> table = pa.table(df) >>> metadata = table.schema.pandas_metadata >>> import pprint >>> pprint.pprint(metadata) {'column_indexes': [{'field_name': None, 'metadata': None, 'name': None, 'numpy_type': 'object', 'pandas_type': 'empty'}], 'columns': [], 'creator': {'library': 'pyarrow', 'version': '2.0.0.dev381+g06830c954.d20201001'}, 'index_columns': [{'kind': 'range', 'name': '6', 'start': 0, 'step': 1, 'stop': 4}], 'pandas_version': '1.1.2'} ``` Closes apache#8314 from dianaclarke/ARROW-10147 Authored-by: Diana Clarke <[email protected]> Signed-off-by: Wes McKinney <[email protected]>

…ames. This adds a check, during logical planning, to the uniqueness of column names after a projection and aggregation. This eliminates the risk that a user creates a wrong logical query by the first column named X when there are two columns named X. This does not eliminate this risk when the column name is duplicated on a scan. Closes apache#8334 from jorgecarleitao/check_schema Authored-by: Jorge C. Leitao <[email protected]> Signed-off-by: Jorge C. Leitao <[email protected]>

This PR fixes a bug on which GenericStringArray::from(ArrayData) was able to be created with the wrong DataType. It also simplifies generics that use `GenericStringArray`, as they no longer need to pass the corresponding `DataType` and can instead rely on a `const` to do it for them. Closes apache#8296 from jorgecarleitao/check_datatypes Authored-by: Jorge C. Leitao <[email protected]> Signed-off-by: Jorge C. Leitao <[email protected]>

It seems that I used a notation in the labeler that [is only supported in the master version of it](actions/labeler#73 (comment)), which is causing all PRs to dail. This PR fixes that. I verified it on my own fork: https://github.com/jorgecarleitao/arrow/pull/12 Closes apache#8350 from jorgecarleitao/labeler Authored-by: Jorge C. Leitao <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>

This fixes a small typo in the example provided in the C data interface on which the [specification](https://arrow.apache.org/docs/format/CDataInterface.html#data-type-description-format-strings) uses `"i"` of india to represent i32, but the [example](https://arrow.apache.org/docs/format/CDataInterface.html#exporting-a-simple-int32-array) uses `"l"` of lima to represent i32. Closes apache#8357 from jorgecarleitao/fix_example Authored-by: Jorge C. Leitao <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

Also renamed an example, that refered to the old name "table_api". Closes apache#8355 from jorgecarleitao/fix Authored-by: Jorge C. Leitao <[email protected]> Signed-off-by: Andy Grove <[email protected]>

…sts, by using standard pretty printer This PR removes (most of) the special case pretty printing code in DataFusion's sql integration test, `sql.rs` in favor of the standard pretty printer in `arrow::utils::pretty` and in the process adds support for DictionaryArray printing as well as standardizing how tests are run and output is compared. I am working on adding support for `DictionaryArray` in DataFusion (and thus want to add tests to sql.rs). This specific PR's changes are larger than strictly necessary, but I felt it made it easier to write new tests in sql.rs. However, if people prefer, I could instead add a special case for `DictionaryArray` in `array_str` and accomplish my goal with a much smaller PR Note: I found that using `Vec<Vec<String>>` to encode expected results rather than a `String` retains the nice property that differences to expected output are shown reasonably in the test output. For example: ``` ---- csv_query_cast_literal stdout ---- thread 'csv_query_cast_literal' panicked at 'assertion failed: `(left == right)` left: `[["0.9294097332465232", "1.0"], ["0.3114712539863804", "1.0"]]`, right: `[["0.9294097332465232", "1"], ["0.3114712539863804", "1"]]`', datafusion/tests/sql.rs:502:5 ``` Closes apache#8333 from alamb/alamb/ARROW-10167-cleanup-sql-display Authored-by: alamb <[email protected]> Signed-off-by: Andy Grove <[email protected]>

Closes apache#8342 from pitrou/ARROW-10120-nested-pq-benchmarks Authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

…rray ``` pyarrow/tests/test_convert_builtin.py::test_fixed_size_binary_length_check ../src/arrow/array/builder_binary.cc:53: Check failed: (size) == (byte_width_) Appending wrong size to FixedSizeBinaryBuilder 0 libarrow.200.0.0.dylib 0x000000010e7f9704 _ZN5arrow4util7CerrLog14PrintBackTraceEv + 52 1 libarrow.200.0.0.dylib 0x000000010e7f9622 _ZN5arrow4util7CerrLogD2Ev + 98 2 libarrow.200.0.0.dylib 0x000000010e7f9585 _ZN5arrow4util7CerrLogD1Ev + 21 3 libarrow.200.0.0.dylib 0x000000010e7f95ac _ZN5arrow4util7CerrLogD0Ev + 28 4 libarrow.200.0.0.dylib 0x000000010e7f9492 _ZN5arrow4util8ArrowLogD2Ev + 82 5 libarrow.200.0.0.dylib 0x000000010e7f94c5 _ZN5arrow4util8ArrowLogD1Ev + 21 6 libarrow.200.0.0.dylib 0x000000010e303ec1 _ZN5arrow22FixedSizeBinaryBuilder14CheckValueSizeEx + 209 7 libarrow.200.0.0.dylib 0x000000010e30c361 _ZN5arrow22FixedSizeBinaryBuilder12UnsafeAppendEN6nonstd7sv_lite17basic_string_viewIcNSt3__111char_traitsIcEEEE + 49 8 libarrow_python.200.0.0.dylib 0x000000010b4efa7d _ZN5arrow2py20PyPrimitiveConverterINS_19FixedSizeBinaryTypeEvE6AppendEP7_object + 813 ``` The input `const char*` value gets implicitly casted to `string_view` which makes the length check fail in debug builds. Closes apache#8360 from kszucs/ARROW-10193 Authored-by: Krisztián Szűcs <[email protected]> Signed-off-by: Krisztián Szűcs <[email protected]>

…g array to Pandas Fix a crash on conversion of e.g. struct of dictionaries. Closes apache#8361 from pitrou/ARROW-10192-struct-of-dict Lead-authored-by: Antoine Pitrou <[email protected]> Co-authored-by: Krisztián Szűcs <[email protected]> Signed-off-by: Krisztián Szűcs <[email protected]>

Closes apache#8369 from bkietz/10176-Nightly-valgrind-job-fail Authored-by: Benjamin Kietzman <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

This is a follow-up of apache#7938. apache#7938 forgot downloading a required dll `libprotobuf.so.18` at the build time. Closes apache#8368 from kiszk/ARROW-10200 Authored-by: Kazuaki Ishizaki <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>

Adds cmake cases for Intel compiler which allows compilation using the Intel compiler. Fixes unknown compiler message, making it "Unknown compiler: <compiler id> <compiler version>" instead of "Unknown compiler: <compiler version> <compiler version>". Closes apache#8601 from jensenrichardson/master Authored-by: jensenrichardson <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>

…builds Closes apache#8599 from xhochy/ARROW-10502 Authored-by: Uwe L. Korn <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>

…nt headers. PR for https://issues.apache.org/jira/browse/ARROW-10467 Add a CallOption for specifying arbitrary properties as headers for RPC calls. Add a ServerMiddleware that will intercept and pass the properties to a handler when registered via the FlightServer builder. Closes apache#8572 from kylepbit/ARROW-10467 Authored-by: Kyle Porter <[email protected]> Signed-off-by: David Li <[email protected]>

Closes apache#8580 from jorisvandenbossche/ARROW-10482 Authored-by: Joris Van den Bossche <[email protected]> Signed-off-by: Uwe L. Korn <[email protected]>

Add a ClientMiddleware that can read HTTP Set-Cookie and Set-Cookie2 headers from server responses and transmit corresponding Cookie headers Closes apache#8554 from jduo/cookie-handling Lead-authored-by: James Duong <[email protected]> Co-authored-by: Keerat Singh <[email protected]> Co-authored-by: Keerat Singh <[email protected]> Signed-off-by: David Li <[email protected]>

This change adds support for `LargeList` in `take()`. There is an additional update to the underlying implementation of `take()` such that the indices may be any `PrimitiveArray` of `ArrowNumericType`, rather than only `UInt32Array`. This change is motivated by the recursive call to `take()` in `take_list()` ([here](https://github.com/apache/arrow/blob/b109195b77d85e513aab80650bd4b193e26a5471/rust/arrow/src/compute/kernels/take.rs#L324)), since in order to support `LargeListArray`, which use `i64` offsets, the recursive call must support indices arrays that are `Int64Array`. Closes apache#8556 from drusso/ARROW-10378 Authored-by: Daniel Russo <[email protected]> Signed-off-by: Neville Dipale <[email protected]>

This allows binary arrays to be iterated and be constructed from an iterator. This is analogous to what we already support for primitive and string arrays. Closes apache#8576 from jorgecarleitao/iter_binary Authored-by: Jorge C. Leitao <[email protected]> Signed-off-by: Neville Dipale <[email protected]>

… or Option<String> Currently, our code-base supports creating a `StringArray` from an iterator of `String`. However, from arrow's perspective, we should not care if it is a `String` or a `&str`, as long as it can be represented by an `AsRef<str>`. A user sometimes is able to create an iterator of `&str` instead of `String`, and should not have to convert one to the other before passing it to Arrow. This PR makes this change. Closes apache#8575 from jorgecarleitao/array_from_str Authored-by: Jorge C. Leitao <[email protected]> Signed-off-by: Neville Dipale <[email protected]>

This changes a list datatype to use `Box<Field>` instead of `Box<DataType>`. This change is needed in order to make both Parquet and IPC roundtrips pass. The C++ implementation uses `Field`, as it allows for preservice list field names and nullability. There are some implementation details which we did not cover in this PR, and will work on as a follow-up, and these are: * Documenting the naming conventions of Arrow and Parquet on list names (Arrow C++ uses "item", Parquet has different compatibility options) * Fixing Parquet failing list tests that have been ignored (I'd like to do this separately, as this isn't a full fix) Closes apache#8608 from nevi-me/ARROW-10261-BREAKING Authored-by: Neville Dipale <[email protected]> Signed-off-by: Neville Dipale <[email protected]>

This PR: 1. Makes `DictionaryArray::keys` be an `PrimitiveArray`. 2. Removes `NullIter` and many of the `unsafe` code that it contained 3. Simplifies the parquet writer implementation around dictionaries 4. Indirectly removes a bug on NullIter's `DoubleEndedIterator` on which the implementation was not following the spec with respect to `end == current` (compare with implementation of `DoubleEndedIterator` for std's `Vec`'s `IntoIter`). 5. Fixes error in computing the size of a dictionary, which was double-counting `data` and `values` (`values` is a reference to `data`) 6. Adds check that the dictionary's ArrayData's datatype matches `T::DATA_TYPE`. Since `NullIter` was not being directly tested, no tests were removed. This is backward incompatible and the migration requires replacing `dict.keys()` by `dict.keys().into_iter()` whenever the user's intention is to use `keys()` as an iterator. Closes apache#8561 from jorgecarleitao/dictionary_clean Authored-by: Jorge C. Leitao <[email protected]> Signed-off-by: Neville Dipale <[email protected]>

This PR removes `PrimitiveArray::new`. `PrimitiveArray::new` is `pub`, but it is dangerous because: * when the buffer's content is not aligned with T, many of its methods cause UB * when used with `null_count > 0`, many calls panic as the null bitmap is `None`, but the `null_count != 0` * when used with `null_count > 0`, it creates an array out of spec (as the buffer for the null bitmap is `None` but the null count is not zero) Since: * a change in this method's signature (to either add the bitmap or remove `null_count`) requires a backward incompatible change * it is only used in tests * we have good offers to create primitive arrays: * from an ArrayData, * from a vector or vector of optionals * from an iterator This PR removes it. Closes apache#8560 from jorgecarleitao/remove_new Authored-by: Jorge C. Leitao <[email protected]> Signed-off-by: Neville Dipale <[email protected]>

This puts it in line with `array::NullIter`. This PR is complementary to apache#8561 (which removes `NullIter`), so that the PrimitiveArrayIterator can be reversed in the same way `NullIter` can. Together with apache#8561, it re-allows Dictionary keys to be iterated backwards. Closes apache#8562 from jorgecarleitao/double_ended Authored-by: Jorge C. Leitao <[email protected]> Signed-off-by: Neville Dipale <[email protected]>

[ARROW-10510](https://issues.apache.org/jira/browse/ARROW-10510) This change adds benchmarks for `COUNT(DISTINCT)` queries. This is a small follow-up to [ARROW-10043](https://issues.apache.org/jira/browse/ARROW-10043) / apache#8222. In that PR, a number of implementation ideas were discussed for follow-ups, and having benchmarks will help evaluate them. --- There are two benchmarks added: * wide: all of the values are distinct; this is looking at worst-case performance * narrow: only a handful of distinct values; this is closer to best-case performance The wide benchmark runs ~ 7x slower than the narrow benchmark. Closes apache#8606 from drusso/ARROW-10510 Authored-by: Daniel Russo <[email protected]> Signed-off-by: Neville Dipale <[email protected]>

This is a major refactor of the `equal.rs` module. The rational for this change is many fold: * currently array comparison requires downcasting the array ref to its concrete types. This is painful and not very ergonomics, as the user must "guess" what to downcast for comparison. We can see this in the hacks around `sort`, `take` and `concatenate` kernel's tests, and some of the tests of the builders. * the code in array comparison is difficult to follow given the amount of calls that they perform around offsets. * The implementation currently indirectly uses many of the `unsafe` APIs that we have (via pointer aritmetics), which makes it risky to operate and mutate. * Some code is being repeated. This PR: 1. adds `impl PartialEq for dyn Array`, to allow `Array` comparison based on `Array::data` (main change) 2. Makes array equality to only depend on `ArrayData`, i.e. it no longer depends on concrete array types (such as `PrimitiveArray` and related API) to perform comparisons. 3. Significantly reduces the risk of panics and UB when composite arrays are of different types, by checking the types on `range` comparison 4. Makes array equality be statically dispatched, via `match datatype`. 5. DRY the code around array equality 6. Fixes an error in equality of dictionary with equal values 7. Added tests to equalities that were not tested (fixed binary, some edge cases of dictionaries) 8. splits `equal.rs` in smaller, more manageable files. 9. Removes `ArrayListOps`, since it it no longer needed 10. Moves Json equality to its own module, for clarity. 11. removes the need to have two functions per type to compare arrays. 12. Adds the number of buffers and their respective width to datatypes from the specification. This was backported from apache#8401 13. adds a benchmark for array equality Note that this does not implement `PartialEq` for `ArrayData`, only `dyn Array`, as different data does not imply a different array (due to nullability). That implementation is being worked on apache#8200. IMO this PR significantly simplifies the code around array comparison, to the point where many implementations are 5 lines long. This also improves performance by 10-40%. <details> <summary>Benchmark results</summary> ``` Previous HEAD position was 3dd3c69d7 Added bench for equality. Switched to branch 'equal' Your branch is up to date with 'origin/equal'. Compiling arrow v3.0.0-SNAPSHOT (/Users/jorgecarleitao/projects/arrow/rust/arrow) Finished bench [optimized] target(s) in 51.28s Running /Users/jorgecarleitao/projects/arrow/rust/target/release/deps/equal-176c3cb11360bd12 Gnuplot not found, using plotters backend equal_512 time: [36.861 ns 36.894 ns 36.934 ns] change: [-43.752% -43.400% -43.005%] (p = 0.00 < 0.05) Performance has improved. Found 12 outliers among 100 measurements (12.00%) 7 (7.00%) high mild 5 (5.00%) high severe equal_nulls_512 time: [2.3271 us 2.3299 us 2.3331 us] change: [-10.846% -9.0877% -7.7336%] (p = 0.00 < 0.05) Performance has improved. Found 11 outliers among 100 measurements (11.00%) 4 (4.00%) high mild 7 (7.00%) high severe equal_string_512 time: [49.219 ns 49.347 ns 49.517 ns] change: [-30.789% -30.538% -30.235%] (p = 0.00 < 0.05) Performance has improved. Found 9 outliers among 100 measurements (9.00%) 3 (3.00%) high mild 6 (6.00%) high severe equal_string_nulls_512 time: [3.7873 us 3.7939 us 3.8013 us] change: [-8.2944% -7.0636% -5.4266%] (p = 0.00 < 0.05) Performance has improved. Found 12 outliers among 100 measurements (12.00%) 4 (4.00%) high mild 8 (8.00%) high severe ``` </details> All tests are there, plus new tests for some of the edge cases and untested arrays. This change is backward incompatible `array1.equals(&array2)` no longer works: use `array1 == array2` instead, which is the idiomatic way of comparing structs and trait objects in rust. Closes apache#8541 from jorgecarleitao/equal Authored-by: Jorge C. Leitao <[email protected]> Signed-off-by: Neville Dipale <[email protected]>

Basically a reopening of apache#8393. After further discussion, this provides the cleanest contribution in terms of IP clearances and git history, and it should be fairly straightforward to manage the Julia-side of package release registering. Closes apache#8547 from quinnj/jq/ARROW-10228 Authored-by: Jacob Quinn <[email protected]> Signed-off-by: Neal Richardson <[email protected]>

…ath issue on Windows Closes apache#8539 from jorisvandenbossche/ci-windows-fsspec Authored-by: Joris Van den Bossche <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>

…e on CI Closes apache#8573 from jorisvandenbossche/ARROW-10471-s3fs Authored-by: Joris Van den Bossche <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>

At some point, Intel changed __force_inline to __forceinline. Arrow does not compile without this. Changes the header file to redefine the macro from __force_inline to __forceinline. Source: [here](https://software.intel.com/content/www/us/en/develop/documentation/cpp-compiler-developer-guide-and-reference/top/compiler-reference/compiler-options/compiler-option-details/inlining-options/inline-forceinline-qinline-forceinline.html) Closes apache#8600 from jensenrichardson/patch-1 Authored-by: jensenrichardson <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

…lang+Windows Closes apache#8604 from xhochy/ARROW-10509 Authored-by: Uwe L. Korn <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

…WS config Closes apache#8568 from xhochy/ARROW-10346 Authored-by: Uwe L. Korn <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

Separate Mode and Variance/Stddev kernels registration from basic aggregation kernels. Closes apache#8523 from cyb70289/agg-refine Authored-by: Yibo Cai <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

Remove inclusion of heavyweights "arrow/api.h" and "arrow/builder.h", among other changes. This seems to reduce CPU build time by about 3%. Closes apache#8511 from pitrou/ARROW-7531-header-reduction Authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

…nd string types Closes apache#8529 from david1437/ARROW-5394 Lead-authored-by: david <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

@paddyhoran

This PR splits the `array.rs` into multiple modules, so that it is a bit easier to navigate in the project, particularly `array.rs`. Semantically, there should be no difference, and this change is also backward compatible, public items are only exposed in `mod.rs`. This change was initially proposed by @paddyhoran on ARROW-9361, and this is just a proposed implementation. Some notes: * Each commit is a different array * I named the files `array_[type]` to distinguish from other modules that do not contain arrays. * `null` and `Union` were not renamed * Some functions were moved around / incorporated in `impl` since they were used in multiple places. Closes apache#8543 from jorgecarleitao/clean_slice Authored-by: Jorge C. Leitao <[email protected]> Signed-off-by: Neville Dipale <[email protected]>

From a deadlocked run... ``` #0 0x00007f8a5d48dccd in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x00007f8a5d486f05 in pthread_mutex_lock () from /lib64/libpthread.so.0 #2 0x00007f8a566e7e89 in arrow::internal::FnOnce<void ()>::FnImpl<arrow::Future<Aws::Utils::Outcome<Aws::S3::Model::ListObjectsV2Result, Aws::S3::S3Error> >::Callback<arrow::fs::(anonymous namespace)::TreeWalker::ListObjectsV2Handler> >::invoke() () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so apache#3 0x00007f8a5650efa0 in arrow::FutureImpl::AddCallback(arrow::internal::FnOnce<void ()>) () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so apache#4 0x00007f8a566e67a9 in arrow::fs::(anonymous namespace)::TreeWalker::ListObjectsV2Handler::SpawnListObjectsV2() () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so apache#5 0x00007f8a566e723f in arrow::fs::(anonymous namespace)::TreeWalker::WalkChild(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int) () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so apache#6 0x00007f8a566e827d in arrow::internal::FnOnce<void ()>::FnImpl<arrow::Future<Aws::Utils::Outcome<Aws::S3::Model::ListObjectsV2Result, Aws::S3::S3Error> >::Callback<arrow::fs::(anonymous namespace)::TreeWalker::ListObjectsV2Handler> >::invoke() () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so apache#7 0x00007f8a5650efa0 in arrow::FutureImpl::AddCallback(arrow::internal::FnOnce<void ()>) () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so apache#8 0x00007f8a566e67a9 in arrow::fs::(anonymous namespace)::TreeWalker::ListObjectsV2Handler::SpawnListObjectsV2() () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so apache#9 0x00007f8a566e723f in arrow::fs::(anonymous namespace)::TreeWalker::WalkChild(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int) () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so apache#10 0x00007f8a566e74b1 in arrow::fs::(anonymous namespace)::TreeWalker::DoWalk() () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so ``` The callback `ListObjectsV2Handler` is being called recursively and the mutex is non-reentrant thus deadlock. To fix it I got rid of the mutex on `TreeWalker` by using `arrow::util::internal::TaskGroup` instead of manually tracking the #/status of in-flight requests. Closes apache#9842 from westonpace/bugfix/arrow-12040 Lead-authored-by: Weston Pace <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

Before change: ``` Direct leak of 65536 byte(s) in 1 object(s) allocated from: #0 0x522f09 in #1 0x7f28ae5826f4 in #2 0x7f28ae57fa5d in apache#3 0x7f28ae58cb0f in apache#4 0x7f28ae58bda0 in ... ``` After change: ``` Direct leak of 65536 byte(s) in 1 object(s) allocated from: #0 0x522f09 in posix_memalign (/build/cpp/debug/arrow-dataset-file-csv-test+0x522f09) #1 0x7f28ae5826f4 in arrow::(anonymous namespace)::SystemAllocator::AllocateAligned(long, unsigned char**) /arrow/cpp/src/arrow/memory_pool.cc:213:24 #2 0x7f28ae57fa5d in arrow::BaseMemoryPoolImpl<arrow::(anonymous namespace)::SystemAllocator>::Allocate(long, unsigned char**) /arrow/cpp/src/arrow/memory_pool.cc:405:5 apache#3 0x7f28ae58cb0f in arrow::PoolBuffer::Reserve(long) /arrow/cpp/src/arrow/memory_pool.cc:717:9 apache#4 0x7f28ae58bda0 in arrow::PoolBuffer::Resize(long, bool) /arrow/cpp/src/arrow/memory_pool.cc:741:7 ... ``` Closes apache#10498 from westonpace/feature/ARROW-13027--c-fix-asan-stack-traces-in-ci Authored-by: Weston Pace <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

Error log of Valgrind failure: ``` [----------] 3 tests from TestArrowReadDeltaEncoding [ RUN ] TestArrowReadDeltaEncoding.DeltaBinaryPacked [ OK ] TestArrowReadDeltaEncoding.DeltaBinaryPacked (812 ms) [ RUN ] TestArrowReadDeltaEncoding.DeltaByteArray ==12587== Conditional jump or move depends on uninitialised value(s) ==12587== at 0x4F12C57: Advance (bit_stream_utils.h:426) ==12587== by 0x4F12C57: parquet::(anonymous namespace)::DeltaBitPackDecoder<parquet::PhysicalType<(parquet::Type::type)1> >::GetInternal(int*, int) (encoding.cc:2216) ==12587== by 0x4F13823: Decode (encoding.cc:2091) ==12587== by 0x4F13823: parquet::(anonymous namespace)::DeltaByteArrayDecoder::SetData(int, unsigned char const*, int) (encoding.cc:2360) ==12587== by 0x4E89EF5: parquet::(anonymous namespace)::ColumnReaderImplBase<parquet::PhysicalType<(parquet::Type::type)6> >::InitializeDataDecoder(parquet::DataPage const&, long) (column_reader.cc:797) ==12587== by 0x4E9AE63: ReadNewPage (column_reader.cc:614) ==12587== by 0x4E9AE63: HasNextInternal (column_reader.cc:576) ==12587== by 0x4E9AE63: parquet::internal::(anonymous namespace)::TypedRecordReader<parquet::PhysicalType<(parquet::Type::type)6> >::ReadRecords(long) (column_reader.cc:1228) ==12587== by 0x4DFB19F: parquet::arrow::(anonymous namespace)::LeafReader::LoadBatch(long) (reader.cc:467) ==12587== by 0x4DF513C: parquet::arrow::ColumnReaderImpl::NextBatch(long, std::shared_ptr<arrow::ChunkedArray>*) (reader.cc:108) ==12587== by 0x4DFB74D: parquet::arrow::(anonymous namespace)::FileReaderImpl::ReadColumn(int, std::vector<int, std::allocator<int> > const&, parquet::arrow::ColumnReader*, std::shared_ptr<arrow::ChunkedArray>*) (reader.cc:273) ==12587== by 0x4E11FDA: operator() (reader.cc:1180) ==12587== by 0x4E11FDA: arrow::Future<std::vector<std::shared_ptr<arrow::ChunkedArray>, std::allocator<arrow::Future> > > arrow::internal::OptionalParallelForAsync<parquet::arrow::(anonymous namespace)::FileReaderImpl::DecodeRowGroups(std::shared_ptr<parquet::arrow::(anonymous namespace)::FileReaderImpl>, std::vector<int, std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&, arrow::internal::Executor*)::{lambda(unsigned long, std::shared_ptr<parquet::arrow::ColumnReaderImpl>)#1}&, std::shared_ptr<parquet::arrow::ColumnReaderImpl>, std::shared_ptr<arrow::ChunkedArray> >(bool, std::vector<std::shared_ptr<parquet::arrow::ColumnReaderImpl>, std::allocator<arrow::Future<std::vector<std::shared_ptr<arrow::ChunkedArray>, std::allocator<arrow::Future> > > > >, parquet::arrow::(anonymous namespace)::FileReaderImpl::DecodeRowGroups(std::shared_ptr<parquet::arrow::(anonymous namespace)::FileReaderImpl>, std::vector<int, std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&, arrow::internal::Executor*)::{lambda(unsigned long, std::shared_ptr<parquet::arrow::ColumnReaderImpl>)#1}&, arrow::internal::Executor*) (parallel.h:95) ==12587== by 0x4E126A9: parquet::arrow::(anonymous namespace)::FileReaderImpl::DecodeRowGroups(std::shared_ptr<parquet::arrow::(anonymous namespace)::FileReaderImpl>, std::vector<int, std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&, arrow::internal::Executor*) (reader.cc:1198) ==12587== by 0x4E12F50: parquet::arrow::(anonymous namespace)::FileReaderImpl::ReadRowGroups(std::vector<int, std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&, std::shared_ptr<arrow::Table>*) (reader.cc:1160) ==12587== by 0x4DFA2BC: parquet::arrow::(anonymous namespace)::FileReaderImpl::ReadTable(std::vector<int, std::allocator<int> > const&, std::shared_ptr<arrow::Table>*) (reader.cc:198) ==12587== by 0x4DFA392: parquet::arrow::(anonymous namespace)::FileReaderImpl::ReadTable(std::shared_ptr<arrow::Table>*) (reader.cc:289) ==12587== by 0x1DCE62: parquet::arrow::TestArrowReadDeltaEncoding::ReadTableFromParquetFile(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::shared_ptr<arrow::Table>*) (arrow_reader_writer_test.cc:4174) ==12587== by 0x2266D2: parquet::arrow::TestArrowReadDeltaEncoding_DeltaByteArray_Test::TestBody() (arrow_reader_writer_test.cc:4209) ==12587== by 0x4AD2C9B: void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (gtest.cc:2607) ==12587== by 0x4AC9DD1: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (gtest.cc:2643) ==12587== by 0x4AA4C02: testing::Test::Run() (gtest.cc:2682) ==12587== by 0x4AA563A: testing::TestInfo::Run() (gtest.cc:2861) ==12587== by 0x4AA600F: testing::TestSuite::Run() (gtest.cc:3015) ==12587== by 0x4AB631B: testing::internal::UnitTestImpl::RunAllTests() (gtest.cc:5855) ==12587== by 0x4AD3CE7: bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) (gtest.cc:2607) ==12587== by 0x4ACB063: bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) (gtest.cc:2643) ==12587== by 0x4AB47B6: testing::UnitTest::Run() (gtest.cc:5438) ==12587== by 0x4218918: RUN_ALL_TESTS() (gtest.h:2490) ==12587== by 0x421895B: main (gtest_main.cc:52) ``` Closes apache#11725 from pitrou/ARROW-14704-parquet-valgrind Authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

emkornfield and others added 30 commits October 2, 2020 21:05

ARROW-10127: Update specification for Decimal to allow for 256-bits

51a3c88

Will start a thread on the mailing list for a Vote. Closes apache#8293 from emkornfield/decimal256_format Authored-by: Micah Kornfield <[email protected]> Signed-off-by: Micah Kornfield <[email protected]>

ARROW-10161: [Rust] [DataFusion] DRYed code in tests

b73731b

Just a small cleanup by moving common code to a macro and using it in the tests. Closes apache#8328 from jorgecarleitao/clean_tests Authored-by: Jorge C. Leitao <[email protected]> Signed-off-by: Andy Grove <[email protected]>

ARROW-10155: [Rust] [DataFusion] Improved lib.rs docs

57f548c

Extended lib documentation, migrating some of it from the README. Closes apache#8323 from jorgecarleitao/readme Authored-by: Jorge C. Leitao <[email protected]> Signed-off-by: Jorge C. Leitao <[email protected]>

ARROW-9786: [R] Unvendor cpp11 before release

4555fb9

Closes apache#8339 from romainfrancois/ARROW-9786/unvendorcpp11 Authored-by: Romain Francois <[email protected]> Signed-off-by: Neal Richardson <[email protected]>

ARROW-10188: [Rust] [DataFusion] Fixed DataFusion examples.

adcb8ef

Also renamed an example, that refered to the old name "table_api". Closes apache#8355 from jorgecarleitao/fix Authored-by: Jorge C. Leitao <[email protected]> Signed-off-by: Andy Grove <[email protected]>

ARROW-10120: [C++] Add two-level nested Parquet read to Arrow benchmarks

b1480a6

Closes apache#8342 from pitrou/ARROW-10120-nested-pq-benchmarks Authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

ARROW-10176: [C++] Avoid using unformattable types for test parameters

bca04cd

Closes apache#8369 from bkietz/10176-Nightly-valgrind-job-fail Authored-by: Benjamin Kietzman <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

jensenrichardson and others added 25 commits November 6, 2020 05:32

ARROW-10502: [C++/Python] CUDA detection messes up nightly conda-win …

78da3e6

…builds Closes apache#8599 from xhochy/ARROW-10502 Authored-by: Uwe L. Korn <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>

ARROW-10482: [Python] Fix compression per column in Parquet writing

ddda960

Closes apache#8580 from jorisvandenbossche/ARROW-10482 Authored-by: Joris Van den Bossche <[email protected]> Signed-off-by: Uwe L. Korn <[email protected]>

ARROW-10462: [Python] Fix usage of fsspec in ParquetDataset causing p…

a3f97ab

…ath issue on Windows Closes apache#8539 from jorisvandenbossche/ci-windows-fsspec Authored-by: Joris Van den Bossche <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>

ARROW-10471: [CI][Python] Ensure we have tests with s3fs and run thos…

66aad85

…e on CI Closes apache#8573 from jorisvandenbossche/ARROW-10471-s3fs Authored-by: Joris Van den Bossche <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>

ARROW-10509: [C++] Define operator<<(ostream, ParquetException) for c…

3767689

…lang+Windows Closes apache#8604 from xhochy/ARROW-10509 Authored-by: Uwe L. Korn <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

ARROW-10346: [Python] Ensure tests aren't affected by user-supplied A…

a9988ee

…WS config Closes apache#8568 from xhochy/ARROW-10346 Authored-by: Uwe L. Korn <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

ARROW-5394: [C++][Benchmark] IsIn and IndexIn benchmark for integer a…

3e72c70

…nd string types Closes apache#8529 from david1437/ARROW-5394 Lead-authored-by: david <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

Add a note about being patient after gitbox enable

34ff231

alamb closed this Nov 10, 2020

alamb deleted the alamb/update_docs branch December 7, 2020 19:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Alamb/update docs #1

Alamb/update docs #1

Uh oh!

alamb commented Nov 10, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Alamb/update docs #1

Alamb/update docs #1

Uh oh!

Conversation

alamb commented Nov 10, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants