Skip to content

Conversation

@nevi-me
Copy link
Contributor

@nevi-me nevi-me commented Feb 12, 2019

An initial JSON reader for Rust, which reads JSON line-delimited files into record batches.

Supports:

  • schema inference and user-supplied schema
  • projection by column names
  • reading lists
  • reading lists and scalars, converting scalars to single-value lists

Excluded in this PR are:

  • Struct arrays
  • Nesting in lists. The list or struct is replaced with null for now
  • Extremely mixed types, e.g [int, bool, float, list(_)]

@nevi-me
Copy link
Contributor Author

nevi-me commented Feb 12, 2019

@andygrove @paddyhoran @sunchao PTAL. I've followed a similar structure to our CSV reader. I'm happy with this as a start, though there might be some code which I can share between JSON and CSV readers, and perhaps some cleaning up of the code.

I've left out nested struct support for now.

@nevi-me nevi-me changed the title ARROW-5450: [Rust] Basic JSON reader ARROW-4540: [Rust] Basic JSON reader Feb 12, 2019
nevi-me and others added 16 commits February 12, 2019 16:07
Enclose optional shell variable names by double quotations to suppress shell warnings that is like below:

```
+'[' == 1 ']'
/home/travis/build/apache/arrow/ci/travis_before_script_cpp.sh: line 81: [: ==: unary operator expected
```

Author: Kenta Murata <[email protected]>

Closes apache#3623 from mrkn/suppress_shell_warning_on_ci and squashes the following commits:

4db2e53 <Kenta Murata> Suppress shell warning on travis-ci
…fter docs migration

We no longer need to render the format docs as part of the Jekyll site, as they were ported to the master Sphinx project. As a result of this, the instructions to build the website were broken

Author: Wes McKinney <[email protected]>

Closes apache#3620 from wesm/ARROW-4481 and squashes the following commits:

013d13b <Wes McKinney> Remove generated specification docs from site after docs migration
MakeBuilder doesn't preserve the field name in the given ListType.
I think this is a bug.

Author: Kenta Murata <[email protected]>

Closes apache#3619 from mrkn/fix_make_builder_to_preserve_field_name_of_list_type and squashes the following commits:

e4d8597 <Kenta Murata> Fix c_glib's test
8fc747a <Kenta Murata> Fix MakeBuilder to preserve ListType's field name
…devices

This PR introduces new C++ methods
  `CudaContext::CopyDeviceToAnotherDevice`
  `CudaBuffer::CopyFromAnotherDevice`
that are used to implement the Python support for copying CUDA buffer data between different devices.

Author: Pearu Peterson <[email protected]>

Closes apache#3617 from pearu/arrow-3653 and squashes the following commits:

4de6c97 <Pearu Peterson> Skip tests on copying data between buffers of different devices on a single GPU system.
b10afc3 <Pearu Peterson> Support copying data in buffers of different devices.
This fixes two issues:
- object fields inside numpy structs should be allowed
- buggy ndarray indexing if stride % itemsize != 0

Author: Antoine Pitrou <[email protected]>

Closes apache#3614 from pitrou/ARROW-4181-py-struct-conversion-fixes and squashes the following commits:

9886e29 <Antoine Pitrou> ARROW-4181:  Fixes for Numpy struct array conversion
The varchar/varbinary vectors have a setSafe variant that takes an arrow buf and a start/end offsets. The method needs to expand the buffer to make space for 'end - start' bytes. but, due to a bug, it was expanding it to accommodate 'end' bytes, thereby making the value buffer much larger than required.

Author: Pindikura Ravindra <[email protected]>

Closes apache#3613 from pravindra/varchar and squashes the following commits:

1b4f224 <Pindikura Ravindra> ARROW-4532: fix review comments
8f88e3a <Pindikura Ravindra> ARROW-4532:  fix a size compute bug
This change seems to indicate an alternative would be to force the build to be Release type (and avoid the naming magic).  It doesn't appear that the conda feedstock (https://github.com/conda-forge/gtest-feedstock/blob/master/recipe/build.sh) builds debug for linux at all.  it does for windows though:
(https://github.com/conda-forge/gtest-feedstock/blob/master/recipe/bld.bat) I'm not sure if we care a lot about this?

Author: Micah Kornfield <[email protected]>
Author: emkornfield <[email protected]>

Closes apache#3589 from emkornfield/upgrade_gmock and squashes the following commits:

2df8f58 <emkornfield> fix case of "set"
200dec7 <Micah Kornfield> try naming for debug
e910d47 <Micah Kornfield> remove hack for different vendored vs conda gmock
af7740b <Micah Kornfield> Upgrade vendored gmock to 1.8.1
…ache#3571)

* Added (SIMD) bitwise AND/OR for `Buffer`

* Added benchmark

* Removed boolean kernels in favor of traits.

* Updated typo.

* Cleaned up code.

* Updated benchmark to include bitwise OR op
Enabled on Ubuntu 16.04 and macOS.

Author: Antoine Pitrou <[email protected]>

Closes apache#3626 from pitrou/ARROW-3292-flight-travis-ci and squashes the following commits:

afd1fd3 <Antoine Pitrou> ARROW-3292:  Test Flight RPC in Travis-CI
…ve performance

It seems that using `Object.create` in the `bind` method slows down proxy generation substantially. I separated the `Row` and `RowProxyGenerator` concepts - now the generator creates a prototype based on `Row` and the parent vector when it is created, and constructs instances based on that prototype repeatedly in `bind`.

This change improved the included benchmark from 1,114.78 ms to 154.96 ms on my laptop (Intel i5-7200U).

Author: Brian Hulette <[email protected]>

Closes apache#3601 from TheNeuralBit/proxy-bench and squashes the following commits:

2442545 <Brian Hulette> Remove inner class, don't re-define columns
7bb6e4f <Brian Hulette> Update comparator, switch to Symbols for internal variables in Row
4079439 <Brian Hulette> linter fixes
da4d97c <Brian Hulette> Switch back to Object.create with no extra property descriptor
8a5d162 <Brian Hulette> Improve Row proxy performance
fb7a0f0 <Brian Hulette> add row proxy benchmark
Untested but proposed fix to apache#3629 using the standard UMD pattern as generated by Rollup.

Author: Mike Bostock <[email protected]>

Closes apache#3630 from mbostock/fix-amd and squashes the following commits:

0930385 <Mike Bostock> Single quotes.
a6ed870 <Mike Bostock> Fix AMD pattern.
Author: Paddy Horan <[email protected]>

Closes apache#3632 from paddyhoran/bitmap-and-or and squashes the following commits:

b32b6d0 <Paddy Horan> Implemented `BitAnd` and `BitOr` for `Bitmap`
This is the start of a Scalar object model suitable for static and dynamic dispatch to correspond with the existing array and array builder types.

I modified the first aggregation kernel (sum) to use these types for outputs.

Author: Wes McKinney <[email protected]>

Closes apache#3604 from wesm/ARROW-47 and squashes the following commits:

0d01bb3 <Wes McKinney> Fix unit test on MSVC for small integer types
d57f7aa <Wes McKinney> Remove ARROW_GTEST_VENDORED
03ca01c <Wes McKinney> Changes because MSVC tries to instantiate NumericScalar for Time/Timestamp types
271d602 <Wes McKinney> Add notes that Scalar API is experimental
e4a13b4 <Wes McKinney> flake
6626035 <Wes McKinney> Fix up date/time scalars, add tests
704daee <Wes McKinney> Code review comments
d922bd2 <Wes McKinney> Use new Scalar objects in aggregation code
fa89bd0 <Wes McKinney> Drafting, untested
94d5e62 <Wes McKinney> start
* ARROW-4539: [Java] Fix child vector count for lists.

- Child vector count was not set correctly for lists. Fixed to use the right count.

* ARROW-4539: [Java] Add license header.
I admit this feels gross, but it's less gross than what was there before. I can do some more clean up but wanted to get feedback before spending any more time on it

So, the problem partially lies with the gRPC C++ library. The obvious thing, and first thing I tried, was to specialize `SerializationTraits<protocol::FlightData>` and do casts between `FlightData` and `protocol::FlightData` (the proto) at the last possible moment. Unfortunately, this seems to not be possible because of this:

https://github.com/grpc/grpc/blob/master/include/grpcpp/impl/codegen/proto_utils.h#L100

So I had to override that Googly hack and go to some shenanigans (see protocol.h/protocol.cc) to make sure the same templates are always visible both in `Flight.grpc.pb.cc` as well as our client.cc/server.cc

Author: Wes McKinney <[email protected]>

Closes apache#3633 from wesm/flight-cpp-avoid-ub and squashes the following commits:

ed6eb80 <Wes McKinney> Further refinements, make protocol.h an internal header. Comments per feedback
b3609d4 <Wes McKinney> Add comments about the purpose of protocol.cc
ac405b3 <Wes McKinney> Ensure .proto file is compiled before anything else
23fe416 <Wes McKinney> Implement gRPC customizations another way without calling reinterpret_cast on the client and server C++ types
Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @nevi-me this is looking good! I left some initial comments and will come back to this with a more thorough review this weekend.

let scalar_type = match dt.len() {
1 => dt[0].clone(),
2 => {
if dt.contains(&&DataType::Float64) && dt.contains(&&DataType::Int64) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be other rules in here other than Float64/Int64?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessarily (at least until we have date time support). When we infer schemas, we only use f64 and i64 for numbers, so bool and string automatically get coerced to string.

I wanted to reactor this for lists and nesting, but I've done it now instead.

}
}
Value::Object(_) => {
panic!("Reading nested JSON structes currently not supported")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this return Result instead of panicking

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

if projection.is_empty() {
return true;
}
projection.contains(field.name())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to picking the fields that are in the projection correctly but not necessarily in the correct order.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to enable preserving field order on serde_json to keep field order. I opened a separate JIRA for this yesterday. I'd like to address it separately because it also impacts data type JSON conversions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine. I think for now we can just ensure that we pass the projection in the correct order (I'm fairly sure that is already the case).

nevi-me and others added 10 commits February 14, 2019 09:45
Following the instructions in docker-compose.yml gives me a working IWYU build now

```
export PYTHON_VERSION=3.6
docker-compose build cpp
docker-compose build python
docker-compose build lint
docker-compose run iwyu
```

Author: François Saint-Jacques <[email protected]>
Author: Wes McKinney <[email protected]>

Closes apache#3643 from wesm/ARROW-4340 and squashes the following commits:

56733fc <François Saint-Jacques> Refactor iwyu build into docker install script. (#8)
e1c46c7 <Wes McKinney> Build IWYU for LLVM 7 in iwyu docker-compose job
Author: Pindikura Ravindra <[email protected]>

Closes apache#3636 from pravindra/dsub and squashes the following commits:

ee60c00 <Pindikura Ravindra> ARROW-4204:  add support for decimal subtract
- [ ] Add docs
- [ ] Format code
- [ ] Include Python in integration tests (requires binding the JSON reader/writer from C++)
- [ ] Validate performance?
- [ ] Complete server bindings if approach makes sense

Author: David Li <[email protected]>
Author: Wes McKinney <[email protected]>

Closes apache#3566 from lihalite/flight-python and squashes the following commits:

ac29ab8 <David Li> Clean up to-be-implemented parts of Flight Python bindings
9d5442a <David Li> Clarify various RecordBatchStream{Reader,Writer} wrappers
e1c298a <David Li> Lint CMake files
7764444 <Wes McKinney> Reformat cmake
c6b02aa <David Li> Add basic Python bindings for Flight
Also add a debug check on the C++ side.

Author: Antoine Pitrou <[email protected]>

Closes apache#3647 from pitrou/ARROW-4563-py-validate-decimal128-inputs and squashes the following commits:

5a4cd6a <Antoine Pitrou> ARROW-4563:  Validate decimal128() precision input
…template instantiation to not generate dead identity cast code

Also resolves ARROW-4110, which has been on my list for some time.

This ended up being a huge pain.

* `detail::PrimitiveAllocatingUnaryKernel` can now allocate memory for any kind of fixed width type.
* I factored out simple bitmap propagation into `detail::PropagateNulls`
* I moved the null count resolution code one level down into `ArrayData`, since there are cases where it may be set to `kUnknownNullCount` (e.g. after a slice) and you need to know what it is. This isn't tested but I suggest addressing this in a follow up patch

I also moved hand-maintained macro spaghetti for instantiating CastFunctors into a Python code-generation script. This might be the most controversial change in this patch, but the problem here is that we needed to exclude 1 macro case for each numeric type -- currently they were relying on `NUMERIC_CASES`. This means the list of generated types is slightly different for each type, lending to poor code reuse. Rather than maintaining this code by hand, it is _so much simpler_ to generate it with a script.

Speaking of code generation, I think we should continue to invest in code generation scripts to make generating mundane C++ code for pre-compiled kernels simpler. I checked the file in but I'm not opposed to auto-generating the files as part of the CMake build -- we could do that in a follow up PR.

Author: Wes McKinney <[email protected]>

Closes apache#3642 from wesm/ARROW-1896 and squashes the following commits:

57d1084 <Wes McKinney> Fix another clang warning
0d3a7b3 <Wes McKinney> Fix clang warning on macOS
8aeaf96 <Wes McKinney> Code review
ab534d1 <Wes McKinney> Fix dictionary->dense conversion for Decimal128
7a178a4 <Wes McKinney> Refactoring around kernel memory allocation, do not allocate memory inside CastKernel. Use code generation to avoid instantiating CastFunctors for identity casts that are never used
Author: Antoine Pitrou <[email protected]>

Closes apache#3650 from pitrou/ARROW-4576-py-benchmarks-fix and squashes the following commits:

67e8069 <Antoine Pitrou> ARROW-4576:  Fix error during benchmarks
Previously we would return an incorrect result.

Author: Antoine Pitrou <[email protected]>

Closes apache#3648 from pitrou/ARROW-3669-numpy-byteswapped-arrays and squashes the following commits:

1e0e10d <Antoine Pitrou> ARROW-3669:  Raise error on Numpy byte-swapped array
kou and others added 9 commits February 16, 2019 17:48
Author: Kouhei Sutou <[email protected]>

Closes apache#3665 from kou/glib-exit-on-error and squashes the following commits:

d8bc073 <Kouhei Sutou>  Stop configure immediately when GLib isn't available
…d of Arrow::Array

This is a compatibility breaking change but we warn about this change
in 0.12.0.

Author: Kouhei Sutou <[email protected]>

Closes apache#3667 from kou/ruby-struct-array-ref and squashes the following commits:

5b30449 <Kouhei Sutou>  Arrow::StructArray# returns Arrow::Struct instead of Arrow::Array
Author: Kouhei Sutou <[email protected]>

Closes apache#3666 from kou/ruby-array-ref-out-of-range and squashes the following commits:

f64df5b <Kouhei Sutou>  Array# returns nil
Author: Korn, Uwe <[email protected]>

Closes apache#3657 from xhochy/ARROW-4584 and squashes the following commits:

00a53a8 <Korn, Uwe> ARROW-4584:  Add built wheel to manylinux1 dockerignore
Author: Nicolas Trinquier <[email protected]>

Closes apache#3663 from ntrinquier/arrow-4377 and squashes the following commits:

0dbf90c <Nicolas Trinquier> Propagate Err
bceddd0 <Nicolas Trinquier> Display arrays vertically
c0e7d55 <Nicolas Trinquier> Handle null case
802501a <Nicolas Trinquier> :xImplement Debug for PrimitiveArrays
We can detect LLVM installed by Homebrew automatically in CMake.

Author: Kouhei Sutou <[email protected]>

Closes apache#3674 from kou/travis-remove-needless-llvm-dir and squashes the following commits:

70d4bf6 <Kouhei Sutou>  Remove needless LLVM_DIR for macOS
Error message:

    In file included from C:/msys64/mingw64/include/thrift/TApplicationException.h:23,
                     from C:/Users/kou/work/arrow/arrow.kou/cpp/src/parquet/thrift.h:35,
                     from C:/Users/kou/work/arrow/arrow.kou/cpp/src/parquet/column_reader.cc:36:
    C:/msys64/mingw64/include/thrift/Thrift.h:32:10: fatal error: netinet/in.h: No such file or directory
     #include <netinet/in.h>
              ^~~~~~~~~~~~~~

Author: Kouhei Sutou <[email protected]>

Closes apache#3676 from kou/cpp-parquet-fix-build-error-with-mingw and squashes the following commits:

248063c <Kouhei Sutou>  Fix build error with MinGW
This PR adds the first query optimizer rule, which rewrites a logical plan to push the projection down to the TableScan.

Once this is merged, I will create a follow up PR to integrate this into the query engine so that only the necessary columns are loaded from disk.

Author: Andy Grove <[email protected]>

Closes apache#3664 from andygrove/ARROW-4589-wip and squashes the following commits:

b876f28 <Andy Grove> revert formatting change that broke the tests
2051deb <Andy Grove> formatting comments and test strings to be < 90 columns wide
8effde3 <Andy Grove> Address PR feedback, fix bug, add extra unit test
ecdd32a <Andy Grove> refactor code to reduce duplication
6229b32 <Andy Grove> refactor code to reduce duplication
f959500 <Andy Grove> implement projection push down for rest of logical plan variants
5fd5382 <Andy Grove> implement collect_expr and rewrite_expr for all expression types
bd49f17 <Andy Grove> improve error handling
92918dd <Andy Grove> Implement projection push-down for selection and make projection deterministic
a80cfdf <Andy Grove> Implement mapping and expression rewrite logic
26fd3b4 <Andy Grove> revert change
d7c4822 <Andy Grove> formatting and add assertion to test
e81af14 <Andy Grove> Roughing out projection push down rule
Author: Uwe L. Korn <[email protected]>

Closes apache#3679 from xhochy/ARROW-4601 and squashes the following commits:

7b5544f <Uwe L. Korn> ARROW-4601:  Add license header to dockerignore
@andygrove
Copy link
Member

@nevi-me I tried merging this but there was a conflict:

CONFLICT (rename/rename): Rename "rust/arrow/src/mod.rs"->"rust/arrow/src/compute/mod.rs" in branch "HEAD" rename "rust/arrow/src/mod.rs"->"rust/arrow/src/json/mod.rs" in "PR_TOOL_MERGE_PR_3624"

Please rebase against master and I will try again

Nicolas Trinquier and others added 9 commits February 17, 2019 16:14
Author: Nicolas Trinquier <[email protected]>

Closes apache#3669 from ntrinquier/arrow-4464 and squashes the following commits:

facc5c2 <Nicolas Trinquier> Add Limit case to ProjectionPushDown
2ed488c <Nicolas Trinquier> Merge remote-tracking branch 'upstream/master' into arrow-4464
c78ae2c <Nicolas Trinquier> Use the previous batch's schema for Limit
e93df93 <Nicolas Trinquier> Remove redundant variable
dbc639f <Nicolas Trinquier> Make limit an usize and avoid evaluting the limit expression
eac5a24 <Nicolas Trinquier> Add support for Limit
@nevi-me
Copy link
Contributor Author

nevi-me commented Feb 18, 2019

I seem to have messed this PR up while rebasing. I'm closing it in favour of a new one

@wesm
Copy link
Member

wesm commented Feb 18, 2019

@nevi-me It's preferable to force push the branch instead of opening a new PR

@nevi-me
Copy link
Contributor Author

nevi-me commented Feb 18, 2019

Apologies Wes, rebases are still hit/miss with me. Looks like I got it right this time around, but I can't reopen this PR

andygrove pushed a commit that referenced this pull request Feb 18, 2019
Replaces #3624

___

An initial JSON reader for Rust, which reads JSON line-delimited files into record batches.

Supports:
* schema inference and user-supplied schema
* projection by column names
* reading lists
* reading lists and scalars, converting scalars to single-value lists

Excluded in this PR are:
* Struct arrays
* Nesting in lists. The list or struct is replaced with null for now
* Extremely mixed types, e.g [int, bool, float, list(_)]

Author: Neville Dipale <[email protected]>

Closes #3685 from nevi-me/rust/json-reader2 and squashes the following commits:

594d056 <Neville Dipale> ARROW-4540:  Basic JSON reader
@wesm
Copy link
Member

wesm commented Feb 18, 2019

OK, if you get into trouble with a PR rebase please ping a maintainer so we can help you. The important thing is to make sure you don't run git merge at any point during the process

@andygrove
Copy link
Member

andygrove commented Feb 18, 2019 via email

@wesm
Copy link
Member

wesm commented Feb 18, 2019

Maybe you can try to evangelize the benefits of a linear commit history in your respective organizations =)

git merge has IMHO zero benefits in a project with many contributors. Things that become extremely hard:

  • Patch backporting
  • git bisect

Because we have a linear history, we can make bugfix releases with cherry-picked patches with almost no pain at all.

NB: many large organizations, such as Google, work in the way that Arrow does. It's so prevalent that it's built into the code review tools they have produced (e.g. https://www.gerritcodereview.com/)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.