-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-4540: [Rust] Basic JSON reader #3624
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
An initial JSON reader
|
@andygrove @paddyhoran @sunchao PTAL. I've followed a similar structure to our CSV reader. I'm happy with this as a start, though there might be some code which I can share between JSON and CSV readers, and perhaps some cleaning up of the code. I've left out nested struct support for now. |
Enclose optional shell variable names by double quotations to suppress shell warnings that is like below: ``` +'[' == 1 ']' /home/travis/build/apache/arrow/ci/travis_before_script_cpp.sh: line 81: [: ==: unary operator expected ``` Author: Kenta Murata <[email protected]> Closes apache#3623 from mrkn/suppress_shell_warning_on_ci and squashes the following commits: 4db2e53 <Kenta Murata> Suppress shell warning on travis-ci
…fter docs migration We no longer need to render the format docs as part of the Jekyll site, as they were ported to the master Sphinx project. As a result of this, the instructions to build the website were broken Author: Wes McKinney <[email protected]> Closes apache#3620 from wesm/ARROW-4481 and squashes the following commits: 013d13b <Wes McKinney> Remove generated specification docs from site after docs migration
MakeBuilder doesn't preserve the field name in the given ListType. I think this is a bug. Author: Kenta Murata <[email protected]> Closes apache#3619 from mrkn/fix_make_builder_to_preserve_field_name_of_list_type and squashes the following commits: e4d8597 <Kenta Murata> Fix c_glib's test 8fc747a <Kenta Murata> Fix MakeBuilder to preserve ListType's field name
…devices This PR introduces new C++ methods `CudaContext::CopyDeviceToAnotherDevice` `CudaBuffer::CopyFromAnotherDevice` that are used to implement the Python support for copying CUDA buffer data between different devices. Author: Pearu Peterson <[email protected]> Closes apache#3617 from pearu/arrow-3653 and squashes the following commits: 4de6c97 <Pearu Peterson> Skip tests on copying data between buffers of different devices on a single GPU system. b10afc3 <Pearu Peterson> Support copying data in buffers of different devices.
This fixes two issues: - object fields inside numpy structs should be allowed - buggy ndarray indexing if stride % itemsize != 0 Author: Antoine Pitrou <[email protected]> Closes apache#3614 from pitrou/ARROW-4181-py-struct-conversion-fixes and squashes the following commits: 9886e29 <Antoine Pitrou> ARROW-4181: Fixes for Numpy struct array conversion
The varchar/varbinary vectors have a setSafe variant that takes an arrow buf and a start/end offsets. The method needs to expand the buffer to make space for 'end - start' bytes. but, due to a bug, it was expanding it to accommodate 'end' bytes, thereby making the value buffer much larger than required. Author: Pindikura Ravindra <[email protected]> Closes apache#3613 from pravindra/varchar and squashes the following commits: 1b4f224 <Pindikura Ravindra> ARROW-4532: fix review comments 8f88e3a <Pindikura Ravindra> ARROW-4532: fix a size compute bug
This change seems to indicate an alternative would be to force the build to be Release type (and avoid the naming magic). It doesn't appear that the conda feedstock (https://github.com/conda-forge/gtest-feedstock/blob/master/recipe/build.sh) builds debug for linux at all. it does for windows though: (https://github.com/conda-forge/gtest-feedstock/blob/master/recipe/bld.bat) I'm not sure if we care a lot about this? Author: Micah Kornfield <[email protected]> Author: emkornfield <[email protected]> Closes apache#3589 from emkornfield/upgrade_gmock and squashes the following commits: 2df8f58 <emkornfield> fix case of "set" 200dec7 <Micah Kornfield> try naming for debug e910d47 <Micah Kornfield> remove hack for different vendored vs conda gmock af7740b <Micah Kornfield> Upgrade vendored gmock to 1.8.1
…ache#3571) * Added (SIMD) bitwise AND/OR for `Buffer` * Added benchmark * Removed boolean kernels in favor of traits. * Updated typo. * Cleaned up code. * Updated benchmark to include bitwise OR op
Enabled on Ubuntu 16.04 and macOS. Author: Antoine Pitrou <[email protected]> Closes apache#3626 from pitrou/ARROW-3292-flight-travis-ci and squashes the following commits: afd1fd3 <Antoine Pitrou> ARROW-3292: Test Flight RPC in Travis-CI
…ve performance It seems that using `Object.create` in the `bind` method slows down proxy generation substantially. I separated the `Row` and `RowProxyGenerator` concepts - now the generator creates a prototype based on `Row` and the parent vector when it is created, and constructs instances based on that prototype repeatedly in `bind`. This change improved the included benchmark from 1,114.78 ms to 154.96 ms on my laptop (Intel i5-7200U). Author: Brian Hulette <[email protected]> Closes apache#3601 from TheNeuralBit/proxy-bench and squashes the following commits: 2442545 <Brian Hulette> Remove inner class, don't re-define columns 7bb6e4f <Brian Hulette> Update comparator, switch to Symbols for internal variables in Row 4079439 <Brian Hulette> linter fixes da4d97c <Brian Hulette> Switch back to Object.create with no extra property descriptor 8a5d162 <Brian Hulette> Improve Row proxy performance fb7a0f0 <Brian Hulette> add row proxy benchmark
Untested but proposed fix to apache#3629 using the standard UMD pattern as generated by Rollup. Author: Mike Bostock <[email protected]> Closes apache#3630 from mbostock/fix-amd and squashes the following commits: 0930385 <Mike Bostock> Single quotes. a6ed870 <Mike Bostock> Fix AMD pattern.
Author: Paddy Horan <[email protected]> Closes apache#3632 from paddyhoran/bitmap-and-or and squashes the following commits: b32b6d0 <Paddy Horan> Implemented `BitAnd` and `BitOr` for `Bitmap`
This is the start of a Scalar object model suitable for static and dynamic dispatch to correspond with the existing array and array builder types. I modified the first aggregation kernel (sum) to use these types for outputs. Author: Wes McKinney <[email protected]> Closes apache#3604 from wesm/ARROW-47 and squashes the following commits: 0d01bb3 <Wes McKinney> Fix unit test on MSVC for small integer types d57f7aa <Wes McKinney> Remove ARROW_GTEST_VENDORED 03ca01c <Wes McKinney> Changes because MSVC tries to instantiate NumericScalar for Time/Timestamp types 271d602 <Wes McKinney> Add notes that Scalar API is experimental e4a13b4 <Wes McKinney> flake 6626035 <Wes McKinney> Fix up date/time scalars, add tests 704daee <Wes McKinney> Code review comments d922bd2 <Wes McKinney> Use new Scalar objects in aggregation code fa89bd0 <Wes McKinney> Drafting, untested 94d5e62 <Wes McKinney> start
* ARROW-4539: [Java] Fix child vector count for lists. - Child vector count was not set correctly for lists. Fixed to use the right count. * ARROW-4539: [Java] Add license header.
I admit this feels gross, but it's less gross than what was there before. I can do some more clean up but wanted to get feedback before spending any more time on it So, the problem partially lies with the gRPC C++ library. The obvious thing, and first thing I tried, was to specialize `SerializationTraits<protocol::FlightData>` and do casts between `FlightData` and `protocol::FlightData` (the proto) at the last possible moment. Unfortunately, this seems to not be possible because of this: https://github.com/grpc/grpc/blob/master/include/grpcpp/impl/codegen/proto_utils.h#L100 So I had to override that Googly hack and go to some shenanigans (see protocol.h/protocol.cc) to make sure the same templates are always visible both in `Flight.grpc.pb.cc` as well as our client.cc/server.cc Author: Wes McKinney <[email protected]> Closes apache#3633 from wesm/flight-cpp-avoid-ub and squashes the following commits: ed6eb80 <Wes McKinney> Further refinements, make protocol.h an internal header. Comments per feedback b3609d4 <Wes McKinney> Add comments about the purpose of protocol.cc ac405b3 <Wes McKinney> Ensure .proto file is compiled before anything else 23fe416 <Wes McKinney> Implement gRPC customizations another way without calling reinterpret_cast on the client and server C++ types
andygrove
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @nevi-me this is looking good! I left some initial comments and will come back to this with a more thorough review this weekend.
rust/arrow/src/json/reader.rs
Outdated
| let scalar_type = match dt.len() { | ||
| 1 => dt[0].clone(), | ||
| 2 => { | ||
| if dt.contains(&&DataType::Float64) && dt.contains(&&DataType::Int64) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should there be other rules in here other than Float64/Int64?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessarily (at least until we have date time support). When we infer schemas, we only use f64 and i64 for numbers, so bool and string automatically get coerced to string.
I wanted to reactor this for lists and nesting, but I've done it now instead.
rust/arrow/src/json/reader.rs
Outdated
| } | ||
| } | ||
| Value::Object(_) => { | ||
| panic!("Reading nested JSON structes currently not supported") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make this return Result instead of panicking
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| if projection.is_empty() { | ||
| return true; | ||
| } | ||
| projection.contains(field.name()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to picking the fields that are in the projection correctly but not necessarily in the correct order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to enable preserving field order on serde_json to keep field order. I opened a separate JIRA for this yesterday. I'd like to address it separately because it also impacts data type JSON conversions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fine. I think for now we can just ensure that we pass the projection in the correct order (I'm fairly sure that is already the case).
Following the instructions in docker-compose.yml gives me a working IWYU build now ``` export PYTHON_VERSION=3.6 docker-compose build cpp docker-compose build python docker-compose build lint docker-compose run iwyu ``` Author: François Saint-Jacques <[email protected]> Author: Wes McKinney <[email protected]> Closes apache#3643 from wesm/ARROW-4340 and squashes the following commits: 56733fc <François Saint-Jacques> Refactor iwyu build into docker install script. (#8) e1c46c7 <Wes McKinney> Build IWYU for LLVM 7 in iwyu docker-compose job
Author: Pindikura Ravindra <[email protected]> Closes apache#3636 from pravindra/dsub and squashes the following commits: ee60c00 <Pindikura Ravindra> ARROW-4204: add support for decimal subtract
- [ ] Add docs - [ ] Format code - [ ] Include Python in integration tests (requires binding the JSON reader/writer from C++) - [ ] Validate performance? - [ ] Complete server bindings if approach makes sense Author: David Li <[email protected]> Author: Wes McKinney <[email protected]> Closes apache#3566 from lihalite/flight-python and squashes the following commits: ac29ab8 <David Li> Clean up to-be-implemented parts of Flight Python bindings 9d5442a <David Li> Clarify various RecordBatchStream{Reader,Writer} wrappers e1c298a <David Li> Lint CMake files 7764444 <Wes McKinney> Reformat cmake c6b02aa <David Li> Add basic Python bindings for Flight
Also add a debug check on the C++ side. Author: Antoine Pitrou <[email protected]> Closes apache#3647 from pitrou/ARROW-4563-py-validate-decimal128-inputs and squashes the following commits: 5a4cd6a <Antoine Pitrou> ARROW-4563: Validate decimal128() precision input
…template instantiation to not generate dead identity cast code Also resolves ARROW-4110, which has been on my list for some time. This ended up being a huge pain. * `detail::PrimitiveAllocatingUnaryKernel` can now allocate memory for any kind of fixed width type. * I factored out simple bitmap propagation into `detail::PropagateNulls` * I moved the null count resolution code one level down into `ArrayData`, since there are cases where it may be set to `kUnknownNullCount` (e.g. after a slice) and you need to know what it is. This isn't tested but I suggest addressing this in a follow up patch I also moved hand-maintained macro spaghetti for instantiating CastFunctors into a Python code-generation script. This might be the most controversial change in this patch, but the problem here is that we needed to exclude 1 macro case for each numeric type -- currently they were relying on `NUMERIC_CASES`. This means the list of generated types is slightly different for each type, lending to poor code reuse. Rather than maintaining this code by hand, it is _so much simpler_ to generate it with a script. Speaking of code generation, I think we should continue to invest in code generation scripts to make generating mundane C++ code for pre-compiled kernels simpler. I checked the file in but I'm not opposed to auto-generating the files as part of the CMake build -- we could do that in a follow up PR. Author: Wes McKinney <[email protected]> Closes apache#3642 from wesm/ARROW-1896 and squashes the following commits: 57d1084 <Wes McKinney> Fix another clang warning 0d3a7b3 <Wes McKinney> Fix clang warning on macOS 8aeaf96 <Wes McKinney> Code review ab534d1 <Wes McKinney> Fix dictionary->dense conversion for Decimal128 7a178a4 <Wes McKinney> Refactoring around kernel memory allocation, do not allocate memory inside CastKernel. Use code generation to avoid instantiating CastFunctors for identity casts that are never used
Author: Antoine Pitrou <[email protected]> Closes apache#3650 from pitrou/ARROW-4576-py-benchmarks-fix and squashes the following commits: 67e8069 <Antoine Pitrou> ARROW-4576: Fix error during benchmarks
Previously we would return an incorrect result. Author: Antoine Pitrou <[email protected]> Closes apache#3648 from pitrou/ARROW-3669-numpy-byteswapped-arrays and squashes the following commits: 1e0e10d <Antoine Pitrou> ARROW-3669: Raise error on Numpy byte-swapped array
Author: Kouhei Sutou <[email protected]> Closes apache#3665 from kou/glib-exit-on-error and squashes the following commits: d8bc073 <Kouhei Sutou> Stop configure immediately when GLib isn't available
…d of Arrow::Array This is a compatibility breaking change but we warn about this change in 0.12.0. Author: Kouhei Sutou <[email protected]> Closes apache#3667 from kou/ruby-struct-array-ref and squashes the following commits: 5b30449 <Kouhei Sutou> Arrow::StructArray# returns Arrow::Struct instead of Arrow::Array
Author: Kouhei Sutou <[email protected]> Closes apache#3666 from kou/ruby-array-ref-out-of-range and squashes the following commits: f64df5b <Kouhei Sutou> Array# returns nil
Author: Korn, Uwe <[email protected]> Closes apache#3657 from xhochy/ARROW-4584 and squashes the following commits: 00a53a8 <Korn, Uwe> ARROW-4584: Add built wheel to manylinux1 dockerignore
Author: Nicolas Trinquier <[email protected]> Closes apache#3663 from ntrinquier/arrow-4377 and squashes the following commits: 0dbf90c <Nicolas Trinquier> Propagate Err bceddd0 <Nicolas Trinquier> Display arrays vertically c0e7d55 <Nicolas Trinquier> Handle null case 802501a <Nicolas Trinquier> :xImplement Debug for PrimitiveArrays
We can detect LLVM installed by Homebrew automatically in CMake. Author: Kouhei Sutou <[email protected]> Closes apache#3674 from kou/travis-remove-needless-llvm-dir and squashes the following commits: 70d4bf6 <Kouhei Sutou> Remove needless LLVM_DIR for macOS
Error message:
In file included from C:/msys64/mingw64/include/thrift/TApplicationException.h:23,
from C:/Users/kou/work/arrow/arrow.kou/cpp/src/parquet/thrift.h:35,
from C:/Users/kou/work/arrow/arrow.kou/cpp/src/parquet/column_reader.cc:36:
C:/msys64/mingw64/include/thrift/Thrift.h:32:10: fatal error: netinet/in.h: No such file or directory
#include <netinet/in.h>
^~~~~~~~~~~~~~
Author: Kouhei Sutou <[email protected]>
Closes apache#3676 from kou/cpp-parquet-fix-build-error-with-mingw and squashes the following commits:
248063c <Kouhei Sutou> Fix build error with MinGW
This PR adds the first query optimizer rule, which rewrites a logical plan to push the projection down to the TableScan. Once this is merged, I will create a follow up PR to integrate this into the query engine so that only the necessary columns are loaded from disk. Author: Andy Grove <[email protected]> Closes apache#3664 from andygrove/ARROW-4589-wip and squashes the following commits: b876f28 <Andy Grove> revert formatting change that broke the tests 2051deb <Andy Grove> formatting comments and test strings to be < 90 columns wide 8effde3 <Andy Grove> Address PR feedback, fix bug, add extra unit test ecdd32a <Andy Grove> refactor code to reduce duplication 6229b32 <Andy Grove> refactor code to reduce duplication f959500 <Andy Grove> implement projection push down for rest of logical plan variants 5fd5382 <Andy Grove> implement collect_expr and rewrite_expr for all expression types bd49f17 <Andy Grove> improve error handling 92918dd <Andy Grove> Implement projection push-down for selection and make projection deterministic a80cfdf <Andy Grove> Implement mapping and expression rewrite logic 26fd3b4 <Andy Grove> revert change d7c4822 <Andy Grove> formatting and add assertion to test e81af14 <Andy Grove> Roughing out projection push down rule
Author: Uwe L. Korn <[email protected]> Closes apache#3679 from xhochy/ARROW-4601 and squashes the following commits: 7b5544f <Uwe L. Korn> ARROW-4601: Add license header to dockerignore
|
@nevi-me I tried merging this but there was a conflict: Please rebase against master and I will try again |
Author: Nicolas Trinquier <[email protected]> Closes apache#3669 from ntrinquier/arrow-4464 and squashes the following commits: facc5c2 <Nicolas Trinquier> Add Limit case to ProjectionPushDown 2ed488c <Nicolas Trinquier> Merge remote-tracking branch 'upstream/master' into arrow-4464 c78ae2c <Nicolas Trinquier> Use the previous batch's schema for Limit e93df93 <Nicolas Trinquier> Remove redundant variable dbc639f <Nicolas Trinquier> Make limit an usize and avoid evaluting the limit expression eac5a24 <Nicolas Trinquier> Add support for Limit
An initial JSON reader
…nto rust/json-reader
|
I seem to have messed this PR up while rebasing. I'm closing it in favour of a new one |
|
@nevi-me It's preferable to force push the branch instead of opening a new PR |
|
Apologies Wes, rebases are still hit/miss with me. Looks like I got it right this time around, but I can't reopen this PR |
Replaces #3624 ___ An initial JSON reader for Rust, which reads JSON line-delimited files into record batches. Supports: * schema inference and user-supplied schema * projection by column names * reading lists * reading lists and scalars, converting scalars to single-value lists Excluded in this PR are: * Struct arrays * Nesting in lists. The list or struct is replaced with null for now * Extremely mixed types, e.g [int, bool, float, list(_)] Author: Neville Dipale <[email protected]> Closes #3685 from nevi-me/rust/json-reader2 and squashes the following commits: 594d056 <Neville Dipale> ARROW-4540: Basic JSON reader
|
OK, if you get into trouble with a PR rebase please ping a maintainer so we can help you. The important thing is to make sure you don't run |
|
I've been struggling with this too, so took the time this morning to write
up some instructions (mostly for my own benefit) -
https://andygrove.io/apache_arrow_git_tips/
…On Mon, Feb 18, 2019 at 11:17 AM Wes McKinney ***@***.***> wrote:
OK, if you get into trouble with a PR rebase please ping a maintainer so
we can help you. The important thing is to make sure you don't run git
merge at any point during the process
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3624 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AA5AxHPOG0H5OJUTRH2V3tcbwE_AEBemks5vOu5VgaJpZM4a2N_R>
.
|
|
Maybe you can try to evangelize the benefits of a linear commit history in your respective organizations =)
Because we have a linear history, we can make bugfix releases with cherry-picked patches with almost no pain at all. NB: many large organizations, such as Google, work in the way that Arrow does. It's so prevalent that it's built into the code review tools they have produced (e.g. https://www.gerritcodereview.com/) |
An initial JSON reader for Rust, which reads JSON line-delimited files into record batches.
Supports:
Excluded in this PR are: