-
Notifications
You must be signed in to change notification settings - Fork 0
ARROW-17980: [C++] As-of-Join Substrait extension #14385 #12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…rs (apache#14440) * `configure` checks that `R CMD config CXX17` is defined and exits early if not * README and install.Rmd discuss system dependencies; install.Rmd includes a bash script to run once to configure your CentOS 7 machine to install arrow. Confirmed that this works manually via docker. * Updated `r_docker_configure.sh` to use that script logic, so our CI should confirm too. * Various cleanups in nixlibs.R: remove checks for gcc 4.8/4.9 all over, updated system requirements check to handle case of devtoolset-8 but no openssl/curl * Unrelated purge of windows build script handling for rtools35 Lead-authored-by: Neal Richardson <[email protected]> Co-authored-by: Jacob Wujciak-Jens <[email protected]> Signed-off-by: Neal Richardson <[email protected]>
|
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename pull request title in the following format? or See also: |
westonpace
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks pretty straightforward. I'll review the substrait changes more closely when we can figure out how to get rid of the ::substrait change (I think Anja was making progress on this)?
It looks like the major changes are:
- Change as-of join node so that it can have unique key indices per table
- Pull the MakeOutputSchema method out into its own header file so it can be referenced by the substrait mapping
- Add substrait mapping for asof join node
Is that correct?
| message("SOURCE DIR2 IS ${SOURCE_DIR} AND ${CMAKE_SOURCE_DIR} AND ${ARROW_SUBSTRAIT_PROTOS_DIR}" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove this cleanup if its ready
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
| /// The keys must be consistent across the input tables: | ||
| /// Each "on" key must refer to a field of the same type and units across the tables. | ||
| /// Each "by" key must refer to a list of fields of the same types across the tables. | ||
| struct Keys { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So there is one Keys element per input? Do they all need to have the exact same number of by_key keys?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes and yes.
| } | ||
|
|
||
| private: | ||
| Result<DeclarationInfo> MakeAsOfJoinRel( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit weird to have this in options.cc. I think, when we looking at merging into arrow, we will want to find a better home for this. Perhaps a custom_relations.cc or something like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
| *compute::default_exec_context(), buf, {}, conversion_options); | ||
| } | ||
|
|
||
| TEST(Substrait, PlanWithExtension) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we change the name of this test to reflect that it has something to do with asof join?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
Some usage examples:
```
$ archery docker info debian-go-cgo --show command
Service debian-go-cgo docker compose config:
- command: /bin/bash -c "
/arrow/ci/scripts/go_build.sh /arrow &&
/arrow/ci/scripts/go_test.sh /arrow"
```
or
```
$ archery docker info debian-go-cgo
Service debian-go-cgo docker compose config:
- image: ${REPO}:${ARCH}-debian-${DEBIAN}-go-${GO}-cgo
- build
- context: .
- dockerfile: ci/docker/debian-go-cgo.dockerfile
- cache_from: ['${REPO}:${ARCH}-debian-${DEBIAN}-go-${GO}-cgo']
- args
- base: ${REPO}:${ARCH}-debian-${DEBIAN}-go-${GO}
- shm_size: 2G
- volumes: ['.:/arrow:delegated', '${DOCKER_VOLUME_PREFIX}debian-ccache:/ccache:delegated']
- environment
- ARROW_GO_TESTCGO: 1
- command: /bin/bash -c "
/arrow/ci/scripts/go_build.sh /arrow &&
/arrow/ci/scripts/go_test.sh /arrow"
```
or
```
$ archery docker info ubuntu-cuda-cpp --show environment
Service ubuntu-cuda-cpp docker compose config:
- environment
- ARROW_BUILD_STATIC: OFF
- ARROW_CUDA: ON
- ARROW_GANDIVA: OFF
- ARROW_GCS: OFF
- ARROW_ORC: OFF
- ARROW_S3: OFF
- ARROW_SUBSTRAIT: OFF
- ARROW_WITH_OPENTELEMETRY: OFF
- CCACHE_COMPILERCHECK: content
- CCACHE_COMPRESS: 1
- CCACHE_COMPRESSLEVEL: 6
- CCACHE_MAXSIZE: 1G
- CCACHE_DIR: /ccache
```
_edited only the commands to reflect the change of the command --show_
Lead-authored-by: Raúl Cumplido <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
…formance regressions (apache#14447) Authored-by: Nic Crane <[email protected]> Signed-off-by: Nic Crane <[email protected]>
…e as the existing one (apache#14436) This PR makes it so that you can do the following without a warning: ``` r library(arrow, warn.conflicts = FALSE) register_scalar_function( "times_32", function(context, x) x * 32L, in_type = list(int32(), int64(), float64()), out_type = function(in_types) in_types[[1]], auto_convert = TRUE ) register_scalar_function( "times_32", function(context, x) x * 32L, in_type = list(int32(), int64(), float64()), out_type = function(in_types) in_types[[1]], auto_convert = TRUE ) ``` In fixing that, I also ran across an important discovery, which is that `cpp11::function` does *not* protect the underlying `SEXP` from garbage collection!!!! It the two functions we used this for were being protected by proxy because the execution environment of `register_scalar_function()` was being saved when the binding was registered. Authored-by: Dewey Dunnington <[email protected]> Signed-off-by: Dewey Dunnington <[email protected]>
…pache#14366) Add a test for `TypedColumnReader::Skip` with repeated values to make it clear that we are skipping values and not records. Also, add some comments to the existing test for Skip of non-repeated values. Lead-authored-by: Fatemah Panahi <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Co-authored-by: Fatemah Panahi <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
… table (apache#13962) I think there is a bug in the ORC reader : when we specify the fields indexes that we want to keep, it does not work correctly. Looking at the code, it seems to be because we do "includeTypes" in lieue of "include" when setting the ORC options. It can be problematic when we want to import an ORC table containing Union types as it will do an error at the import, even if we try not to import these specific fields. The definitions of the corresponding ORC methods are here : https://github.com/apache/orc/blob/72220851cbde164a22706f8d47741fd1ad3db190/c%2B%2B/src/Options.hh#L185-L191 and https://github.com/apache/orc/blob/72220851cbde164a22706f8d47741fd1ad3db190/c%2B%2B/src/Options.hh#L201-L207 Lead-authored-by: LouisClt <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
|
@westonpace, I see this is a PR on your personal GitHub space. Is this intended? |
I think so. My understanding is that we are going to do some rapid prototyping on this branch. |
This new version includes a security fix. Authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
This is most of them I could find. There's another one in `python/pyarrow/src/arrow/python/datetime.cc` to remove but it's more involved. Authored-by: Neal Richardson <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
…apache#14453) Error message: https://github.com/ursacomputing/crossbow/actions/runs/3271941831/jobs/5382341820#step:5:566 WARN: Remotes registry file missing, creating default one in /root/.conan/remotes.json WARN: grpc/1.48.0: requirement re2/20220601 overridden by arrow/10.0.0 to re2/20220201 WARN: grpc/1.48.0: requirement protobuf/3.21.4 overridden by arrow/10.0.0 to protobuf/3.21.1 WARN: googleapis/cci.20220711: requirement protobuf/3.21.4 overridden by grpc/1.48.0 to protobuf/3.21.1 WARN: grpc-proto/cci.20220627: requirement protobuf/3.21.4 overridden by grpc/1.48.0 to protobuf/3.21.1 ERROR: Missing binary: grpc/1.48.0:ddc600b3316e16c4e38f2c1ca1214d7241b4dd80 grpc/1.48.0: WARN: Can't find a 'grpc/1.48.0' package for the specified settings, options and dependencies: - Settings: arch=x86_64, build_type=Release, compiler=gcc, compiler.libcxx=libstdc++, compiler.version=10, os=Linux - Options: codegen=True, cpp_plugin=True, csharp_ext=False, csharp_plugin=True, fPIC=True, node_plugin=True, objective_c_plugin=True, php_plugin=True, python_plugin=True, ruby_plugin=True, secure=False, shared=False, abseil:fPIC=True, abseil:shared=False, c-ares:fPIC=True, c-ares:shared=False, c-ares:tools=True, googleapis:fPIC=True, googleapis:shared=False, grpc-proto:fPIC=True, grpc-proto:shared=False, openssl:386=False, openssl:enable_weak_ssl_ciphers=False, openssl:fPIC=True, openssl:no_aria=False, openssl:no_asm=False, openssl:no_async=False, openssl:no_bf=False, openssl:no_blake2=False, openssl:no_camellia=False, openssl:no_cast=False, openssl:no_chacha=False, openssl:no_cms=False, openssl:no_comp=False, openssl:no_ct=False, openssl:no_deprecated=False, openssl:no_des=False, openssl:no_dgram=False, openssl:no_dh=False, openssl:no_dsa=False, openssl:no_dso=False, openssl:no_ec=False, openssl:no_ecdh=False, openssl:no_ecdsa=False, openssl:no_engine=False, openssl:no_filenames=False, openssl:no_gost=False, openssl:no_hmac=False, openssl:no_idea=False, openssl:no_md4=False, openssl:no_md5=False, openssl:no_mdc2=False, openssl:no_ocsp=False, openssl:no_pinshared=False, openssl:no_rc2=False, openssl:no_rfc3779=False, openssl:no_rmd160=False, openssl:no_rsa=False, openssl:no_seed=False, openssl:no_sha=False, openssl:no_sm2=False, openssl:no_sm3=False, openssl:no_sm4=False, openssl:no_sock=False, openssl:no_srp=False, openssl:no_srtp=False, openssl:no_sse2=False, openssl:no_ssl=False, openssl:no_ssl3=False, openssl:no_stdio=False, openssl:no_tests=False, openssl:no_threads=False, openssl:no_tls1=False, openssl:no_ts=False, openssl:no_whirlpool=False, openssl:openssldir=None, openssl:shared=False, protobuf:debug_suffix=True, protobuf:fPIC=True, protobuf:lite=False, protobuf:shared=False, protobuf:with_rtti=True, protobuf:with_zlib=True, re2:fPIC=True, re2:shared=False, zlib:fPIC=True, zlib:shared=False - Dependencies: abseil/20220623.0, c-ares/1.18.1, openssl/1.1.1q, re2/20220201, zlib/1.2.12, protobuf/3.21.1, googleapis/cci.20220711, grpc-proto/cci.20220627 - Requirements: abseil/20220623.Y.Z, c-ares/1.Y.Z, googleapis/cci.20220711, grpc-proto/cci.20220627, openssl/1.Y.Z, protobuf/3.21.1:37dd8aae630726607d9d4108fefd2f59c8f7e9db, re2/20220201.Y.Z, zlib/1.Y.Z - Package ID: ddc600b3316e16c4e38f2c1ca1214d7241b4dd80 ERROR: Missing prebuilt package for 'grpc/1.48.0' Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
We generate version in Crossbow not CI job. We can use file name validation feature by using version generated by Crossbow. We use "X.Y.Z" version for RC version such as "10.0.0" for "10.0.0-rc1". Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
…pers/python page (apache#14350) This PR updates Windows section of the Python Development page. Main changes: - use Python 3.10 (also in instructions for Linux/MacOs) - definition of `PATH` not needed as Python doesn't search in `PATH` for dlls anymore ([3.8 +](https://bugs.python.org/issue43173)) - use `CONDA_PREFIX` to define `ARROW_HOME` as in other parts of the docs - remove **Running C++ unit tests for Python integration** section (C++ unit tests are part of `pytest`-based test module as of apache#14117) cc @wjones127 @jorisvandenbossche Authored-by: Alenka Frim <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>
…ssage (apache#14458) Insert a space between the ``@`` and the username, to avoid it is an actual github reference Authored-by: Joris Van den Bossche <[email protected]> Signed-off-by: Rok Mihevc <[email protected]>
Authored-by: David Li <[email protected]> Signed-off-by: David Li <[email protected]>
…che#14462) The current patterns may match multiple files that have the same base name. For example: * `arrow/dev/tasks/linux-packages/apache-arrow/apt/repositories/debian/pool/bookworm/main/a/apache-arrow/libarrow-glib-dev_10.0.0.dev480-1_arm64.deb` * `arrow/dev/tasks/linux-packages/apache-arrow/apt/build/debian-bookworm-arm64/libgandiva-glib-dev_10.0.0.dev480-1_arm64.deb` The latter (`**/build/**`) is an artifact in a build directory. We should use only the former (`**/repositories/**`) that is an artifact for upload. Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
) Only implements Addition and Subtraction for integral types and float32/float64. Temporal types and others will come afterwards. Authored-by: Matt Topol <[email protected]> Signed-off-by: Matt Topol <[email protected]>
Lead-authored-by: Will Jones <[email protected]> Co-authored-by: Neal Richardson <[email protected]> Co-authored-by: Dewey Dunnington <[email protected]> Signed-off-by: Neal Richardson <[email protected]>
I discussed this issue with @icexelloss. Instead of this |
Authored-by: Matt Topol <[email protected]> Signed-off-by: Matt Topol <[email protected]>
Which prevents `libarrow_bundled_dependencies` to contain all required symbols. Authored-by: Y <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
https://anaconda.org/conda-forge/orc doesn't provide binaries for Windows. Error message: https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=37759&view=logs&j=4c86bc1b-1091-5192-4404-c74dfaad23e7&t=41795ef0-6501-5db4-3ad4-33c0cf085626&l=497 CMake Error at cmake_modules/FindORC.cmake:56 (message): ORC library was required in toolchain and unable to locate Call Stack (most recent call first): cmake_modules/ThirdpartyToolchain.cmake:280 (find_package) cmake_modules/ThirdpartyToolchain.cmake:4362 (resolve_dependency) CMakeLists.txt:496 (include) Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
…ated to timedelta (apache#14460) Pandas 2.0+ will now preserve the unit of a timedelta64/datetime64 numpy array that is passed to a DataFrame constructor, instead of always coercing to nanoseconds. But pyarrow will still use nanoseconds in the conversion to pandas, giving data type mismatches in the tests. For now, ensuring we always create pandas.DataFrames with datetime64[ns] across pandas versions. Longer term solution is ARROW-18124 Authored-by: Joris Van den Bossche <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>
Authored-by: Matt Topol <[email protected]> Signed-off-by: Matt Topol <[email protected]>
Authored-by: Neal Richardson <[email protected]> Signed-off-by: Neal Richardson <[email protected]>
Authored-by: Neal Richardson <[email protected]> Signed-off-by: Neal Richardson <[email protected]>
) Needed to avoid 'Failed to close file descriptor for child process'. The other Docker invocations already have it, but not this one. Authored-by: David Li <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
) Authored-by: David Li <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
…n timezone (apache#45051) ### Rationale for this change If the timezone database is present on the system, but does not contain a timezone referenced in a ORC file, the ORC reader will crash with an uncaught C++ exception. This can happen for example on Ubuntu 24.04 where some timezone aliases have been removed from the main `tzdata` package to a `tzdata-legacy` package. If `tzdata-legacy` is not installed, trying to read a ORC file that references e.g. the "US/Pacific" timezone would crash. Here is a backtrace excerpt: ``` #12 0x00007f1a3ce23a55 in std::terminate() () from /lib/x86_64-linux-gnu/libstdc++.so.6 #13 0x00007f1a3ce39391 in __cxa_throw () from /lib/x86_64-linux-gnu/libstdc++.so.6 #14 0x00007f1a3f4accc4 in orc::loadTZDB(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /tmp/arrow-HEAD.ArqTs/venv-wheel-3.12-manylinux_2_17_x86_64.manylinux2014_x86_64/lib/python3.12/site-packages/pyarrow/libarrow.so.1900 #15 0x00007f1a3f4ad392 in std::call_once<orc::LazyTimezone::getImpl() const::{lambda()#1}>(std::once_flag&, orc::LazyTimezone::getImpl() const::{lambda()#1}&&)::{lambda()#2}::_FUN() () from /tmp/arrow-HEAD.ArqTs/venv-wheel-3.12-manylinux_2_17_x86_64.manylinux2014_x86_64/lib/python3.12/site-packages/pyarrow/libarrow.so.1900 #16 0x00007f1a4298bec3 in __pthread_once_slow (once_control=0xa5ca7c8, init_routine=0x7f1a3ce69420 <__once_proxy>) at ./nptl/pthread_once.c:116 #17 0x00007f1a3f4a9ad0 in orc::LazyTimezone::getEpoch() const () from /tmp/arrow-HEAD.ArqTs/venv-wheel-3.12-manylinux_2_17_x86_64.manylinux2014_x86_64/lib/python3.12/site-packages/pyarrow/libarrow.so.1900 #18 0x00007f1a3f4e76b1 in orc::TimestampColumnReader::TimestampColumnReader(orc::Type const&, orc::StripeStreams&, bool) () from /tmp/arrow-HEAD.ArqTs/venv-wheel-3.12-manylinux_2_17_x86_64.manylinux2014_x86_64/lib/python3.12/site-packages/pyarrow/libarrow.so.1900 #19 0x00007f1a3f4e84ad in orc::buildReader(orc::Type const&, orc::StripeStreams&, bool, bool, bool) () from /tmp/arrow-HEAD.ArqTs/venv-wheel-3.12-manylinux_2_17_x86_64.manylinux2014_x86_64/lib/python3.12/site-packages/pyarrow/libarrow.so.1900 #20 0x00007f1a3f4e8dd7 in orc::StructColumnReader::StructColumnReader(orc::Type const&, orc::StripeStreams&, bool, bool) () from /tmp/arrow-HEAD.ArqTs/venv-wheel-3.12-manylinux_2_17_x86_64.manylinux2014_x86_64/lib/python3.12/site-packages/pyarrow/libarrow.so.1900 #21 0x00007f1a3f4e8532 in orc::buildReader(orc::Type const&, orc::StripeStreams&, bool, bool, bool) () from /tmp/arrow-HEAD.ArqTs/venv-wheel-3.12-manylinux_2_17_x86_64.manylinux2014_x86_64/lib/python3.12/site-packages/pyarrow/libarrow.so.1900 #22 0x00007f1a3f4925e9 in orc::RowReaderImpl::startNextStripe() () from /tmp/arrow-HEAD.ArqTs/venv-wheel-3.12-manylinux_2_17_x86_64.manylinux2014_x86_64/lib/python3.12/site-packages/pyarrow/libarrow.so.1900 #23 0x00007f1a3f492c9d in orc::RowReaderImpl::next(orc::ColumnVectorBatch&) () from /tmp/arrow-HEAD.ArqTs/venv-wheel-3.12-manylinux_2_17_x86_64.manylinux2014_x86_64/lib/python3.12/site-packages/pyarrow/libarrow.so.1900 #24 0x00007f1a3e6b251f in arrow::adapters::orc::ORCFileReader::Impl::ReadBatch(orc::RowReaderOptions const&, std::shared_ptr<arrow::Schema> const&, long) () from /tmp/arrow-HEAD.ArqTs/venv-wheel-3.12-manylinux_2_17_x86_64.manylinux2014_x86_64/lib/python3.12/site-packages/pyarrow/libarrow.so.1900 ``` ### What changes are included in this PR? Catch C++ exceptions when iterating ORC batches instead of letting them slip through. ### Are these changes tested? Yes. ### Are there any user-facing changes? No. * GitHub Issue: apache#40633 Authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
No description provided.