diff --git a/.gitignore b/.gitignore index dd69b6cec9c..e6dfe19bb98 100644 --- a/.gitignore +++ b/.gitignore @@ -27,4 +27,5 @@ MANIFEST cpp/.idea/ python/.eggs/ -.vscode \ No newline at end of file +.vscode +.idea/ diff --git a/.travis.yml b/.travis.yml index cdf787c831b..b93f1c2519b 100644 --- a/.travis.yml +++ b/.travis.yml @@ -42,7 +42,6 @@ cache: ccache: true directories: - $HOME/.conda_packages - - $HOME/.ccache matrix: fast_finish: true @@ -56,6 +55,9 @@ matrix: before_script: - export CC="gcc-4.9" - export CXX="g++-4.9" + - export ARROW_TRAVIS_USE_TOOLCHAIN=1 + - export ARROW_TRAVIS_VALGRIND=1 + - export ARROW_TRAVIS_PLASMA=1 - $TRAVIS_BUILD_DIR/ci/travis_before_script_cpp.sh script: - $TRAVIS_BUILD_DIR/ci/travis_script_cpp.sh @@ -66,6 +68,8 @@ matrix: cache: addons: before_script: + - export ARROW_TRAVIS_USE_TOOLCHAIN=1 + - export ARROW_TRAVIS_PLASMA=1 - $TRAVIS_BUILD_DIR/ci/travis_before_script_cpp.sh script: - $TRAVIS_BUILD_DIR/ci/travis_script_cpp.sh diff --git a/CHANGELOG.md b/CHANGELOG.md index 55b02e0f9a1..6cedf32df62 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,24 +1,167 @@ - http://www.apache.org/licenses/LICENSE-2.0 +# Apache Arrow 0.5.0 (23 July 2017) - Unless required by applicable law or agreed to in writing, software - distributed under the License is distributed on an "AS IS" BASIS, - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - See the License for the specific language governing permissions and - limitations under the License. See accompanying LICENSE file. ---> +## Bug + +* ARROW-1074 - from_pandas doesnt convert ndarray to list +* ARROW-1079 - [Python] Empty "private" directories should be ignored by Parquet interface +* ARROW-1081 - C++: arrow::test::TestBase::MakePrimitive doesn't fill null_bitmap +* ARROW-1096 - [C++] Memory mapping file over 4GB fails on Windows +* ARROW-1097 - Reading tensor needs file to be opened in writeable mode +* ARROW-1098 - Document Error? +* ARROW-1101 - UnionListWriter is not implementing all methods on interface ScalarWriter +* ARROW-1103 - [Python] Utilize pandas metadata from common `_metadata` Parquet file if it exists +* ARROW-1107 - [JAVA] NullableMapVector getField() should return nullable type +* ARROW-1108 - Check if ArrowBuf is empty buffer in getActualConsumedMemory() and getPossibleConsumedMemory() +* ARROW-1109 - [JAVA] transferOwnership fails when readerIndex is not 0 +* ARROW-1110 - [JAVA] make union vector naming consistent +* ARROW-1111 - [JAVA] Make aligning buffers optional, and allow -1 for unknown null count +* ARROW-1112 - [JAVA] Set lastSet for VarLength and List vectors when loading +* ARROW-1113 - [C++] gflags EP build gets triggered (as a no-op) on subsequent calls to make or ninja build +* ARROW-1115 - [C++] Use absolute path for ccache +* ARROW-1117 - [Docs] Minor issues in GLib README +* ARROW-1124 - [Python] pyarrow needs to depend on numpy>=1.10 (not 1.9) +* ARROW-1125 - Python: `Table.from_pandas` doesn't work anymore on partial schemas +* ARROW-1128 - [Docs] command to build a wheel is not properly rendered +* ARROW-1129 - [C++] Fix Linux toolchain build regression from ARROW-742 +* ARROW-1131 - Python: Parquet unit tests are always skipped +* ARROW-1132 - [Python] Unable to write pandas DataFrame w/MultiIndex containing duplicate values to parquet +* ARROW-1136 - [C++/Python] Segfault on empty stream +* ARROW-1138 - Travis: Use OpenJDK7 instead of OracleJDK7 +* ARROW-1139 - [C++] dlmalloc doesn't allow arrow to be built with clang 4 or gcc 7.1.1 +* ARROW-1141 - on import get libjemalloc.so.2: cannot allocate memory in static TLS block +* ARROW-1143 - C++: Fix comparison of NullArray +* ARROW-1144 - [C++] Remove unused variable +* ARROW-1150 - [C++] AdaptiveIntBuilder compiler warning on MSVC +* ARROW-1152 - [Cython] `read_tensor` should work with a readable file +* ARROW-1155 - segmentation fault when run pa.Int16Value() +* ARROW-1157 - C++/Python: Decimal templates are not correctly exported on OSX +* ARROW-1159 - [C++] Static data members cannot be accessed from inline functions in Arrow headers by thirdparty users +* ARROW-1162 - Transfer Between Empty Lists Should Not Invoke Callback +* ARROW-1166 - Errors in Struct type's example and missing reference in Layout.md +* ARROW-1167 - [Python] Create chunked BinaryArray in `Table.from_pandas` when a column's data exceeds 2GB +* ARROW-1168 - [Python] pandas metadata may contain "mixed" data types +* ARROW-1169 - C++: jemalloc externalproject doesn't build with CMake's ninja generator +* ARROW-1170 - C++: `ARROW_JEMALLOC=OFF` breaks linking on unittest +* ARROW-1174 - [GLib] Investigate root cause of ListArray glib test failure +* ARROW-1177 - [C++] Detect int32 overflow in ListBuilder::Append +* ARROW-1179 - C++: Add missing virtual destructors +* ARROW-1180 - [GLib] `garrow_tensor_get_dimension_name()` returns invalid address +* ARROW-1181 - [Python] Parquet test fail if not enabled +* ARROW-1182 - C++: Specify `BUILD_BYPRODUCTS` for zlib and zstd +* ARROW-1186 - [C++] Enable option to build arrow with minimal dependencies needed to build Parquet library +* ARROW-1188 - Segfault when trying to serialize a DataFrame with Null-only Categorical Column +* ARROW-1190 - VectorLoader corrupts vectors with duplicate names +* ARROW-1191 - [JAVA] Implement getField() method for the complex readers +* ARROW-1194 - Getting record batch size with `pa.get_record_batch_size` returns a size that is too small for pandas DataFrame. +* ARROW-1197 - [GLib] `record_batch.hpp` Inclusion is missing +* ARROW-1200 - [C++] DictionaryBuilder should use signed integers for indices +* ARROW-1201 - [Python] Incomplete Python types cause a core dump when repr-ing +* ARROW-1203 - [C++] Disallow BinaryBuilder to append byte strings larger than the maximum value of `int32_t` +* ARROW-1205 - C++: Reference to type objects in ArrayLoader may cause segmentation faults. +* ARROW-1206 - [C++] Enable MSVC builds to work with some compression library support disabled +* ARROW-1208 - [C++] Toolchain build with ZSTD library from conda-forge failure +* ARROW-1215 - [Python] Class methods in API reference +* ARROW-1216 - Numpy arrays cannot be created from Arrow Buffers on Python 2 +* ARROW-1218 - Arrow doesn't compile if all compression libraries are deactivated +* ARROW-1222 - [Python] pyarrow.array returns NullArray for array of unsupported Python objects +* ARROW-1223 - [GLib] Fix function name that returns wrapped object +* ARROW-1235 - [C++] macOS linker failure with operator<< and std::ostream +* ARROW-1236 - Library paths in exported pkg-config file are incorrect +* ARROW-601 - Some logical types not supported when loading Parquet +* ARROW-784 - Cleaning up thirdparty toolchain support in Arrow on Windows +* ARROW-992 - [Python] In place development builds do not have a `__version__` + +## Improvement + +* ARROW-1041 - [Python] Support `read_pandas` on a directory of Parquet files +* ARROW-1100 - [Python] Add "mode" property to NativeFile instances +* ARROW-1102 - Make MessageSerializer.serializeMessage() public +* ARROW-1120 - [Python] Write support for int96 +* ARROW-1137 - Python: Ensure Pandas roundtrip of all-None column +* ARROW-1148 - [C++] Raise minimum CMake version to 3.2 +* ARROW-1151 - [C++] Add gcc branch prediction to status check macro +* ARROW-1160 - C++: Implement DictionaryBuilder +* ARROW-1165 - [C++] Refactor PythonDecimalToArrowDecimal to not use templates +* ARROW-1185 - [C++] Clean up arrow::Status implementation, add `warn_unused_result` attribute for clang +* ARROW-1187 - Serialize a DataFrame with None column +* ARROW-1193 - [C++] Support pkg-config for `arrow_python.so` +* ARROW-1196 - [C++] Appveyor separate jobs for Debug/Release builds from sources; Build with conda toolchain; Build with NMake Makefiles Generator +* ARROW-1199 - [C++] Introduce mutable POD struct for generic array data +* ARROW-1202 - Remove semicolons from status macros +* ARROW-1217 - [GLib] Add GInputStream based arrow::io::RandomAccessFile +* ARROW-1220 - [C++] Standartize usage of `*_HOME` cmake script variables for 3rd party libs +* ARROW-1221 - [C++] Pin clang-format version +* ARROW-1229 - [GLib] Follow Reader API change (get -> read) +* ARROW-742 - Handling exceptions during execution of `std::wstring_convert` +* ARROW-834 - [Python] Support creating Arrow arrays from Python iterables +* ARROW-915 - Struct Array reads limited support +* ARROW-935 - [Java] Build Javadoc in Travis CI +* ARROW-960 - [Python] Add source build guide for macOS + Homebrew +* ARROW-962 - [Python] Add schema attribute to FileReader +* ARROW-966 - [Python] `pyarrow.list_` should also accept Field instance +* ARROW-978 - [Python] Use sphinx-bootstrap-theme for Sphinx documentation + +## New Feature + +* ARROW-1048 - Allow user `LD_LIBRARY_PATH` to be used with source release script +* ARROW-1073 - C++: Adapative integer builder +* ARROW-1095 - [Website] Add Arrow icon asset +* ARROW-111 - [C++] Add static analyzer to tool chain to verify checking of Status returns +* ARROW-1122 - [Website] Guest blog post on Arrow + ODBC from turbodbc +* ARROW-1123 - C++: Make jemalloc the default allocator +* ARROW-1135 - Upgrade Travis CI clang builds to use LLVM 4.0 +* ARROW-1142 - [C++] Move over compression library toolchain from parquet-cpp +* ARROW-1145 - [GLib] Add `get_values()` +* ARROW-1154 - [C++] Migrate more computational utility code from parquet-cpp +* ARROW-1183 - [Python] Implement time type conversions in `to_pandas` +* ARROW-1198 - Python: Add public C++ API to unwrap PyArrow object +* ARROW-1212 - [GLib] Add `garrow_binary_array_get_offsets_buffer()` +* ARROW-1214 - [Python] Add classes / functions to enable stream message components to be handled outside of the stream reader class +* ARROW-1227 - [GLib] Support GOutputStream +* ARROW-460 - [C++] Implement JSON round trip for DictionaryArray +* ARROW-462 - [C++] Implement in-memory conversions between non-nested primitive types and DictionaryArray equivalent +* ARROW-575 - Python: Auto-detect nested lists and nested numpy arrays in Pandas +* ARROW-597 - [Python] Add convenience function to yield DataFrame from any object that a StreamReader or FileReader can read from +* ARROW-599 - [C++] Add LZ4 codec to 3rd-party toolchain +* ARROW-600 - [C++] Add ZSTD codec to 3rd-party toolchain +* ARROW-692 - Java<->C++ Integration tests for dictionary-encoded vectors +* ARROW-693 - [Java] Add JSON support for dictionary vectors + +## Task + +* ARROW-1052 - Arrow 0.5.0 release + +## Test + +* ARROW-1228 - [GLib] Test file name should be the same name as target class +* ARROW-1233 - [C++] Validate cmake script resolving of 3rd party linked libs from correct location in toolchain build # Apache Arrow 0.4.1 (9 June 2017) ## Bug -* ARROW-1039 - Python: pyarrow.Filesystem.read_parquet causing error if nthreads>1 +* ARROW-1039 - Python: `pyarrow.Filesystem.read_parquet` causing error if nthreads>1 * ARROW-1050 - [C++] Export arrow::ValidateArray -* ARROW-1051 - [Python] If pyarrow.parquet fails to import due to a shared library ABI conflict, the test_parquet.py tests silently do not run +* ARROW-1051 - [Python] If pyarrow.parquet fails to import due to a shared library ABI conflict, the `test_parquet.py` tests silently do not run * ARROW-1056 - [Python] Parquet+HDFS test failure due to writing pandas index * ARROW-1057 - Fix cmake warning and msvc debug asserts * ARROW-1062 - [GLib] Examples use old API @@ -27,8 +170,8 @@ * ARROW-1075 - [GLib] Build error on macOS * ARROW-1085 - [java] Follow up on template cleanup. Missing method for IntervalYear * ARROW-1086 - [Python] pyarrow 0.4.0 on pypi is missing pxd files -* ARROW-1088 - [Python] test_unicode_filename test fails when unicode filenames aren't supported by system -* ARROW-1090 - [Python] build_ext usability +* ARROW-1088 - [Python] `test_unicode_filename` test fails when unicode filenames aren't supported by system +* ARROW-1090 - [Python] `build_ext` usability * ARROW-1091 - Decimal scale and precision are flipped * ARROW-1092 - More Decimal and scale flipped follow-up * ARROW-1094 - [C++] Incomplete buffer reads in arrow::io::ReadableFile should exactly truncate returned buffer @@ -63,9 +206,9 @@ * ARROW-1003 - [C++] Hdfs and java dlls fail to load when built for Windows with MSVC * ARROW-1004 - ArrowInvalid: Invalid: Python object of type float is not None and is not a string, bool, or date object -* ARROW-1017 - Python: Table.to_pandas leaks memory +* ARROW-1017 - Python: `Table.to_pandas` leaks memory * ARROW-1023 - Python: Fix bundling of arrow-cpp for macOS -* ARROW-1033 - [Python] pytest discovers scripts/test_leak.py +* ARROW-1033 - [Python] pytest discovers `scripts/test_leak.py` * ARROW-1046 - [Python] Conform DataFrame metadata to pandas spec * ARROW-1053 - [Python] Memory leak with RecordBatchFileReader * ARROW-1054 - [Python] Test suite fails on pandas 0.19.2 @@ -74,16 +217,16 @@ * ARROW-813 - [Python] setup.py sdist must also bundle dependent cmake modules * ARROW-824 - Date and Time Vectors should reflect timezone-less semantics * ARROW-856 - CmakeError by Unknown compiler. -* ARROW-881 - [Python] Reconstruct Pandas DataFrame indexes using custom_metadata +* ARROW-881 - [Python] Reconstruct Pandas DataFrame indexes using `custom_metadata` * ARROW-909 - libjemalloc.so.2: cannot open shared object file: * ARROW-939 - Fix division by zero for zero-dimensional Tensors * ARROW-940 - [JS] Generate multiple sets of artifacts * ARROW-944 - Python: Compat broken for pandas==0.18.1 * ARROW-948 - [GLib] Update C++ header file list * ARROW-952 - Compilation error on macOS with clang-802.0.42 -* ARROW-958 - [Python] Conda build guide still needs ARROW_HOME, PARQUET_HOME -* ARROW-979 - [Python] Fix setuptools_scm version when release tag is not in the master timeline -* ARROW-991 - [Python] PyArray_SimpleNew should not be used with NPY_DATETIME +* ARROW-958 - [Python] Conda build guide still needs `ARROW_HOME`, `PARQUET_HOME` +* ARROW-979 - [Python] Fix `setuptools_scm` version when release tag is not in the master timeline +* ARROW-991 - [Python] `PyArray_SimpleNew` should not be used with `NPY_DATETIME` * ARROW-995 - [Website] 0.3 release announce has a typo in reference * ARROW-998 - [Doc] File format documents incorrect schema location @@ -138,9 +281,9 @@ * ARROW-1044 - [GLib] Support Feather * ARROW-29 - C++: Add re2 as optional 3rd-party toolchain dependency * ARROW-446 - [Python] Document NativeFile interfaces, HDFS client in Sphinx -* ARROW-482 - [Java] Provide API access to "custom_metadata" Field attribute in IPC setting +* ARROW-482 - [Java] Provide API access to `custom_metadata` Field attribute in IPC setting * ARROW-596 - [Python] Add convenience function to convert pandas.DataFrame to pyarrow.Buffer containing a file or stream representation -* ARROW-714 - [C++] Add import_pyarrow C API in the style of NumPy for thirdparty C++ users +* ARROW-714 - [C++] Add `import_pyarrow` C API in the style of NumPy for thirdparty C++ users * ARROW-819 - [Python] Define public Cython API * ARROW-872 - [JS] Read streaming format * ARROW-873 - [JS] Implement fixed width list type @@ -165,8 +308,8 @@ * ARROW-208 - Add checkstyle policy to java project * ARROW-347 - Add method to pass CallBack when creating a transfer pair * ARROW-413 - DATE type is not specified clearly -* ARROW-431 - [Python] Review GIL release and acquisition in to_pandas conversion -* ARROW-443 - [Python] Support for converting from strided pandas data in Table.from_pandas +* ARROW-431 - [Python] Review GIL release and acquisition in `to_pandas` conversion +* ARROW-443 - [Python] Support for converting from strided pandas data in `Table.from_pandas` * ARROW-451 - [C++] Override DataType::Equals for other types with additional metadata * ARROW-454 - pojo.Field doesn't implement hashCode() * ARROW-526 - [Format] Update IPC.md to account for File format changes and Streaming format @@ -178,8 +321,8 @@ * ARROW-604 - Python: boxed Field instances are missing the reference to DataType * ARROW-613 - [JS] Implement random-access file format * ARROW-617 - Time type is not specified clearly -* ARROW-619 - Python: Fix typos in setup.py args and LD_LIBRARY_PATH -* ARROW-623 - segfault with __repr__ of empty Field +* ARROW-619 - Python: Fix typos in setup.py args and `LD_LIBRARY_PATH` +* ARROW-623 - segfault with `__repr__` of empty Field * ARROW-624 - [C++] Restore MakePrimitiveArray function * ARROW-627 - [C++] Compatibility macros for exported extern template class declarations * ARROW-628 - [Python] Install nomkl metapackage when building parquet-cpp for faster Travis builds @@ -201,7 +344,7 @@ * ARROW-686 - [C++] Account for time metadata changes, add time32 and time64 types * ARROW-689 - [GLib] Install header files and documents to wrong directories * ARROW-691 - [Java] Encode dictionary Int type in message format -* ARROW-697 - [Java] Raise appropriate exceptions when encountering large (> INT32_MAX) record batches +* ARROW-697 - [Java] Raise appropriate exceptions when encountering large (> `INT32_MAX`) record batches * ARROW-699 - [C++] Arrow dynamic libraries are missed on run of unit tests on Windows * ARROW-702 - Fix BitVector.copyFromSafe to reAllocate instead of returning false * ARROW-703 - Fix issue where setValueCount(0) doesn’t work in the case that we’ve shipped vectors across the wire @@ -211,14 +354,14 @@ * ARROW-715 - Python: Explicit pandas import makes it a hard requirement * ARROW-716 - error building arrow/python * ARROW-720 - [java] arrow should not have a dependency on slf4j bridges in compile -* ARROW-723 - Arrow freezes on write if chunk_size=0 +* ARROW-723 - Arrow freezes on write if `chunk_size=0` * ARROW-726 - [C++] PyBuffer dtor may segfault if constructor passed an object not exporting buffer protocol * ARROW-732 - Schema comparison bugs in struct and union types * ARROW-736 - [Python] Mixed-type object DataFrame columns should not silently coerce to an Arrow type by default * ARROW-738 - [Python] Fix manylinux1 packaging * ARROW-739 - Parallel build fails non-deterministically. * ARROW-740 - FileReader fails for large objects -* ARROW-747 - [C++] Fix spurious warning caused by passing dl to add_dependencies +* ARROW-747 - [C++] Fix spurious warning caused by passing dl to `add_dependencies` * ARROW-749 - [Python] Delete incomplete binary files when writing fails * ARROW-753 - [Python] Unit tests in arrow/python fail to link on some OS X platforms * ARROW-756 - [C++] Do not pass -fPIC when compiling with MSVC @@ -238,13 +381,13 @@ * ARROW-809 - C++: Writing sliced record batch to IPC writes the entire array * ARROW-812 - Pip install pyarrow on mac failed. * ARROW-817 - [C++] Fix incorrect code comment from ARROW-722 -* ARROW-821 - [Python] Extra file _table_api.h generated during Python build process +* ARROW-821 - [Python] Extra file `_table_api.h` generated during Python build process * ARROW-822 - [Python] StreamWriter fails to open with socket as sink -* ARROW-826 - Compilation error on Mac with -DARROW_PYTHON=on +* ARROW-826 - Compilation error on Mac with `-DARROW_PYTHON=on` * ARROW-829 - Python: Parquet: Dictionary encoding is deactivated if column-wise compression was selected * ARROW-830 - Python: jemalloc is not anymore publicly exposed -* ARROW-839 - [C++] Portable alternative to PyDate_to_ms function -* ARROW-847 - C++: BUILD_BYPRODUCTS not specified anymore for gtest +* ARROW-839 - [C++] Portable alternative to `PyDate_to_ms` function +* ARROW-847 - C++: `BUILD_BYPRODUCTS` not specified anymore for gtest * ARROW-852 - Python: Also set Arrow Library PATHS when detection was done through pkg-config * ARROW-853 - [Python] It is no longer necessary to modify the RPATH of the Cython extensions on many environments * ARROW-858 - Remove dependency on boost regex @@ -262,7 +405,7 @@ * ARROW-914 - [C++/Python] Fix Decimal ToBytes * ARROW-922 - Allow Flatbuffers and RapidJSON to be used locally on Windows * ARROW-928 - Update CMAKE script to detect unsupported msvc compilers versions -* ARROW-933 - [Python] arrow_python bindings have debug print statement +* ARROW-933 - [Python] `arrow_python` bindings have debug print statement * ARROW-934 - [GLib] Glib sources missing from result of 02-source.sh * ARROW-936 - Fix release README * ARROW-938 - Fix Apache Rat errors from source release build @@ -275,7 +418,7 @@ * ARROW-566 - Python: Deterministic position of libarrow in manylinux1 wheels * ARROW-569 - [C++] Set version for .pc * ARROW-577 - [C++] Refactor StreamWriter and FileWriter to have private implementations -* ARROW-580 - C++: Also provide jemalloc_X targets if only a static or shared version is found +* ARROW-580 - C++: Also provide `jemalloc_X` targets if only a static or shared version is found * ARROW-582 - [Java] Add Date/Time Support to JSON File * ARROW-589 - C++: Use system provided shared jemalloc if static is unavailable * ARROW-593 - [C++] Rename ReadableFileInterface to RandomAccessFile @@ -296,7 +439,7 @@ * ARROW-679 - [Format] Change RecordBatch and Field length members from int to long * ARROW-681 - [C++] Build Arrow on Windows with dynamically linked boost * ARROW-684 - Python: More informative message when parquet-cpp but not parquet-arrow is available -* ARROW-688 - [C++] Use CMAKE_INSTALL_INCLUDEDIR for consistency +* ARROW-688 - [C++] Use `CMAKE_INSTALL_INCLUDEDIR` for consistency * ARROW-690 - Only send JIRA updates to issues@arrow.apache.org * ARROW-700 - Add headroom interface for allocator. * ARROW-706 - [GLib] Add package install document @@ -311,13 +454,13 @@ * ARROW-731 - [C++] Add shared library related versions to .pc * ARROW-741 - [Python] Add Python 3.6 to Travis CI * ARROW-743 - [C++] Consolidate unit tests for code in array.h -* ARROW-744 - [GLib] Re-add an assertion to garrow_table_new() test +* ARROW-744 - [GLib] Re-add an assertion to `garrow_table_new()` test * ARROW-745 - [C++] Allow use of system cpplint -* ARROW-746 - [GLib] Add garrow_array_get_data_type() +* ARROW-746 - [GLib] Add `garrow_array_get_data_type()` * ARROW-751 - [Python] Rename all Cython extensions to "private" status with leading underscore * ARROW-752 - [Python] Construct pyarrow.DictionaryArray from boxed pyarrow array objects -* ARROW-754 - [GLib] Add garrow_array_is_null() -* ARROW-755 - [GLib] Add garrow_array_get_value_type() +* ARROW-754 - [GLib] Add `garrow_array_is_null()` +* ARROW-755 - [GLib] Add `garrow_array_get_value_type()` * ARROW-758 - [C++] Fix compiler warnings on MSVC x64 * ARROW-761 - [Python] Add function to compute the total size of tensor payloads, including metadata and padding * ARROW-763 - C++: Use `python-config` to find libpythonX.X.dylib @@ -329,7 +472,7 @@ * ARROW-779 - [C++/Python] Raise exception if old metadata encountered * ARROW-782 - [C++] Change struct to class for objects that meet the criteria in the Google style guide * ARROW-788 - Possible nondeterminism in Tensor serialization code -* ARROW-795 - [C++] Combine libarrow/libarrow_io/libarrow_ipc +* ARROW-795 - [C++] Combine `libarrow/libarrow_io/libarrow_ipc` * ARROW-802 - [GLib] Add read examples * ARROW-803 - [GLib] Update package repository URL * ARROW-804 - [GLib] Update build document @@ -342,7 +485,7 @@ * ARROW-816 - [C++] Use conda packages for RapidJSON, Flatbuffers to speed up builds * ARROW-818 - [Python] Review public pyarrow. API completeness and update docs * ARROW-820 - [C++] Build dependencies for Parquet library without arrow support -* ARROW-825 - [Python] Generalize pyarrow.from_pylist to accept any object implementing the PySequence protocol +* ARROW-825 - [Python] Generalize `pyarrow.from_pylist` to accept any object implementing the PySequence protocol * ARROW-827 - [Python] Variety of Parquet improvements to support Dask integration * ARROW-828 - [CPP] Document new requirement (libboost-regex-dev) in README.md * ARROW-832 - [C++] Upgrade thirdparty gtest to 1.8.0 @@ -352,7 +495,7 @@ * ARROW-845 - [Python] Sync FindArrow.cmake changes from parquet-cpp * ARROW-846 - [GLib] Add GArrowTensor, GArrowInt8Tensor and GArrowUInt8Tensor * ARROW-848 - [Python] Improvements / fixes to conda quickstart guide -* ARROW-849 - [C++] Add optional $ARROW_BUILD_TOOLCHAIN environment variable option for configuring build environment +* ARROW-849 - [C++] Add optional `$ARROW_BUILD_TOOLCHAIN` environment variable option for configuring build environment * ARROW-857 - [Python] Automate publishing Python documentation to arrow-site * ARROW-860 - [C++] Decide if typed Tensor subclasses are worthwhile * ARROW-861 - [Python] Move DEVELOPMENT.md to Sphinx docs @@ -362,8 +505,8 @@ * ARROW-868 - [GLib] Use GBytes to reduce copy * ARROW-871 - [GLib] Unify DataType files * ARROW-876 - [GLib] Unify ArrayBuffer files -* ARROW-877 - [GLib] Add garrow_array_get_null_bitmap() -* ARROW-878 - [GLib] Add garrow_binary_array_get_buffer() +* ARROW-877 - [GLib] Add `garrow_array_get_null_bitmap()` +* ARROW-878 - [GLib] Add `garrow_binary_array_get_buffer()` * ARROW-892 - [GLib] Fix GArrowTensor document * ARROW-893 - Add GLib document to Web site * ARROW-894 - [GLib] Add GArrowPoolBuffer @@ -389,13 +532,13 @@ * ARROW-341 - [Python] Making libpyarrow available to third parties * ARROW-452 - [C++/Python] Merge "Feather" file format implementation * ARROW-459 - [C++] Implement IPC round trip for DictionaryArray, dictionaries shared across record batches -* ARROW-483 - [C++/Python] Provide access to "custom_metadata" Field attribute in IPC setting +* ARROW-483 - [C++/Python] Provide access to `custom_metadata` Field attribute in IPC setting * ARROW-491 - [C++] Add FixedWidthBinary type * ARROW-493 - [C++] Allow in-memory array over 2^31 -1 elements but require splitting at IPC / RPC boundaries * ARROW-502 - [C++/Python] Add MemoryPool implementation that logs allocation activity to std::cout * ARROW-510 - Add integration tests for date and time types * ARROW-520 - [C++] Add STL-compliant allocator that hooks into an arrow::MemoryPool -* ARROW-528 - [Python] Support _metadata or _common_metadata files when reading Parquet directories +* ARROW-528 - [Python] Support `_metadata` or `_common_metadata` files when reading Parquet directories * ARROW-534 - [C++] Add IPC tests for date/time types * ARROW-539 - [Python] Support reading Parquet datasets with standard partition directory schemes * ARROW-550 - [Format] Add a TensorMessage type @@ -444,7 +587,7 @@ * ARROW-771 - [Python] Add APIs for reading individual Parquet row groups * ARROW-773 - [C++] Add function to create arrow::Table with column appended to existing table * ARROW-865 - [Python] Verify Parquet roundtrips for new date/time types -* ARROW-880 - [GLib] Add garrow_primitive_array_get_buffer() +* ARROW-880 - [GLib] Add `garrow_primitive_array_get_buffer()` * ARROW-890 - [GLib] Add GArrowMutableBuffer * ARROW-926 - Update KEYS to include wesm @@ -481,7 +624,7 @@ * ARROW-323 - [Python] Opt-in to PyArrow parquet build rather than skipping silently on failure * ARROW-334 - [Python] OS X rpath issues on some configurations * ARROW-337 - UnionListWriter.list() is doing more than it should, this can cause data corruption -* ARROW-339 - Make merge_arrow_pr script work with Python 3 +* ARROW-339 - Make `merge_arrow_pr` script work with Python 3 * ARROW-340 - [C++] Opening a writeable file on disk that already exists does not truncate to zero * ARROW-342 - Set Python version on release * ARROW-345 - libhdfs integration doesn't work for Mac @@ -490,15 +633,15 @@ * ARROW-349 - Six is missing as a requirement in the python setup.py * ARROW-351 - Time type has no unit * ARROW-354 - Connot compare an array of empty strings to another -* ARROW-357 - Default Parquet chunk_size of 64k is too small +* ARROW-357 - Default Parquet `chunk_size` of 64k is too small * ARROW-358 - [C++] libhdfs can be in non-standard locations in some Hadoop distributions -* ARROW-362 - Python: Calling to_pandas on a table read from Parquet leaks memory +* ARROW-362 - Python: Calling `to_pandas` on a table read from Parquet leaks memory * ARROW-371 - Python: Table with null timestamp becomes float in pandas -* ARROW-375 - columns parameter in parquet.read_table() raises KeyError for valid column +* ARROW-375 - columns parameter in `parquet.read_table()` raises KeyError for valid column * ARROW-384 - Align Java and C++ RecordBatch data and metadata layout * ARROW-386 - [Java] Respect case of struct / map field names * ARROW-387 - [C++] arrow::io::BufferReader does not permit shared memory ownership in zero-copy reads -* ARROW-390 - C++: CMake fails on json-integration-test with ARROW_BUILD_TESTS=OFF +* ARROW-390 - C++: CMake fails on json-integration-test with `ARROW_BUILD_TESTS=OFF` * ARROW-392 - Fix string/binary integration tests * ARROW-393 - [JAVA] JSON file reader fails to set the buffer size on String data vector * ARROW-395 - Arrow file format writes record batches in reverse order. @@ -509,19 +652,19 @@ * ARROW-402 - [Java] "refCnt gone negative" error in integration tests * ARROW-403 - [JAVA] UnionVector: Creating a transfer pair doesn't transfer the schema to destination vector * ARROW-404 - [Python] Closing an HdfsClient while there are still open file handles results in a crash -* ARROW-405 - [C++] Be less stringent about finding include/hdfs.h in HADOOP_HOME +* ARROW-405 - [C++] Be less stringent about finding include/hdfs.h in `HADOOP_HOME` * ARROW-406 - [C++] Large HDFS reads must utilize the set file buffer size when making RPCs * ARROW-408 - [C++/Python] Remove defunct conda recipes * ARROW-414 - [Java] "Buffer too large to resize to ..." error * ARROW-420 - Align Date implementation between Java and C++ * ARROW-421 - [Python] Zero-copy buffers read by pyarrow::PyBytesReader must retain a reference to the parent PyBytes to avoid premature garbage collection issues -* ARROW-422 - C++: IPC should depend on rapidjson_ep if RapidJSON is vendored +* ARROW-422 - C++: IPC should depend on `rapidjson_ep` if RapidJSON is vendored * ARROW-429 - git-archive SHA-256 checksums are changing * ARROW-433 - [Python] Date conversion is locale-dependent * ARROW-434 - Segfaults and encoding issues in Python Parquet reads -* ARROW-435 - C++: Spelling mistake in if(RAPIDJSON_VENDORED) +* ARROW-435 - C++: Spelling mistake in `if(RAPIDJSON_VENDORED)` * ARROW-437 - [C++] clang compiler warnings from overridden virtual functions -* ARROW-445 - C++: arrow_ipc is built before arrow/ipc/Message_generated.h was generated +* ARROW-445 - C++: `arrow_ipc` is built before `arrow/ipc/Message_generated.h` was generated * ARROW-447 - Python: Align scalar/pylist string encoding with pandas' one. * ARROW-455 - [C++] BufferOutputStream dtor does not call Close() * ARROW-469 - C++: Add option so that resize doesn't decrease the capacity @@ -536,13 +679,13 @@ * ARROW-519 - [C++] Missing vtable in libarrow.dylib on Xcode 6.4 * ARROW-523 - Python: Account for changes in PARQUET-834 * ARROW-533 - [C++] arrow::TimestampArray / TimeArray has a broken constructor -* ARROW-535 - [Python] Add type mapping for NPY_LONGLONG +* ARROW-535 - [Python] Add type mapping for `NPY_LONGLONG` * ARROW-537 - [C++] StringArray/BinaryArray comparisons may be incorrect when values with non-zero length are null * ARROW-540 - [C++] Fix build in aftermath of ARROW-33 -* ARROW-543 - C++: Lazily computed null_counts counts number of non-null entries +* ARROW-543 - C++: Lazily computed `null_counts` counts number of non-null entries * ARROW-544 - [C++] ArrayLoader::LoadBinary fails for length-0 arrays * ARROW-545 - [Python] Ignore files without .parq or .parquet prefix when reading directory of files -* ARROW-548 - [Python] Add nthreads option to pyarrow.Filesystem.read_parquet +* ARROW-548 - [Python] Add nthreads option to `pyarrow.Filesystem.read_parquet` * ARROW-551 - C++: Construction of Column with nullptr Array segfaults * ARROW-556 - [Integration] Can not run Integration tests if different cpp build path * ARROW-561 - Update java & python dependencies to improve downstream packaging experience @@ -551,7 +694,7 @@ * ARROW-189 - C++: Use ExternalProject to build thirdparty dependencies * ARROW-191 - Python: Provide infrastructure for manylinux1 wheels -* ARROW-328 - [C++] Return shared_ptr by value instead of const-ref? +* ARROW-328 - [C++] Return `shared_ptr` by value instead of const-ref? * ARROW-330 - [C++] CMake functions to simplify shared / static library configuration * ARROW-333 - Make writers update their internal schema even when no data is written. * ARROW-335 - Improve Type apis and toString() by encapsulating flatbuffers better @@ -562,20 +705,20 @@ * ARROW-356 - Add documentation about reading Parquet * ARROW-360 - C++: Add method to shrink PoolBuffer using realloc * ARROW-361 - Python: Support reading a column-selection from Parquet files -* ARROW-365 - Python: Provide Array.to_pandas() +* ARROW-365 - Python: Provide `Array.to_pandas()` * ARROW-366 - [java] implement Dictionary vector * ARROW-374 - Python: clarify unicode vs. binary in API -* ARROW-379 - Python: Use setuptools_scm/setuptools_scm_git_archive to provide the version number +* ARROW-379 - Python: Use `setuptools_scm`/`setuptools_scm_git_archive` to provide the version number * ARROW-380 - [Java] optimize null count when serializing vectors. * ARROW-382 - Python: Extend API documentation * ARROW-396 - Python: Add pyarrow.schema.Schema.equals -* ARROW-409 - Python: Change pyarrow.Table.dataframe_from_batches API to create Table instead +* ARROW-409 - Python: Change `pyarrow.Table.dataframe_from_batches` API to create Table instead * ARROW-411 - [Java] Move Intergration.compare and Intergration.compareSchemas to a public utils class -* ARROW-423 - C++: Define BUILD_BYPRODUCTS in external project to support non-make CMake generators +* ARROW-423 - C++: Define `BUILD_BYPRODUCTS` in external project to support non-make CMake generators * ARROW-425 - Python: Expose a C function to convert arrow::Table to pyarrow.Table * ARROW-426 - Python: Conversion from pyarrow.Array to a Python list * ARROW-430 - Python: Better version handling -* ARROW-432 - [Python] Avoid unnecessary memory copy in to_pandas conversion by using low-level pandas internals APIs +* ARROW-432 - [Python] Avoid unnecessary memory copy in `to_pandas` conversion by using low-level pandas internals APIs * ARROW-450 - Python: Fixes for PARQUET-818 * ARROW-457 - Python: Better control over memory pool * ARROW-458 - Python: Expose jemalloc MemoryPool @@ -596,7 +739,7 @@ * ARROW-108 - [C++] Add IPC round trip for union types * ARROW-221 - Add switch for writing Parquet 1.0 compatible logical types -* ARROW-227 - [C++/Python] Hook arrow_io generic reader / writer interface into arrow_parquet +* ARROW-227 - [C++/Python] Hook `arrow_io` generic reader / writer interface into `arrow_parquet` * ARROW-228 - [Python] Create an Arrow-cpp-compatible interface for reading bytes from Python file-like objects * ARROW-243 - [C++] Add "driver" option to HdfsClient to choose between libhdfs and libhdfs3 at runtime * ARROW-303 - [C++] Also build static libraries for leaf libraries @@ -624,7 +767,7 @@ * ARROW-440 - [C++] Support pkg-config * ARROW-441 - [Python] Expose Arrow's file and memory map classes as NativeFile subclasses * ARROW-442 - [Python] Add public Python API to inspect Parquet file metadata -* ARROW-444 - [Python] Avoid unnecessary memory copies from use of PyBytes_* C APIs +* ARROW-444 - [Python] Avoid unnecessary memory copies from use of `PyBytes_*` C APIs * ARROW-449 - Python: Conversion from pyarrow.{Table,RecordBatch} to a Python dict * ARROW-456 - C++: Add jemalloc based MemoryPool * ARROW-461 - [Python] Implement conversion between arrow::DictionaryArray and pandas.Categorical @@ -657,9 +800,9 @@ * ARROW-268 - [C++] Flesh out union implementation to have all required methods for IPC * ARROW-327 - [Python] Remove conda builds from Travis CI processes * ARROW-353 - Arrow release 0.2 -* ARROW-359 - Need to document ARROW_LIBHDFS_DIR +* ARROW-359 - Need to document `ARROW_LIBHDFS_DIR` * ARROW-367 - [java] converter csv/json <=> Arrow file format for Integration tests -* ARROW-368 - Document use of LD_LIBRARY_PATH when using Python +* ARROW-368 - Document use of `LD_LIBRARY_PATH` when using Python * ARROW-372 - Create JSON arrow file format for integration tests * ARROW-506 - Implement Arrow Echo server for integration testing * ARROW-527 - clean drill-module.conf file @@ -687,7 +830,7 @@ * ARROW-210 - [C++] Tidy up the type system a little bit * ARROW-211 - Several typos/errors in Layout.md examples * ARROW-217 - Fix Travis w.r.t conda 4.1.0 changes -* ARROW-219 - [C++] Passed CMAKE_CXX_FLAGS are being dropped, fix compiler warnings +* ARROW-219 - [C++] Passed `CMAKE_CXX_FLAGS` are being dropped, fix compiler warnings * ARROW-223 - Do not link against libpython * ARROW-225 - [C++/Python] master Travis CI build is broken * ARROW-244 - [C++] Some global APIs of IPC module should be visible to the outside @@ -699,7 +842,7 @@ * ARROW-266 - [C++] Fix the broken build * ARROW-274 - Make the MapVector nullable * ARROW-278 - [Format] Struct type name consistency in implementations and metadata -* ARROW-283 - [C++] Update arrow_parquet to account for API changes in PARQUET-573 +* ARROW-283 - [C++] Update `arrow_parquet` to account for API changes in PARQUET-573 * ARROW-284 - [C++] Triage builds by disabling Arrow-Parquet module * ARROW-287 - [java] Make nullable vectors use a BitVecor instead of UInt1Vector for bits * ARROW-297 - Fix Arrow pom for release @@ -737,7 +880,7 @@ * ARROW-212 - [C++] Clarify the fact that PrimitiveArray is now abstract class * ARROW-213 - Exposing static arrow build * ARROW-218 - Add option to use GitHub API token via environment variable when merging PRs -* ARROW-234 - [C++] Build with libhdfs support in arrow_io in conda builds +* ARROW-234 - [C++] Build with libhdfs support in `arrow_io` in conda builds * ARROW-238 - C++: InternalMemoryPool::Free() should throw an error when there is insufficient allocated memory * ARROW-245 - [Format] Clarify Arrow's relationship with big endian platforms * ARROW-252 - Add implementation guidelines to the documentation @@ -757,7 +900,7 @@ * ARROW-290 - Specialize alloc() in ArrowBuf * ARROW-292 - [Java] Upgrade Netty to 4.041 * ARROW-299 - Use absolute namespace in macros -* ARROW-305 - Add compression and use_dictionary options to Parquet interface +* ARROW-305 - Add compression and `use_dictionary` options to Parquet interface * ARROW-306 - Add option to pass cmake arguments via environment variable * ARROW-315 - Finalize timestamp type * ARROW-319 - Add canonical Arrow Schema json representation @@ -767,7 +910,7 @@ * ARROW-54 - Python: rename package to "pyarrow" * ARROW-64 - Add zsh support to C++ build scripts * ARROW-66 - Maybe some missing steps in installation guide -* ARROW-68 - Update setup_build_env and third-party script to be more userfriendly +* ARROW-68 - Update `setup_build_env` and third-party script to be more userfriendly * ARROW-71 - C++: Add script to run clang-tidy on codebase * ARROW-73 - Support CMake 2.8 * ARROW-78 - C++: Add constructor for DecimalType @@ -809,7 +952,7 @@ * ARROW-267 - [C++] C++ implementation of file-like layout for RPC / IPC * ARROW-28 - C++: Add google/benchmark to the 3rd-party build toolchain * ARROW-293 - [C++] Implementations of IO interfaces for operating system files -* ARROW-296 - [C++] Remove arrow_parquet C++ module and related parts of build system +* ARROW-296 - [C++] Remove `arrow_parquet` C++ module and related parts of build system * ARROW-3 - Post Initial Arrow Format Spec * ARROW-30 - Python: pandas/NumPy to/from Arrow conversion routines * ARROW-301 - [Format] Add some form of user field metadata to IPC schemas @@ -819,8 +962,8 @@ * ARROW-37 - C++: Represent boolean array data in bit-packed form * ARROW-4 - Initial Arrow CPP Implementation * ARROW-42 - Python: Add to Travis CI build -* ARROW-43 - Python: Add rudimentary console __repr__ for array types -* ARROW-44 - Python: Implement basic object model for scalar values (i.e. results of arrow_arr[i]) +* ARROW-43 - Python: Add rudimentary console `__repr__` for array types +* ARROW-44 - Python: Implement basic object model for scalar values (i.e. results of `arrow_arr[i]`) * ARROW-48 - Python: Add Schema object wrapper * ARROW-49 - Python: Add Column and Table wrapper interface * ARROW-53 - Python: Fix RPATH and add source installation instructions diff --git a/README.md b/README.md index 27908958785..9dda25de911 100644 --- a/README.md +++ b/README.md @@ -1,15 +1,20 @@ ## Apache Arrow diff --git a/appveyor.yml b/appveyor.yml index 91e9ee26490..55c58d0bf66 100644 --- a/appveyor.yml +++ b/appveyor.yml @@ -60,7 +60,7 @@ environment: init: - set MINICONDA=C:\Miniconda35-x64 - set PATH=%MINICONDA%;%MINICONDA%/Scripts;%MINICONDA%/Library/bin;%PATH% - - if "%GENERATOR%"=="NMake Makefiles" call "C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\vcvarsall.bat" x86_amd64 + - if "%GENERATOR%"=="NMake Makefiles" call "C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\vcvarsall.bat" amd64 build_script: - git config core.symlinks true diff --git a/c_glib/README.md b/c_glib/README.md index 622938550d8..fec877e236f 100644 --- a/c_glib/README.md +++ b/c_glib/README.md @@ -1,15 +1,20 @@ # Arrow GLib diff --git a/c_glib/doc/reference/arrow-glib-docs.sgml b/c_glib/doc/reference/arrow-glib-docs.sgml index 4fa1b7c42de..26fd2f6262b 100644 --- a/c_glib/doc/reference/arrow-glib-docs.sgml +++ b/c_glib/doc/reference/arrow-glib-docs.sgml @@ -1,16 +1,21 @@ # Arrow GLib example diff --git a/c_glib/example/go/Makefile b/c_glib/example/go/Makefile index d8831122d4d..fa2163ca81b 100644 --- a/c_glib/example/go/Makefile +++ b/c_glib/example/go/Makefile @@ -1,14 +1,19 @@ -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at # -# http://www.apache.org/licenses/LICENSE-2.0 +# http://www.apache.org/licenses/LICENSE-2.0 # -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. See accompanying LICENSE file. +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. PROGRAMS = \ read-batch \ diff --git a/c_glib/example/go/README.md b/c_glib/example/go/README.md index 2054055e655..76eeed78c71 100644 --- a/c_glib/example/go/README.md +++ b/c_glib/example/go/README.md @@ -1,15 +1,20 @@ # Arrow Go example diff --git a/c_glib/example/lua/README.md b/c_glib/example/lua/README.md index 6145bc74ddd..e7e3351fef1 100644 --- a/c_glib/example/lua/README.md +++ b/c_glib/example/lua/README.md @@ -1,15 +1,20 @@ # Arrow Lua example diff --git a/ci/msvc-build.bat b/ci/msvc-build.bat index 22108abdd3b..04fe2ab62cb 100644 --- a/ci/msvc-build.bat +++ b/ci/msvc-build.bat @@ -51,9 +51,16 @@ conda create -n arrow -q -y python=%PYTHON% ^ six pytest setuptools numpy pandas cython ^ thrift-cpp +@rem ARROW-1294 CMake 3.9.0 in conda-forge breaks the build +set ARROW_CMAKE_VERSION=3.8.0 + if "%JOB%" == "Toolchain" ( + conda install -n arrow -q -y -c conda-forge ^ - flatbuffers rapidjson cmake git boost-cpp ^ + flatbuffers rapidjson ^ + cmake=%ARROW_CMAKE_VERSION% ^ + git ^ + boost-cpp ^ snappy zlib brotli gflags lz4-c zstd ) @@ -107,6 +114,9 @@ popd @rem see PARQUET-1018 pushd python + +set PYARROW_CXXFLAGS=/WX python setup.py build_ext --inplace --with-parquet --bundle-arrow-cpp bdist_wheel || exit /B py.test pyarrow -v -s --parquet || exit /B + popd diff --git a/ci/travis_before_script_c_glib.sh b/ci/travis_before_script_c_glib.sh index 6547ea4e537..bf2d385d79d 100755 --- a/ci/travis_before_script_c_glib.sh +++ b/ci/travis_before_script_c_glib.sh @@ -1,17 +1,21 @@ #!/usr/bin/env bash -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. See accompanying LICENSE file. - +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. set -ex diff --git a/ci/travis_before_script_cpp.sh b/ci/travis_before_script_cpp.sh index e250e705f1f..d456d308c53 100755 --- a/ci/travis_before_script_cpp.sh +++ b/ci/travis_before_script_cpp.sh @@ -1,36 +1,52 @@ #!/usr/bin/env bash -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. See accompanying LICENSE file. +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. set -ex +source $TRAVIS_BUILD_DIR/ci/travis_env_common.sh + if [ "$1" == "--only-library" ]; then only_library_mode=yes else only_library_mode=no + source $TRAVIS_BUILD_DIR/ci/travis_install_conda.sh fi -source $TRAVIS_BUILD_DIR/ci/travis_env_common.sh - -if [ $only_library_mode == "no" ]; then - # C++ toolchain - export CPP_TOOLCHAIN=$TRAVIS_BUILD_DIR/cpp-toolchain - export RAPIDJSON_HOME=$CPP_TOOLCHAIN - +if [ "$ARROW_TRAVIS_USE_TOOLCHAIN" == "1" ]; then # Set up C++ toolchain from conda-forge packages for faster builds - source $TRAVIS_BUILD_DIR/ci/travis_install_conda.sh - conda create -y -q -p $CPP_TOOLCHAIN python=2.7 rapidjson + conda create -y -q -p $CPP_TOOLCHAIN python=2.7 \ + jemalloc=4.4.0 \ + nomkl \ + boost-cpp \ + rapidjson \ + flatbuffers \ + gflags \ + lz4-c \ + snappy \ + zstd \ + brotli \ + zlib \ + cmake \ + curl \ + thrift-cpp \ + ninja fi if [ $TRAVIS_OS_NAME == "osx" ]; then @@ -45,7 +61,6 @@ pushd $ARROW_CPP_BUILD_DIR CMAKE_COMMON_FLAGS="\ -DARROW_BUILD_BENCHMARKS=ON \ -DCMAKE_INSTALL_PREFIX=$ARROW_CPP_INSTALL \ --DARROW_PLASMA=ON \ -DARROW_NO_DEPRECATED_API=ON" CMAKE_LINUX_FLAGS="" CMAKE_OSX_FLAGS="" @@ -60,8 +75,20 @@ else # also in the manylinux1 image. CMAKE_LINUX_FLAGS="\ $CMAKE_LINUX_FLAGS \ --DARROW_JEMALLOC=ON \ --DARROW_TEST_MEMCHECK=ON" +-DARROW_JEMALLOC=ON" +fi + +# Use Ninja for faster builds when using toolchain +if [ $ARROW_TRAVIS_USE_TOOLCHAIN == "1" ]; then + CMAKE_COMMON_FLAGS="$CMAKE_COMMON_FLAGS -GNinja" +fi + +if [ $ARROW_TRAVIS_PLASMA == "1" ]; then + CMAKE_COMMON_FLAGS="$CMAKE_COMMON_FLAGS -DARROW_PLASMA=ON" +fi + +if [ $ARROW_TRAVIS_VALGRIND == "1" ]; then + CMAKE_COMMON_FLAGS="$CMAKE_COMMON_FLAGS -DARROW_TEST_MEMCHECK=ON" fi if [ $TRAVIS_OS_NAME == "linux" ]; then @@ -76,7 +103,7 @@ else $ARROW_CPP_DIR fi -make VERBOSE=1 -j4 -make install +$TRAVIS_MAKE -j4 +$TRAVIS_MAKE install popd diff --git a/ci/travis_before_script_js.sh b/ci/travis_before_script_js.sh index 304c48137aa..b72accc2193 100755 --- a/ci/travis_before_script_js.sh +++ b/ci/travis_before_script_js.sh @@ -1,16 +1,21 @@ #!/usr/bin/env bash -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. See accompanying LICENSE file. +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. set -ex diff --git a/ci/travis_env_common.sh b/ci/travis_env_common.sh index a2e591014cf..d84753125d5 100755 --- a/ci/travis_env_common.sh +++ b/ci/travis_env_common.sh @@ -1,16 +1,21 @@ #!/usr/bin/env bash -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. See accompanying LICENSE file. +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. export MINICONDA=$HOME/miniconda export PATH="$MINICONDA/bin:$PATH" @@ -29,6 +34,19 @@ export ARROW_CPP_INSTALL=$TRAVIS_BUILD_DIR/cpp-install export ARROW_CPP_BUILD_DIR=$TRAVIS_BUILD_DIR/cpp-build export ARROW_C_GLIB_INSTALL=$TRAVIS_BUILD_DIR/c-glib-install +if [ "$ARROW_TRAVIS_USE_TOOLCHAIN" == "1" ]; then + # C++ toolchain + export CPP_TOOLCHAIN=$TRAVIS_BUILD_DIR/cpp-toolchain + export ARROW_BUILD_TOOLCHAIN=$CPP_TOOLCHAIN + export BOOST_ROOT=$CPP_TOOLCHAIN + + export PATH=$CPP_TOOLCHAIN/bin:$PATH + export LD_LIBRARY_PATH=$CPP_TOOLCHAIN/lib:$LD_LIBRARY_PATH + export TRAVIS_MAKE=ninja +else + export TRAVIS_MAKE=make +fi + if [ $TRAVIS_OS_NAME == "osx" ]; then export GOPATH=$TRAVIS_BUILD_DIR/gopath fi diff --git a/ci/travis_install_clang_tools.sh b/ci/travis_install_clang_tools.sh index a4fd0e24619..bad1e73d24a 100644 --- a/ci/travis_install_clang_tools.sh +++ b/ci/travis_install_clang_tools.sh @@ -1,16 +1,21 @@ #!/usr/bin/env bash -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. See accompanying LICENSE file. +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. wget -O - http://llvm.org/apt/llvm-snapshot.gpg.key|sudo apt-key add - sudo apt-add-repository -y \ diff --git a/ci/travis_install_conda.sh b/ci/travis_install_conda.sh index 369820b37f5..c2502a3744c 100644 --- a/ci/travis_install_conda.sh +++ b/ci/travis_install_conda.sh @@ -1,16 +1,21 @@ #!/usr/bin/env bash -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. See accompanying LICENSE file. +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. set -e diff --git a/ci/travis_script_c_glib.sh b/ci/travis_script_c_glib.sh index 4bfa0c0af49..d392abdfbbc 100755 --- a/ci/travis_script_c_glib.sh +++ b/ci/travis_script_c_glib.sh @@ -1,16 +1,21 @@ #!/usr/bin/env bash -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. See accompanying LICENSE file. +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. set -e diff --git a/ci/travis_script_cpp.sh b/ci/travis_script_cpp.sh index c368a1daedd..4e3e7bbea1c 100755 --- a/ci/travis_script_cpp.sh +++ b/ci/travis_script_cpp.sh @@ -1,20 +1,25 @@ #!/usr/bin/env bash -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. See accompanying LICENSE file. +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. set -e -: ${CPP_BUILD_DIR=$TRAVIS_BUILD_DIR/cpp-build} +source $TRAVIS_BUILD_DIR/ci/travis_env_common.sh # Check licenses according to Apache policy git archive HEAD --prefix=apache-arrow/ --output=arrow-src.tar.gz @@ -22,7 +27,7 @@ git archive HEAD --prefix=apache-arrow/ --output=arrow-src.tar.gz pushd $CPP_BUILD_DIR -make lint +$TRAVIS_MAKE lint # ARROW-209: checks depending on the LLVM toolchain are disabled temporarily # until we are able to install the full LLVM toolchain in Travis CI again diff --git a/ci/travis_script_integration.sh b/ci/travis_script_integration.sh index 6e93ed79a22..be025512f0b 100755 --- a/ci/travis_script_integration.sh +++ b/ci/travis_script_integration.sh @@ -1,16 +1,21 @@ #!/usr/bin/env bash -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. See accompanying LICENSE file. +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. set -e diff --git a/ci/travis_script_java.sh b/ci/travis_script_java.sh index 259b73ec24e..2f6b685253b 100755 --- a/ci/travis_script_java.sh +++ b/ci/travis_script_java.sh @@ -1,16 +1,21 @@ #!/usr/bin/env bash -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. See accompanying LICENSE file. +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. set -e diff --git a/ci/travis_script_js.sh b/ci/travis_script_js.sh index 52ac3b9bdf8..cb1e9e19440 100755 --- a/ci/travis_script_js.sh +++ b/ci/travis_script_js.sh @@ -1,16 +1,21 @@ #!/usr/bin/env bash -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. See accompanying LICENSE file. +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. set -e diff --git a/ci/travis_script_manylinux.sh b/ci/travis_script_manylinux.sh index 4e6be62bd3e..14e6404d3de 100755 --- a/ci/travis_script_manylinux.sh +++ b/ci/travis_script_manylinux.sh @@ -1,16 +1,21 @@ #!/usr/bin/env bash -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. See accompanying LICENSE file. +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. set -ex @@ -18,4 +23,4 @@ set -ex pushd python/manylinux1 git clone ../../ arrow docker build -t arrow-base-x86_64 -f Dockerfile-x86_64 . -docker run --rm -e PYARROW_PARALLEL=3 -v $PWD:/io arrow-base-x86_64 /io/build_arrow.sh +docker run --shm-size=2g --rm -e PYARROW_PARALLEL=3 -v $PWD:/io arrow-base-x86_64 /io/build_arrow.sh diff --git a/ci/travis_script_python.sh b/ci/travis_script_python.sh index ac64c548d82..9135aaf38e4 100755 --- a/ci/travis_script_python.sh +++ b/ci/travis_script_python.sh @@ -1,41 +1,35 @@ #!/usr/bin/env bash - -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. See accompanying LICENSE file. +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. set -e source $TRAVIS_BUILD_DIR/ci/travis_env_common.sh export ARROW_HOME=$ARROW_CPP_INSTALL - -pushd $ARROW_PYTHON_DIR export PARQUET_HOME=$TRAVIS_BUILD_DIR/parquet-env +export LD_LIBRARY_PATH=$ARROW_HOME/lib:$PARQUET_HOME/lib:$LD_LIBRARY_PATH +export PYARROW_CXXFLAGS="-Werror" build_parquet_cpp() { export PARQUET_ARROW_VERSION=$(git rev-parse HEAD) - conda create -y -q -p $PARQUET_HOME python=3.6 cmake curl - source activate $PARQUET_HOME - - # In case some package wants to download the MKL - conda install -y -q nomkl - conda install -y -q thrift-cpp snappy zlib brotli boost - - export BOOST_ROOT=$PARQUET_HOME - export SNAPPY_HOME=$PARQUET_HOME - export THRIFT_HOME=$PARQUET_HOME - export ZLIB_HOME=$PARQUET_HOME - export BROTLI_HOME=$PARQUET_HOME + # $CPP_TOOLCHAIN set up in before_script_cpp + export PARQUET_BUILD_TOOLCHAIN=$CPP_TOOLCHAIN PARQUET_DIR=$TRAVIS_BUILD_DIR/parquet mkdir -p $PARQUET_DIR @@ -47,38 +41,39 @@ build_parquet_cpp() { cd build-dir cmake \ + -GNinja \ -DCMAKE_BUILD_TYPE=debug \ -DCMAKE_INSTALL_PREFIX=$PARQUET_HOME \ + -DPARQUET_BOOST_USE_SHARED=off \ -DPARQUET_BUILD_BENCHMARKS=off \ -DPARQUET_BUILD_EXECUTABLES=off \ - -DPARQUET_ZLIB_VENDORED=off \ - -DPARQUET_BUILD_TESTS=on \ + -DPARQUET_BUILD_TESTS=off \ .. - make -j${CPU_COUNT} - make install + ninja + ninja install popd } build_parquet_cpp -function build_arrow_libraries() { - CPP_BUILD_DIR=$1 - CPP_DIR=$TRAVIS_BUILD_DIR/cpp +function rebuild_arrow_libraries() { + pushd $ARROW_CPP_BUILD_DIR - mkdir $CPP_BUILD_DIR - pushd $CPP_BUILD_DIR + # Clear out prior build files + rm -rf * - cmake -DARROW_BUILD_TESTS=off \ - -DARROW_PYTHON=on \ - -DPLASMA_PYTHON=on \ + cmake -GNinja \ + -DARROW_BUILD_TESTS=off \ + -DARROW_BUILD_UTILITIES=off \ -DARROW_PLASMA=on \ - -DCMAKE_INSTALL_PREFIX=$2 \ - $CPP_DIR + -DARROW_PYTHON=on \ + -DCMAKE_INSTALL_PREFIX=$ARROW_HOME \ + $ARROW_CPP_DIR - make -j4 - make install + ninja + ninja install popd } @@ -87,9 +82,6 @@ python_version_tests() { PYTHON_VERSION=$1 CONDA_ENV_DIR=$TRAVIS_BUILD_DIR/pyarrow-test-$PYTHON_VERSION - export ARROW_HOME=$TRAVIS_BUILD_DIR/arrow-install-$PYTHON_VERSION - export LD_LIBRARY_PATH=$ARROW_HOME/lib:$PARQUET_HOME/lib - conda create -y -q -p $CONDA_ENV_DIR python=$PYTHON_VERSION cmake curl source activate $CONDA_ENV_DIR @@ -103,27 +95,35 @@ python_version_tests() { conda install -y -q pip numpy pandas cython # Build C++ libraries - build_arrow_libraries arrow-build-$PYTHON_VERSION $ARROW_HOME + rebuild_arrow_libraries # Other stuff pip install + pushd $ARROW_PYTHON_DIR pip install -r requirements.txt - - python setup.py build_ext --inplace --with-parquet + python setup.py build_ext --with-parquet --with-plasma \ + install --single-version-externally-managed --record=record.text + popd python -c "import pyarrow.parquet" + python -c "import pyarrow.plasma" - python -m pytest -vv -r sxX pyarrow --parquet + if [ $TRAVIS_OS_NAME == "linux" ]; then + export PLASMA_VALGRIND=1 + fi + PYARROW_PATH=$CONDA_PREFIX/lib/python$PYTHON_VERSION/site-packages/pyarrow + python -m pytest -vv -r sxX -s $PYARROW_PATH --parquet + + pushd $ARROW_PYTHON_DIR # Build documentation once if [[ "$PYTHON_VERSION" == "3.6" ]] then conda install -y -q --file=doc/requirements.txt python setup.py build_sphinx -s doc/source fi + popd } # run tests for python 2.7 and 3.6 python_version_tests 2.7 python_version_tests 3.6 - -popd diff --git a/cpp/.clang-format b/cpp/.clang-format index 33f282a20de..06453dfbb25 100644 --- a/cpp/.clang-format +++ b/cpp/.clang-format @@ -15,67 +15,6 @@ # specific language governing permissions and limitations # under the License. --- -Language: Cpp -# BasedOnStyle: Google -AccessModifierOffset: -1 -AlignAfterOpenBracket: false -AlignConsecutiveAssignments: false -AlignEscapedNewlinesLeft: true -AlignOperands: true -AlignTrailingComments: true -AllowAllParametersOfDeclarationOnNextLine: true -AllowShortBlocksOnASingleLine: true -AllowShortCaseLabelsOnASingleLine: false -AllowShortFunctionsOnASingleLine: Inline -AllowShortIfStatementsOnASingleLine: true -AllowShortLoopsOnASingleLine: false -AlwaysBreakAfterDefinitionReturnType: None -AlwaysBreakBeforeMultilineStrings: true -AlwaysBreakTemplateDeclarations: true -BinPackArguments: true -BinPackParameters: true -BreakBeforeBinaryOperators: None -BreakBeforeBraces: Attach -BreakBeforeTernaryOperators: true -BreakConstructorInitializersBeforeComma: false -ColumnLimit: 90 -CommentPragmas: '^ IWYU pragma:' -ConstructorInitializerAllOnOneLineOrOnePerLine: true -ConstructorInitializerIndentWidth: 4 -ContinuationIndentWidth: 4 -Cpp11BracedListStyle: true +BasedOnStyle: Google DerivePointerAlignment: false -DisableFormat: false -ExperimentalAutoDetectBinPacking: false -ForEachMacros: [ foreach, Q_FOREACH, BOOST_FOREACH ] -IndentCaseLabels: true -IndentWidth: 2 -IndentWrappedFunctionNames: false -KeepEmptyLinesAtTheStartOfBlocks: false -MacroBlockBegin: '' -MacroBlockEnd: '' -MaxEmptyLinesToKeep: 1 -NamespaceIndentation: None -ObjCBlockIndentWidth: 2 -ObjCSpaceAfterProperty: false -ObjCSpaceBeforeProtocolList: false -PenaltyBreakBeforeFirstCallParameter: 1000 -PenaltyBreakComment: 300 -PenaltyBreakFirstLessLess: 120 -PenaltyBreakString: 1000 -PenaltyExcessCharacter: 1000000 -PenaltyReturnTypeOnItsOwnLine: 200 -PointerAlignment: Left -SpaceAfterCStyleCast: false -SpaceBeforeAssignmentOperators: true -SpaceBeforeParens: ControlStatements -SpaceInEmptyParentheses: false -SpacesBeforeTrailingComments: 2 -SpacesInAngles: false -SpacesInContainerLiterals: true -SpacesInCStyleCastParentheses: false -SpacesInParentheses: false -SpacesInSquareBrackets: false -Standard: Cpp11 -TabWidth: 8 -UseTab: Never +ColumnLimit: 90 diff --git a/cpp/.gitignore b/cpp/.gitignore index 4910544ec87..ec846b35ba6 100644 --- a/cpp/.gitignore +++ b/cpp/.gitignore @@ -1,3 +1,20 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + thirdparty/ CMakeFiles/ CMakeCache.txt diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index 2891a5d7618..07b8e15b504 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -162,10 +162,14 @@ if("${CMAKE_SOURCE_DIR}" STREQUAL "${CMAKE_CURRENT_SOURCE_DIR}") "Build with zstd compression" ON) + option(ARROW_VERBOSE_THIRDPARTY_BUILD + "If off, output from ExternalProjects will be logged to files rather than shown" + ON) + if (MSVC) set(BROTLI_MSVC_STATIC_LIB_SUFFIX "_static" CACHE STRING "Brotli static lib suffix used on Windows with MSVC (default _static)") - set(SNAPPY_MSVC_STATIC_LIB_SUFFIX "" CACHE STRING + set(SNAPPY_MSVC_STATIC_LIB_SUFFIX "_static" CACHE STRING "Snappy static lib suffix used on Windows with MSVC (default is empty string)") set(ZLIB_MSVC_STATIC_LIB_SUFFIX "libstatic" CACHE STRING "Zlib static lib suffix used on Windows with MSVC (default libstatic)") @@ -303,8 +307,28 @@ include_directories(src) # For generate_export_header() and add_compiler_export_flags(). include(GenerateExportHeader) -# Sets -fvisibility=hidden for gcc -add_compiler_export_flags() +# Adapted from Apache Kudu: https://github.com/apache/kudu/commit/bd549e13743a51013585 +# Honor visibility properties for all target types. See +# "cmake --help-policy CMP0063" for details. +# +# This policy was only added to cmake in version 3.3, so until the cmake in +# thirdparty is updated, we must check if the policy exists before setting it. +if(POLICY CMP0063) + cmake_policy(SET CMP0063 NEW) +endif() + +if (PARQUET_BUILD_SHARED) + if (POLICY CMP0063) + set_target_properties(arrow_shared + PROPERTIES + C_VISIBILITY_PRESET hidden + CXX_VISIBILITY_PRESET hidden + VISIBILITY_INLINES_HIDDEN 1) + else() + # Sets -fvisibility=hidden for gcc + add_compiler_export_flags() + endif() +endif() ############################################################ # Benchmarking @@ -582,20 +606,6 @@ if (ARROW_STATIC_LINK_LIBS) add_dependencies(arrow_dependencies ${ARROW_STATIC_LINK_LIBS}) endif() -set(ARROW_MIN_TEST_LIBS - arrow_static - ${ARROW_STATIC_LINK_LIBS} - gtest - gtest_main) - -if(NOT MSVC) - set(ARROW_MIN_TEST_LIBS - ${ARROW_MIN_TEST_LIBS} - ${CMAKE_DL_LIBS}) -endif() - -set(ARROW_TEST_LINK_LIBS ${ARROW_MIN_TEST_LIBS}) - set(ARROW_BENCHMARK_LINK_LIBS arrow_static arrow_benchmark_main @@ -618,6 +628,20 @@ if (NOT MSVC) ${CMAKE_DL_LIBS}) endif() +set(ARROW_MIN_TEST_LIBS + arrow_static + ${ARROW_STATIC_LINK_LIBS} + gtest + gtest_main) + +if(NOT MSVC) + set(ARROW_MIN_TEST_LIBS + ${ARROW_MIN_TEST_LIBS} + ${CMAKE_DL_LIBS}) +endif() + +set(ARROW_TEST_LINK_LIBS ${ARROW_MIN_TEST_LIBS}) + if (ARROW_JEMALLOC) add_definitions(-DARROW_JEMALLOC) # In the case that jemalloc is only available as a shared library also use it to diff --git a/cpp/README.md b/cpp/README.md index 5bb516fc99b..2f98b085115 100644 --- a/cpp/README.md +++ b/cpp/README.md @@ -1,15 +1,20 @@ # Arrow C++ @@ -77,11 +82,24 @@ Benchmark logs will be placed in the build directory under `build/benchmark-logs To set up your own specific build toolchain, here are the relevant environment variables +* Boost: `BOOST_ROOT` * Googletest: `GTEST_HOME` (only required to build the unit tests) +* gflags: `GFLAGS_HOME` (only required to build the unit tests) * Google Benchmark: `GBENCHMARK_HOME` (only required if building benchmarks) * Flatbuffers: `FLATBUFFERS_HOME` (only required for the IPC extensions) * Hadoop: `HADOOP_HOME` (only required for the HDFS I/O extensions) -* jemalloc: `JEMALLOC_HOME` (only required for the jemalloc-based memory pool) +* jemalloc: `JEMALLOC_HOME` +* brotli: `BROTLI_HOME`, can be disabled with `-DARROW_WITH_BROTLI=off` +* lz4: `LZ4_HOME`, can be disabled with `-DARROW_WITH_LZ4=off` +* snappy: `SNAPPY_HOME`, can be disabled with `-DARROW_WITH_SNAPPY=off` +* zlib: `ZLIB_HOME`, can be disabled with `-DARROW_WITH_ZLIB=off` +* zstd: `ZSTD_HOME`, can be disabled with `-DARROW_WITH_ZSTD=off` + +If you have all of your toolchain libraries installed at the same prefix, you +can use the environment variable `$ARROW_BUILD_TOOLCHAIN` to automatically set +all of these variables. Note that `ARROW_BUILD_TOOLCHAIN` will not set +`BOOST_ROOT`, so if you have custom Boost installation, you must set this +environment variable separately. ### Building Python integration library @@ -102,6 +120,35 @@ directoy: This requires [Doxygen](http://www.doxygen.org) to be installed. +## Development + +This project follows [Google's C++ Style Guide][3] with minor exceptions. We do +not encourage anonymous namespaces and we relax the line length restriction to +90 characters. + +### Error Handling and Exceptions + +For error handling, we use `arrow::Status` values instead of throwing C++ +exceptions. Since the Arrow C++ libraries are intended to be useful as a +component in larger C++ projects, using `Status` objects can help with good +code hygiene by making explicit when a function is expected to be able to fail. + +For expressing invariants and "cannot fail" errors, we use DCHECK macros +defined in `arrow/util/logging.h`. These checks are disabled in release builds +and are intended to catch internal development errors, particularly when +refactoring. These macros are not to be included in any public header files. + +Since we do not use exceptions, we avoid doing expensive work in object +constructors. Objects that are expensive to construct may often have private +constructors, with public static factory methods that return `Status`. + +There are a number of object constructors, like `arrow::Schema` and +`arrow::RecordBatch` where larger STL container objects like `std::vector` may +be created. While it is possible for `std::bad_alloc` to be thrown in these +constructors, the circumstances where they would are somewhat esoteric, and it +is likely that an application would have encountered other more serious +problems prior to having `std::bad_alloc` thrown in a constructor. + ## Continuous Integration Pull requests are run through travis-ci for continuous integration. You can avoid @@ -109,9 +156,8 @@ build failures by running the following checks before submitting your pull reque make unittest make lint - # The next two commands may change your code. It is recommended you commit - # before running them. - make clang-tidy # requires clang-tidy is installed + # The next command may change your code. It is recommended you commit + # before running it. make format # requires clang-format is installed Note that the clang-tidy target may take a while to run. You might consider @@ -127,3 +173,4 @@ both of these options would be used rarely. Current known uses-cases whent hey [1]: https://brew.sh/ [2]: https://github.com/apache/arrow/blob/master/cpp/apidoc/Windows.md +[3]: https://google.github.io/styleguide/cppguide.html \ No newline at end of file diff --git a/cpp/apidoc/Doxyfile b/cpp/apidoc/Doxyfile index 31276624133..f32ad5425da 100644 --- a/cpp/apidoc/Doxyfile +++ b/cpp/apidoc/Doxyfile @@ -833,50 +833,17 @@ INPUT_ENCODING = UTF-8 # *.m, *.markdown, *.md, *.mm, *.dox, *.py, *.pyw, *.f90, *.f95, *.f03, *.f08, # *.f, *.for, *.tcl, *.vhd, *.vhdl, *.ucf and *.qsf. -FILE_PATTERNS = *.c \ - *.cc \ - *.cxx \ - *.cpp \ - *.c++ \ - *.java \ - *.ii \ - *.ixx \ - *.ipp \ - *.i++ \ - *.inl \ - *.idl \ - *.ddl \ - *.odl \ - *.h \ +FILE_PATTERNS = *.h \ *.hh \ *.hxx \ *.hpp \ - *.h++ \ - *.cs \ - *.d \ - *.php \ - *.php4 \ - *.php5 \ - *.phtml \ *.inc \ *.m \ *.markdown \ *.md \ *.mm \ *.dox \ - *.py \ - *.pyw \ - *.f90 \ - *.f95 \ - *.f03 \ - *.f08 \ - *.f \ - *.for \ - *.tcl \ - *.vhd \ - *.vhdl \ - *.ucf \ - *.qsf + *.py # The RECURSIVE tag can be used to specify whether or not subdirectories should # be searched for input files as well. @@ -908,6 +875,7 @@ EXCLUDE_SYMLINKS = NO # exclude all test directories for example use the pattern */test/* EXCLUDE_PATTERNS = *-test.cc \ + *test* \ *_generated.h \ *-benchmark.cc @@ -920,7 +888,11 @@ EXCLUDE_PATTERNS = *-test.cc \ # Note that the wildcards are matched against the file with absolute path, so to # exclude all test directories use the pattern */test/* -EXCLUDE_SYMBOLS = +EXCLUDE_SYMBOLS = detail +EXCLUDE_SYMBOLS += internal +EXCLUDE_SYMBOLS += _* +EXCLUDE_SYMBOLS += BitUtil +EXCLUDE_SYMBOLS += SSEUtil # The EXAMPLE_PATH tag can be used to specify one or more files or directories # that contain example code fragments that are included (see the \include @@ -2060,7 +2032,7 @@ ENABLE_PREPROCESSING = YES # The default value is: NO. # This tag requires that the tag ENABLE_PREPROCESSING is set to YES. -MACRO_EXPANSION = NO +MACRO_EXPANSION = YES # If the EXPAND_ONLY_PREDEF and MACRO_EXPANSION tags are both set to YES then # the macro expansion is limited to the macros specified with the PREDEFINED and @@ -2068,7 +2040,7 @@ MACRO_EXPANSION = NO # The default value is: NO. # This tag requires that the tag ENABLE_PREPROCESSING is set to YES. -EXPAND_ONLY_PREDEF = NO +EXPAND_ONLY_PREDEF = YES # If the SEARCH_INCLUDES tag is set to YES, the include files in the # INCLUDE_PATH will be searched if a #include is found. @@ -2100,7 +2072,10 @@ INCLUDE_FILE_PATTERNS = # recursively expanded use the := operator instead of the = operator. # This tag requires that the tag ENABLE_PREPROCESSING is set to YES. -PREDEFINED = +PREDEFINED = __attribute__(x)= \ + __declspec(x)= \ + ARROW_EXPORT= \ + ARROW_EXTERN_TEMPLATE= # If the MACRO_EXPANSION and EXPAND_ONLY_PREDEF tags are set to YES then this # tag can be used to specify a list of macro names that should be expanded. The diff --git a/cpp/apidoc/HDFS.md b/cpp/apidoc/HDFS.md index 180d31e54d5..d54ad270c05 100644 --- a/cpp/apidoc/HDFS.md +++ b/cpp/apidoc/HDFS.md @@ -1,15 +1,20 @@ ## Using Arrow's HDFS (Apache Hadoop Distributed File System) interface @@ -72,4 +77,3 @@ If you get an error about needing to install Java 6, then add *BundledApp* and https://oliverdowling.com.au/2015/10/09/oracles-jre-8-on-mac-os-x-el-capitan/ https://derflounder.wordpress.com/2015/08/08/modifying-oracles-java-sdk-to-run-java-applications-on-os-x/ - diff --git a/cpp/apidoc/Windows.md b/cpp/apidoc/Windows.md index 6bfb951548a..30b7b8f3ce2 100644 --- a/cpp/apidoc/Windows.md +++ b/cpp/apidoc/Windows.md @@ -1,15 +1,20 @@ # Developing Arrow C++ on Windows @@ -26,7 +31,7 @@ other development instructions for Windows here. [Miniconda][1] is a minimal Python distribution including the conda package manager. To get started, download and install a 64-bit distribution. -We recommend using packages from [conda-forge][2]. +We recommend using packages from [conda-forge][2]. Launch cmd.exe and run following commands: ```shell @@ -46,7 +51,7 @@ previous step: activate arrow-dev ``` -We are using [cmake][4] tool to support Windows builds. +We are using [cmake][4] tool to support Windows builds. To allow cmake to pick up 3rd party dependencies, you should set `ARROW_BUILD_TOOLCHAIN` environment variable to contain `Library` folder path of new created on previous step `arrow-dev` conda environment. @@ -71,16 +76,16 @@ As alternative to `ARROW_BUILD_TOOLCHAIN`, it's possible to configure path to each 3rd party dependency separately by setting appropriate environment variable: -`FLATBUFFERS_HOME` variable with path to `flatbuffers` installation -`RAPIDJSON_HOME` variable with path to `rapidjson` installation -`GFLAGS_HOME` variable with path to `gflags` installation -`SNAPPY_HOME` variable with path to `snappy` installation -`ZLIB_HOME` variable with path to `zlib` installation -`BROTLI_HOME` variable with path to `brotli` installation -`LZ4_HOME` variable with path to `lz4` installation +`FLATBUFFERS_HOME` variable with path to `flatbuffers` installation +`RAPIDJSON_HOME` variable with path to `rapidjson` installation +`GFLAGS_HOME` variable with path to `gflags` installation +`SNAPPY_HOME` variable with path to `snappy` installation +`ZLIB_HOME` variable with path to `zlib` installation +`BROTLI_HOME` variable with path to `brotli` installation +`LZ4_HOME` variable with path to `lz4` installation `ZSTD_HOME` variable with path to `zstd` installation -### Customize static libraries names lookup of 3rd party dependencies +### Customize static libraries names lookup of 3rd party dependencies If you decided to use pre-built 3rd party dependencies libs, it's possible to configure Arrow's cmake build script to search for customized names of 3rd diff --git a/cpp/apidoc/index.md b/cpp/apidoc/index.md index 4004e1ef42e..8389d16b4aa 100644 --- a/cpp/apidoc/index.md +++ b/cpp/apidoc/index.md @@ -2,17 +2,22 @@ Apache Arrow C++ API documentation {#index} ================================== Apache Arrow is a columnar in-memory analytics layer designed to accelerate diff --git a/cpp/apidoc/tutorials/row_wise_conversion.md b/cpp/apidoc/tutorials/row_wise_conversion.md index 1486fc2a4e0..e91c26e9da1 100644 --- a/cpp/apidoc/tutorials/row_wise_conversion.md +++ b/cpp/apidoc/tutorials/row_wise_conversion.md @@ -1,15 +1,20 @@ Convert a vector of row-wise data into an Arrow table @@ -118,7 +123,7 @@ To convert an Arrow table back into the same row-wise representation as in the above section, we first will check that the table conforms to our expected schema and then will build up the vector of rows incrementally. -For the check if the table is as expected, we can utilise solely its schema. +For the check if the table is as expected, we can utilise solely its schema. ``` // This is our input that was passed in from the outside. diff --git a/cpp/build-support/build-lz4-lib.sh b/cpp/build-support/build-lz4-lib.sh index 62805bae286..8cb5c18782a 100755 --- a/cpp/build-support/build-lz4-lib.sh +++ b/cpp/build-support/build-lz4-lib.sh @@ -1,16 +1,21 @@ #!/bin/sh # -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at # -# http://www.apache.org/licenses/LICENSE-2.0 +# http://www.apache.org/licenses/LICENSE-2.0 # -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. # export CFLAGS="${CFLAGS} -O3 -fPIC" -make -j4 \ No newline at end of file +make -j4 diff --git a/cpp/build-support/build-zstd-lib.sh b/cpp/build-support/build-zstd-lib.sh index 62805bae286..8cb5c18782a 100755 --- a/cpp/build-support/build-zstd-lib.sh +++ b/cpp/build-support/build-zstd-lib.sh @@ -1,16 +1,21 @@ #!/bin/sh # -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at # -# http://www.apache.org/licenses/LICENSE-2.0 +# http://www.apache.org/licenses/LICENSE-2.0 # -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. # export CFLAGS="${CFLAGS} -O3 -fPIC" -make -j4 \ No newline at end of file +make -j4 diff --git a/cpp/build-support/lz4_msbuild_wholeprogramoptimization_param.patch b/cpp/build-support/lz4_msbuild_wholeprogramoptimization_param.patch new file mode 100644 index 00000000000..ee0f8a12054 --- /dev/null +++ b/cpp/build-support/lz4_msbuild_wholeprogramoptimization_param.patch @@ -0,0 +1,225 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +diff --git a/visual/VS2010/datagen/datagen.vcxproj b/visual/VS2010/datagen/datagen.vcxproj +index aaf81ad..67b716f 100644 +--- a/visual/VS2010/datagen/datagen.vcxproj ++++ b/visual/VS2010/datagen/datagen.vcxproj +@@ -39,15 +39,19 @@ + + Application + false +- true + Unicode + + + Application + false +- true + Unicode + ++ ++ true ++ ++ ++ true ++ + + + +diff --git a/visual/VS2010/frametest/frametest.vcxproj b/visual/VS2010/frametest/frametest.vcxproj +index 76d12c9..723571d 100644 +--- a/visual/VS2010/frametest/frametest.vcxproj ++++ b/visual/VS2010/frametest/frametest.vcxproj +@@ -39,15 +39,19 @@ + + Application + false +- true + Unicode + + + Application + false +- true + Unicode + ++ ++ true ++ ++ ++ true ++ + + + +diff --git a/visual/VS2010/fullbench-dll/fullbench-dll.vcxproj b/visual/VS2010/fullbench-dll/fullbench-dll.vcxproj +index c10552a..0c8f293 100644 +--- a/visual/VS2010/fullbench-dll/fullbench-dll.vcxproj ++++ b/visual/VS2010/fullbench-dll/fullbench-dll.vcxproj +@@ -39,15 +39,19 @@ + + Application + false +- true + Unicode + + + Application + false +- true + Unicode + ++ ++ true ++ ++ ++ true ++ + + + +diff --git a/visual/VS2010/fullbench/fullbench.vcxproj b/visual/VS2010/fullbench/fullbench.vcxproj +index e2d95c9..4cd88d0 100644 +--- a/visual/VS2010/fullbench/fullbench.vcxproj ++++ b/visual/VS2010/fullbench/fullbench.vcxproj +@@ -39,15 +39,19 @@ + + Application + false +- true + Unicode + + + Application + false +- true + Unicode + ++ ++ true ++ ++ ++ true ++ + + + +diff --git a/visual/VS2010/fuzzer/fuzzer.vcxproj b/visual/VS2010/fuzzer/fuzzer.vcxproj +index 85d6c9b..3ddc77d 100644 +--- a/visual/VS2010/fuzzer/fuzzer.vcxproj ++++ b/visual/VS2010/fuzzer/fuzzer.vcxproj +@@ -39,15 +39,19 @@ + + Application + false +- true + Unicode + + + Application + false +- true + Unicode + ++ ++ true ++ ++ ++ true ++ + + + +diff --git a/visual/VS2010/liblz4-dll/liblz4-dll.vcxproj b/visual/VS2010/liblz4-dll/liblz4-dll.vcxproj +index 389f13c..038a4d2 100644 +--- a/visual/VS2010/liblz4-dll/liblz4-dll.vcxproj ++++ b/visual/VS2010/liblz4-dll/liblz4-dll.vcxproj +@@ -40,15 +40,19 @@ + + DynamicLibrary + false +- true + Unicode + + + DynamicLibrary + false +- true + Unicode + ++ ++ true ++ ++ ++ true ++ + + + +diff --git a/visual/VS2010/liblz4/liblz4.vcxproj b/visual/VS2010/liblz4/liblz4.vcxproj +index a0b8000..9aad8c2 100644 +--- a/visual/VS2010/liblz4/liblz4.vcxproj ++++ b/visual/VS2010/liblz4/liblz4.vcxproj +@@ -39,15 +39,19 @@ + + StaticLibrary + false +- true + Unicode + + + StaticLibrary + false +- true + Unicode + ++ ++ true ++ ++ ++ true ++ + + + +diff --git a/visual/VS2010/lz4/lz4.vcxproj b/visual/VS2010/lz4/lz4.vcxproj +index 693e121..7e63f1e 100644 +--- a/visual/VS2010/lz4/lz4.vcxproj ++++ b/visual/VS2010/lz4/lz4.vcxproj +@@ -39,15 +39,19 @@ + + Application + false +- true + Unicode + + + Application + false +- true + Unicode + ++ ++ true ++ ++ ++ true ++ + + + diff --git a/cpp/build-support/run-clang-tidy.sh b/cpp/build-support/run-clang-tidy.sh index 4ba8ab8cd76..75e9458e257 100755 --- a/cpp/build-support/run-clang-tidy.sh +++ b/cpp/build-support/run-clang-tidy.sh @@ -1,16 +1,21 @@ #!/bin/bash # -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at # -# http://www.apache.org/licenses/LICENSE-2.0 +# http://www.apache.org/licenses/LICENSE-2.0 # -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. # # # Runs clang format in the given directory @@ -27,7 +32,7 @@ shift APPLY_FIXES=$1 shift -# clang format will only find its configuration if we are in +# clang format will only find its configuration if we are in # the source tree or in a path relative to the source tree if [ "$APPLY_FIXES" == "1" ]; then $CLANG_TIDY -p $COMPILE_COMMANDS -fix $@ @@ -37,4 +42,4 @@ else echo "clang-tidy had suggested fixes. Please fix these!!!" exit 1 fi -fi +fi diff --git a/cpp/build-support/run_clang_format.py b/cpp/build-support/run_clang_format.py index ab800e641b5..ac4954ca570 100755 --- a/cpp/build-support/run_clang_format.py +++ b/cpp/build-support/run_clang_format.py @@ -57,5 +57,9 @@ # exit 1 # fi -subprocess.check_output([CLANG_FORMAT, '-i'] + files_to_format, - stderr=subprocess.STDOUT) +try: + subprocess.check_output([CLANG_FORMAT, '-i'] + files_to_format, + stderr=subprocess.STDOUT) +except Exception as e: + print(e) + raise diff --git a/cpp/build-support/zstd_msbuild_wholeprogramoptimization_param.patch b/cpp/build-support/zstd_msbuild_wholeprogramoptimization_param.patch new file mode 100644 index 00000000000..8bfb928947e --- /dev/null +++ b/cpp/build-support/zstd_msbuild_wholeprogramoptimization_param.patch @@ -0,0 +1,199 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +diff --git a/build/VS2010/datagen/datagen.vcxproj b/build/VS2010/datagen/datagen.vcxproj +index bd8a213..8e4dc89 100644 +--- a/build/VS2010/datagen/datagen.vcxproj ++++ b/build/VS2010/datagen/datagen.vcxproj +@@ -39,15 +39,19 @@ + + Application + false +- true + MultiByte + + + Application + false +- true + MultiByte + ++ ++ true ++ ++ ++ true ++ + + + +diff --git a/build/VS2010/fullbench-dll/fullbench-dll.vcxproj b/build/VS2010/fullbench-dll/fullbench-dll.vcxproj +index e697318..82cd4ab 100644 +--- a/build/VS2010/fullbench-dll/fullbench-dll.vcxproj ++++ b/build/VS2010/fullbench-dll/fullbench-dll.vcxproj +@@ -39,15 +39,19 @@ + + Application + false +- true + MultiByte + + + Application + false +- true + MultiByte + ++ ++ true ++ ++ ++ true ++ + + + +diff --git a/build/VS2010/fullbench/fullbench.vcxproj b/build/VS2010/fullbench/fullbench.vcxproj +index 2bff4ca..ced4047 100644 +--- a/build/VS2010/fullbench/fullbench.vcxproj ++++ b/build/VS2010/fullbench/fullbench.vcxproj +@@ -39,15 +39,19 @@ + + Application + false +- true + MultiByte + + + Application + false +- true + MultiByte + ++ ++ true ++ ++ ++ true ++ + + + +diff --git a/build/VS2010/fuzzer/fuzzer.vcxproj b/build/VS2010/fuzzer/fuzzer.vcxproj +index 12a4b93..227efd1 100644 +--- a/build/VS2010/fuzzer/fuzzer.vcxproj ++++ b/build/VS2010/fuzzer/fuzzer.vcxproj +@@ -39,15 +39,19 @@ + + Application + false +- true + MultiByte + + + Application + false +- true + MultiByte + ++ ++ true ++ ++ ++ true ++ + + + +diff --git a/build/VS2010/libzstd-dll/libzstd-dll.vcxproj b/build/VS2010/libzstd-dll/libzstd-dll.vcxproj +index 364b3be..b227320 100644 +--- a/build/VS2010/libzstd-dll/libzstd-dll.vcxproj ++++ b/build/VS2010/libzstd-dll/libzstd-dll.vcxproj +@@ -94,15 +94,19 @@ + + DynamicLibrary + false +- true + MultiByte + + + DynamicLibrary + false +- true + MultiByte + ++ ++ true ++ ++ ++ true ++ + + + +diff --git a/build/VS2010/libzstd/libzstd.vcxproj b/build/VS2010/libzstd/libzstd.vcxproj +index 6087d73..51a0572 100644 +--- a/build/VS2010/libzstd/libzstd.vcxproj ++++ b/build/VS2010/libzstd/libzstd.vcxproj +@@ -91,15 +91,19 @@ + + StaticLibrary + false +- true + MultiByte + + + StaticLibrary + false +- true + MultiByte + ++ ++ true ++ ++ ++ true ++ + + + +diff --git a/build/VS2010/zstd/zstd.vcxproj b/build/VS2010/zstd/zstd.vcxproj +index 438dc61..834ae01 100644 +--- a/build/VS2010/zstd/zstd.vcxproj ++++ b/build/VS2010/zstd/zstd.vcxproj +@@ -100,15 +100,19 @@ + + Application + false +- true + MultiByte + + + Application + false +- true + MultiByte + ++ ++ true ++ ++ ++ true ++ + + + diff --git a/cpp/cmake_modules/SnappyCMakeLists.txt b/cpp/cmake_modules/SnappyCMakeLists.txt index 9d0a166064e..50083ce405e 100644 --- a/cpp/cmake_modules/SnappyCMakeLists.txt +++ b/cpp/cmake_modules/SnappyCMakeLists.txt @@ -68,10 +68,10 @@ set(SNAPPY_SRCS snappy.cc snappy-stubs-public.h) add_library(snappy SHARED ${SNAPPY_SRCS}) -add_library(snappystatic STATIC ${SNAPPY_SRCS}) +add_library(snappy_static STATIC ${SNAPPY_SRCS}) TARGET_COMPILE_DEFINITIONS(snappy PRIVATE -DHAVE_CONFIG_H) -TARGET_COMPILE_DEFINITIONS(snappystatic PRIVATE -DHAVE_CONFIG_H) +TARGET_COMPILE_DEFINITIONS(snappy_static PRIVATE -DHAVE_CONFIG_H) install(FILES snappy.h snappy-c.h @@ -79,7 +79,7 @@ install(FILES snappy.h ${snappy_BINARY_DIR}/snappy-stubs-public.h DESTINATION include) -install(TARGETS snappy snappystatic +install(TARGETS snappy snappy_static RUNTIME DESTINATION bin LIBRARY DESTINATION lib ARCHIVE DESTINATION lib) diff --git a/cpp/cmake_modules/ThirdpartyToolchain.cmake b/cpp/cmake_modules/ThirdpartyToolchain.cmake index b9d9823e80c..1271b8a4ab3 100644 --- a/cpp/cmake_modules/ThirdpartyToolchain.cmake +++ b/cpp/cmake_modules/ThirdpartyToolchain.cmake @@ -35,6 +35,16 @@ string(TOUPPER ${CMAKE_BUILD_TYPE} UPPERCASE_BUILD_TYPE) set(EP_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${CMAKE_CXX_FLAGS_${UPPERCASE_BUILD_TYPE}}") set(EP_C_FLAGS "${CMAKE_C_FLAGS} ${CMAKE_C_FLAGS_${UPPERCASE_BUILD_TYPE}}") +if (NOT ARROW_VERBOSE_THIRDPARTY_BUILD) + set(EP_LOG_OPTIONS + LOG_CONFIGURE 1 + LOG_BUILD 1 + LOG_INSTALL 1 + LOG_DOWNLOAD 1) +else() + set(EP_LOG_OPTIONS) +endif() + if (NOT MSVC) # Set -fPIC on all external projects set(EP_CXX_FLAGS "${EP_CXX_FLAGS} -fPIC") @@ -205,7 +215,8 @@ if(ARROW_BUILD_TESTS OR ARROW_BUILD_BENCHMARKS) ExternalProject_Add(googletest_ep URL "https://github.com/google/googletest/archive/release-${GTEST_VERSION}.tar.gz" BUILD_BYPRODUCTS ${GTEST_STATIC_LIB} ${GTEST_MAIN_STATIC_LIB} - CMAKE_ARGS ${GTEST_CMAKE_ARGS}) + CMAKE_ARGS ${GTEST_CMAKE_ARGS} + ${EP_LOG_OPTIONS}) else() find_package(GTest REQUIRED) set(GTEST_VENDORED 0) @@ -250,6 +261,7 @@ if(ARROW_BUILD_TESTS OR ARROW_BUILD_BENCHMARKS) ExternalProject_Add(gflags_ep URL ${GFLAGS_URL} + ${EP_LOG_OPTIONS} BUILD_IN_SOURCE 1 BUILD_BYPRODUCTS "${GFLAGS_STATIC_LIB}" CMAKE_ARGS ${GFLAGS_CMAKE_ARGS}) @@ -300,7 +312,8 @@ if(ARROW_BUILD_BENCHMARKS) ExternalProject_Add(gbenchmark_ep URL "https://github.com/google/benchmark/archive/v${GBENCHMARK_VERSION}.tar.gz" BUILD_BYPRODUCTS "${GBENCHMARK_STATIC_LIB}" - CMAKE_ARGS ${GBENCHMARK_CMAKE_ARGS}) + CMAKE_ARGS ${GBENCHMARK_CMAKE_ARGS} + ${EP_LOG_OPTIONS}) else() find_package(GBenchmark REQUIRED) set(GBENCHMARK_VENDORED 0) @@ -327,6 +340,7 @@ if (ARROW_IPC) CONFIGURE_COMMAND "" BUILD_COMMAND "" BUILD_IN_SOURCE 1 + ${EP_LOG_OPTIONS} INSTALL_COMMAND "") ExternalProject_Get_Property(rapidjson_ep SOURCE_DIR) @@ -356,7 +370,8 @@ if (ARROW_IPC) CMAKE_ARGS "-DCMAKE_CXX_FLAGS=${FLATBUFFERS_CMAKE_CXX_FLAGS}" "-DCMAKE_INSTALL_PREFIX:PATH=${FLATBUFFERS_PREFIX}" - "-DFLATBUFFERS_BUILD_TESTS=OFF") + "-DFLATBUFFERS_BUILD_TESTS=OFF" + ${EP_LOG_OPTIONS}) set(FLATBUFFERS_INCLUDE_DIR "${FLATBUFFERS_PREFIX}/include") set(FLATBUFFERS_COMPILER "${FLATBUFFERS_PREFIX}/bin/flatc") @@ -395,6 +410,7 @@ if (ARROW_JEMALLOC) ExternalProject_Add(jemalloc_ep URL https://github.com/jemalloc/jemalloc/releases/download/${JEMALLOC_VERSION}/jemalloc-${JEMALLOC_VERSION}.tar.bz2 CONFIGURE_COMMAND ./configure "--prefix=${JEMALLOC_PREFIX}" "--with-jemalloc-prefix=" + ${EP_LOG_OPTIONS} BUILD_IN_SOURCE 1 BUILD_COMMAND ${MAKE} BUILD_BYPRODUCTS "${JEMALLOC_STATIC_LIB}" "${JEMALLOC_SHARED_LIB}" @@ -475,6 +491,7 @@ if (ARROW_WITH_ZLIB) ExternalProject_Add(zlib_ep URL "http://zlib.net/fossils/zlib-1.2.8.tar.gz" + ${EP_LOG_OPTIONS} BUILD_BYPRODUCTS "${ZLIB_STATIC_LIB}" CMAKE_ARGS ${ZLIB_CMAKE_ARGS}) set(ZLIB_VENDORED 1) @@ -501,7 +518,7 @@ if (ARROW_WITH_SNAPPY) set(SNAPPY_HOME "${SNAPPY_PREFIX}") set(SNAPPY_INCLUDE_DIR "${SNAPPY_PREFIX}/include") if (MSVC) - set(SNAPPY_STATIC_LIB_NAME snappystatic) + set(SNAPPY_STATIC_LIB_NAME snappy_static) else() set(SNAPPY_STATIC_LIB_NAME snappy) endif() @@ -529,6 +546,7 @@ if (ARROW_WITH_SNAPPY) ./config.h) ExternalProject_Add(snappy_ep UPDATE_COMMAND ${SNAPPY_UPDATE_COMMAND} + ${EP_LOG_OPTIONS} BUILD_IN_SOURCE 1 BUILD_COMMAND ${MAKE} INSTALL_DIR ${SNAPPY_PREFIX} @@ -538,6 +556,7 @@ if (ARROW_WITH_SNAPPY) else() ExternalProject_Add(snappy_ep CONFIGURE_COMMAND ./configure --with-pic "--prefix=${SNAPPY_PREFIX}" ${SNAPPY_CXXFLAGS} + ${EP_LOG_OPTIONS} BUILD_IN_SOURCE 1 BUILD_COMMAND ${MAKE} INSTALL_DIR ${SNAPPY_PREFIX} @@ -586,6 +605,7 @@ if (ARROW_WITH_BROTLI) URL "https://github.com/google/brotli/archive/${BROTLI_VERSION}.tar.gz" BUILD_BYPRODUCTS "${BROTLI_STATIC_LIBRARY_ENC}" "${BROTLI_STATIC_LIBRARY_DEC}" "${BROTLI_STATIC_LIBRARY_COMMON}" ${BROTLI_BUILD_BYPRODUCTS} + ${EP_LOG_OPTIONS} CMAKE_ARGS ${BROTLI_CMAKE_ARGS} STEP_TARGETS headers_copy) if (MSVC) @@ -624,41 +644,43 @@ if (ARROW_WITH_LZ4) if("${LZ4_HOME}" STREQUAL "") set(LZ4_BUILD_DIR "${CMAKE_CURRENT_BINARY_DIR}/lz4_ep-prefix/src/lz4_ep") set(LZ4_INCLUDE_DIR "${LZ4_BUILD_DIR}/lib") - + if (MSVC) set(LZ4_STATIC_LIB "${LZ4_BUILD_DIR}/visual/VS2010/bin/x64_${CMAKE_BUILD_TYPE}/liblz4_static.lib") set(LZ4_BUILD_COMMAND BUILD_COMMAND msbuild.exe /m /p:Configuration=${CMAKE_BUILD_TYPE} /p:Platform=x64 /p:PlatformToolset=v140 /t:Build ${LZ4_BUILD_DIR}/visual/VS2010/lz4.sln) + set(LZ4_PATCH_COMMAND PATCH_COMMAND git --git-dir=. apply --verbose ${CMAKE_SOURCE_DIR}/build-support/lz4_msbuild_wholeprogramoptimization_param.patch) else() set(LZ4_STATIC_LIB "${LZ4_BUILD_DIR}/lib/liblz4.a") set(LZ4_BUILD_COMMAND BUILD_COMMAND ${CMAKE_SOURCE_DIR}/build-support/build-lz4-lib.sh) endif() - + ExternalProject_Add(lz4_ep URL "https://github.com/lz4/lz4/archive/v${LZ4_VERSION}.tar.gz" + ${EP_LOG_OPTIONS} UPDATE_COMMAND "" - PATCH_COMMAND "" + ${LZ4_PATCH_COMMAND} CONFIGURE_COMMAND "" INSTALL_COMMAND "" BINARY_DIR ${LZ4_BUILD_DIR} BUILD_BYPRODUCTS ${LZ4_STATIC_LIB} ${LZ4_BUILD_COMMAND} ) - + set(LZ4_VENDORED 1) else() find_package(Lz4 REQUIRED) set(LZ4_VENDORED 0) endif() - + include_directories(SYSTEM ${LZ4_INCLUDE_DIR}) ADD_THIRDPARTY_LIB(lz4_static STATIC_LIB ${LZ4_STATIC_LIB}) - + if (LZ4_VENDORED) add_dependencies(lz4_static lz4_ep) endif() endif() - + if (ARROW_WITH_ZSTD) # ---------------------------------------------------------------------- # ZSTD @@ -670,6 +692,7 @@ if (ARROW_WITH_ZSTD) if (MSVC) set(ZSTD_STATIC_LIB "${ZSTD_BUILD_DIR}/build/VS2010/bin/x64_${CMAKE_BUILD_TYPE}/libzstd_static.lib") set(ZSTD_BUILD_COMMAND BUILD_COMMAND msbuild ${ZSTD_BUILD_DIR}/build/VS2010/zstd.sln /t:Build /v:minimal /p:Configuration=${CMAKE_BUILD_TYPE} /p:Platform=x64 /p:PlatformToolset=v140 /p:OutDir=${ZSTD_BUILD_DIR}/build/VS2010/bin/x64_${CMAKE_BUILD_TYPE}/ /p:SolutionDir=${ZSTD_BUILD_DIR}/build/VS2010/ ) + set(ZSTD_PATCH_COMMAND PATCH_COMMAND git --git-dir=. apply --verbose ${CMAKE_SOURCE_DIR}/build-support/zstd_msbuild_wholeprogramoptimization_param.patch) else() set(ZSTD_STATIC_LIB "${ZSTD_BUILD_DIR}/lib/libzstd.a") set(ZSTD_BUILD_COMMAND BUILD_COMMAND ${CMAKE_SOURCE_DIR}/build-support/build-zstd-lib.sh) @@ -677,8 +700,9 @@ if (ARROW_WITH_ZSTD) ExternalProject_Add(zstd_ep URL "https://github.com/facebook/zstd/archive/v${ZSTD_VERSION}.tar.gz" + ${EP_LOG_OPTIONS} UPDATE_COMMAND "" - PATCH_COMMAND "" + ${ZSTD_PATCH_COMMAND} CONFIGURE_COMMAND "" INSTALL_COMMAND "" BINARY_DIR ${ZSTD_BUILD_DIR} diff --git a/cpp/doc/Parquet.md b/cpp/doc/Parquet.md index ce2961ab26a..0ed100731ca 100644 --- a/cpp/doc/Parquet.md +++ b/cpp/doc/Parquet.md @@ -1,15 +1,20 @@ ## Building Arrow-Parquet integration diff --git a/cpp/src/arrow/allocator-test.cc b/cpp/src/arrow/allocator-test.cc index 5a4e98d7660..f3a80cdae81 100644 --- a/cpp/src/arrow/allocator-test.cc +++ b/cpp/src/arrow/allocator-test.cc @@ -48,7 +48,7 @@ TEST(stl_allocator, FreeLargeMemory) { #ifndef NDEBUG EXPECT_EXIT(alloc.deallocate(data, 120), ::testing::ExitedWithCode(1), - ".*Check failed: \\(bytes_allocated_\\) >= \\(size\\)"); + ".*Check failed: \\(bytes_allocated_\\) >= \\(size\\)"); #endif alloc.deallocate(data, 100); diff --git a/cpp/src/arrow/api.h b/cpp/src/arrow/api.h index 731f23918e4..4d731bd32bf 100644 --- a/cpp/src/arrow/api.h +++ b/cpp/src/arrow/api.h @@ -32,4 +32,7 @@ #include "arrow/type.h" #include "arrow/visitor.h" +/// \brief Top-level namespace for Apache Arrow C++ API +namespace arrow {} + #endif // ARROW_API_H diff --git a/cpp/src/arrow/array-decimal-test.cc b/cpp/src/arrow/array-decimal-test.cc index 0959d686498..436ce9cf7c3 100644 --- a/cpp/src/arrow/array-decimal-test.cc +++ b/cpp/src/arrow/array-decimal-test.cc @@ -28,12 +28,12 @@ namespace decimal { template class DecimalTestBase { public: - virtual std::vector data( - const std::vector& input, size_t byte_width) const = 0; + virtual std::vector data(const std::vector& input, + size_t byte_width) const = 0; void test(int precision, const std::vector& draw, - const std::vector& valid_bytes, - const std::vector& sign_bitmap = {}, int64_t offset = 0) const { + const std::vector& valid_bytes, + const std::vector& sign_bitmap = {}, int64_t offset = 0) const { auto type = std::make_shared(precision, 4); int byte_width = type->byte_width(); auto pool = default_memory_pool(); @@ -63,8 +63,9 @@ class DecimalTestBase { ASSERT_OK(BitUtil::BytesToBits(valid_bytes, &expected_null_bitmap)); int64_t expected_null_count = test::null_count(valid_bytes); - auto expected = std::make_shared(type, size, expected_data, - expected_null_bitmap, expected_null_count, offset, expected_sign_bitmap); + auto expected = + std::make_shared(type, size, expected_data, expected_null_bitmap, + expected_null_count, offset, expected_sign_bitmap); std::shared_ptr out; ASSERT_OK(builder->Finish(&out)); @@ -75,8 +76,8 @@ class DecimalTestBase { template class DecimalTest : public DecimalTestBase { public: - std::vector data( - const std::vector& input, size_t byte_width) const override { + std::vector data(const std::vector& input, + size_t byte_width) const override { std::vector result(input.size() * byte_width); // TODO(phillipc): There's probably a better way to do this constexpr static const size_t bytes_per_element = sizeof(T); @@ -90,8 +91,8 @@ class DecimalTest : public DecimalTestBase { template <> class DecimalTest : public DecimalTestBase { public: - std::vector data( - const std::vector& input, size_t byte_width) const override { + std::vector data(const std::vector& input, + size_t byte_width) const override { std::vector result; result.reserve(input.size() * byte_width); constexpr static const size_t bytes_per_element = 16; @@ -120,24 +121,24 @@ class Decimal128BuilderTest : public ::testing::TestWithParam, TEST_P(Decimal32BuilderTest, NoNulls) { int precision = GetParam(); - std::vector draw = { - Decimal32(1), Decimal32(2), Decimal32(2389), Decimal32(4), Decimal32(-12348)}; + std::vector draw = {Decimal32(1), Decimal32(2), Decimal32(2389), + Decimal32(4), Decimal32(-12348)}; std::vector valid_bytes = {true, true, true, true, true}; this->test(precision, draw, valid_bytes); } TEST_P(Decimal64BuilderTest, NoNulls) { int precision = GetParam(); - std::vector draw = { - Decimal64(1), Decimal64(2), Decimal64(2389), Decimal64(4), Decimal64(-12348)}; + std::vector draw = {Decimal64(1), Decimal64(2), Decimal64(2389), + Decimal64(4), Decimal64(-12348)}; std::vector valid_bytes = {true, true, true, true, true}; this->test(precision, draw, valid_bytes); } TEST_P(Decimal128BuilderTest, NoNulls) { int precision = GetParam(); - std::vector draw = { - Decimal128(1), Decimal128(-2), Decimal128(2389), Decimal128(4), Decimal128(-12348)}; + std::vector draw = {Decimal128(1), Decimal128(-2), Decimal128(2389), + Decimal128(4), Decimal128(-12348)}; std::vector valid_bytes = {true, true, true, true, true}; std::vector sign_bitmap = {false, true, false, false, true}; this->test(precision, draw, valid_bytes, sign_bitmap); @@ -145,41 +146,47 @@ TEST_P(Decimal128BuilderTest, NoNulls) { TEST_P(Decimal32BuilderTest, WithNulls) { int precision = GetParam(); - std::vector draw = { - Decimal32(1), Decimal32(2), Decimal32(-1), Decimal32(4), Decimal32(-1)}; + std::vector draw = {Decimal32(1), Decimal32(2), Decimal32(-1), Decimal32(4), + Decimal32(-1)}; std::vector valid_bytes = {true, true, false, true, false}; this->test(precision, draw, valid_bytes); } TEST_P(Decimal64BuilderTest, WithNulls) { int precision = GetParam(); - std::vector draw = { - Decimal64(-1), Decimal64(2), Decimal64(-1), Decimal64(4), Decimal64(-1)}; + std::vector draw = {Decimal64(-1), Decimal64(2), Decimal64(-1), Decimal64(4), + Decimal64(-1)}; std::vector valid_bytes = {true, true, false, true, false}; this->test(precision, draw, valid_bytes); } TEST_P(Decimal128BuilderTest, WithNulls) { int precision = GetParam(); - std::vector draw = {Decimal128(1), Decimal128(2), Decimal128(-1), - Decimal128(4), Decimal128(-1), Decimal128(1), Decimal128(2), - Decimal128("230342903942.234234"), Decimal128("-23049302932.235234")}; - std::vector valid_bytes = { - true, true, false, true, false, true, true, true, true}; - std::vector sign_bitmap = { - false, false, false, false, false, false, false, false, true}; + std::vector draw = {Decimal128(1), + Decimal128(2), + Decimal128(-1), + Decimal128(4), + Decimal128(-1), + Decimal128(1), + Decimal128(2), + Decimal128("230342903942.234234"), + Decimal128("-23049302932.235234")}; + std::vector valid_bytes = {true, true, false, true, false, + true, true, true, true}; + std::vector sign_bitmap = {false, false, false, false, false, + false, false, false, true}; this->test(precision, draw, valid_bytes, sign_bitmap); } INSTANTIATE_TEST_CASE_P(Decimal32BuilderTest, Decimal32BuilderTest, - ::testing::Range( - DecimalPrecision::minimum, DecimalPrecision::maximum)); + ::testing::Range(DecimalPrecision::minimum, + DecimalPrecision::maximum)); INSTANTIATE_TEST_CASE_P(Decimal64BuilderTest, Decimal64BuilderTest, - ::testing::Range( - DecimalPrecision::minimum, DecimalPrecision::maximum)); + ::testing::Range(DecimalPrecision::minimum, + DecimalPrecision::maximum)); INSTANTIATE_TEST_CASE_P(Decimal128BuilderTest, Decimal128BuilderTest, - ::testing::Range( - DecimalPrecision::minimum, DecimalPrecision::maximum)); + ::testing::Range(DecimalPrecision::minimum, + DecimalPrecision::maximum)); } // namespace decimal } // namespace arrow diff --git a/cpp/src/arrow/array-test.cc b/cpp/src/arrow/array-test.cc index acb4819dd09..0efb51ccece 100644 --- a/cpp/src/arrow/array-test.cc +++ b/cpp/src/arrow/array-test.cc @@ -64,8 +64,8 @@ TEST_F(TestArray, TestLength) { ASSERT_EQ(arr->length(), 100); } -Status MakeArrayFromValidBytes( - const vector& v, MemoryPool* pool, std::shared_ptr* out) { +Status MakeArrayFromValidBytes(const vector& v, MemoryPool* pool, + std::shared_ptr* out) { int64_t null_count = v.size() - std::accumulate(v.begin(), v.end(), 0); std::shared_ptr null_buf; @@ -147,7 +147,9 @@ TEST_F(TestArray, TestIsNull) { // clang-format on int64_t null_count = 0; for (uint8_t x : null_bitmap) { - if (x == 0) { ++null_count; } + if (x == 0) { + ++null_count; + } } std::shared_ptr null_buf; @@ -223,8 +225,8 @@ class TestPrimitiveBuilder : public TestBuilder { void Check(const std::unique_ptr& builder, bool nullable) { int64_t size = builder->length(); - auto ex_data = std::make_shared( - reinterpret_cast(draws_.data()), size * sizeof(T)); + auto ex_data = std::make_shared(reinterpret_cast(draws_.data()), + size * sizeof(T)); std::shared_ptr ex_null_bitmap; int64_t ex_null_count = 0; @@ -316,8 +318,8 @@ void TestPrimitiveBuilder::RandomData(int64_t N, double pct_null) { } template <> -void TestPrimitiveBuilder::Check( - const std::unique_ptr& builder, bool nullable) { +void TestPrimitiveBuilder::Check(const std::unique_ptr& builder, + bool nullable) { int64_t size = builder->length(); std::shared_ptr ex_data; @@ -351,7 +353,9 @@ void TestPrimitiveBuilder::Check( ASSERT_EQ(expected->length(), result->length()); for (int64_t i = 0; i < result->length(); ++i) { - if (nullable) { ASSERT_EQ(valid_bytes_[i] == 0, result->IsNull(i)) << i; } + if (nullable) { + ASSERT_EQ(valid_bytes_[i] == 0, result->IsNull(i)) << i; + } bool actual = BitUtil::GetBit(result->values()->data(), i); ASSERT_EQ(draws_[i] != 0, actual) << i; } @@ -359,7 +363,7 @@ void TestPrimitiveBuilder::Check( } typedef ::testing::Types + PInt32, PInt64, PFloat, PDouble> Primitives; TYPED_TEST_CASE(TestPrimitiveBuilder, Primitives); @@ -377,7 +381,7 @@ TYPED_TEST(TestPrimitiveBuilder, TestInit) { ASSERT_OK(this->builder_->Reserve(n)); ASSERT_EQ(BitUtil::NextPower2(n), this->builder_->capacity()); ASSERT_EQ(BitUtil::NextPower2(TypeTraits::bytes_required(n)), - this->builder_->data()->size()); + this->builder_->data()->size()); // unsure if this should go in all builder classes ASSERT_EQ(0, this->builder_->num_children()); @@ -440,8 +444,8 @@ TYPED_TEST(TestPrimitiveBuilder, Equality) { ASSERT_OK(MakeArray(valid_bytes, draws, size, builder, &equal_array)); // Make the not equal array by negating the first valid element with itself. - const auto first_valid = std::find_if( - valid_bytes.begin(), valid_bytes.end(), [](uint8_t valid) { return valid > 0; }); + const auto first_valid = std::find_if(valid_bytes.begin(), valid_bytes.end(), + [](uint8_t valid) { return valid > 0; }); const int64_t first_valid_idx = std::distance(valid_bytes.begin(), first_valid); // This should be true with a very high probability, but might introduce flakiness ASSERT_LT(first_valid_idx, size - 1); @@ -679,8 +683,8 @@ class TestStringArray : public ::testing::Test { ASSERT_OK(BitUtil::BytesToBits(valid_bytes_, &null_bitmap_)); null_count_ = test::null_count(valid_bytes_); - strings_ = std::make_shared( - length_, offsets_buf_, value_buf_, null_bitmap_, null_count_); + strings_ = std::make_shared(length_, offsets_buf_, value_buf_, + null_bitmap_, null_count_); } protected: @@ -723,8 +727,8 @@ TEST_F(TestStringArray, TestListFunctions) { } TEST_F(TestStringArray, TestDestructor) { - auto arr = std::make_shared( - length_, offsets_buf_, value_buf_, null_bitmap_, null_count_); + auto arr = std::make_shared(length_, offsets_buf_, value_buf_, + null_bitmap_, null_count_); } TEST_F(TestStringArray, TestGetString) { @@ -742,10 +746,10 @@ TEST_F(TestStringArray, TestEmptyStringComparison) { offsets_buf_ = test::GetBufferFromVector(offsets_); length_ = static_cast(offsets_.size() - 1); - auto strings_a = std::make_shared( - length_, offsets_buf_, nullptr, null_bitmap_, null_count_); - auto strings_b = std::make_shared( - length_, offsets_buf_, nullptr, null_bitmap_, null_count_); + auto strings_a = std::make_shared(length_, offsets_buf_, nullptr, + null_bitmap_, null_count_); + auto strings_b = std::make_shared(length_, offsets_buf_, nullptr, + null_bitmap_, null_count_); ASSERT_TRUE(strings_a->Equals(strings_b)); } @@ -893,8 +897,8 @@ class TestBinaryArray : public ::testing::Test { ASSERT_OK(BitUtil::BytesToBits(valid_bytes_, &null_bitmap_)); null_count_ = test::null_count(valid_bytes_); - strings_ = std::make_shared( - length_, offsets_buf_, value_buf_, null_bitmap_, null_count_); + strings_ = std::make_shared(length_, offsets_buf_, value_buf_, + null_bitmap_, null_count_); } protected: @@ -937,8 +941,8 @@ TEST_F(TestBinaryArray, TestListFunctions) { } TEST_F(TestBinaryArray, TestDestructor) { - auto arr = std::make_shared( - length_, offsets_buf_, value_buf_, null_bitmap_, null_count_); + auto arr = std::make_shared(length_, offsets_buf_, value_buf_, + null_bitmap_, null_count_); } TEST_F(TestBinaryArray, TestGetValue) { @@ -965,8 +969,9 @@ TEST_F(TestBinaryArray, TestEqualsEmptyStrings) { ASSERT_OK(builder.Finish(&left_arr)); const BinaryArray& left = static_cast(*left_arr); - std::shared_ptr right = std::make_shared(left.length(), - left.value_offsets(), nullptr, left.null_bitmap(), left.null_count()); + std::shared_ptr right = + std::make_shared(left.length(), left.value_offsets(), nullptr, + left.null_bitmap(), left.null_count()); ASSERT_TRUE(left.Equals(right)); ASSERT_TRUE(left.RangeEquals(0, left.length(), 0, right)); @@ -1082,17 +1087,11 @@ void CheckSliceEquality() { ASSERT_TRUE(array->RangeEquals(5, 25, 0, slice)); } -TEST_F(TestBinaryArray, TestSliceEquality) { - CheckSliceEquality(); -} +TEST_F(TestBinaryArray, TestSliceEquality) { CheckSliceEquality(); } -TEST_F(TestStringArray, TestSliceEquality) { - CheckSliceEquality(); -} +TEST_F(TestStringArray, TestSliceEquality) { CheckSliceEquality(); } -TEST_F(TestBinaryArray, LengthZeroCtor) { - BinaryArray array(0, nullptr, nullptr); -} +TEST_F(TestBinaryArray, LengthZeroCtor) { BinaryArray array(0, nullptr, nullptr); } // ---------------------------------------------------------------------- // FixedSizeBinary tests @@ -1126,8 +1125,8 @@ TEST_F(TestFWBinaryArray, Builder) { std::shared_ptr result; - auto CheckResult = [this, &length, &is_valid, &raw_data, &byte_width]( - const Array& result) { + auto CheckResult = [this, &length, &is_valid, &raw_data, + &byte_width](const Array& result) { // Verify output const auto& fw_result = static_cast(result); @@ -1135,8 +1134,8 @@ TEST_F(TestFWBinaryArray, Builder) { for (int64_t i = 0; i < result.length(); ++i) { if (is_valid[i]) { - ASSERT_EQ( - 0, memcmp(raw_data + byte_width * i, fw_result.GetValue(i), byte_width)); + ASSERT_EQ(0, + memcmp(raw_data + byte_width * i, fw_result.GetValue(i), byte_width)); } else { ASSERT_TRUE(fw_result.IsNull(i)); } @@ -1323,8 +1322,8 @@ TEST_F(TestAdaptiveIntBuilder, TestInt16) { SetUp(); ASSERT_OK(builder_->Append(std::numeric_limits::max())); ASSERT_OK(builder_->Append(std::numeric_limits::min())); - expected_values = { - std::numeric_limits::max(), std::numeric_limits::min()}; + expected_values = {std::numeric_limits::max(), + std::numeric_limits::min()}; Done(); ArrayFromVector(expected_values, &expected_); @@ -1354,8 +1353,8 @@ TEST_F(TestAdaptiveIntBuilder, TestInt32) { SetUp(); ASSERT_OK(builder_->Append(std::numeric_limits::max())); ASSERT_OK(builder_->Append(std::numeric_limits::min())); - expected_values = { - std::numeric_limits::max(), std::numeric_limits::min()}; + expected_values = {std::numeric_limits::max(), + std::numeric_limits::min()}; Done(); ArrayFromVector(expected_values, &expected_); @@ -1385,8 +1384,8 @@ TEST_F(TestAdaptiveIntBuilder, TestInt64) { SetUp(); ASSERT_OK(builder_->Append(std::numeric_limits::max())); ASSERT_OK(builder_->Append(std::numeric_limits::min())); - expected_values = { - std::numeric_limits::max(), std::numeric_limits::min()}; + expected_values = {std::numeric_limits::max(), + std::numeric_limits::min()}; Done(); ArrayFromVector(expected_values, &expected_); @@ -1505,7 +1504,7 @@ template class TestDictionaryBuilder : public TestBuilder {}; typedef ::testing::Types + UInt32Type, Int64Type, UInt64Type, FloatType, DoubleType> PrimitiveDictionaries; TYPED_TEST_CASE(TestDictionaryBuilder, PrimitiveDictionaries); @@ -1784,7 +1783,7 @@ TEST_F(TestListBuilder, TestAppendNull) { } void ValidateBasicListArray(const ListArray* result, const vector& values, - const vector& is_valid) { + const vector& is_valid) { ASSERT_OK(ValidateArray(*result)); ASSERT_EQ(1, result->null_count()); ASSERT_EQ(0, result->values()->null_count()); @@ -1997,9 +1996,12 @@ TEST(TestDictionary, Validate) { // Struct tests void ValidateBasicStructArray(const StructArray* result, - const vector& struct_is_valid, const vector& list_values, - const vector& list_is_valid, const vector& list_lengths, - const vector& list_offsets, const vector& int_values) { + const vector& struct_is_valid, + const vector& list_values, + const vector& list_is_valid, + const vector& list_lengths, + const vector& list_offsets, + const vector& int_values) { ASSERT_EQ(4, result->length()); ASSERT_OK(ValidateArray(*result)); @@ -2040,9 +2042,9 @@ class TestStructBuilder : public TestBuilder { auto list_type = list(char_type); vector> types = {list_type, int32_type}; - vector fields; - fields.push_back(FieldPtr(new Field("list", list_type))); - fields.push_back(FieldPtr(new Field("int", int32_type))); + vector> fields; + fields.push_back(field("list", list_type)); + fields.push_back(field("int", int32_type)); type_ = struct_(fields); value_fields_ = fields; @@ -2060,7 +2062,7 @@ class TestStructBuilder : public TestBuilder { } protected: - vector value_fields_; + vector> value_fields_; std::shared_ptr type_; std::shared_ptr builder_; @@ -2134,7 +2136,7 @@ TEST_F(TestStructBuilder, TestBasics) { Done(); ValidateBasicStructArray(result_.get(), struct_is_valid, list_values, list_is_valid, - list_lengths, list_offsets, int_values); + list_lengths, list_offsets, int_values); } TEST_F(TestStructBuilder, BulkAppend) { @@ -2166,7 +2168,7 @@ TEST_F(TestStructBuilder, BulkAppend) { Done(); ValidateBasicStructArray(result_.get(), struct_is_valid, list_values, list_is_valid, - list_lengths, list_offsets, int_values); + list_lengths, list_offsets, int_values); } TEST_F(TestStructBuilder, BulkAppendInvalid) { @@ -2280,7 +2282,7 @@ TEST_F(TestStructBuilder, TestEquality) { // setup an unequal one with unequal offsets ASSERT_OK(builder_->Append(struct_is_valid.size(), struct_is_valid.data())); ASSERT_OK(list_vb->Append(unequal_list_offsets.data(), unequal_list_offsets.size(), - unequal_list_is_valid.data())); + unequal_list_is_valid.data())); for (int8_t value : list_values) { char_vb->UnsafeAppend(value); } diff --git a/cpp/src/arrow/array.cc b/cpp/src/arrow/array.cc index 4a405f24342..ab0be7a0964 100644 --- a/cpp/src/arrow/array.cc +++ b/cpp/src/arrow/array.cc @@ -57,45 +57,57 @@ int64_t Array::null_count() const { bool Array::Equals(const Array& arr) const { bool are_equal = false; Status error = ArrayEquals(*this, arr, &are_equal); - if (!error.ok()) { DCHECK(false) << "Arrays not comparable: " << error.ToString(); } + if (!error.ok()) { + DCHECK(false) << "Arrays not comparable: " << error.ToString(); + } return are_equal; } bool Array::Equals(const std::shared_ptr& arr) const { - if (!arr) { return false; } + if (!arr) { + return false; + } return Equals(*arr); } bool Array::ApproxEquals(const Array& arr) const { bool are_equal = false; Status error = ArrayApproxEquals(*this, arr, &are_equal); - if (!error.ok()) { DCHECK(false) << "Arrays not comparable: " << error.ToString(); } + if (!error.ok()) { + DCHECK(false) << "Arrays not comparable: " << error.ToString(); + } return are_equal; } bool Array::ApproxEquals(const std::shared_ptr& arr) const { - if (!arr) { return false; } + if (!arr) { + return false; + } return ApproxEquals(*arr); } bool Array::RangeEquals(int64_t start_idx, int64_t end_idx, int64_t other_start_idx, - const std::shared_ptr& other) const { - if (!other) { return false; } + const std::shared_ptr& other) const { + if (!other) { + return false; + } return RangeEquals(*other, start_idx, end_idx, other_start_idx); } bool Array::RangeEquals(const Array& other, int64_t start_idx, int64_t end_idx, - int64_t other_start_idx) const { + int64_t other_start_idx) const { bool are_equal = false; Status error = ArrayRangeEquals(*this, other, start_idx, end_idx, other_start_idx, &are_equal); - if (!error.ok()) { DCHECK(false) << "Arrays not comparable: " << error.ToString(); } + if (!error.ok()) { + DCHECK(false) << "Arrays not comparable: " << error.ToString(); + } return are_equal; } // Last two parameters are in-out parameters -static inline void ConformSliceParams( - int64_t array_offset, int64_t array_length, int64_t* offset, int64_t* length) { +static inline void ConformSliceParams(int64_t array_offset, int64_t array_length, + int64_t* offset, int64_t* length) { DCHECK_LE(*offset, array_length); DCHECK_NE(offset, nullptr); *length = std::min(array_length - *offset, *length); @@ -113,8 +125,8 @@ std::string Array::ToString() const { return ss.str(); } -static inline std::shared_ptr SliceData( - const ArrayData& data, int64_t offset, int64_t length) { +static inline std::shared_ptr SliceData(const ArrayData& data, int64_t offset, + int64_t length) { ConformSliceParams(data.offset, data.length, &offset, &length); auto new_data = data.ShallowCopy(); @@ -139,8 +151,9 @@ std::shared_ptr NullArray::Slice(int64_t offset, int64_t length) const { // Primitive array base PrimitiveArray::PrimitiveArray(const std::shared_ptr& type, int64_t length, - const std::shared_ptr& data, const std::shared_ptr& null_bitmap, - int64_t null_count, int64_t offset) { + const std::shared_ptr& data, + const std::shared_ptr& null_bitmap, + int64_t null_count, int64_t offset) { BufferVector buffers = {null_bitmap, data}; SetData( std::make_shared(type, length, std::move(buffers), null_count, offset)); @@ -166,7 +179,8 @@ BooleanArray::BooleanArray(const std::shared_ptr& data) } BooleanArray::BooleanArray(int64_t length, const std::shared_ptr& data, - const std::shared_ptr& null_bitmap, int64_t null_count, int64_t offset) + const std::shared_ptr& null_bitmap, int64_t null_count, + int64_t offset) : PrimitiveArray(boolean(), length, data, null_bitmap, null_count, offset) {} std::shared_ptr BooleanArray::Slice(int64_t offset, int64_t length) const { @@ -182,8 +196,10 @@ ListArray::ListArray(const std::shared_ptr& data) { } ListArray::ListArray(const std::shared_ptr& type, int64_t length, - const std::shared_ptr& value_offsets, const std::shared_ptr& values, - const std::shared_ptr& null_bitmap, int64_t null_count, int64_t offset) { + const std::shared_ptr& value_offsets, + const std::shared_ptr& values, + const std::shared_ptr& null_bitmap, int64_t null_count, + int64_t offset) { BufferVector buffers = {null_bitmap, value_offsets}; auto internal_data = std::make_shared(type, length, std::move(buffers), null_count, offset); @@ -192,7 +208,7 @@ ListArray::ListArray(const std::shared_ptr& type, int64_t length, } Status ListArray::FromArrays(const Array& offsets, const Array& values, MemoryPool* pool, - std::shared_ptr* out) { + std::shared_ptr* out) { if (ARROW_PREDICT_FALSE(offsets.length() == 0)) { return Status::Invalid("List offsets must have non-zero length"); } @@ -205,12 +221,13 @@ Status ListArray::FromArrays(const Array& offsets, const Array& values, MemoryPo return Status::Invalid("List offsets must be signed int32"); } - BufferVector buffers = { - offsets.null_bitmap(), static_cast(offsets).values()}; + BufferVector buffers = {offsets.null_bitmap(), + static_cast(offsets).values()}; auto list_type = list(values.type()); - auto internal_data = std::make_shared(list_type, - offsets.length() - 1, std::move(buffers), offsets.null_count(), offsets.offset()); + auto internal_data = std::make_shared( + list_type, offsets.length() - 1, std::move(buffers), offsets.null_count(), + offsets.offset()); internal_data->child_data.push_back(values.data()); *out = std::make_shared(internal_data); @@ -230,14 +247,12 @@ std::shared_ptr ListArray::value_type() const { return static_cast(*type()).value_type(); } -std::shared_ptr ListArray::values() const { - return values_; -} +std::shared_ptr ListArray::values() const { return values_; } std::shared_ptr ListArray::Slice(int64_t offset, int64_t length) const { ConformSliceParams(data_->offset, data_->length, &offset, &length); return std::make_shared(type(), length, value_offsets(), values(), - null_bitmap(), kUnknownNullCount, offset); + null_bitmap(), kUnknownNullCount, offset); } // ---------------------------------------------------------------------- @@ -262,14 +277,17 @@ void BinaryArray::SetData(const std::shared_ptr& data) { } BinaryArray::BinaryArray(int64_t length, const std::shared_ptr& value_offsets, - const std::shared_ptr& data, const std::shared_ptr& null_bitmap, - int64_t null_count, int64_t offset) + const std::shared_ptr& data, + const std::shared_ptr& null_bitmap, int64_t null_count, + int64_t offset) : BinaryArray(kBinary, length, value_offsets, data, null_bitmap, null_count, offset) { } BinaryArray::BinaryArray(const std::shared_ptr& type, int64_t length, - const std::shared_ptr& value_offsets, const std::shared_ptr& data, - const std::shared_ptr& null_bitmap, int64_t null_count, int64_t offset) { + const std::shared_ptr& value_offsets, + const std::shared_ptr& data, + const std::shared_ptr& null_bitmap, int64_t null_count, + int64_t offset) { BufferVector buffers = {null_bitmap, value_offsets, data}; SetData( std::make_shared(type, length, std::move(buffers), null_count, offset)); @@ -285,8 +303,9 @@ StringArray::StringArray(const std::shared_ptr& data) { } StringArray::StringArray(int64_t length, const std::shared_ptr& value_offsets, - const std::shared_ptr& data, const std::shared_ptr& null_bitmap, - int64_t null_count, int64_t offset) + const std::shared_ptr& data, + const std::shared_ptr& null_bitmap, int64_t null_count, + int64_t offset) : BinaryArray(kString, length, value_offsets, data, null_bitmap, null_count, offset) { } @@ -304,8 +323,10 @@ FixedSizeBinaryArray::FixedSizeBinaryArray( } FixedSizeBinaryArray::FixedSizeBinaryArray(const std::shared_ptr& type, - int64_t length, const std::shared_ptr& data, - const std::shared_ptr& null_bitmap, int64_t null_count, int64_t offset) + int64_t length, + const std::shared_ptr& data, + const std::shared_ptr& null_bitmap, + int64_t null_count, int64_t offset) : PrimitiveArray(type, length, data, null_bitmap, null_count, offset), byte_width_(static_cast(*type).byte_width()) {} @@ -335,8 +356,9 @@ void DecimalArray::SetData(const std::shared_ptr& data) { } DecimalArray::DecimalArray(const std::shared_ptr& type, int64_t length, - const std::shared_ptr& data, const std::shared_ptr& null_bitmap, - int64_t null_count, int64_t offset, const std::shared_ptr& sign_bitmap) { + const std::shared_ptr& data, + const std::shared_ptr& null_bitmap, int64_t null_count, + int64_t offset, const std::shared_ptr& sign_bitmap) { BufferVector buffers = {null_bitmap, data, sign_bitmap}; SetData( std::make_shared(type, length, std::move(buffers), null_count, offset)); @@ -392,8 +414,9 @@ StructArray::StructArray(const std::shared_ptr& data) { } StructArray::StructArray(const std::shared_ptr& type, int64_t length, - const std::vector>& children, - std::shared_ptr null_bitmap, int64_t null_count, int64_t offset) { + const std::vector>& children, + std::shared_ptr null_bitmap, int64_t null_count, + int64_t offset) { BufferVector buffers = {null_bitmap}; SetData( std::make_shared(type, length, std::move(buffers), null_count, offset)); @@ -433,9 +456,11 @@ UnionArray::UnionArray(const std::shared_ptr& data) { } UnionArray::UnionArray(const std::shared_ptr& type, int64_t length, - const std::vector>& children, - const std::shared_ptr& type_ids, const std::shared_ptr& value_offsets, - const std::shared_ptr& null_bitmap, int64_t null_count, int64_t offset) { + const std::vector>& children, + const std::shared_ptr& type_ids, + const std::shared_ptr& value_offsets, + const std::shared_ptr& null_bitmap, int64_t null_count, + int64_t offset) { BufferVector buffers = {null_bitmap, type_ids, value_offsets}; auto internal_data = std::make_shared(type, length, std::move(buffers), null_count, offset); @@ -464,8 +489,8 @@ DictionaryArray::DictionaryArray(const std::shared_ptr& data) SetData(data); } -DictionaryArray::DictionaryArray( - const std::shared_ptr& type, const std::shared_ptr& indices) +DictionaryArray::DictionaryArray(const std::shared_ptr& type, + const std::shared_ptr& indices) : dict_type_(static_cast(type.get())) { DCHECK_EQ(type->id(), Type::DICTIONARY); DCHECK_EQ(indices->type_id(), dict_type_->index_type()->id()); @@ -482,9 +507,7 @@ void DictionaryArray::SetData(const std::shared_ptr& data) { DCHECK(internal::MakeArray(indices_data, &indices_).ok()); } -std::shared_ptr DictionaryArray::indices() const { - return indices_; -} +std::shared_ptr DictionaryArray::indices() const { return indices_; } std::shared_ptr DictionaryArray::dictionary() const { return dict_type_->dictionary(); @@ -504,6 +527,8 @@ Status Array::Accept(ArrayVisitor* visitor) const { // ---------------------------------------------------------------------- // Implement Array::Validate as inline visitor +namespace internal { + struct ValidateVisitor { Status Visit(const NullArray& array) { return Status::OK(); } @@ -517,7 +542,9 @@ struct ValidateVisitor { } Status Visit(const ListArray& array) { - if (array.length() < 0) { return Status::Invalid("Length was negative"); } + if (array.length() < 0) { + return Status::Invalid("Length was negative"); + } auto value_offsets = array.value_offsets(); if (array.length() && !value_offsets) { @@ -550,7 +577,9 @@ struct ValidateVisitor { } int32_t prev_offset = array.value_offset(0); - if (prev_offset != 0) { return Status::Invalid("The first offset wasn't zero"); } + if (prev_offset != 0) { + return Status::Invalid("The first offset wasn't zero"); + } for (int64_t i = 1; i <= array.length(); ++i) { int32_t current_offset = array.value_offset(i); if (array.IsNull(i - 1) && current_offset != prev_offset) { @@ -573,7 +602,9 @@ struct ValidateVisitor { } Status Visit(const StructArray& array) { - if (array.length() < 0) { return Status::Invalid("Length was negative"); } + if (array.length() < 0) { + return Status::Invalid("Length was negative"); + } if (array.null_count() > array.length()) { return Status::Invalid("Null count exceeds the length of this struct"); @@ -610,7 +641,9 @@ struct ValidateVisitor { } Status Visit(const UnionArray& array) { - if (array.length() < 0) { return Status::Invalid("Length was negative"); } + if (array.length() < 0) { + return Status::Invalid("Length was negative"); + } if (array.null_count() > array.length()) { return Status::Invalid("Null count exceeds the length of this struct"); @@ -627,8 +660,10 @@ struct ValidateVisitor { } }; +} // namespace internal + Status ValidateArray(const Array& array) { - ValidateVisitor validate_visitor; + internal::ValidateVisitor validate_visitor; return VisitArrayInline(array, &validate_visitor); } @@ -661,8 +696,9 @@ Status MakeArray(const std::shared_ptr& data, std::shared_ptr* } // namespace internal Status MakePrimitiveArray(const std::shared_ptr& type, int64_t length, - const std::shared_ptr& data, const std::shared_ptr& null_bitmap, - int64_t null_count, int64_t offset, std::shared_ptr* out) { + const std::shared_ptr& data, + const std::shared_ptr& null_bitmap, int64_t null_count, + int64_t offset, std::shared_ptr* out) { BufferVector buffers = {null_bitmap, data}; auto internal_data = std::make_shared( type, length, std::move(buffers), null_count, offset); @@ -670,8 +706,9 @@ Status MakePrimitiveArray(const std::shared_ptr& type, int64_t length, } Status MakePrimitiveArray(const std::shared_ptr& type, - const std::vector>& buffers, int64_t length, - int64_t null_count, int64_t offset, std::shared_ptr* out) { + const std::vector>& buffers, + int64_t length, int64_t null_count, int64_t offset, + std::shared_ptr* out) { auto internal_data = std::make_shared(type, length, buffers, null_count, offset); return internal::MakeArray(internal_data, out); diff --git a/cpp/src/arrow/array.h b/cpp/src/arrow/array.h index c32d5e1c93f..a853f2bb5f9 100644 --- a/cpp/src/arrow/array.h +++ b/cpp/src/arrow/array.h @@ -88,8 +88,8 @@ struct ARROW_EXPORT ArrayData { ArrayData() {} ArrayData(const std::shared_ptr& type, int64_t length, - const std::vector>& buffers, - int64_t null_count = kUnknownNullCount, int64_t offset = 0) + const std::vector>& buffers, + int64_t null_count = kUnknownNullCount, int64_t offset = 0) : type(type), length(length), buffers(buffers), @@ -97,8 +97,8 @@ struct ARROW_EXPORT ArrayData { offset(offset) {} ArrayData(const std::shared_ptr& type, int64_t length, - std::vector>&& buffers, - int64_t null_count = kUnknownNullCount, int64_t offset = 0) + std::vector>&& buffers, + int64_t null_count = kUnknownNullCount, int64_t offset = 0) : type(type), length(length), buffers(std::move(buffers)), @@ -145,8 +145,8 @@ struct ARROW_EXPORT ArrayData { std::vector> child_data; }; -Status ARROW_EXPORT MakeArray( - const std::shared_ptr& data, std::shared_ptr* out); +Status ARROW_EXPORT MakeArray(const std::shared_ptr& data, + std::shared_ptr* out); } // namespace internal @@ -211,10 +211,10 @@ class ARROW_EXPORT Array { /// Compare if the range of slots specified are equal for the given array and /// this array. end_idx exclusive. This methods does not bounds check. bool RangeEquals(int64_t start_idx, int64_t end_idx, int64_t other_start_idx, - const std::shared_ptr& other) const; + const std::shared_ptr& other) const; bool RangeEquals(const Array& other, int64_t start_idx, int64_t end_idx, - int64_t other_start_idx) const; + int64_t other_start_idx) const; Status Accept(ArrayVisitor* visitor) const; @@ -285,9 +285,9 @@ class ARROW_EXPORT NullArray : public FlatArray { class ARROW_EXPORT PrimitiveArray : public FlatArray { public: PrimitiveArray(const std::shared_ptr& type, int64_t length, - const std::shared_ptr& data, - const std::shared_ptr& null_bitmap = nullptr, int64_t null_count = 0, - int64_t offset = 0); + const std::shared_ptr& data, + const std::shared_ptr& null_bitmap = nullptr, + int64_t null_count = 0, int64_t offset = 0); /// Does not account for any slice offset std::shared_ptr values() const { return data_->buffers[1]; } @@ -328,7 +328,7 @@ class ARROW_EXPORT NumericArray : public PrimitiveArray { const std::shared_ptr& null_bitmap = nullptr, int64_t null_count = 0, int64_t offset = 0) : PrimitiveArray(TypeTraits::type_singleton(), length, data, null_bitmap, - null_count, offset) {} + null_count, offset) {} const value_type* raw_values() const { return reinterpret_cast(raw_values_) + data_->offset; @@ -349,14 +349,14 @@ class ARROW_EXPORT BooleanArray : public PrimitiveArray { explicit BooleanArray(const std::shared_ptr& data); BooleanArray(int64_t length, const std::shared_ptr& data, - const std::shared_ptr& null_bitmap = nullptr, int64_t null_count = 0, - int64_t offset = 0); + const std::shared_ptr& null_bitmap = nullptr, + int64_t null_count = 0, int64_t offset = 0); std::shared_ptr Slice(int64_t offset, int64_t length) const override; bool Value(int64_t i) const { - return BitUtil::GetBit( - reinterpret_cast(raw_values_), i + data_->offset); + return BitUtil::GetBit(reinterpret_cast(raw_values_), + i + data_->offset); } protected: @@ -373,9 +373,10 @@ class ARROW_EXPORT ListArray : public Array { explicit ListArray(const std::shared_ptr& data); ListArray(const std::shared_ptr& type, int64_t length, - const std::shared_ptr& value_offsets, const std::shared_ptr& values, - const std::shared_ptr& null_bitmap = nullptr, int64_t null_count = 0, - int64_t offset = 0); + const std::shared_ptr& value_offsets, + const std::shared_ptr& values, + const std::shared_ptr& null_bitmap = nullptr, int64_t null_count = 0, + int64_t offset = 0); /// \brief Construct ListArray from array of offsets and child value array /// @@ -388,7 +389,7 @@ class ARROW_EXPORT ListArray : public Array { /// allocated because of null values /// \param[out] out Will have length equal to offsets.length() - 1 static Status FromArrays(const Array& offsets, const Array& values, MemoryPool* pool, - std::shared_ptr* out); + std::shared_ptr* out); /// \brief Return array object containing the list's values std::shared_ptr values() const; @@ -428,9 +429,9 @@ class ARROW_EXPORT BinaryArray : public FlatArray { explicit BinaryArray(const std::shared_ptr& data); BinaryArray(int64_t length, const std::shared_ptr& value_offsets, - const std::shared_ptr& data, - const std::shared_ptr& null_bitmap = nullptr, int64_t null_count = 0, - int64_t offset = 0); + const std::shared_ptr& data, + const std::shared_ptr& null_bitmap = nullptr, + int64_t null_count = 0, int64_t offset = 0); // Return the pointer to the given elements bytes // TODO(emkornfield) introduce a StringPiece or something similar to capture zero-copy @@ -471,9 +472,10 @@ class ARROW_EXPORT BinaryArray : public FlatArray { // Constructor that allows sub-classes/builders to propagate there logical type up the // class hierarchy. BinaryArray(const std::shared_ptr& type, int64_t length, - const std::shared_ptr& value_offsets, const std::shared_ptr& data, - const std::shared_ptr& null_bitmap = nullptr, int64_t null_count = 0, - int64_t offset = 0); + const std::shared_ptr& value_offsets, + const std::shared_ptr& data, + const std::shared_ptr& null_bitmap = nullptr, + int64_t null_count = 0, int64_t offset = 0); const int32_t* raw_value_offsets_; const uint8_t* raw_data_; @@ -486,9 +488,9 @@ class ARROW_EXPORT StringArray : public BinaryArray { explicit StringArray(const std::shared_ptr& data); StringArray(int64_t length, const std::shared_ptr& value_offsets, - const std::shared_ptr& data, - const std::shared_ptr& null_bitmap = nullptr, int64_t null_count = 0, - int64_t offset = 0); + const std::shared_ptr& data, + const std::shared_ptr& null_bitmap = nullptr, + int64_t null_count = 0, int64_t offset = 0); // Construct a std::string // TODO: std::bad_alloc possibility @@ -511,9 +513,9 @@ class ARROW_EXPORT FixedSizeBinaryArray : public PrimitiveArray { explicit FixedSizeBinaryArray(const std::shared_ptr& data); FixedSizeBinaryArray(const std::shared_ptr& type, int64_t length, - const std::shared_ptr& data, - const std::shared_ptr& null_bitmap = nullptr, int64_t null_count = 0, - int64_t offset = 0); + const std::shared_ptr& data, + const std::shared_ptr& null_bitmap = nullptr, + int64_t null_count = 0, int64_t offset = 0); const uint8_t* GetValue(int64_t i) const; @@ -542,9 +544,10 @@ class ARROW_EXPORT DecimalArray : public FlatArray { explicit DecimalArray(const std::shared_ptr& data); DecimalArray(const std::shared_ptr& type, int64_t length, - const std::shared_ptr& data, - const std::shared_ptr& null_bitmap = nullptr, int64_t null_count = 0, - int64_t offset = 0, const std::shared_ptr& sign_bitmap = nullptr); + const std::shared_ptr& data, + const std::shared_ptr& null_bitmap = nullptr, + int64_t null_count = 0, int64_t offset = 0, + const std::shared_ptr& sign_bitmap = nullptr); bool IsNegative(int64_t i) const; @@ -582,9 +585,9 @@ class ARROW_EXPORT StructArray : public Array { explicit StructArray(const std::shared_ptr& data); StructArray(const std::shared_ptr& type, int64_t length, - const std::vector>& children, - std::shared_ptr null_bitmap = nullptr, int64_t null_count = 0, - int64_t offset = 0); + const std::vector>& children, + std::shared_ptr null_bitmap = nullptr, int64_t null_count = 0, + int64_t offset = 0); // Return a shared pointer in case the requestor desires to share ownership // with this array. @@ -604,11 +607,11 @@ class ARROW_EXPORT UnionArray : public Array { explicit UnionArray(const std::shared_ptr& data); UnionArray(const std::shared_ptr& type, int64_t length, - const std::vector>& children, - const std::shared_ptr& type_ids, - const std::shared_ptr& value_offsets = nullptr, - const std::shared_ptr& null_bitmap = nullptr, int64_t null_count = 0, - int64_t offset = 0); + const std::vector>& children, + const std::shared_ptr& type_ids, + const std::shared_ptr& value_offsets = nullptr, + const std::shared_ptr& null_bitmap = nullptr, int64_t null_count = 0, + int64_t offset = 0); /// Note that this buffer does not account for any slice offset std::shared_ptr type_ids() const { return data_->buffers[1]; } @@ -656,8 +659,8 @@ class ARROW_EXPORT DictionaryArray : public Array { explicit DictionaryArray(const std::shared_ptr& data); - DictionaryArray( - const std::shared_ptr& type, const std::shared_ptr& indices); + DictionaryArray(const std::shared_ptr& type, + const std::shared_ptr& indices); std::shared_ptr indices() const; std::shared_ptr dictionary() const; @@ -705,13 +708,16 @@ Status ARROW_EXPORT ValidateArray(const Array& array); /// Create new arrays for logical types that are backed by primitive arrays. Status ARROW_EXPORT MakePrimitiveArray(const std::shared_ptr& type, - int64_t length, const std::shared_ptr& data, - const std::shared_ptr& null_bitmap, int64_t null_count, int64_t offset, - std::shared_ptr* out); - -Status ARROW_EXPORT MakePrimitiveArray(const std::shared_ptr& type, - const std::vector>& buffers, int64_t length, - int64_t null_count, int64_t offset, std::shared_ptr* out); + int64_t length, + const std::shared_ptr& data, + const std::shared_ptr& null_bitmap, + int64_t null_count, int64_t offset, + std::shared_ptr* out); + +Status ARROW_EXPORT +MakePrimitiveArray(const std::shared_ptr& type, + const std::vector>& buffers, int64_t length, + int64_t null_count, int64_t offset, std::shared_ptr* out); } // namespace arrow diff --git a/cpp/src/arrow/buffer.cc b/cpp/src/arrow/buffer.cc index a1d119ecdca..b9c5897f8a2 100644 --- a/cpp/src/arrow/buffer.cc +++ b/cpp/src/arrow/buffer.cc @@ -27,8 +27,8 @@ namespace arrow { -Status Buffer::Copy( - int64_t start, int64_t nbytes, MemoryPool* pool, std::shared_ptr* out) const { +Status Buffer::Copy(int64_t start, int64_t nbytes, MemoryPool* pool, + std::shared_ptr* out) const { // Sanity checks DCHECK_LT(start, size_); DCHECK_LE(nbytes, size_ - start); @@ -47,25 +47,28 @@ Status Buffer::Copy(int64_t start, int64_t nbytes, std::shared_ptr* out) } bool Buffer::Equals(const Buffer& other, int64_t nbytes) const { - return this == &other || - (size_ >= nbytes && other.size_ >= nbytes && - (data_ == other.data_ || - !memcmp(data_, other.data_, static_cast(nbytes)))); + return this == &other || (size_ >= nbytes && other.size_ >= nbytes && + (data_ == other.data_ || + !memcmp(data_, other.data_, static_cast(nbytes)))); } bool Buffer::Equals(const Buffer& other) const { - return this == &other || (size_ == other.size_ && (data_ == other.data_ || - !memcmp(data_, other.data_, - static_cast(size_)))); + return this == &other || (size_ == other.size_ && + (data_ == other.data_ || + !memcmp(data_, other.data_, static_cast(size_)))); } PoolBuffer::PoolBuffer(MemoryPool* pool) : ResizableBuffer(nullptr, 0) { - if (pool == nullptr) { pool = default_memory_pool(); } + if (pool == nullptr) { + pool = default_memory_pool(); + } pool_ = pool; } PoolBuffer::~PoolBuffer() { - if (mutable_data_ != nullptr) { pool_->Free(mutable_data_, capacity_); } + if (mutable_data_ != nullptr) { + pool_->Free(mutable_data_, capacity_); + } } Status PoolBuffer::Reserve(int64_t new_capacity) { @@ -109,28 +112,28 @@ Status PoolBuffer::Resize(int64_t new_size, bool shrink_to_fit) { return Status::OK(); } -std::shared_ptr SliceMutableBuffer( - const std::shared_ptr& buffer, int64_t offset, int64_t length) { +std::shared_ptr SliceMutableBuffer(const std::shared_ptr& buffer, + int64_t offset, int64_t length) { return std::make_shared(buffer, offset, length); } -MutableBuffer::MutableBuffer( - const std::shared_ptr& parent, int64_t offset, int64_t size) +MutableBuffer::MutableBuffer(const std::shared_ptr& parent, int64_t offset, + int64_t size) : MutableBuffer(parent->mutable_data() + offset, size) { DCHECK(parent->is_mutable()) << "Must pass mutable buffer"; parent_ = parent; } -Status AllocateBuffer( - MemoryPool* pool, int64_t size, std::shared_ptr* out) { +Status AllocateBuffer(MemoryPool* pool, int64_t size, + std::shared_ptr* out) { auto buffer = std::make_shared(pool); RETURN_NOT_OK(buffer->Resize(size)); *out = buffer; return Status::OK(); } -Status AllocateResizableBuffer( - MemoryPool* pool, int64_t size, std::shared_ptr* out) { +Status AllocateResizableBuffer(MemoryPool* pool, int64_t size, + std::shared_ptr* out) { auto buffer = std::make_shared(pool); RETURN_NOT_OK(buffer->Resize(size)); *out = buffer; diff --git a/cpp/src/arrow/buffer.h b/cpp/src/arrow/buffer.h index 488a4c05334..5d050b77f77 100644 --- a/cpp/src/arrow/buffer.h +++ b/cpp/src/arrow/buffer.h @@ -25,6 +25,7 @@ #include #include "arrow/status.h" +#include "arrow/util/bit-util.h" #include "arrow/util/macros.h" #include "arrow/util/visibility.h" @@ -72,7 +73,7 @@ class ARROW_EXPORT Buffer { /// Copy a section of the buffer into a new Buffer. Status Copy(int64_t start, int64_t nbytes, MemoryPool* pool, - std::shared_ptr* out) const; + std::shared_ptr* out) const; /// Copy a section of the buffer using the default memory pool into a new Buffer. Status Copy(int64_t start, int64_t nbytes, std::shared_ptr* out) const; @@ -106,21 +107,21 @@ class ARROW_EXPORT Buffer { /// \param str std::string instance /// \return std::shared_ptr static inline std::shared_ptr GetBufferFromString(const std::string& str) { - return std::make_shared( - reinterpret_cast(str.c_str()), static_cast(str.size())); + return std::make_shared(reinterpret_cast(str.c_str()), + static_cast(str.size())); } /// Construct a view on passed buffer at the indicated offset and length. This /// function cannot fail and does not error checking (except in debug builds) -static inline std::shared_ptr SliceBuffer( - const std::shared_ptr& buffer, int64_t offset, int64_t length) { +static inline std::shared_ptr SliceBuffer(const std::shared_ptr& buffer, + int64_t offset, int64_t length) { return std::make_shared(buffer, offset, length); } /// Construct a mutable buffer slice. If the parent buffer is not mutable, this /// will abort in debug builds -std::shared_ptr ARROW_EXPORT SliceMutableBuffer( - const std::shared_ptr& buffer, int64_t offset, int64_t length); +std::shared_ptr ARROW_EXPORT +SliceMutableBuffer(const std::shared_ptr& buffer, int64_t offset, int64_t length); /// A Buffer whose contents can be mutated. May or may not own its data. class ARROW_EXPORT MutableBuffer : public Buffer { @@ -186,8 +187,12 @@ class ARROW_EXPORT BufferBuilder { /// Resizes the buffer to the nearest multiple of 64 bytes per Layout.md Status Resize(int64_t elements) { // Resize(0) is a no-op - if (elements == 0) { return Status::OK(); } - if (capacity_ == 0) { buffer_ = std::make_shared(pool_); } + if (elements == 0) { + return Status::OK(); + } + if (capacity_ == 0) { + buffer_ = std::make_shared(pool_); + } int64_t old_capacity = capacity_; RETURN_NOT_OK(buffer_->Resize(elements)); capacity_ = buffer_->capacity(); @@ -199,14 +204,20 @@ class ARROW_EXPORT BufferBuilder { } Status Append(const uint8_t* data, int64_t length) { - if (capacity_ < length + size_) { RETURN_NOT_OK(Resize(length + size_)); } + if (capacity_ < length + size_) { + int64_t new_capacity = BitUtil::NextPower2(length + size_); + RETURN_NOT_OK(Resize(new_capacity)); + } UnsafeAppend(data, length); return Status::OK(); } // Advance pointer and zero out memory Status Advance(int64_t length) { - if (capacity_ < length + size_) { RETURN_NOT_OK(Resize(length + size_)); } + if (capacity_ < length + size_) { + int64_t new_capacity = BitUtil::NextPower2(length + size_); + RETURN_NOT_OK(Resize(new_capacity)); + } memset(data_ + size_, 0, static_cast(length)); size_ += length; return Status::OK(); @@ -220,7 +231,9 @@ class ARROW_EXPORT BufferBuilder { Status Finish(std::shared_ptr* out) { // Do not shrink to fit to avoid unneeded realloc - if (size_ > 0) { RETURN_NOT_OK(buffer_->Resize(size_, false)); } + if (size_ > 0) { + RETURN_NOT_OK(buffer_->Resize(size_, false)); + } *out = buffer_; Reset(); return Status::OK(); @@ -250,29 +263,29 @@ class ARROW_EXPORT TypedBufferBuilder : public BufferBuilder { Status Append(T arithmetic_value) { static_assert(std::is_arithmetic::value, - "Convenience buffer append only supports arithmetic types"); - return BufferBuilder::Append( - reinterpret_cast(&arithmetic_value), sizeof(T)); + "Convenience buffer append only supports arithmetic types"); + return BufferBuilder::Append(reinterpret_cast(&arithmetic_value), + sizeof(T)); } Status Append(const T* arithmetic_values, int64_t num_elements) { static_assert(std::is_arithmetic::value, - "Convenience buffer append only supports arithmetic types"); - return BufferBuilder::Append( - reinterpret_cast(arithmetic_values), num_elements * sizeof(T)); + "Convenience buffer append only supports arithmetic types"); + return BufferBuilder::Append(reinterpret_cast(arithmetic_values), + num_elements * sizeof(T)); } void UnsafeAppend(T arithmetic_value) { static_assert(std::is_arithmetic::value, - "Convenience buffer append only supports arithmetic types"); + "Convenience buffer append only supports arithmetic types"); BufferBuilder::UnsafeAppend(reinterpret_cast(&arithmetic_value), sizeof(T)); } void UnsafeAppend(const T* arithmetic_values, int64_t num_elements) { static_assert(std::is_arithmetic::value, - "Convenience buffer append only supports arithmetic types"); - BufferBuilder::UnsafeAppend( - reinterpret_cast(arithmetic_values), num_elements * sizeof(T)); + "Convenience buffer append only supports arithmetic types"); + BufferBuilder::UnsafeAppend(reinterpret_cast(arithmetic_values), + num_elements * sizeof(T)); } const T* data() const { return reinterpret_cast(data_); } @@ -286,11 +299,11 @@ class ARROW_EXPORT TypedBufferBuilder : public BufferBuilder { /// \param[out] out the allocated buffer with padding /// /// \return Status message -Status ARROW_EXPORT AllocateBuffer( - MemoryPool* pool, int64_t size, std::shared_ptr* out); +Status ARROW_EXPORT AllocateBuffer(MemoryPool* pool, int64_t size, + std::shared_ptr* out); -Status ARROW_EXPORT AllocateResizableBuffer( - MemoryPool* pool, int64_t size, std::shared_ptr* out); +Status ARROW_EXPORT AllocateResizableBuffer(MemoryPool* pool, int64_t size, + std::shared_ptr* out); } // namespace arrow diff --git a/cpp/src/arrow/builder-benchmark.cc b/cpp/src/arrow/builder-benchmark.cc index 7ca7bb49998..13d7b20591d 100644 --- a/cpp/src/arrow/builder-benchmark.cc +++ b/cpp/src/arrow/builder-benchmark.cc @@ -38,8 +38,8 @@ static void BM_BuildPrimitiveArrayNoNulls( std::shared_ptr out; ABORT_NOT_OK(builder.Finish(&out)); } - state.SetBytesProcessed( - state.iterations() * data.size() * sizeof(int64_t) * kFinalSize); + state.SetBytesProcessed(state.iterations() * data.size() * sizeof(int64_t) * + kFinalSize); } static void BM_BuildVectorNoNulls( @@ -53,8 +53,8 @@ static void BM_BuildVectorNoNulls( builder.insert(builder.end(), data.cbegin(), data.cend()); } } - state.SetBytesProcessed( - state.iterations() * data.size() * sizeof(int64_t) * kFinalSize); + state.SetBytesProcessed(state.iterations() * data.size() * sizeof(int64_t) * + kFinalSize); } static void BM_BuildAdaptiveIntNoNulls( @@ -127,8 +127,8 @@ static void BM_BuildDictionary(benchmark::State& state) { // NOLINT non-const r std::shared_ptr out; ABORT_NOT_OK(builder.Finish(&out)); } - state.SetBytesProcessed( - state.iterations() * iterations * (iterations + 1) / 2 * sizeof(int64_t)); + state.SetBytesProcessed(state.iterations() * iterations * (iterations + 1) / 2 * + sizeof(int64_t)); } static void BM_BuildStringDictionary( @@ -152,8 +152,24 @@ static void BM_BuildStringDictionary( ABORT_NOT_OK(builder.Finish(&out)); } // Assuming a string here needs on average 2 bytes - state.SetBytesProcessed( - state.iterations() * iterations * (iterations + 1) / 2 * sizeof(int32_t)); + state.SetBytesProcessed(state.iterations() * iterations * (iterations + 1) / 2 * + sizeof(int32_t)); +} + +static void BM_BuildBinaryArray(benchmark::State& state) { // NOLINT non-const reference + const int64_t iterations = 1 << 20; + + std::string value = "1234567890"; + while (state.KeepRunning()) { + BinaryBuilder builder(default_memory_pool()); + for (int64_t i = 0; i < iterations; i++) { + ABORT_NOT_OK(builder.Append(value)); + } + std::shared_ptr out; + ABORT_NOT_OK(builder.Finish(&out)); + } + // Assuming a string here needs on average 2 bytes + state.SetBytesProcessed(state.iterations() * iterations * value.size()); } BENCHMARK(BM_BuildPrimitiveArrayNoNulls)->Repetitions(3)->Unit(benchmark::kMicrosecond); @@ -166,4 +182,6 @@ BENCHMARK(BM_BuildAdaptiveUIntNoNulls)->Repetitions(3)->Unit(benchmark::kMicrose BENCHMARK(BM_BuildDictionary)->Repetitions(3)->Unit(benchmark::kMicrosecond); BENCHMARK(BM_BuildStringDictionary)->Repetitions(3)->Unit(benchmark::kMicrosecond); +BENCHMARK(BM_BuildBinaryArray)->Repetitions(3)->Unit(benchmark::kMicrosecond); + } // namespace arrow diff --git a/cpp/src/arrow/builder.cc b/cpp/src/arrow/builder.cc index ee363b91d8f..391204f5669 100644 --- a/cpp/src/arrow/builder.cc +++ b/cpp/src/arrow/builder.cc @@ -37,6 +37,10 @@ namespace arrow { +using internal::AdaptiveIntBuilderBase; +using internal::ArrayData; +using internal::WrappedBinary; + Status ArrayBuilder::AppendToBitmap(bool is_valid) { if (length_ == capacity_) { // If the capacity was not already a multiple of 2, do so here @@ -69,7 +73,9 @@ Status ArrayBuilder::Init(int64_t capacity) { } Status ArrayBuilder::Resize(int64_t new_bits) { - if (!null_bitmap_) { return Init(new_bits); } + if (!null_bitmap_) { + return Init(new_bits); + } int64_t new_bytes = BitUtil::CeilByte(new_bits) / 8; int64_t old_bytes = null_bitmap_->size(); RETURN_NOT_OK(null_bitmap_->Resize(new_bytes)); @@ -78,8 +84,8 @@ Status ArrayBuilder::Resize(int64_t new_bits) { const int64_t byte_capacity = null_bitmap_->capacity(); capacity_ = new_bits; if (old_bytes < new_bytes) { - memset( - null_bitmap_data_ + old_bytes, 0, static_cast(byte_capacity - old_bytes)); + memset(null_bitmap_data_ + old_bytes, 0, + static_cast(byte_capacity - old_bytes)); } return Status::OK(); } @@ -140,7 +146,9 @@ void ArrayBuilder::UnsafeAppendToBitmap(const uint8_t* valid_bytes, int64_t leng bit_offset++; } - if (bit_offset != 0) { null_bitmap_data_[byte_offset] = bitset; } + if (bit_offset != 0) { + null_bitmap_data_[byte_offset] = bitset; + } length_ += length; } @@ -149,7 +157,9 @@ void ArrayBuilder::UnsafeSetNotNull(int64_t length) { // Fill up the bytes until we have a byte alignment int64_t pad_to_byte = std::min(8 - (length_ % 8), length); - if (pad_to_byte == 8) { pad_to_byte = 0; } + if (pad_to_byte == 8) { + pad_to_byte = 0; + } for (int64_t i = length_; i < length_ + pad_to_byte; ++i) { BitUtil::SetBit(null_bitmap_data_, i); } @@ -157,7 +167,7 @@ void ArrayBuilder::UnsafeSetNotNull(int64_t length) { // Fast bitsetting int64_t fast_length = (length - pad_to_byte) / 8; memset(null_bitmap_data_ + ((length_ + pad_to_byte) / 8), 0xFF, - static_cast(fast_length)); + static_cast(fast_length)); // Trailing bytes for (int64_t i = length_ + pad_to_byte + (fast_length * 8); i < new_length; ++i) { @@ -184,7 +194,9 @@ Status PrimitiveBuilder::Init(int64_t capacity) { template Status PrimitiveBuilder::Resize(int64_t capacity) { // XXX: Set floor size for now - if (capacity < kMinBuilderCapacity) { capacity = kMinBuilderCapacity; } + if (capacity < kMinBuilderCapacity) { + capacity = kMinBuilderCapacity; + } if (capacity_ == 0) { RETURN_NOT_OK(Init(capacity)); @@ -195,20 +207,20 @@ Status PrimitiveBuilder::Resize(int64_t capacity) { RETURN_NOT_OK(data_->Resize(new_bytes)); raw_data_ = reinterpret_cast(data_->mutable_data()); // TODO(emkornfield) valgrind complains without this - memset( - data_->mutable_data() + old_bytes, 0, static_cast(new_bytes - old_bytes)); + memset(data_->mutable_data() + old_bytes, 0, + static_cast(new_bytes - old_bytes)); } return Status::OK(); } template -Status PrimitiveBuilder::Append( - const value_type* values, int64_t length, const uint8_t* valid_bytes) { +Status PrimitiveBuilder::Append(const value_type* values, int64_t length, + const uint8_t* valid_bytes) { RETURN_NOT_OK(Reserve(length)); if (length > 0) { std::memcpy(raw_data_ + length_, values, - static_cast(TypeTraits::bytes_required(length))); + static_cast(TypeTraits::bytes_required(length))); } // length_ is update by these @@ -224,8 +236,8 @@ Status PrimitiveBuilder::Finish(std::shared_ptr* out) { // Trim buffers RETURN_NOT_OK(data_->Resize(bytes_required)); } - *out = std::make_shared::ArrayType>( - type_, length_, data_, null_bitmap_, null_count_); + *out = std::make_shared::ArrayType>(type_, length_, data_, + null_bitmap_, null_count_); data_ = null_bitmap_ = nullptr; capacity_ = length_ = null_count_ = 0; @@ -267,7 +279,9 @@ Status AdaptiveIntBuilderBase::Init(int64_t capacity) { Status AdaptiveIntBuilderBase::Resize(int64_t capacity) { // XXX: Set floor size for now - if (capacity < kMinBuilderCapacity) { capacity = kMinBuilderCapacity; } + if (capacity < kMinBuilderCapacity) { + capacity = kMinBuilderCapacity; + } if (capacity_ == 0) { RETURN_NOT_OK(Init(capacity)); @@ -278,8 +292,8 @@ Status AdaptiveIntBuilderBase::Resize(int64_t capacity) { RETURN_NOT_OK(data_->Resize(new_bytes)); raw_data_ = data_->mutable_data(); // TODO(emkornfield) valgrind complains without this - memset( - data_->mutable_data() + old_bytes, 0, static_cast(new_bytes - old_bytes)); + memset(data_->mutable_data() + old_bytes, 0, + static_cast(new_bytes - old_bytes)); } return Status::OK(); } @@ -298,16 +312,16 @@ Status AdaptiveIntBuilder::Finish(std::shared_ptr* out) { std::make_shared(int8(), length_, data_, null_bitmap_, null_count_); break; case 2: - *out = std::make_shared( - int16(), length_, data_, null_bitmap_, null_count_); + *out = std::make_shared(int16(), length_, data_, null_bitmap_, + null_count_); break; case 4: - *out = std::make_shared( - int32(), length_, data_, null_bitmap_, null_count_); + *out = std::make_shared(int32(), length_, data_, null_bitmap_, + null_count_); break; case 8: - *out = std::make_shared( - int64(), length_, data_, null_bitmap_, null_count_); + *out = std::make_shared(int64(), length_, data_, null_bitmap_, + null_count_); break; default: DCHECK(false); @@ -319,8 +333,8 @@ Status AdaptiveIntBuilder::Finish(std::shared_ptr* out) { return Status::OK(); } -Status AdaptiveIntBuilder::Append( - const int64_t* values, int64_t length, const uint8_t* valid_bytes) { +Status AdaptiveIntBuilder::Append(const int64_t* values, int64_t length, + const uint8_t* valid_bytes) { RETURN_NOT_OK(Reserve(length)); if (length > 0) { @@ -328,16 +342,18 @@ Status AdaptiveIntBuilder::Append( uint8_t new_int_size = int_size_; for (int64_t i = 0; i < length; i++) { if (valid_bytes == nullptr || valid_bytes[i]) { - new_int_size = expanded_int_size(values[i], new_int_size); + new_int_size = internal::ExpandedIntSize(values[i], new_int_size); } } - if (new_int_size != int_size_) { RETURN_NOT_OK(ExpandIntSize(new_int_size)); } + if (new_int_size != int_size_) { + RETURN_NOT_OK(ExpandIntSize(new_int_size)); + } } } if (int_size_ == 8) { std::memcpy(reinterpret_cast(raw_data_) + length_, values, - sizeof(int64_t) * length); + sizeof(int64_t) * length); } else { #ifdef _MSC_VER #pragma warning(push) @@ -348,17 +364,17 @@ Status AdaptiveIntBuilder::Append( case 1: { int8_t* data_ptr = reinterpret_cast(raw_data_) + length_; std::transform(values, values + length, data_ptr, - [](int64_t x) { return static_cast(x); }); + [](int64_t x) { return static_cast(x); }); } break; case 2: { int16_t* data_ptr = reinterpret_cast(raw_data_) + length_; std::transform(values, values + length, data_ptr, - [](int64_t x) { return static_cast(x); }); + [](int64_t x) { return static_cast(x); }); } break; case 4: { int32_t* data_ptr = reinterpret_cast(raw_data_) + length_; std::transform(values, values + length, data_ptr, - [](int64_t x) { return static_cast(x); }); + [](int64_t x) { return static_cast(x); }); } break; default: DCHECK(false); @@ -449,20 +465,20 @@ Status AdaptiveUIntBuilder::Finish(std::shared_ptr* out) { } switch (int_size_) { case 1: - *out = std::make_shared( - uint8(), length_, data_, null_bitmap_, null_count_); + *out = std::make_shared(uint8(), length_, data_, null_bitmap_, + null_count_); break; case 2: - *out = std::make_shared( - uint16(), length_, data_, null_bitmap_, null_count_); + *out = std::make_shared(uint16(), length_, data_, null_bitmap_, + null_count_); break; case 4: - *out = std::make_shared( - uint32(), length_, data_, null_bitmap_, null_count_); + *out = std::make_shared(uint32(), length_, data_, null_bitmap_, + null_count_); break; case 8: - *out = std::make_shared( - uint64(), length_, data_, null_bitmap_, null_count_); + *out = std::make_shared(uint64(), length_, data_, null_bitmap_, + null_count_); break; default: DCHECK(false); @@ -474,8 +490,8 @@ Status AdaptiveUIntBuilder::Finish(std::shared_ptr* out) { return Status::OK(); } -Status AdaptiveUIntBuilder::Append( - const uint64_t* values, int64_t length, const uint8_t* valid_bytes) { +Status AdaptiveUIntBuilder::Append(const uint64_t* values, int64_t length, + const uint8_t* valid_bytes) { RETURN_NOT_OK(Reserve(length)); if (length > 0) { @@ -483,16 +499,18 @@ Status AdaptiveUIntBuilder::Append( uint8_t new_int_size = int_size_; for (int64_t i = 0; i < length; i++) { if (valid_bytes == nullptr || valid_bytes[i]) { - new_int_size = expanded_uint_size(values[i], new_int_size); + new_int_size = internal::ExpandedUIntSize(values[i], new_int_size); } } - if (new_int_size != int_size_) { RETURN_NOT_OK(ExpandIntSize(new_int_size)); } + if (new_int_size != int_size_) { + RETURN_NOT_OK(ExpandIntSize(new_int_size)); + } } } if (int_size_ == 8) { std::memcpy(reinterpret_cast(raw_data_) + length_, values, - sizeof(uint64_t) * length); + sizeof(uint64_t) * length); } else { #ifdef _MSC_VER #pragma warning(push) @@ -503,17 +521,17 @@ Status AdaptiveUIntBuilder::Append( case 1: { uint8_t* data_ptr = reinterpret_cast(raw_data_) + length_; std::transform(values, values + length, data_ptr, - [](uint64_t x) { return static_cast(x); }); + [](uint64_t x) { return static_cast(x); }); } break; case 2: { uint16_t* data_ptr = reinterpret_cast(raw_data_) + length_; std::transform(values, values + length, data_ptr, - [](uint64_t x) { return static_cast(x); }); + [](uint64_t x) { return static_cast(x); }); } break; case 4: { uint32_t* data_ptr = reinterpret_cast(raw_data_) + length_; std::transform(values, values + length, data_ptr, - [](uint64_t x) { return static_cast(x); }); + [](uint64_t x) { return static_cast(x); }); } break; default: DCHECK(false); @@ -616,7 +634,9 @@ Status BooleanBuilder::Init(int64_t capacity) { Status BooleanBuilder::Resize(int64_t capacity) { // XXX: Set floor size for now - if (capacity < kMinBuilderCapacity) { capacity = kMinBuilderCapacity; } + if (capacity < kMinBuilderCapacity) { + capacity = kMinBuilderCapacity; + } if (capacity_ == 0) { RETURN_NOT_OK(Init(capacity)); @@ -627,8 +647,8 @@ Status BooleanBuilder::Resize(int64_t capacity) { RETURN_NOT_OK(data_->Resize(new_bytes)); raw_data_ = reinterpret_cast(data_->mutable_data()); - memset( - data_->mutable_data() + old_bytes, 0, static_cast(new_bytes - old_bytes)); + memset(data_->mutable_data() + old_bytes, 0, + static_cast(new_bytes - old_bytes)); } return Status::OK(); } @@ -647,8 +667,8 @@ Status BooleanBuilder::Finish(std::shared_ptr* out) { return Status::OK(); } -Status BooleanBuilder::Append( - const uint8_t* values, int64_t length, const uint8_t* valid_bytes) { +Status BooleanBuilder::Append(const uint8_t* values, int64_t length, + const uint8_t* valid_bytes) { RETURN_NOT_OK(Reserve(length)); for (int64_t i = 0; i < length; ++i) { @@ -673,14 +693,16 @@ Status BooleanBuilder::Append( // DictionaryBuilder template -DictionaryBuilder::DictionaryBuilder( - MemoryPool* pool, const std::shared_ptr& type) +DictionaryBuilder::DictionaryBuilder(MemoryPool* pool, + const std::shared_ptr& type) : ArrayBuilder(pool, type), hash_table_(new PoolBuffer(pool)), hash_slots_(nullptr), dict_builder_(pool, type), values_builder_(pool) { - if (!::arrow::CpuInfo::initialized()) { ::arrow::CpuInfo::Init(); } + if (!::arrow::CpuInfo::initialized()) { + ::arrow::CpuInfo::Init(); + } } template @@ -699,7 +721,9 @@ Status DictionaryBuilder::Init(int64_t elements) { template Status DictionaryBuilder::Resize(int64_t capacity) { - if (capacity < kMinBuilderCapacity) { capacity = kMinBuilderCapacity; } + if (capacity < kMinBuilderCapacity) { + capacity = kMinBuilderCapacity; + } if (capacity_ == 0) { return Init(capacity); @@ -732,7 +756,9 @@ Status DictionaryBuilder::Append(const Scalar& value) { while (kHashSlotEmpty != index && SlotDifferent(index, value)) { // Linear probing ++j; - if (j == hash_table_size_) { j = 0; } + if (j == hash_table_size_) { + j = 0; + } index = hash_slots_[j]; } @@ -784,7 +810,9 @@ Status DictionaryBuilder::DoubleTableSize() { for (int i = 0; i < hash_table_size_; ++i) { hash_slot_t index = hash_slots_[i]; - if (index == kHashSlotEmpty) { continue; } + if (index == kHashSlotEmpty) { + continue; + } // Compute the hash value mod the new table size to start looking for an // empty slot @@ -796,7 +824,9 @@ Status DictionaryBuilder::DoubleTableSize() { while (kHashSlotEmpty != slot && SlotDifferent(slot, value)) { ++j; - if (j == new_size) { j = 0; } + if (j == new_size) { + j = 0; + } slot = new_hash_slots[j]; } @@ -835,48 +865,47 @@ Status DictionaryBuilder::AppendDictionary(const Scalar& value) { return dict_builder_.Append(value); } -#define BINARY_DICTIONARY_SPECIALIZATIONS(Type) \ - template <> \ - internal::WrappedBinary DictionaryBuilder::GetDictionaryValue(int64_t index) { \ - int32_t v_len; \ - const uint8_t* v = dict_builder_.GetValue(static_cast(index), &v_len); \ - return internal::WrappedBinary(v, v_len); \ - } \ - \ - template <> \ - Status DictionaryBuilder::AppendDictionary( \ - const internal::WrappedBinary& value) { \ - return dict_builder_.Append(value.ptr_, value.length_); \ - } \ - \ - template <> \ - Status DictionaryBuilder::AppendArray(const Array& array) { \ - const BinaryArray& binary_array = static_cast(array); \ - internal::WrappedBinary value(nullptr, 0); \ - for (int64_t i = 0; i < array.length(); i++) { \ - if (array.IsNull(i)) { \ - RETURN_NOT_OK(AppendNull()); \ - } else { \ - value.ptr_ = binary_array.GetValue(i, &value.length_); \ - RETURN_NOT_OK(Append(value)); \ - } \ - } \ - return Status::OK(); \ - } \ - \ - template <> \ - int DictionaryBuilder::HashValue(const internal::WrappedBinary& value) { \ - return HashUtil::Hash(value.ptr_, value.length_, 0); \ - } \ - \ - template <> \ - bool DictionaryBuilder::SlotDifferent( \ - hash_slot_t index, const internal::WrappedBinary& value) { \ - int32_t other_length; \ - const uint8_t* other_value = \ - dict_builder_.GetValue(static_cast(index), &other_length); \ - return !(other_length == value.length_ && \ - 0 == memcmp(other_value, value.ptr_, value.length_)); \ +#define BINARY_DICTIONARY_SPECIALIZATIONS(Type) \ + template <> \ + WrappedBinary DictionaryBuilder::GetDictionaryValue(int64_t index) { \ + int32_t v_len; \ + const uint8_t* v = dict_builder_.GetValue(static_cast(index), &v_len); \ + return WrappedBinary(v, v_len); \ + } \ + \ + template <> \ + Status DictionaryBuilder::AppendDictionary(const WrappedBinary& value) { \ + return dict_builder_.Append(value.ptr_, value.length_); \ + } \ + \ + template <> \ + Status DictionaryBuilder::AppendArray(const Array& array) { \ + const BinaryArray& binary_array = static_cast(array); \ + WrappedBinary value(nullptr, 0); \ + for (int64_t i = 0; i < array.length(); i++) { \ + if (array.IsNull(i)) { \ + RETURN_NOT_OK(AppendNull()); \ + } else { \ + value.ptr_ = binary_array.GetValue(i, &value.length_); \ + RETURN_NOT_OK(Append(value)); \ + } \ + } \ + return Status::OK(); \ + } \ + \ + template <> \ + int DictionaryBuilder::HashValue(const WrappedBinary& value) { \ + return HashUtil::Hash(value.ptr_, value.length_, 0); \ + } \ + \ + template <> \ + bool DictionaryBuilder::SlotDifferent(hash_slot_t index, \ + const WrappedBinary& value) { \ + int32_t other_length; \ + const uint8_t* other_value = \ + dict_builder_.GetValue(static_cast(index), &other_length); \ + return !(other_length == value.length_ && \ + 0 == memcmp(other_value, value.ptr_, value.length_)); \ } BINARY_DICTIONARY_SPECIALIZATIONS(StringType); @@ -951,7 +980,9 @@ Status DecimalBuilder::Init(int64_t capacity) { Status DecimalBuilder::Resize(int64_t capacity) { int64_t old_bytes = null_bitmap_ != nullptr ? null_bitmap_->size() : 0; - if (sign_bitmap_ == nullptr) { return Init(capacity); } + if (sign_bitmap_ == nullptr) { + return Init(capacity); + } RETURN_NOT_OK(FixedSizeBinaryBuilder::Resize(capacity)); if (byte_width_ == 16) { @@ -962,7 +993,7 @@ Status DecimalBuilder::Resize(int64_t capacity) { // The buffer might be overpadded to deal with padding according to the spec if (old_bytes < new_bytes) { memset(sign_bitmap_data_ + old_bytes, 0, - static_cast(sign_bitmap_->capacity() - old_bytes)); + static_cast(sign_bitmap_->capacity() - old_bytes)); } } return Status::OK(); @@ -973,8 +1004,8 @@ Status DecimalBuilder::Finish(std::shared_ptr* out) { RETURN_NOT_OK(byte_builder_.Finish(&data)); /// TODO(phillipc): not sure where to get the offset argument here - *out = std::make_shared( - type_, length_, data, null_bitmap_, null_count_, 0, sign_bitmap_); + *out = std::make_shared(type_, length_, data, null_bitmap_, null_count_, + 0, sign_bitmap_); return Status::OK(); } @@ -982,15 +1013,15 @@ Status DecimalBuilder::Finish(std::shared_ptr* out) { // ListBuilder ListBuilder::ListBuilder(MemoryPool* pool, std::unique_ptr value_builder, - const std::shared_ptr& type) - : ArrayBuilder( - pool, type ? type : std::static_pointer_cast( - std::make_shared(value_builder->type()))), + const std::shared_ptr& type) + : ArrayBuilder(pool, + type ? type : std::static_pointer_cast( + std::make_shared(value_builder->type()))), offsets_builder_(pool), value_builder_(std::move(value_builder)) {} -Status ListBuilder::Append( - const int32_t* offsets, int64_t length, const uint8_t* valid_bytes) { +Status ListBuilder::Append(const int32_t* offsets, int64_t length, + const uint8_t* valid_bytes) { RETURN_NOT_OK(Reserve(length)); UnsafeAppendToBitmap(valid_bytes, length); offsets_builder_.UnsafeAppend(offsets, length); @@ -1035,10 +1066,12 @@ Status ListBuilder::Finish(std::shared_ptr* out) { RETURN_NOT_OK(offsets_builder_.Finish(&offsets)); std::shared_ptr items = values_; - if (!items) { RETURN_NOT_OK(value_builder_->Finish(&items)); } + if (!items) { + RETURN_NOT_OK(value_builder_->Finish(&items)); + } - *out = std::make_shared( - type_, length_, offsets, items, null_bitmap_, null_count_); + *out = std::make_shared(type_, length_, offsets, items, null_bitmap_, + null_count_); Reset(); return Status::OK(); @@ -1102,7 +1135,7 @@ Status BinaryBuilder::AppendNull() { return Status::OK(); } -Status BinaryBuilder::FinishInternal(std::shared_ptr* out) { +Status BinaryBuilder::FinishInternal(std::shared_ptr* out) { // Write final offset (values length) RETURN_NOT_OK(AppendNextOffset()); std::shared_ptr offsets, value_data; @@ -1111,13 +1144,12 @@ Status BinaryBuilder::FinishInternal(std::shared_ptr* out) RETURN_NOT_OK(value_data_builder_.Finish(&value_data)); BufferVector buffers = {null_bitmap_, offsets, value_data}; - *out = std::make_shared( - type_, length_, std::move(buffers), null_count_, 0); + *out = std::make_shared(type_, length_, std::move(buffers), null_count_, 0); return Status::OK(); } Status BinaryBuilder::Finish(std::shared_ptr* out) { - std::shared_ptr data; + std::shared_ptr data; RETURN_NOT_OK(FinishInternal(&data)); *out = std::make_shared(data); Reset(); @@ -1144,7 +1176,7 @@ const uint8_t* BinaryBuilder::GetValue(int64_t i, int32_t* out_length) const { StringBuilder::StringBuilder(MemoryPool* pool) : BinaryBuilder(pool, utf8()) {} Status StringBuilder::Finish(std::shared_ptr* out) { - std::shared_ptr data; + std::shared_ptr data; RETURN_NOT_OK(FinishInternal(&data)); *out = std::make_shared(data); Reset(); @@ -1154,8 +1186,8 @@ Status StringBuilder::Finish(std::shared_ptr* out) { // ---------------------------------------------------------------------- // Fixed width binary -FixedSizeBinaryBuilder::FixedSizeBinaryBuilder( - MemoryPool* pool, const std::shared_ptr& type) +FixedSizeBinaryBuilder::FixedSizeBinaryBuilder(MemoryPool* pool, + const std::shared_ptr& type) : ArrayBuilder(pool, type), byte_width_(static_cast(*type).byte_width()), byte_builder_(pool) {} @@ -1166,8 +1198,8 @@ Status FixedSizeBinaryBuilder::Append(const uint8_t* value) { return byte_builder_.Append(value, byte_width_); } -Status FixedSizeBinaryBuilder::Append( - const uint8_t* data, int64_t length, const uint8_t* valid_bytes) { +Status FixedSizeBinaryBuilder::Append(const uint8_t* data, int64_t length, + const uint8_t* valid_bytes) { RETURN_NOT_OK(Reserve(length)); UnsafeAppendToBitmap(valid_bytes, length); return byte_builder_.Append(data, length * byte_width_); @@ -1196,8 +1228,8 @@ Status FixedSizeBinaryBuilder::Resize(int64_t capacity) { Status FixedSizeBinaryBuilder::Finish(std::shared_ptr* out) { std::shared_ptr data; RETURN_NOT_OK(byte_builder_.Finish(&data)); - *out = std::make_shared( - type_, length_, data, null_bitmap_, null_count_); + *out = std::make_shared(type_, length_, data, null_bitmap_, + null_count_); return Status::OK(); } @@ -1205,7 +1237,7 @@ Status FixedSizeBinaryBuilder::Finish(std::shared_ptr* out) { // Struct StructBuilder::StructBuilder(MemoryPool* pool, const std::shared_ptr& type, - std::vector>&& field_builders) + std::vector>&& field_builders) : ArrayBuilder(pool, type) { field_builders_ = std::move(field_builders); } @@ -1237,7 +1269,7 @@ Status StructBuilder::Finish(std::shared_ptr* out) { // // TODO(wesm): come up with a less monolithic strategy Status MakeBuilder(MemoryPool* pool, const std::shared_ptr& type, - std::unique_ptr* out) { + std::unique_ptr* out) { switch (type->id()) { BUILDER_CASE(UINT8, UInt8Builder); BUILDER_CASE(INT8, Int8Builder); @@ -1269,7 +1301,7 @@ Status MakeBuilder(MemoryPool* pool, const std::shared_ptr& type, } case Type::STRUCT: { - const std::vector& fields = type->children(); + const std::vector>& fields = type->children(); std::vector> values_builder; for (auto it : fields) { @@ -1292,7 +1324,7 @@ Status MakeBuilder(MemoryPool* pool, const std::shared_ptr& type, return Status::OK(); Status MakeDictionaryBuilder(MemoryPool* pool, const std::shared_ptr& type, - std::shared_ptr* out) { + std::shared_ptr* out) { switch (type->id()) { DICTIONARY_BUILDER_CASE(UINT8, DictionaryBuilder); DICTIONARY_BUILDER_CASE(INT8, DictionaryBuilder); diff --git a/cpp/src/arrow/builder.h b/cpp/src/arrow/builder.h index 065e115ac58..009fd7ae47d 100644 --- a/cpp/src/arrow/builder.h +++ b/cpp/src/arrow/builder.h @@ -186,8 +186,8 @@ class ARROW_EXPORT PrimitiveBuilder : public ArrayBuilder { /// /// If passed, valid_bytes is of equal length to values, and any zero byte /// will be considered as a null for that slot - Status Append( - const value_type* values, int64_t length, const uint8_t* valid_bytes = nullptr); + Status Append(const value_type* values, int64_t length, + const uint8_t* valid_bytes = nullptr); Status Finish(std::shared_ptr* out) override; Status Init(int64_t capacity) override; @@ -262,6 +262,8 @@ using HalfFloatBuilder = NumericBuilder; using FloatBuilder = NumericBuilder; using DoubleBuilder = NumericBuilder; +namespace internal { + class ARROW_EXPORT AdaptiveIntBuilderBase : public ArrayBuilder { public: explicit AdaptiveIntBuilderBase(MemoryPool* pool); @@ -295,25 +297,49 @@ class ARROW_EXPORT AdaptiveIntBuilderBase : public ArrayBuilder { }; // Check if we would need to expand the underlying storage type -inline uint8_t expanded_uint_size(uint64_t val, uint8_t current_int_size) { +inline uint8_t ExpandedIntSize(int64_t val, uint8_t current_int_size) { + if (current_int_size == 8 || + (current_int_size < 8 && + (val > static_cast(std::numeric_limits::max()) || + val < static_cast(std::numeric_limits::min())))) { + return 8; + } else if (current_int_size == 4 || + (current_int_size < 4 && + (val > static_cast(std::numeric_limits::max()) || + val < static_cast(std::numeric_limits::min())))) { + return 4; + } else if (current_int_size == 2 || + (current_int_size == 1 && + (val > static_cast(std::numeric_limits::max()) || + val < static_cast(std::numeric_limits::min())))) { + return 2; + } else { + return 1; + } +} + +// Check if we would need to expand the underlying storage type +inline uint8_t ExpandedUIntSize(uint64_t val, uint8_t current_int_size) { if (current_int_size == 8 || (current_int_size < 8 && - (val > static_cast(std::numeric_limits::max())))) { + (val > static_cast(std::numeric_limits::max())))) { return 8; } else if (current_int_size == 4 || (current_int_size < 4 && - (val > static_cast(std::numeric_limits::max())))) { + (val > static_cast(std::numeric_limits::max())))) { return 4; } else if (current_int_size == 2 || (current_int_size == 1 && - (val > static_cast(std::numeric_limits::max())))) { + (val > static_cast(std::numeric_limits::max())))) { return 2; } else { return 1; } } -class ARROW_EXPORT AdaptiveUIntBuilder : public AdaptiveIntBuilderBase { +} // namespace internal + +class ARROW_EXPORT AdaptiveUIntBuilder : public internal::AdaptiveIntBuilderBase { public: explicit AdaptiveUIntBuilder(MemoryPool* pool); @@ -324,8 +350,10 @@ class ARROW_EXPORT AdaptiveUIntBuilder : public AdaptiveIntBuilderBase { RETURN_NOT_OK(Reserve(1)); BitUtil::SetBit(null_bitmap_data_, length_); - uint8_t new_int_size = expanded_uint_size(val, int_size_); - if (new_int_size != int_size_) { RETURN_NOT_OK(ExpandIntSize(new_int_size)); } + uint8_t new_int_size = internal::ExpandedUIntSize(val, int_size_); + if (new_int_size != int_size_) { + RETURN_NOT_OK(ExpandIntSize(new_int_size)); + } switch (int_size_) { case 1: @@ -350,8 +378,8 @@ class ARROW_EXPORT AdaptiveUIntBuilder : public AdaptiveIntBuilderBase { /// /// If passed, valid_bytes is of equal length to values, and any zero byte /// will be considered as a null for that slot - Status Append( - const uint64_t* values, int64_t length, const uint8_t* valid_bytes = nullptr); + Status Append(const uint64_t* values, int64_t length, + const uint8_t* valid_bytes = nullptr); Status ExpandIntSize(uint8_t new_int_size); Status Finish(std::shared_ptr* out) override; @@ -370,29 +398,7 @@ class ARROW_EXPORT AdaptiveUIntBuilder : public AdaptiveIntBuilderBase { Status ExpandIntSizeN(); }; -// Check if we would need to expand the underlying storage type -inline uint8_t expanded_int_size(int64_t val, uint8_t current_int_size) { - if (current_int_size == 8 || - (current_int_size < 8 && - (val > static_cast(std::numeric_limits::max()) || - val < static_cast(std::numeric_limits::min())))) { - return 8; - } else if (current_int_size == 4 || - (current_int_size < 4 && - (val > static_cast(std::numeric_limits::max()) || - val < static_cast(std::numeric_limits::min())))) { - return 4; - } else if (current_int_size == 2 || - (current_int_size == 1 && - (val > static_cast(std::numeric_limits::max()) || - val < static_cast(std::numeric_limits::min())))) { - return 2; - } else { - return 1; - } -} - -class ARROW_EXPORT AdaptiveIntBuilder : public AdaptiveIntBuilderBase { +class ARROW_EXPORT AdaptiveIntBuilder : public internal::AdaptiveIntBuilderBase { public: explicit AdaptiveIntBuilder(MemoryPool* pool); @@ -403,8 +409,10 @@ class ARROW_EXPORT AdaptiveIntBuilder : public AdaptiveIntBuilderBase { RETURN_NOT_OK(Reserve(1)); BitUtil::SetBit(null_bitmap_data_, length_); - uint8_t new_int_size = expanded_int_size(val, int_size_); - if (new_int_size != int_size_) { RETURN_NOT_OK(ExpandIntSize(new_int_size)); } + uint8_t new_int_size = internal::ExpandedIntSize(val, int_size_); + if (new_int_size != int_size_) { + RETURN_NOT_OK(ExpandIntSize(new_int_size)); + } switch (int_size_) { case 1: @@ -429,8 +437,8 @@ class ARROW_EXPORT AdaptiveIntBuilder : public AdaptiveIntBuilderBase { /// /// If passed, valid_bytes is of equal length to values, and any zero byte /// will be considered as a null for that slot - Status Append( - const int64_t* values, int64_t length, const uint8_t* valid_bytes = nullptr); + Status Append(const int64_t* values, int64_t length, + const uint8_t* valid_bytes = nullptr); Status ExpandIntSize(uint8_t new_int_size); Status Finish(std::shared_ptr* out) override; @@ -490,8 +498,8 @@ class ARROW_EXPORT BooleanBuilder : public ArrayBuilder { /// /// If passed, valid_bytes is of equal length to values, and any zero byte /// will be considered as a null for that slot - Status Append( - const uint8_t* values, int64_t length, const uint8_t* valid_bytes = nullptr); + Status Append(const uint8_t* values, int64_t length, + const uint8_t* valid_bytes = nullptr); Status Finish(std::shared_ptr* out) override; Status Init(int64_t capacity) override; @@ -526,7 +534,7 @@ class ARROW_EXPORT ListBuilder : public ArrayBuilder { /// Use this constructor to incrementally build the value array along with offsets and /// null bitmap. ListBuilder(MemoryPool* pool, std::unique_ptr value_builder, - const std::shared_ptr& type = nullptr); + const std::shared_ptr& type = nullptr); Status Init(int64_t elements) override; Status Resize(int64_t capacity) override; @@ -536,8 +544,8 @@ class ARROW_EXPORT ListBuilder : public ArrayBuilder { /// /// If passed, valid_bytes is of equal length to values, and any zero byte /// will be considered as a null for that slot - Status Append( - const int32_t* offsets, int64_t length, const uint8_t* valid_bytes = nullptr); + Status Append(const int32_t* offsets, int64_t length, + const uint8_t* valid_bytes = nullptr); /// \brief Start a new variable-length list slot /// @@ -626,8 +634,8 @@ class ARROW_EXPORT FixedSizeBinaryBuilder : public ArrayBuilder { FixedSizeBinaryBuilder(MemoryPool* pool, const std::shared_ptr& type); Status Append(const uint8_t* value); - Status Append( - const uint8_t* data, int64_t length, const uint8_t* valid_bytes = nullptr); + Status Append(const uint8_t* data, int64_t length, + const uint8_t* valid_bytes = nullptr); Status Append(const std::string& value); Status AppendNull(); @@ -672,7 +680,7 @@ class ARROW_EXPORT DecimalBuilder : public FixedSizeBinaryBuilder { class ARROW_EXPORT StructBuilder : public ArrayBuilder { public: StructBuilder(MemoryPool* pool, const std::shared_ptr& type, - std::vector>&& field_builders); + std::vector>&& field_builders); Status Finish(std::shared_ptr* out) override; @@ -808,7 +816,7 @@ class ARROW_EXPORT BinaryDictionaryBuilder : public DictionaryBuilder(value.c_str()), - static_cast(value.size()))); + static_cast(value.size()))); } }; @@ -829,7 +837,7 @@ class ARROW_EXPORT StringDictionaryBuilder : public DictionaryBuilder(value.c_str()), - static_cast(value.size()))); + static_cast(value.size()))); } }; @@ -837,10 +845,11 @@ class ARROW_EXPORT StringDictionaryBuilder : public DictionaryBuilder& type, - std::unique_ptr* out); + std::unique_ptr* out); Status ARROW_EXPORT MakeDictionaryBuilder(MemoryPool* pool, - const std::shared_ptr& type, std::shared_ptr* out); + const std::shared_ptr& type, + std::shared_ptr* out); } // namespace arrow diff --git a/cpp/src/arrow/compare.cc b/cpp/src/arrow/compare.cc index 1465e0b414f..dda5fdd95d0 100644 --- a/cpp/src/arrow/compare.cc +++ b/cpp/src/arrow/compare.cc @@ -38,10 +38,12 @@ namespace arrow { // ---------------------------------------------------------------------- // Public method implementations +namespace internal { + class RangeEqualsVisitor { public: RangeEqualsVisitor(const Array& right, int64_t left_start_idx, int64_t left_end_idx, - int64_t right_start_idx) + int64_t right_start_idx) : right_(right), left_start_idx_(left_start_idx), left_end_idx_(left_end_idx), @@ -71,7 +73,9 @@ class RangeEqualsVisitor { for (int64_t i = left_start_idx_, o_i = right_start_idx_; i < left_end_idx_; ++i, ++o_i) { const bool is_null = left.IsNull(i); - if (is_null != right.IsNull(o_i)) { return false; } + if (is_null != right.IsNull(o_i)) { + return false; + } if (is_null) continue; const int32_t begin_offset = left.value_offset(i); const int32_t end_offset = left.value_offset(i + 1); @@ -84,8 +88,8 @@ class RangeEqualsVisitor { if (end_offset - begin_offset > 0 && std::memcmp(left.value_data()->data() + begin_offset, - right.value_data()->data() + right_begin_offset, - static_cast(end_offset - begin_offset))) { + right.value_data()->data() + right_begin_offset, + static_cast(end_offset - begin_offset))) { return false; } } @@ -101,7 +105,9 @@ class RangeEqualsVisitor { for (int64_t i = left_start_idx_, o_i = right_start_idx_; i < left_end_idx_; ++i, ++o_i) { const bool is_null = left.IsNull(i); - if (is_null != right.IsNull(o_i)) { return false; } + if (is_null != right.IsNull(o_i)) { + return false; + } if (is_null) continue; const int32_t begin_offset = left.value_offset(i); const int32_t end_offset = left.value_offset(i + 1); @@ -111,8 +117,8 @@ class RangeEqualsVisitor { if (end_offset - begin_offset != right_end_offset - right_begin_offset) { return false; } - if (!left_values->RangeEquals( - begin_offset, end_offset, right_begin_offset, right_values)) { + if (!left_values->RangeEquals(begin_offset, end_offset, right_begin_offset, + right_values)) { return false; } } @@ -124,7 +130,9 @@ class RangeEqualsVisitor { bool equal_fields = true; for (int64_t i = left_start_idx_, o_i = right_start_idx_; i < left_end_idx_; ++i, ++o_i) { - if (left.IsNull(i) != right.IsNull(o_i)) { return false; } + if (left.IsNull(i) != right.IsNull(o_i)) { + return false; + } if (left.IsNull(i)) continue; for (int j = 0; j < left.num_fields(); ++j) { // TODO: really we should be comparing stretches of non-null data rather @@ -132,9 +140,11 @@ class RangeEqualsVisitor { const int64_t left_abs_index = i + left.offset(); const int64_t right_abs_index = o_i + right.offset(); - equal_fields = left.field(j)->RangeEquals( - left_abs_index, left_abs_index + 1, right_abs_index, right.field(j)); - if (!equal_fields) { return false; } + equal_fields = left.field(j)->RangeEquals(left_abs_index, left_abs_index + 1, + right_abs_index, right.field(j)); + if (!equal_fields) { + return false; + } } } return true; @@ -144,7 +154,9 @@ class RangeEqualsVisitor { const auto& right = static_cast(right_); const UnionMode union_mode = left.mode(); - if (union_mode != right.mode()) { return false; } + if (union_mode != right.mode()) { + return false; + } const auto& left_type = static_cast(*left.type()); @@ -154,7 +166,9 @@ class RangeEqualsVisitor { const std::vector& type_codes = left_type.type_codes(); for (size_t i = 0; i < type_codes.size(); ++i) { const uint8_t code = type_codes[i]; - if (code > max_code) { max_code = code; } + if (code > max_code) { + max_code = code; + } } // Store mapping in a vector for constant time lookups @@ -169,9 +183,13 @@ class RangeEqualsVisitor { uint8_t id, child_num; for (int64_t i = left_start_idx_, o_i = right_start_idx_; i < left_end_idx_; ++i, ++o_i) { - if (left.IsNull(i) != right.IsNull(o_i)) { return false; } + if (left.IsNull(i) != right.IsNull(o_i)) { + return false; + } if (left.IsNull(i)) continue; - if (left_ids[i] != right_ids[o_i]) { return false; } + if (left_ids[i] != right_ids[o_i]) { + return false; + } id = left_ids[i]; child_num = type_id_to_child_num[id]; @@ -183,14 +201,15 @@ class RangeEqualsVisitor { // rather than looking at one value at a time. if (union_mode == UnionMode::SPARSE) { if (!left.child(child_num)->RangeEquals(left_abs_index, left_abs_index + 1, - right_abs_index, right.child(child_num))) { + right_abs_index, + right.child(child_num))) { return false; } } else { const int32_t offset = left.raw_value_offsets()[i]; const int32_t o_offset = right.raw_value_offsets()[o_i]; - if (!left.child(child_num)->RangeEquals( - offset, offset + 1, o_offset, right.child(child_num))) { + if (!left.child(child_num)->RangeEquals(offset, offset + 1, o_offset, + right.child(child_num))) { return false; } } @@ -211,9 +230,13 @@ class RangeEqualsVisitor { const uint8_t* left_data = nullptr; const uint8_t* right_data = nullptr; - if (left.values()) { left_data = left.raw_values() + left.offset() * width; } + if (left.values()) { + left_data = left.raw_values() + left.offset() * width; + } - if (right.values()) { right_data = right.raw_values() + right.offset() * width; } + if (right.values()) { + right_data = right.raw_values() + right.offset() * width; + } for (int64_t i = left_start_idx_, o_i = right_start_idx_; i < left_end_idx_; ++i, ++o_i) { @@ -241,9 +264,13 @@ class RangeEqualsVisitor { const uint8_t* left_data = nullptr; const uint8_t* right_data = nullptr; - if (left.values()) { left_data = left.raw_values() + left.offset() * width; } + if (left.values()) { + left_data = left.raw_values() + left.offset() * width; + } - if (right.values()) { right_data = right.raw_values() + right.offset() * width; } + if (right.values()) { + right_data = right.raw_values() + right.offset() * width; + } for (int64_t i = left_start_idx_, o_i = right_start_idx_; i < left_end_idx_; ++i, ++o_i) { @@ -301,8 +328,8 @@ class RangeEqualsVisitor { result_ = false; return Status::OK(); } - result_ = left.indices()->RangeEquals( - left_start_idx_, left_end_idx_, right_start_idx_, right.indices()); + result_ = left.indices()->RangeEquals(left_start_idx_, left_end_idx_, + right_start_idx_, right.indices()); return Status::OK(); } @@ -324,7 +351,9 @@ static bool IsEqualPrimitive(const PrimitiveArray& left, const PrimitiveArray& r const uint8_t* left_data = nullptr; const uint8_t* right_data = nullptr; - if (left.values()) { left_data = left.values()->data() + left.offset() * byte_width; } + if (left.values()) { + left_data = left.values()->data() + left.offset() * byte_width; + } if (right.values()) { right_data = right.values()->data() + right.offset() * byte_width; } @@ -341,13 +370,13 @@ static bool IsEqualPrimitive(const PrimitiveArray& left, const PrimitiveArray& r return true; } else { return memcmp(left_data, right_data, - static_cast(byte_width * left.length())) == 0; + static_cast(byte_width * left.length())) == 0; } } template -static inline bool CompareBuiltIn( - const Array& left, const Array& right, const T* ldata, const T* rdata) { +static inline bool CompareBuiltIn(const Array& left, const Array& right, const T* ldata, + const T* rdata) { if (left.null_count() > 0) { for (int64_t i = 0; i < left.length(); ++i) { if (left.IsNull(i) != right.IsNull(i)) { @@ -369,17 +398,21 @@ static bool IsEqualDecimal(const DecimalArray& left, const DecimalArray& right) const uint8_t* left_data = nullptr; const uint8_t* right_data = nullptr; - if (left.values()) { left_data = left.values()->data(); } - if (right.values()) { right_data = right.values()->data(); } + if (left.values()) { + left_data = left.values()->data(); + } + if (right.values()) { + right_data = right.values()->data(); + } const int32_t byte_width = left.byte_width(); if (byte_width == 4) { - return CompareBuiltIn(left, right, - reinterpret_cast(left_data) + loffset, + return CompareBuiltIn( + left, right, reinterpret_cast(left_data) + loffset, reinterpret_cast(right_data) + roffset); } else if (byte_width == 8) { - return CompareBuiltIn(left, right, - reinterpret_cast(left_data) + loffset, + return CompareBuiltIn( + left, right, reinterpret_cast(left_data) + loffset, reinterpret_cast(right_data) + roffset); } else { // 128-bit @@ -387,8 +420,12 @@ static bool IsEqualDecimal(const DecimalArray& left, const DecimalArray& right) // Must also compare sign bitmap const uint8_t* left_sign = nullptr; const uint8_t* right_sign = nullptr; - if (left.sign_bitmap()) { left_sign = left.sign_bitmap()->data(); } - if (right.sign_bitmap()) { right_sign = right.sign_bitmap()->data(); } + if (left.sign_bitmap()) { + left_sign = left.sign_bitmap()->data(); + } + if (right.sign_bitmap()) { + right_sign = right.sign_bitmap()->data(); + } for (int64_t i = 0; i < left.length(); ++i) { bool left_null = left.IsNull(i); @@ -434,7 +471,7 @@ class ArrayEqualsVisitor : public RangeEqualsVisitor { result_ = true; } else { result_ = BitmapEquals(left.values()->data(), left.offset(), right.values()->data(), - right.offset(), left.length()); + right.offset(), left.length()); } return Status::OK(); } @@ -442,7 +479,7 @@ class ArrayEqualsVisitor : public RangeEqualsVisitor { template typename std::enable_if::value && !std::is_base_of::value, - Status>::type + Status>::type Visit(const T& left) { result_ = IsEqualPrimitive(left, static_cast(right_)); return Status::OK(); @@ -458,8 +495,8 @@ class ArrayEqualsVisitor : public RangeEqualsVisitor { const auto& right = static_cast(right_); if (left.offset() == 0 && right.offset() == 0) { - return left.value_offsets()->Equals( - *right.value_offsets(), (left.length() + 1) * sizeof(int32_t)); + return left.value_offsets()->Equals(*right.value_offsets(), + (left.length() + 1) * sizeof(int32_t)); } else { // One of the arrays is sliced; logic is more complicated because the // value offsets are not both 0-based @@ -482,10 +519,16 @@ class ArrayEqualsVisitor : public RangeEqualsVisitor { const auto& right = static_cast(right_); bool equal_offsets = ValueOffsetsEqual(left); - if (!equal_offsets) { return false; } + if (!equal_offsets) { + return false; + } - if (!left.value_data() && !(right.value_data())) { return true; } - if (left.value_offset(left.length()) == 0) { return true; } + if (!left.value_data() && !(right.value_data())) { + return true; + } + if (left.value_offset(left.length()) == 0) { + return true; + } const uint8_t* left_data = left.value_data()->data(); const uint8_t* right_data = right.value_data()->data(); @@ -493,23 +536,25 @@ class ArrayEqualsVisitor : public RangeEqualsVisitor { if (left.null_count() == 0) { // Fast path for null count 0, single memcmp if (left.offset() == 0 && right.offset() == 0) { - return std::memcmp( - left_data, right_data, left.raw_value_offsets()[left.length()]) == 0; + return std::memcmp(left_data, right_data, + left.raw_value_offsets()[left.length()]) == 0; } else { const int64_t total_bytes = left.value_offset(left.length()) - left.value_offset(0); return std::memcmp(left_data + left.value_offset(0), - right_data + right.value_offset(0), - static_cast(total_bytes)) == 0; + right_data + right.value_offset(0), + static_cast(total_bytes)) == 0; } } else { // ARROW-537: Only compare data in non-null slots const int32_t* left_offsets = left.raw_value_offsets(); const int32_t* right_offsets = right.raw_value_offsets(); for (int64_t i = 0; i < left.length(); ++i) { - if (left.IsNull(i)) { continue; } + if (left.IsNull(i)) { + continue; + } if (std::memcmp(left_data + left_offsets[i], right_data + right_offsets[i], - left.value_length(i))) { + left.value_length(i))) { return false; } } @@ -530,8 +575,9 @@ class ArrayEqualsVisitor : public RangeEqualsVisitor { return Status::OK(); } - result_ = left.values()->RangeEquals(left.value_offset(0), - left.value_offset(left.length()), right.value_offset(0), right.values()); + result_ = + left.values()->RangeEquals(left.value_offset(0), left.value_offset(left.length()), + right.value_offset(0), right.values()); return Status::OK(); } @@ -547,15 +593,15 @@ class ArrayEqualsVisitor : public RangeEqualsVisitor { template typename std::enable_if::value, - Status>::type + Status>::type Visit(const T& left) { return RangeEqualsVisitor::Visit(left); } }; template -inline bool FloatingApproxEquals( - const NumericArray& left, const NumericArray& right) { +inline bool FloatingApproxEquals(const NumericArray& left, + const NumericArray& right) { using T = typename TYPE::c_type; const T* left_data = left.raw_values(); @@ -566,11 +612,15 @@ inline bool FloatingApproxEquals( if (left.null_count() > 0) { for (int64_t i = 0; i < left.length(); ++i) { if (left.IsNull(i)) continue; - if (fabs(left_data[i] - right_data[i]) > EPSILON) { return false; } + if (fabs(left_data[i] - right_data[i]) > EPSILON) { + return false; + } } } else { for (int64_t i = 0; i < left.length(); ++i) { - if (fabs(left_data[i] - right_data[i]) > EPSILON) { return false; } + if (fabs(left_data[i] - right_data[i]) > EPSILON) { + return false; + } } } return true; @@ -601,7 +651,7 @@ static bool BaseDataEquals(const Array& left, const Array& right) { } if (left.null_count() > 0 && left.null_count() < left.length()) { return BitmapEquals(left.null_bitmap()->data(), left.offset(), - right.null_bitmap()->data(), right.offset(), left.length()); + right.null_bitmap()->data(), right.offset(), left.length()); } return true; } @@ -625,63 +675,6 @@ inline Status ArrayEqualsImpl(const Array& left, const Array& right, bool* are_e return Status::OK(); } -Status ArrayEquals(const Array& left, const Array& right, bool* are_equal) { - return ArrayEqualsImpl(left, right, are_equal); -} - -Status ArrayApproxEquals(const Array& left, const Array& right, bool* are_equal) { - return ArrayEqualsImpl(left, right, are_equal); -} - -Status ArrayRangeEquals(const Array& left, const Array& right, int64_t left_start_idx, - int64_t left_end_idx, int64_t right_start_idx, bool* are_equal) { - if (&left == &right) { - *are_equal = true; - } else if (left.type_id() != right.type_id()) { - *are_equal = false; - } else if (left.length() == 0) { - *are_equal = true; - } else { - RangeEqualsVisitor visitor(right, left_start_idx, left_end_idx, right_start_idx); - RETURN_NOT_OK(VisitArrayInline(left, &visitor)); - *are_equal = visitor.result(); - } - return Status::OK(); -} - -// ---------------------------------------------------------------------- -// Implement TensorEquals - -Status TensorEquals(const Tensor& left, const Tensor& right, bool* are_equal) { - // The arrays are the same object - if (&left == &right) { - *are_equal = true; - } else if (left.type_id() != right.type_id()) { - *are_equal = false; - } else if (left.size() == 0) { - *are_equal = true; - } else { - if (!left.is_contiguous() || !right.is_contiguous()) { - return Status::NotImplemented( - "Comparison not implemented for non-contiguous tensors"); - } - - const auto& size_meta = dynamic_cast(*left.type()); - const int byte_width = size_meta.bit_width() / 8; - DCHECK_GT(byte_width, 0); - - const uint8_t* left_data = left.data()->data(); - const uint8_t* right_data = right.data()->data(); - - *are_equal = - memcmp(left_data, right_data, static_cast(byte_width * left.size())) == 0; - } - return Status::OK(); -} - -// ---------------------------------------------------------------------- -// Implement TypeEquals - class TypeEqualsVisitor { public: explicit TypeEqualsVisitor(const DataType& right) : right_(right), result_(false) {} @@ -705,7 +698,7 @@ class TypeEqualsVisitor { template typename std::enable_if::value || std::is_base_of::value, - Status>::type + Status>::type Visit(const T& type) { result_ = true; return Status::OK(); @@ -714,7 +707,7 @@ class TypeEqualsVisitor { template typename std::enable_if::value || std::is_base_of::value, - Status>::type + Status>::type Visit(const T& left) { const auto& right = static_cast(right_); result_ = left.unit() == right.unit(); @@ -787,6 +780,60 @@ class TypeEqualsVisitor { bool result_; }; +} // namespace internal + +Status ArrayEquals(const Array& left, const Array& right, bool* are_equal) { + return internal::ArrayEqualsImpl(left, right, are_equal); +} + +Status ArrayApproxEquals(const Array& left, const Array& right, bool* are_equal) { + return internal::ArrayEqualsImpl(left, right, are_equal); +} + +Status ArrayRangeEquals(const Array& left, const Array& right, int64_t left_start_idx, + int64_t left_end_idx, int64_t right_start_idx, bool* are_equal) { + if (&left == &right) { + *are_equal = true; + } else if (left.type_id() != right.type_id()) { + *are_equal = false; + } else if (left.length() == 0) { + *are_equal = true; + } else { + internal::RangeEqualsVisitor visitor(right, left_start_idx, left_end_idx, + right_start_idx); + RETURN_NOT_OK(VisitArrayInline(left, &visitor)); + *are_equal = visitor.result(); + } + return Status::OK(); +} + +Status TensorEquals(const Tensor& left, const Tensor& right, bool* are_equal) { + // The arrays are the same object + if (&left == &right) { + *are_equal = true; + } else if (left.type_id() != right.type_id()) { + *are_equal = false; + } else if (left.size() == 0) { + *are_equal = true; + } else { + if (!left.is_contiguous() || !right.is_contiguous()) { + return Status::NotImplemented( + "Comparison not implemented for non-contiguous tensors"); + } + + const auto& size_meta = dynamic_cast(*left.type()); + const int byte_width = size_meta.bit_width() / 8; + DCHECK_GT(byte_width, 0); + + const uint8_t* left_data = left.data()->data(); + const uint8_t* right_data = right.data()->data(); + + *are_equal = + memcmp(left_data, right_data, static_cast(byte_width * left.size())) == 0; + } + return Status::OK(); +} + Status TypeEquals(const DataType& left, const DataType& right, bool* are_equal) { // The arrays are the same object if (&left == &right) { @@ -794,7 +841,7 @@ Status TypeEquals(const DataType& left, const DataType& right, bool* are_equal) } else if (left.id() != right.id()) { *are_equal = false; } else { - TypeEqualsVisitor visitor(right); + internal::TypeEqualsVisitor visitor(right); RETURN_NOT_OK(VisitTypeInline(left, &visitor)); *are_equal = visitor.result(); } diff --git a/cpp/src/arrow/compare.h b/cpp/src/arrow/compare.h index 96a6435c5df..a36b55320b5 100644 --- a/cpp/src/arrow/compare.h +++ b/cpp/src/arrow/compare.h @@ -34,21 +34,22 @@ class Tensor; /// Returns true if the arrays are exactly equal Status ARROW_EXPORT ArrayEquals(const Array& left, const Array& right, bool* are_equal); -Status ARROW_EXPORT TensorEquals( - const Tensor& left, const Tensor& right, bool* are_equal); +Status ARROW_EXPORT TensorEquals(const Tensor& left, const Tensor& right, + bool* are_equal); /// Returns true if the arrays are approximately equal. For non-floating point /// types, this is equivalent to ArrayEquals(left, right) -Status ARROW_EXPORT ArrayApproxEquals( - const Array& left, const Array& right, bool* are_equal); +Status ARROW_EXPORT ArrayApproxEquals(const Array& left, const Array& right, + bool* are_equal); /// Returns true if indicated equal-length segment of arrays is exactly equal Status ARROW_EXPORT ArrayRangeEquals(const Array& left, const Array& right, - int64_t start_idx, int64_t end_idx, int64_t other_start_idx, bool* are_equal); + int64_t start_idx, int64_t end_idx, + int64_t other_start_idx, bool* are_equal); /// Returns true if the type metadata are exactly equal -Status ARROW_EXPORT TypeEquals( - const DataType& left, const DataType& right, bool* are_equal); +Status ARROW_EXPORT TypeEquals(const DataType& left, const DataType& right, + bool* are_equal); } // namespace arrow diff --git a/cpp/src/arrow/io/file.cc b/cpp/src/arrow/io/file.cc index 936655f26db..82e3ba8109c 100644 --- a/cpp/src/arrow/io/file.cc +++ b/cpp/src/arrow/io/file.cc @@ -123,8 +123,8 @@ constexpr const char* kRangeExceptionError = "Range exception during wide-char string conversion"; #endif -static inline Status CheckOpenResult( - int ret, int errno_actual, const char* filename, size_t filename_length) { +static inline Status CheckOpenResult(int ret, int errno_actual, const char* filename, + size_t filename_length) { if (ret == -1) { // TODO: errno codes to strings std::stringstream ss; @@ -134,12 +134,14 @@ static inline Status CheckOpenResult( // this requires c++11 std::wstring_convert, wchar_t> converter; - std::wstring wide_string( - reinterpret_cast(filename), filename_length / sizeof(wchar_t)); + std::wstring wide_string(reinterpret_cast(filename), + filename_length / sizeof(wchar_t)); try { std::string byte_string = converter.to_bytes(wide_string); ss << byte_string; - } catch (const std::range_error&) { ss << kRangeExceptionError; } + } catch (const std::range_error&) { + ss << kRangeExceptionError; + } #else ss << filename; #endif @@ -161,7 +163,9 @@ static inline int64_t lseek64_compat(int fd, int64_t pos, int whence) { #if defined(_MSC_VER) static inline Status ConvertToUtf16(const std::string& input, std::wstring* result) { - if (result == nullptr) { return Status::Invalid("Pointer to result is not valid"); } + if (result == nullptr) { + return Status::Invalid("Pointer to result is not valid"); + } if (input.empty()) { *result = std::wstring(); @@ -171,7 +175,9 @@ static inline Status ConvertToUtf16(const std::string& input, std::wstring* resu std::wstring_convert> utf16_converter; try { *result = utf16_converter.from_bytes(input); - } catch (const std::range_error&) { return Status::Invalid(kRangeExceptionError); } + } catch (const std::range_error&) { + return Status::Invalid(kRangeExceptionError); + } return Status::OK(); } #endif @@ -194,8 +200,8 @@ static inline Status FileOpenReadable(const std::string& filename, int* fd) { return CheckOpenResult(ret, errno_actual, filename.c_str(), filename.size()); } -static inline Status FileOpenWriteable( - const std::string& filename, bool write_only, bool truncate, int* fd) { +static inline Status FileOpenWriteable(const std::string& filename, bool write_only, + bool truncate, int* fd) { int ret; errno_t errno_actual = 0; @@ -205,9 +211,13 @@ static inline Status FileOpenWriteable( int oflag = _O_CREAT | _O_BINARY; int pmode = _S_IWRITE; - if (!write_only) { pmode |= _S_IREAD; } + if (!write_only) { + pmode |= _S_IREAD; + } - if (truncate) { oflag |= _O_TRUNC; } + if (truncate) { + oflag |= _O_TRUNC; + } if (write_only) { oflag |= _O_WRONLY; @@ -221,7 +231,9 @@ static inline Status FileOpenWriteable( #else int oflag = O_CREAT | O_BINARY; - if (truncate) { oflag |= O_TRUNC; } + if (truncate) { + oflag |= O_TRUNC; + } if (write_only) { oflag |= O_WRONLY; @@ -239,7 +251,9 @@ static inline Status FileTell(int fd, int64_t* pos) { #if defined(_MSC_VER) current_pos = _telli64(fd); - if (current_pos == -1) { return Status::IOError("_telli64 failed"); } + if (current_pos == -1) { + return Status::IOError("_telli64 failed"); + } #else current_pos = lseek64_compat(fd, 0, SEEK_CUR); CHECK_LSEEK(current_pos); @@ -255,10 +269,12 @@ static inline Status FileSeek(int fd, int64_t pos) { return Status::OK(); } -static inline Status FileRead( - int fd, uint8_t* buffer, int64_t nbytes, int64_t* bytes_read) { +static inline Status FileRead(int fd, uint8_t* buffer, int64_t nbytes, + int64_t* bytes_read) { #if defined(_MSC_VER) - if (nbytes > INT32_MAX) { return Status::IOError("Unable to read > 2GB blocks yet"); } + if (nbytes > INT32_MAX) { + return Status::IOError("Unable to read > 2GB blocks yet"); + } *bytes_read = static_cast(_read(fd, buffer, static_cast(nbytes))); #else *bytes_read = static_cast(read(fd, buffer, static_cast(nbytes))); @@ -323,7 +339,9 @@ static inline Status FileClose(int fd) { ret = static_cast(close(fd)); #endif - if (ret == -1) { return Status::IOError("error closing file"); } + if (ret == -1) { + return Status::IOError("error closing file"); + } return Status::OK(); } @@ -371,7 +389,9 @@ class OSFile { } Status Seek(int64_t pos) { - if (pos < 0) { return Status::Invalid("Invalid position"); } + if (pos < 0) { + return Status::Invalid("Invalid position"); + } return FileSeek(fd_, pos); } @@ -379,7 +399,9 @@ class OSFile { Status Write(const uint8_t* data, int64_t length) { std::lock_guard guard(lock_); - if (length < 0) { return Status::IOError("Length must be non-negative"); } + if (length < 0) { + return Status::IOError("Length must be non-negative"); + } return FileWrite(fd_, data, length); } @@ -421,7 +443,9 @@ class ReadableFile::ReadableFileImpl : public OSFile { int64_t bytes_read = 0; RETURN_NOT_OK(Read(nbytes, &bytes_read, buffer->mutable_data())); - if (bytes_read < nbytes) { RETURN_NOT_OK(buffer->Resize(bytes_read)); } + if (bytes_read < nbytes) { + RETURN_NOT_OK(buffer->Resize(bytes_read)); + } *out = buffer; return Status::OK(); } @@ -430,13 +454,9 @@ class ReadableFile::ReadableFileImpl : public OSFile { MemoryPool* pool_; }; -ReadableFile::ReadableFile(MemoryPool* pool) { - impl_.reset(new ReadableFileImpl(pool)); -} +ReadableFile::ReadableFile(MemoryPool* pool) { impl_.reset(new ReadableFileImpl(pool)); } -ReadableFile::~ReadableFile() { - DCHECK(impl_->Close().ok()); -} +ReadableFile::~ReadableFile() { DCHECK(impl_->Close().ok()); } Status ReadableFile::Open(const std::string& path, std::shared_ptr* file) { *file = std::shared_ptr(new ReadableFile(default_memory_pool())); @@ -444,18 +464,14 @@ Status ReadableFile::Open(const std::string& path, std::shared_ptr } Status ReadableFile::Open(const std::string& path, MemoryPool* memory_pool, - std::shared_ptr* file) { + std::shared_ptr* file) { *file = std::shared_ptr(new ReadableFile(memory_pool)); return (*file)->impl_->Open(path); } -Status ReadableFile::Close() { - return impl_->Close(); -} +Status ReadableFile::Close() { return impl_->Close(); } -Status ReadableFile::Tell(int64_t* pos) { - return impl_->Tell(pos); -} +Status ReadableFile::Tell(int64_t* pos) { return impl_->Tell(pos); } Status ReadableFile::Read(int64_t nbytes, int64_t* bytes_read, uint8_t* out) { return impl_->Read(nbytes, bytes_read, out); @@ -470,17 +486,11 @@ Status ReadableFile::GetSize(int64_t* size) { return Status::OK(); } -Status ReadableFile::Seek(int64_t pos) { - return impl_->Seek(pos); -} +Status ReadableFile::Seek(int64_t pos) { return impl_->Seek(pos); } -bool ReadableFile::supports_zero_copy() const { - return false; -} +bool ReadableFile::supports_zero_copy() const { return false; } -int ReadableFile::file_descriptor() const { - return impl_->fd(); -} +int ReadableFile::file_descriptor() const { return impl_->fd(); } // ---------------------------------------------------------------------- // FileOutputStream @@ -492,42 +502,34 @@ class FileOutputStream::FileOutputStreamImpl : public OSFile { } }; -FileOutputStream::FileOutputStream() { - impl_.reset(new FileOutputStreamImpl()); -} +FileOutputStream::FileOutputStream() { impl_.reset(new FileOutputStreamImpl()); } FileOutputStream::~FileOutputStream() { // This can fail; better to explicitly call close DCHECK(impl_->Close().ok()); } -Status FileOutputStream::Open( - const std::string& path, std::shared_ptr* file) { +Status FileOutputStream::Open(const std::string& path, + std::shared_ptr* file) { return Open(path, false, file); } -Status FileOutputStream::Open( - const std::string& path, bool append, std::shared_ptr* file) { +Status FileOutputStream::Open(const std::string& path, bool append, + std::shared_ptr* file) { // private ctor *file = std::shared_ptr(new FileOutputStream()); return (*file)->impl_->Open(path, append); } -Status FileOutputStream::Close() { - return impl_->Close(); -} +Status FileOutputStream::Close() { return impl_->Close(); } -Status FileOutputStream::Tell(int64_t* pos) { - return impl_->Tell(pos); -} +Status FileOutputStream::Tell(int64_t* pos) { return impl_->Tell(pos); } Status FileOutputStream::Write(const uint8_t* data, int64_t length) { return impl_->Write(data, length); } -int FileOutputStream::file_descriptor() const { - return impl_->fd(); -} +int FileOutputStream::file_descriptor() const { return impl_->fd(); } // ---------------------------------------------------------------------- // Implement MemoryMappedFile @@ -567,7 +569,7 @@ class MemoryMappedFile::MemoryMap : public MutableBuffer { } void* result = mmap(nullptr, static_cast(file_->size()), prot_flags, map_mode, - file_->fd(), 0); + file_->fd(), 0); if (result == MAP_FAILED) { std::stringstream ss; ss << "Memory mapping file failed, errno: " << errno; @@ -585,7 +587,9 @@ class MemoryMappedFile::MemoryMap : public MutableBuffer { int64_t size() const { return size_; } Status Seek(int64_t position) { - if (position < 0) { return Status::Invalid("position is out of bounds"); } + if (position < 0) { + return Status::Invalid("position is out of bounds"); + } position_ = position; return Status::OK(); } @@ -610,8 +614,8 @@ class MemoryMappedFile::MemoryMap : public MutableBuffer { MemoryMappedFile::MemoryMappedFile() {} MemoryMappedFile::~MemoryMappedFile() {} -Status MemoryMappedFile::Create( - const std::string& path, int64_t size, std::shared_ptr* out) { +Status MemoryMappedFile::Create(const std::string& path, int64_t size, + std::shared_ptr* out) { std::shared_ptr file; RETURN_NOT_OK(FileOutputStream::Open(path, &file)); #ifdef _MSC_VER @@ -624,7 +628,7 @@ Status MemoryMappedFile::Create( } Status MemoryMappedFile::Open(const std::string& path, FileMode::type mode, - std::shared_ptr* out) { + std::shared_ptr* out) { std::shared_ptr result(new MemoryMappedFile()); result->memory_map_.reset(new MemoryMap()); @@ -644,9 +648,7 @@ Status MemoryMappedFile::Tell(int64_t* position) { return Status::OK(); } -Status MemoryMappedFile::Seek(int64_t position) { - return memory_map_->Seek(position); -} +Status MemoryMappedFile::Seek(int64_t position) { return memory_map_->Seek(position); } Status MemoryMappedFile::Close() { // munmap handled in pimpl dtor @@ -656,7 +658,9 @@ Status MemoryMappedFile::Close() { Status MemoryMappedFile::Read(int64_t nbytes, int64_t* bytes_read, uint8_t* out) { nbytes = std::max( 0, std::min(nbytes, memory_map_->size() - memory_map_->position())); - if (nbytes > 0) { std::memcpy(out, memory_map_->head(), static_cast(nbytes)); } + if (nbytes > 0) { + std::memcpy(out, memory_map_->head(), static_cast(nbytes)); + } *bytes_read = nbytes; memory_map_->advance(nbytes); return Status::OK(); @@ -675,9 +679,7 @@ Status MemoryMappedFile::Read(int64_t nbytes, std::shared_ptr* out) { return Status::OK(); } -bool MemoryMappedFile::supports_zero_copy() const { - return true; -} +bool MemoryMappedFile::supports_zero_copy() const { return true; } Status MemoryMappedFile::WriteAt(int64_t position, const uint8_t* data, int64_t nbytes) { std::lock_guard guard(lock_); @@ -708,9 +710,7 @@ Status MemoryMappedFile::WriteInternal(const uint8_t* data, int64_t nbytes) { return Status::OK(); } -int MemoryMappedFile::file_descriptor() const { - return memory_map_->fd(); -} +int MemoryMappedFile::file_descriptor() const { return memory_map_->fd(); } } // namespace io } // namespace arrow diff --git a/cpp/src/arrow/io/file.h b/cpp/src/arrow/io/file.h index f0be3cf9801..ba740f1e8f4 100644 --- a/cpp/src/arrow/io/file.h +++ b/cpp/src/arrow/io/file.h @@ -44,8 +44,8 @@ class ARROW_EXPORT FileOutputStream : public OutputStream { // truncated to 0 bytes, deleting any existing memory static Status Open(const std::string& path, std::shared_ptr* file); - static Status Open( - const std::string& path, bool append, std::shared_ptr* file); + static Status Open(const std::string& path, bool append, + std::shared_ptr* file); // OutputStream interface Status Close() override; @@ -73,7 +73,7 @@ class ARROW_EXPORT ReadableFile : public RandomAccessFile { // Open file with one's own memory pool for memory allocations static Status Open(const std::string& path, MemoryPool* memory_pool, - std::shared_ptr* file); + std::shared_ptr* file); Status Close() override; Status Tell(int64_t* position) override; @@ -107,11 +107,11 @@ class ARROW_EXPORT MemoryMappedFile : public ReadWriteFileInterface { ~MemoryMappedFile(); /// Create new file with indicated size, return in read/write mode - static Status Create( - const std::string& path, int64_t size, std::shared_ptr* out); + static Status Create(const std::string& path, int64_t size, + std::shared_ptr* out); static Status Open(const std::string& path, FileMode::type mode, - std::shared_ptr* out); + std::shared_ptr* out); Status Close() override; diff --git a/cpp/src/arrow/io/hdfs-internal.cc b/cpp/src/arrow/io/hdfs-internal.cc index 8b4a92b3967..35657df4620 100644 --- a/cpp/src/arrow/io/hdfs-internal.cc +++ b/cpp/src/arrow/io/hdfs-internal.cc @@ -59,9 +59,9 @@ static std::vector get_potential_libhdfs_paths(); static std::vector get_potential_libhdfs3_paths(); static arrow::Status try_dlopen(std::vector potential_paths, const char* name, #ifndef _WIN32 - void*& out_handle); + void*& out_handle); #else - HINSTANCE& out_handle); + HINSTANCE& out_handle); #endif static std::vector get_potential_libhdfs_paths() { @@ -88,7 +88,9 @@ static std::vector get_potential_libhdfs_paths() { } const char* libhdfs_dir = std::getenv("ARROW_LIBHDFS_DIR"); - if (libhdfs_dir != nullptr) { search_paths.push_back(fs::path(libhdfs_dir)); } + if (libhdfs_dir != nullptr) { + search_paths.push_back(fs::path(libhdfs_dir)); + } // All paths with file name for (auto& path : search_paths) { @@ -115,7 +117,9 @@ static std::vector get_potential_libhdfs3_paths() { std::vector search_paths = {fs::path(""), fs::path(".")}; const char* libhdfs3_dir = std::getenv("ARROW_LIBHDFS3_DIR"); - if (libhdfs3_dir != nullptr) { search_paths.push_back(fs::path(libhdfs3_dir)); } + if (libhdfs3_dir != nullptr) { + search_paths.push_back(fs::path(libhdfs3_dir)); + } // All paths with file name for (auto& path : search_paths) { @@ -188,8 +192,8 @@ static std::vector get_potential_libjvm_paths() { } #ifndef _WIN32 -static arrow::Status try_dlopen( - std::vector potential_paths, const char* name, void*& out_handle) { +static arrow::Status try_dlopen(std::vector potential_paths, const char* name, + void*& out_handle) { std::vector error_messages; for (auto& i : potential_paths) { @@ -219,8 +223,8 @@ static arrow::Status try_dlopen( } #else -static arrow::Status try_dlopen( - std::vector potential_paths, const char* name, HINSTANCE& out_handle) { +static arrow::Status try_dlopen(std::vector potential_paths, const char* name, + HINSTANCE& out_handle) { std::vector error_messages; for (auto& i : potential_paths) { @@ -278,13 +282,12 @@ static inline void* GetLibrarySymbol(void* handle, const char* symbol) { namespace arrow { namespace io { +namespace internal { static LibHdfsShim libhdfs_shim; static LibHdfsShim libhdfs3_shim; -hdfsBuilder* LibHdfsShim::NewBuilder(void) { - return this->hdfsNewBuilder(); -} +hdfsBuilder* LibHdfsShim::NewBuilder(void) { return this->hdfsNewBuilder(); } void LibHdfsShim::BuilderSetNameNode(hdfsBuilder* bld, const char* nn) { this->hdfsBuilderSetNameNode(bld, nn); @@ -298,8 +301,8 @@ void LibHdfsShim::BuilderSetUserName(hdfsBuilder* bld, const char* userName) { this->hdfsBuilderSetUserName(bld, userName); } -void LibHdfsShim::BuilderSetKerbTicketCachePath( - hdfsBuilder* bld, const char* kerbTicketCachePath) { +void LibHdfsShim::BuilderSetKerbTicketCachePath(hdfsBuilder* bld, + const char* kerbTicketCachePath) { this->hdfsBuilderSetKerbTicketCachePath(bld, kerbTicketCachePath); } @@ -307,12 +310,10 @@ hdfsFS LibHdfsShim::BuilderConnect(hdfsBuilder* bld) { return this->hdfsBuilderConnect(bld); } -int LibHdfsShim::Disconnect(hdfsFS fs) { - return this->hdfsDisconnect(fs); -} +int LibHdfsShim::Disconnect(hdfsFS fs) { return this->hdfsDisconnect(fs); } hdfsFile LibHdfsShim::OpenFile(hdfsFS fs, const char* path, int flags, int bufferSize, - short replication, tSize blocksize) { // NOLINT + short replication, tSize blocksize) { // NOLINT return this->hdfsOpenFile(fs, path, flags, bufferSize, replication, blocksize); } @@ -328,9 +329,7 @@ int LibHdfsShim::Seek(hdfsFS fs, hdfsFile file, tOffset desiredPos) { return this->hdfsSeek(fs, file, desiredPos); } -tOffset LibHdfsShim::Tell(hdfsFS fs, hdfsFile file) { - return this->hdfsTell(fs, file); -} +tOffset LibHdfsShim::Tell(hdfsFS fs, hdfsFile file) { return this->hdfsTell(fs, file); } tSize LibHdfsShim::Read(hdfsFS fs, hdfsFile file, void* buffer, tSize length) { return this->hdfsRead(fs, file, buffer, length); @@ -341,8 +340,8 @@ bool LibHdfsShim::HasPread() { return this->hdfsPread != nullptr; } -tSize LibHdfsShim::Pread( - hdfsFS fs, hdfsFile file, tOffset position, void* buffer, tSize length) { +tSize LibHdfsShim::Pread(hdfsFS fs, hdfsFile file, tOffset position, void* buffer, + tSize length) { GET_SYMBOL(this, hdfsPread); return this->hdfsPread(fs, file, position, buffer, length); } @@ -351,9 +350,7 @@ tSize LibHdfsShim::Write(hdfsFS fs, hdfsFile file, const void* buffer, tSize len return this->hdfsWrite(fs, file, buffer, length); } -int LibHdfsShim::Flush(hdfsFS fs, hdfsFile file) { - return this->hdfsFlush(fs, file); -} +int LibHdfsShim::Flush(hdfsFS fs, hdfsFile file) { return this->hdfsFlush(fs, file); } int LibHdfsShim::Available(hdfsFS fs, hdfsFile file) { GET_SYMBOL(this, hdfsAvailable); @@ -434,8 +431,8 @@ void LibHdfsShim::FreeFileInfo(hdfsFileInfo* hdfsFileInfo, int numEntries) { this->hdfsFreeFileInfo(hdfsFileInfo, numEntries); } -char*** LibHdfsShim::GetHosts( - hdfsFS fs, const char* path, tOffset start, tOffset length) { +char*** LibHdfsShim::GetHosts(hdfsFS fs, const char* path, tOffset start, + tOffset length) { GET_SYMBOL(this, hdfsGetHosts); if (this->hdfsGetHosts) { return this->hdfsGetHosts(fs, path, start, length); @@ -446,7 +443,9 @@ char*** LibHdfsShim::GetHosts( void LibHdfsShim::FreeHosts(char*** blockHosts) { GET_SYMBOL(this, hdfsFreeHosts); - if (this->hdfsFreeHosts) { this->hdfsFreeHosts(blockHosts); } + if (this->hdfsFreeHosts) { + this->hdfsFreeHosts(blockHosts); + } } tOffset LibHdfsShim::GetDefaultBlockSize(hdfsFS fs) { @@ -458,31 +457,17 @@ tOffset LibHdfsShim::GetDefaultBlockSize(hdfsFS fs) { } } -tOffset LibHdfsShim::GetCapacity(hdfsFS fs) { - return this->hdfsGetCapacity(fs); -} +tOffset LibHdfsShim::GetCapacity(hdfsFS fs) { return this->hdfsGetCapacity(fs); } -tOffset LibHdfsShim::GetUsed(hdfsFS fs) { - return this->hdfsGetUsed(fs); -} +tOffset LibHdfsShim::GetUsed(hdfsFS fs) { return this->hdfsGetUsed(fs); } -int LibHdfsShim::Chown( - hdfsFS fs, const char* path, const char* owner, const char* group) { - GET_SYMBOL(this, hdfsChown); - if (this->hdfsChown) { - return this->hdfsChown(fs, path, owner, group); - } else { - return 0; - } +int LibHdfsShim::Chown(hdfsFS fs, const char* path, const char* owner, + const char* group) { + return this->hdfsChown(fs, path, owner, group); } int LibHdfsShim::Chmod(hdfsFS fs, const char* path, short mode) { // NOLINT - GET_SYMBOL(this, hdfsChmod); - if (this->hdfsChmod) { - return this->hdfsChmod(fs, path, mode); - } else { - return 0; - } + return this->hdfsChmod(fs, path, mode); } int LibHdfsShim::Utime(hdfsFS fs, const char* path, tTime mtime, tTime atime) { @@ -510,6 +495,8 @@ Status LibHdfsShim::GetRequiredSymbols() { GET_SYMBOL_REQUIRED(this, hdfsGetUsed); GET_SYMBOL_REQUIRED(this, hdfsGetPathInfo); GET_SYMBOL_REQUIRED(this, hdfsListDirectory); + GET_SYMBOL_REQUIRED(this, hdfsChown); + GET_SYMBOL_REQUIRED(this, hdfsChmod); // File methods GET_SYMBOL_REQUIRED(this, hdfsCloseFile); @@ -570,5 +557,6 @@ Status ConnectLibHdfs3(LibHdfsShim** driver) { return shim->GetRequiredSymbols(); } +} // namespace internal } // namespace io } // namespace arrow diff --git a/cpp/src/arrow/io/hdfs-internal.h b/cpp/src/arrow/io/hdfs-internal.h index c5ea397af0b..f2de00de8b9 100644 --- a/cpp/src/arrow/io/hdfs-internal.h +++ b/cpp/src/arrow/io/hdfs-internal.h @@ -32,6 +32,7 @@ namespace arrow { class Status; namespace io { +namespace internal { // NOTE(wesm): cpplint does not like use of short and other imprecise C types struct LibHdfsShim { @@ -45,22 +46,22 @@ struct LibHdfsShim { void (*hdfsBuilderSetNameNode)(hdfsBuilder* bld, const char* nn); void (*hdfsBuilderSetNameNodePort)(hdfsBuilder* bld, tPort port); void (*hdfsBuilderSetUserName)(hdfsBuilder* bld, const char* userName); - void (*hdfsBuilderSetKerbTicketCachePath)( - hdfsBuilder* bld, const char* kerbTicketCachePath); + void (*hdfsBuilderSetKerbTicketCachePath)(hdfsBuilder* bld, + const char* kerbTicketCachePath); hdfsFS (*hdfsBuilderConnect)(hdfsBuilder* bld); int (*hdfsDisconnect)(hdfsFS fs); hdfsFile (*hdfsOpenFile)(hdfsFS fs, const char* path, int flags, int bufferSize, - short replication, tSize blocksize); // NOLINT + short replication, tSize blocksize); // NOLINT int (*hdfsCloseFile)(hdfsFS fs, hdfsFile file); int (*hdfsExists)(hdfsFS fs, const char* path); int (*hdfsSeek)(hdfsFS fs, hdfsFile file, tOffset desiredPos); tOffset (*hdfsTell)(hdfsFS fs, hdfsFile file); tSize (*hdfsRead)(hdfsFS fs, hdfsFile file, void* buffer, tSize length); - tSize (*hdfsPread)( - hdfsFS fs, hdfsFile file, tOffset position, void* buffer, tSize length); + tSize (*hdfsPread)(hdfsFS fs, hdfsFile file, tOffset position, void* buffer, + tSize length); tSize (*hdfsWrite)(hdfsFS fs, hdfsFile file, const void* buffer, tSize length); int (*hdfsFlush)(hdfsFS fs, hdfsFile file); int (*hdfsAvailable)(hdfsFS fs, hdfsFile file); @@ -139,7 +140,7 @@ struct LibHdfsShim { int Disconnect(hdfsFS fs); hdfsFile OpenFile(hdfsFS fs, const char* path, int flags, int bufferSize, - short replication, tSize blocksize); // NOLINT + short replication, tSize blocksize); // NOLINT int CloseFile(hdfsFS fs, hdfsFile file); @@ -205,6 +206,7 @@ struct LibHdfsShim { Status ARROW_EXPORT ConnectLibHdfs(LibHdfsShim** driver); Status ARROW_EXPORT ConnectLibHdfs3(LibHdfsShim** driver); +} // namespace internal } // namespace io } // namespace arrow diff --git a/cpp/src/arrow/io/hdfs.cc b/cpp/src/arrow/io/hdfs.cc index 9ded9bc3f99..ba446b56e00 100644 --- a/cpp/src/arrow/io/hdfs.cc +++ b/cpp/src/arrow/io/hdfs.cc @@ -61,8 +61,8 @@ static constexpr int kDefaultHdfsBufferSize = 1 << 16; class HdfsAnyFileImpl { public: - void set_members( - const std::string& path, LibHdfsShim* driver, hdfsFS fs, hdfsFile handle) { + void set_members(const std::string& path, internal::LibHdfsShim* driver, hdfsFS fs, + hdfsFile handle) { path_ = path; driver_ = driver; fs_ = fs; @@ -88,7 +88,7 @@ class HdfsAnyFileImpl { protected: std::string path_; - LibHdfsShim* driver_; + internal::LibHdfsShim* driver_; // For threadsafety std::mutex lock_; @@ -118,7 +118,7 @@ class HdfsReadableFile::HdfsReadableFileImpl : public HdfsAnyFileImpl { tSize ret; if (driver_->HasPread()) { ret = driver_->Pread(fs_, file_, static_cast(position), - reinterpret_cast(buffer), static_cast(nbytes)); + reinterpret_cast(buffer), static_cast(nbytes)); } else { std::lock_guard guard(lock_); RETURN_NOT_OK(Seek(position)); @@ -136,7 +136,9 @@ class HdfsReadableFile::HdfsReadableFileImpl : public HdfsAnyFileImpl { int64_t bytes_read = 0; RETURN_NOT_OK(ReadAt(position, nbytes, &bytes_read, buffer->mutable_data())); - if (bytes_read < nbytes) { RETURN_NOT_OK(buffer->Resize(bytes_read)); } + if (bytes_read < nbytes) { + RETURN_NOT_OK(buffer->Resize(bytes_read)); + } *out = buffer; return Status::OK(); @@ -145,11 +147,14 @@ class HdfsReadableFile::HdfsReadableFileImpl : public HdfsAnyFileImpl { Status Read(int64_t nbytes, int64_t* bytes_read, uint8_t* buffer) { int64_t total_bytes = 0; while (total_bytes < nbytes) { - tSize ret = driver_->Read(fs_, file_, reinterpret_cast(buffer + total_bytes), + tSize ret = driver_->Read( + fs_, file_, reinterpret_cast(buffer + total_bytes), static_cast(std::min(buffer_size_, nbytes - total_bytes))); RETURN_NOT_OK(CheckReadResult(ret)); total_bytes += ret; - if (ret == 0) { break; } + if (ret == 0) { + break; + } } *bytes_read = total_bytes; @@ -162,7 +167,9 @@ class HdfsReadableFile::HdfsReadableFileImpl : public HdfsAnyFileImpl { int64_t bytes_read = 0; RETURN_NOT_OK(Read(nbytes, &bytes_read, buffer->mutable_data())); - if (bytes_read < nbytes) { RETURN_NOT_OK(buffer->Resize(bytes_read)); } + if (bytes_read < nbytes) { + RETURN_NOT_OK(buffer->Resize(bytes_read)); + } *out = buffer; return Status::OK(); @@ -170,7 +177,9 @@ class HdfsReadableFile::HdfsReadableFileImpl : public HdfsAnyFileImpl { Status GetSize(int64_t* size) { hdfsFileInfo* entry = driver_->GetPathInfo(fs_, path_.c_str()); - if (entry == nullptr) { return Status::IOError("HDFS: GetPathInfo failed"); } + if (entry == nullptr) { + return Status::IOError("HDFS: GetPathInfo failed"); + } *size = entry->mSize; driver_->FreeFileInfo(entry, 1); @@ -187,31 +196,27 @@ class HdfsReadableFile::HdfsReadableFileImpl : public HdfsAnyFileImpl { }; HdfsReadableFile::HdfsReadableFile(MemoryPool* pool) { - if (pool == nullptr) { pool = default_memory_pool(); } + if (pool == nullptr) { + pool = default_memory_pool(); + } impl_.reset(new HdfsReadableFileImpl(pool)); } -HdfsReadableFile::~HdfsReadableFile() { - DCHECK(impl_->Close().ok()); -} +HdfsReadableFile::~HdfsReadableFile() { DCHECK(impl_->Close().ok()); } -Status HdfsReadableFile::Close() { - return impl_->Close(); -} +Status HdfsReadableFile::Close() { return impl_->Close(); } -Status HdfsReadableFile::ReadAt( - int64_t position, int64_t nbytes, int64_t* bytes_read, uint8_t* buffer) { +Status HdfsReadableFile::ReadAt(int64_t position, int64_t nbytes, int64_t* bytes_read, + uint8_t* buffer) { return impl_->ReadAt(position, nbytes, bytes_read, buffer); } -Status HdfsReadableFile::ReadAt( - int64_t position, int64_t nbytes, std::shared_ptr* out) { +Status HdfsReadableFile::ReadAt(int64_t position, int64_t nbytes, + std::shared_ptr* out) { return impl_->ReadAt(position, nbytes, out); } -bool HdfsReadableFile::supports_zero_copy() const { - return false; -} +bool HdfsReadableFile::supports_zero_copy() const { return false; } Status HdfsReadableFile::Read(int64_t nbytes, int64_t* bytes_read, uint8_t* buffer) { return impl_->Read(nbytes, bytes_read, buffer); @@ -221,17 +226,11 @@ Status HdfsReadableFile::Read(int64_t nbytes, std::shared_ptr* buffer) { return impl_->Read(nbytes, buffer); } -Status HdfsReadableFile::GetSize(int64_t* size) { - return impl_->GetSize(size); -} +Status HdfsReadableFile::GetSize(int64_t* size) { return impl_->GetSize(size); } -Status HdfsReadableFile::Seek(int64_t position) { - return impl_->Seek(position); -} +Status HdfsReadableFile::Seek(int64_t position) { return impl_->Seek(position); } -Status HdfsReadableFile::Tell(int64_t* position) { - return impl_->Tell(position); -} +Status HdfsReadableFile::Tell(int64_t* position) { return impl_->Tell(position); } // ---------------------------------------------------------------------- // File writing @@ -259,28 +258,22 @@ class HdfsOutputStream::HdfsOutputStreamImpl : public HdfsAnyFileImpl { Status Write(const uint8_t* buffer, int64_t nbytes, int64_t* bytes_written) { std::lock_guard guard(lock_); - tSize ret = driver_->Write( - fs_, file_, reinterpret_cast(buffer), static_cast(nbytes)); + tSize ret = driver_->Write(fs_, file_, reinterpret_cast(buffer), + static_cast(nbytes)); CHECK_FAILURE(ret, "Write"); *bytes_written = ret; return Status::OK(); } }; -HdfsOutputStream::HdfsOutputStream() { - impl_.reset(new HdfsOutputStreamImpl()); -} +HdfsOutputStream::HdfsOutputStream() { impl_.reset(new HdfsOutputStreamImpl()); } -HdfsOutputStream::~HdfsOutputStream() { - DCHECK(impl_->Close().ok()); -} +HdfsOutputStream::~HdfsOutputStream() { DCHECK(impl_->Close().ok()); } -Status HdfsOutputStream::Close() { - return impl_->Close(); -} +Status HdfsOutputStream::Close() { return impl_->Close(); } -Status HdfsOutputStream::Write( - const uint8_t* buffer, int64_t nbytes, int64_t* bytes_read) { +Status HdfsOutputStream::Write(const uint8_t* buffer, int64_t nbytes, + int64_t* bytes_read) { return impl_->Write(buffer, nbytes, bytes_read); } @@ -289,13 +282,9 @@ Status HdfsOutputStream::Write(const uint8_t* buffer, int64_t nbytes) { return Write(buffer, nbytes, &bytes_written_dummy); } -Status HdfsOutputStream::Flush() { - return impl_->Flush(); -} +Status HdfsOutputStream::Flush() { return impl_->Flush(); } -Status HdfsOutputStream::Tell(int64_t* position) { - return impl_->Tell(position); -} +Status HdfsOutputStream::Tell(int64_t* position) { return impl_->Tell(position); } // ---------------------------------------------------------------------- // HDFS client @@ -319,9 +308,9 @@ static void SetPathInfo(const hdfsFileInfo* input, HdfsPathInfo* out) { } // Private implementation -class HdfsClient::HdfsClientImpl { +class HadoopFileSystem::HadoopFileSystemImpl { public: - HdfsClientImpl() {} + HadoopFileSystemImpl() {} Status Connect(const HdfsConnectionConfig* config) { if (config->driver == HdfsDriver::LIBHDFS3) { @@ -344,7 +333,9 @@ class HdfsClient::HdfsClientImpl { } fs_ = driver_->BuilderConnect(builder); - if (fs_ == nullptr) { return Status::IOError("HDFS connection failed"); } + if (fs_ == nullptr) { + return Status::IOError("HDFS connection failed"); + } namenode_host_ = config->host; port_ = config->port; user_ = config->user; @@ -395,7 +386,9 @@ class HdfsClient::HdfsClientImpl { Status GetPathInfo(const std::string& path, HdfsPathInfo* info) { hdfsFileInfo* entry = driver_->GetPathInfo(fs_, path.c_str()); - if (entry == nullptr) { return Status::IOError("HDFS: GetPathInfo failed"); } + if (entry == nullptr) { + return Status::IOError("HDFS: GetPathInfo failed"); + } SetPathInfo(entry, info); driver_->FreeFileInfo(entry, 1); @@ -403,6 +396,24 @@ class HdfsClient::HdfsClientImpl { return Status::OK(); } + Status Stat(const std::string& path, FileStatistics* stat) { + HdfsPathInfo info; + RETURN_NOT_OK(GetPathInfo(path, &info)); + + stat->size = info.size; + stat->kind = info.kind; + return Status::OK(); + } + + Status GetChildren(const std::string& path, std::vector* listing) { + std::vector detailed_listing; + RETURN_NOT_OK(ListDirectory(path, &detailed_listing)); + for (const auto& info : detailed_listing) { + listing->push_back(info.name); + } + return Status::OK(); + } + Status ListDirectory(const std::string& path, std::vector* listing) { int num_entries = 0; hdfsFileInfo* entries = driver_->ListDirectory(fs_, path.c_str(), &num_entries); @@ -435,7 +446,7 @@ class HdfsClient::HdfsClientImpl { } Status OpenReadable(const std::string& path, int32_t buffer_size, - std::shared_ptr* file) { + std::shared_ptr* file) { hdfsFile handle = driver_->OpenFile(fs_, path.c_str(), O_RDONLY, buffer_size, 0, 0); if (handle == nullptr) { @@ -454,13 +465,14 @@ class HdfsClient::HdfsClientImpl { } Status OpenWriteable(const std::string& path, bool append, int32_t buffer_size, - int16_t replication, int64_t default_block_size, - std::shared_ptr* file) { + int16_t replication, int64_t default_block_size, + std::shared_ptr* file) { int flags = O_WRONLY; if (append) flags |= O_APPEND; - hdfsFile handle = driver_->OpenFile(fs_, path.c_str(), flags, buffer_size, - replication, static_cast(default_block_size)); + hdfsFile handle = + driver_->OpenFile(fs_, path.c_str(), flags, buffer_size, replication, + static_cast(default_block_size)); if (handle == nullptr) { // TODO(wesm): determine cause of failure @@ -482,8 +494,20 @@ class HdfsClient::HdfsClientImpl { return Status::OK(); } + Status Chmod(const std::string& path, int mode) { + int ret = driver_->Chmod(fs_, path.c_str(), static_cast(mode)); // NOLINT + CHECK_FAILURE(ret, "Chmod"); + return Status::OK(); + } + + Status Chown(const std::string& path, const char* owner, const char* group) { + int ret = driver_->Chown(fs_, path.c_str(), owner, group); + CHECK_FAILURE(ret, "Chown"); + return Status::OK(); + } + private: - LibHdfsShim* driver_; + internal::LibHdfsShim* driver_; std::string namenode_host_; std::string user_; @@ -496,77 +520,92 @@ class HdfsClient::HdfsClientImpl { // ---------------------------------------------------------------------- // Public API for HDFSClient -HdfsClient::HdfsClient() { - impl_.reset(new HdfsClientImpl()); -} +HadoopFileSystem::HadoopFileSystem() { impl_.reset(new HadoopFileSystemImpl()); } -HdfsClient::~HdfsClient() {} +HadoopFileSystem::~HadoopFileSystem() {} -Status HdfsClient::Connect( - const HdfsConnectionConfig* config, std::shared_ptr* fs) { +Status HadoopFileSystem::Connect(const HdfsConnectionConfig* config, + std::shared_ptr* fs) { // ctor is private, make_shared will not work - *fs = std::shared_ptr(new HdfsClient()); + *fs = std::shared_ptr(new HadoopFileSystem()); RETURN_NOT_OK((*fs)->impl_->Connect(config)); return Status::OK(); } -Status HdfsClient::MakeDirectory(const std::string& path) { +Status HadoopFileSystem::MakeDirectory(const std::string& path) { return impl_->MakeDirectory(path); } -Status HdfsClient::Delete(const std::string& path, bool recursive) { +Status HadoopFileSystem::Delete(const std::string& path, bool recursive) { return impl_->Delete(path, recursive); } -Status HdfsClient::Disconnect() { - return impl_->Disconnect(); +Status HadoopFileSystem::DeleteDirectory(const std::string& path) { + return Delete(path, true); } -bool HdfsClient::Exists(const std::string& path) { - return impl_->Exists(path); -} +Status HadoopFileSystem::Disconnect() { return impl_->Disconnect(); } + +bool HadoopFileSystem::Exists(const std::string& path) { return impl_->Exists(path); } -Status HdfsClient::GetPathInfo(const std::string& path, HdfsPathInfo* info) { +Status HadoopFileSystem::GetPathInfo(const std::string& path, HdfsPathInfo* info) { return impl_->GetPathInfo(path, info); } -Status HdfsClient::GetCapacity(int64_t* nbytes) { +Status HadoopFileSystem::Stat(const std::string& path, FileStatistics* stat) { + return impl_->Stat(path, stat); +} + +Status HadoopFileSystem::GetCapacity(int64_t* nbytes) { return impl_->GetCapacity(nbytes); } -Status HdfsClient::GetUsed(int64_t* nbytes) { - return impl_->GetUsed(nbytes); +Status HadoopFileSystem::GetUsed(int64_t* nbytes) { return impl_->GetUsed(nbytes); } + +Status HadoopFileSystem::GetChildren(const std::string& path, + std::vector* listing) { + return impl_->GetChildren(path, listing); } -Status HdfsClient::ListDirectory( - const std::string& path, std::vector* listing) { +Status HadoopFileSystem::ListDirectory(const std::string& path, + std::vector* listing) { return impl_->ListDirectory(path, listing); } -Status HdfsClient::OpenReadable(const std::string& path, int32_t buffer_size, - std::shared_ptr* file) { +Status HadoopFileSystem::OpenReadable(const std::string& path, int32_t buffer_size, + std::shared_ptr* file) { return impl_->OpenReadable(path, buffer_size, file); } -Status HdfsClient::OpenReadable( - const std::string& path, std::shared_ptr* file) { +Status HadoopFileSystem::OpenReadable(const std::string& path, + std::shared_ptr* file) { return OpenReadable(path, kDefaultHdfsBufferSize, file); } -Status HdfsClient::OpenWriteable(const std::string& path, bool append, - int32_t buffer_size, int16_t replication, int64_t default_block_size, - std::shared_ptr* file) { - return impl_->OpenWriteable( - path, append, buffer_size, replication, default_block_size, file); +Status HadoopFileSystem::OpenWriteable(const std::string& path, bool append, + int32_t buffer_size, int16_t replication, + int64_t default_block_size, + std::shared_ptr* file) { + return impl_->OpenWriteable(path, append, buffer_size, replication, default_block_size, + file); } -Status HdfsClient::OpenWriteable( - const std::string& path, bool append, std::shared_ptr* file) { +Status HadoopFileSystem::OpenWriteable(const std::string& path, bool append, + std::shared_ptr* file) { return OpenWriteable(path, append, 0, 0, 0, file); } -Status HdfsClient::Rename(const std::string& src, const std::string& dst) { +Status HadoopFileSystem::Chmod(const std::string& path, int mode) { + return impl_->Chmod(path, mode); +} + +Status HadoopFileSystem::Chown(const std::string& path, const char* owner, + const char* group) { + return impl_->Chown(path, owner, group); +} + +Status HadoopFileSystem::Rename(const std::string& src, const std::string& dst) { return impl_->Rename(src, dst); } @@ -574,13 +613,13 @@ Status HdfsClient::Rename(const std::string& src, const std::string& dst) { // Allow public API users to check whether we are set up correctly Status HaveLibHdfs() { - LibHdfsShim* driver; - return ConnectLibHdfs(&driver); + internal::LibHdfsShim* driver; + return internal::ConnectLibHdfs(&driver); } Status HaveLibHdfs3() { - LibHdfsShim* driver; - return ConnectLibHdfs3(&driver); + internal::LibHdfsShim* driver; + return internal::ConnectLibHdfs3(&driver); } } // namespace io diff --git a/cpp/src/arrow/io/hdfs.h b/cpp/src/arrow/io/hdfs.h index f3de4a2bf17..1507ca969cf 100644 --- a/cpp/src/arrow/io/hdfs.h +++ b/cpp/src/arrow/io/hdfs.h @@ -34,7 +34,7 @@ class Status; namespace io { -class HdfsClient; +class HadoopFileSystem; class HdfsReadableFile; class HdfsOutputStream; @@ -66,23 +66,23 @@ struct HdfsConnectionConfig { HdfsDriver driver; }; -class ARROW_EXPORT HdfsClient : public FileSystemClient { +class ARROW_EXPORT HadoopFileSystem : public FileSystem { public: - ~HdfsClient(); + ~HadoopFileSystem(); // Connect to an HDFS cluster given a configuration // // @param config (in): configuration for connecting // @param fs (out): the created client // @returns Status - static Status Connect( - const HdfsConnectionConfig* config, std::shared_ptr* fs); + static Status Connect(const HdfsConnectionConfig* config, + std::shared_ptr* fs); // Create directory and all parents // // @param path (in): absolute HDFS path // @returns Status - Status MakeDirectory(const std::string& path); + Status MakeDirectory(const std::string& path) override; // Delete file or directory // @param path: absolute path to data @@ -90,6 +90,8 @@ class ARROW_EXPORT HdfsClient : public FileSystemClient { // @returns error status on failure Status Delete(const std::string& path, bool recursive = false); + Status DeleteDirectory(const std::string& path) override; + // Disconnect from cluster // // @returns Status @@ -112,18 +114,29 @@ class ARROW_EXPORT HdfsClient : public FileSystemClient { // @returns Status Status GetUsed(int64_t* nbytes); + Status GetChildren(const std::string& path, std::vector* listing) override; + Status ListDirectory(const std::string& path, std::vector* listing); - // @param path file path to change - // @param owner pass nullptr for no change - // @param group pass nullptr for no change + /// Change + /// + /// @param path file path to change + /// @param owner pass nullptr for no change + /// @param group pass nullptr for no change Status Chown(const std::string& path, const char* owner, const char* group); + /// Change path permissions + /// + /// \param path Absolute path in file system + /// \param mode Mode bitset + /// \return Status Status Chmod(const std::string& path, int mode); // Move file or directory from source path to destination path within the // current filesystem - Status Rename(const std::string& src, const std::string& dst); + Status Rename(const std::string& src, const std::string& dst) override; + + Status Stat(const std::string& path, FileStatistics* stat) override; // TODO(wesm): GetWorkingDirectory, SetWorkingDirectory @@ -132,7 +145,7 @@ class ARROW_EXPORT HdfsClient : public FileSystemClient { // // @param path complete file path Status OpenReadable(const std::string& path, int32_t buffer_size, - std::shared_ptr* file); + std::shared_ptr* file); Status OpenReadable(const std::string& path, std::shared_ptr* file); @@ -142,23 +155,28 @@ class ARROW_EXPORT HdfsClient : public FileSystemClient { // @param replication, 0 for default // @param default_block_size, 0 for default Status OpenWriteable(const std::string& path, bool append, int32_t buffer_size, - int16_t replication, int64_t default_block_size, - std::shared_ptr* file); + int16_t replication, int64_t default_block_size, + std::shared_ptr* file); - Status OpenWriteable( - const std::string& path, bool append, std::shared_ptr* file); + Status OpenWriteable(const std::string& path, bool append, + std::shared_ptr* file); private: friend class HdfsReadableFile; friend class HdfsOutputStream; - class ARROW_NO_EXPORT HdfsClientImpl; - std::unique_ptr impl_; + class ARROW_NO_EXPORT HadoopFileSystemImpl; + std::unique_ptr impl_; - HdfsClient(); - DISALLOW_COPY_AND_ASSIGN(HdfsClient); + HadoopFileSystem(); + DISALLOW_COPY_AND_ASSIGN(HadoopFileSystem); }; +// 0.6.0 +#ifndef ARROW_NO_DEPRECATED_API +using HdfsClient = HadoopFileSystem; +#endif + class ARROW_EXPORT HdfsReadableFile : public RandomAccessFile { public: ~HdfsReadableFile(); @@ -173,8 +191,8 @@ class ARROW_EXPORT HdfsReadableFile : public RandomAccessFile { Status Read(int64_t nbytes, std::shared_ptr* out) override; - Status ReadAt( - int64_t position, int64_t nbytes, int64_t* bytes_read, uint8_t* buffer) override; + Status ReadAt(int64_t position, int64_t nbytes, int64_t* bytes_read, + uint8_t* buffer) override; Status ReadAt(int64_t position, int64_t nbytes, std::shared_ptr* out) override; @@ -191,7 +209,7 @@ class ARROW_EXPORT HdfsReadableFile : public RandomAccessFile { class ARROW_NO_EXPORT HdfsReadableFileImpl; std::unique_ptr impl_; - friend class HdfsClient::HdfsClientImpl; + friend class HadoopFileSystem::HadoopFileSystemImpl; DISALLOW_COPY_AND_ASSIGN(HdfsReadableFile); }; @@ -216,7 +234,7 @@ class ARROW_EXPORT HdfsOutputStream : public OutputStream { class ARROW_NO_EXPORT HdfsOutputStreamImpl; std::unique_ptr impl_; - friend class HdfsClient::HdfsClientImpl; + friend class HadoopFileSystem::HadoopFileSystemImpl; HdfsOutputStream(); diff --git a/cpp/src/arrow/io/interfaces.cc b/cpp/src/arrow/io/interfaces.cc index 06957d4de56..57dc42d8a9b 100644 --- a/cpp/src/arrow/io/interfaces.cc +++ b/cpp/src/arrow/io/interfaces.cc @@ -29,32 +29,28 @@ namespace io { FileInterface::~FileInterface() {} -RandomAccessFile::RandomAccessFile() { - set_mode(FileMode::READ); -} +RandomAccessFile::RandomAccessFile() { set_mode(FileMode::READ); } -Status RandomAccessFile::ReadAt( - int64_t position, int64_t nbytes, int64_t* bytes_read, uint8_t* out) { +Status RandomAccessFile::ReadAt(int64_t position, int64_t nbytes, int64_t* bytes_read, + uint8_t* out) { std::lock_guard guard(lock_); RETURN_NOT_OK(Seek(position)); return Read(nbytes, bytes_read, out); } -Status RandomAccessFile::ReadAt( - int64_t position, int64_t nbytes, std::shared_ptr* out) { +Status RandomAccessFile::ReadAt(int64_t position, int64_t nbytes, + std::shared_ptr* out) { std::lock_guard guard(lock_); RETURN_NOT_OK(Seek(position)); return Read(nbytes, out); } Status Writeable::Write(const std::string& data) { - return Write( - reinterpret_cast(data.c_str()), static_cast(data.size())); + return Write(reinterpret_cast(data.c_str()), + static_cast(data.size())); } -Status Writeable::Flush() { - return Status::OK(); -} +Status Writeable::Flush() { return Status::OK(); } } // namespace io } // namespace arrow diff --git a/cpp/src/arrow/io/interfaces.h b/cpp/src/arrow/io/interfaces.h index b5a0bd85bf2..4bb7ebe2fd9 100644 --- a/cpp/src/arrow/io/interfaces.h +++ b/cpp/src/arrow/io/interfaces.h @@ -22,6 +22,7 @@ #include #include #include +#include #include "arrow/util/macros.h" #include "arrow/util/visibility.h" @@ -42,9 +43,29 @@ struct ObjectType { enum type { FILE, DIRECTORY }; }; -class ARROW_EXPORT FileSystemClient { +struct ARROW_EXPORT FileStatistics { + /// Size of file, -1 if finding length is unsupported + int64_t size; + ObjectType::type kind; + + FileStatistics() {} + FileStatistics(int64_t size, ObjectType::type kind) : size(size), kind(kind) {} +}; + +class ARROW_EXPORT FileSystem { public: - virtual ~FileSystemClient() {} + virtual ~FileSystem() {} + + virtual Status MakeDirectory(const std::string& path) = 0; + + virtual Status DeleteDirectory(const std::string& path) = 0; + + virtual Status GetChildren(const std::string& path, + std::vector* listing) = 0; + + virtual Status Rename(const std::string& src, const std::string& dst) = 0; + + virtual Status Stat(const std::string& path, FileStatistics* stat) = 0; }; class ARROW_EXPORT FileInterface { @@ -107,8 +128,8 @@ class ARROW_EXPORT RandomAccessFile : public InputStream, public Seekable { /// be overridden /// /// Default implementation is thread-safe - virtual Status ReadAt( - int64_t position, int64_t nbytes, int64_t* bytes_read, uint8_t* out); + virtual Status ReadAt(int64_t position, int64_t nbytes, int64_t* bytes_read, + uint8_t* out); /// Default implementation is thread-safe virtual Status ReadAt(int64_t position, int64_t nbytes, std::shared_ptr* out); diff --git a/cpp/src/arrow/io/io-file-test.cc b/cpp/src/arrow/io/io-file-test.cc index a077f8cb921..36c35700d64 100644 --- a/cpp/src/arrow/io/io-file-test.cc +++ b/cpp/src/arrow/io/io-file-test.cc @@ -43,9 +43,10 @@ static bool FileExists(const std::string& path) { #if defined(_MSC_VER) void InvalidParamHandler(const wchar_t* expr, const wchar_t* func, - const wchar_t* source_file, unsigned int source_line, uintptr_t reserved) { + const wchar_t* source_file, unsigned int source_line, + uintptr_t reserved) { wprintf(L"Invalid parameter in funcion %s. Source: %s line %d expression %s", func, - source_file, source_line, expr); + source_file, source_line, expr); } #endif @@ -61,7 +62,9 @@ static bool FileIsClosed(int fd) { int ret = static_cast(_close(fd)); return (ret == -1); #else - if (-1 != fcntl(fd, F_GETFD)) { return false; } + if (-1 != fcntl(fd, F_GETFD)) { + return false; + } return errno == EBADF; #endif } @@ -76,7 +79,9 @@ class FileTestFixture : public ::testing::Test { void TearDown() { EnsureFileDeleted(); } void EnsureFileDeleted() { - if (FileExists(path_)) { std::remove(path_.c_str()); } + if (FileExists(path_)) { + std::remove(path_.c_str()); + } } protected: @@ -382,7 +387,9 @@ TEST_F(TestReadableFile, ThreadSafety) { for (int i = 0; i < niter; ++i) { ASSERT_OK(file_->ReadAt(0, 3, &buffer)); - if (0 == memcmp(data.c_str(), buffer->data(), 3)) { correct_count += 1; } + if (0 == memcmp(data.c_str(), buffer->data(), 3)) { + correct_count += 1; + } } }; @@ -547,8 +554,8 @@ TEST_F(TestMemoryMappedFile, InvalidFile) { std::string non_existent_path = "invalid-file-name-asfd"; std::shared_ptr result; - ASSERT_RAISES( - IOError, MemoryMappedFile::Open(non_existent_path, FileMode::READ, &result)); + ASSERT_RAISES(IOError, + MemoryMappedFile::Open(non_existent_path, FileMode::READ, &result)); } TEST_F(TestMemoryMappedFile, CastableToFileInterface) { @@ -563,8 +570,8 @@ TEST_F(TestMemoryMappedFile, ThreadSafety) { std::shared_ptr file; ASSERT_OK(MemoryMappedFile::Open(path, FileMode::READWRITE, &file)); - ASSERT_OK(file->Write( - reinterpret_cast(data.c_str()), static_cast(data.size()))); + ASSERT_OK(file->Write(reinterpret_cast(data.c_str()), + static_cast(data.size()))); std::atomic correct_count(0); const int niter = 10000; @@ -574,7 +581,9 @@ TEST_F(TestMemoryMappedFile, ThreadSafety) { for (int i = 0; i < niter; ++i) { ASSERT_OK(file->ReadAt(0, 3, &buffer)); - if (0 == memcmp(data.c_str(), buffer->data(), 3)) { correct_count += 1; } + if (0 == memcmp(data.c_str(), buffer->data(), 3)) { + correct_count += 1; + } } }; diff --git a/cpp/src/arrow/io/io-hdfs-test.cc b/cpp/src/arrow/io/io-hdfs-test.cc index 74f80428c45..b6a40e094c9 100644 --- a/cpp/src/arrow/io/io-hdfs-test.cc +++ b/cpp/src/arrow/io/io-hdfs-test.cc @@ -48,7 +48,7 @@ struct PivotalDriver { }; template -class TestHdfsClient : public ::testing::Test { +class TestHadoopFileSystem : public ::testing::Test { public: Status MakeScratchDir() { if (client_->Exists(scratch_dir_)) { @@ -58,11 +58,11 @@ class TestHdfsClient : public ::testing::Test { } Status WriteDummyFile(const std::string& path, const uint8_t* buffer, int64_t size, - bool append = false, int buffer_size = 0, int16_t replication = 0, - int default_block_size = 0) { + bool append = false, int buffer_size = 0, int16_t replication = 0, + int default_block_size = 0) { std::shared_ptr file; - RETURN_NOT_OK(client_->OpenWriteable( - path, append, buffer_size, replication, default_block_size, &file)); + RETURN_NOT_OK(client_->OpenWriteable(path, append, buffer_size, replication, + default_block_size, &file)); RETURN_NOT_OK(file->Write(buffer, size)); RETURN_NOT_OK(file->Close()); @@ -84,12 +84,13 @@ class TestHdfsClient : public ::testing::Test { // Set up shared state between unit tests void SetUp() { - LibHdfsShim* driver_shim; + internal::LibHdfsShim* driver_shim; client_ = nullptr; - scratch_dir_ = boost::filesystem::unique_path( - boost::filesystem::temp_directory_path() / "arrow-hdfs/scratch-%%%%") - .string(); + scratch_dir_ = + boost::filesystem::unique_path(boost::filesystem::temp_directory_path() / + "arrow-hdfs/scratch-%%%%") + .string(); loaded_driver_ = false; @@ -123,7 +124,7 @@ class TestHdfsClient : public ::testing::Test { conf_.port = port == nullptr ? 20500 : atoi(port); conf_.driver = DRIVER::type; - ASSERT_OK(HdfsClient::Connect(&conf_, &client_)); + ASSERT_OK(HadoopFileSystem::Connect(&conf_, &client_)); } void TearDown() { @@ -140,11 +141,11 @@ class TestHdfsClient : public ::testing::Test { // Resources shared amongst unit tests std::string scratch_dir_; - std::shared_ptr client_; + std::shared_ptr client_; }; template <> -std::string TestHdfsClient::HdfsAbsPath(const std::string& relpath) { +std::string TestHadoopFileSystem::HdfsAbsPath(const std::string& relpath) { std::stringstream ss; ss << relpath; return ss.str(); @@ -160,22 +161,24 @@ HdfsDriver JNIDriver::type = HdfsDriver::LIBHDFS; HdfsDriver PivotalDriver::type = HdfsDriver::LIBHDFS3; typedef ::testing::Types DriverTypes; -TYPED_TEST_CASE(TestHdfsClient, DriverTypes); +TYPED_TEST_CASE(TestHadoopFileSystem, DriverTypes); -TYPED_TEST(TestHdfsClient, ConnectsAgain) { +TYPED_TEST(TestHadoopFileSystem, ConnectsAgain) { SKIP_IF_NO_DRIVER(); - std::shared_ptr client; - ASSERT_OK(HdfsClient::Connect(&this->conf_, &client)); + std::shared_ptr client; + ASSERT_OK(HadoopFileSystem::Connect(&this->conf_, &client)); ASSERT_OK(client->Disconnect()); } -TYPED_TEST(TestHdfsClient, MakeDirectory) { +TYPED_TEST(TestHadoopFileSystem, MakeDirectory) { SKIP_IF_NO_DRIVER(); std::string path = this->ScratchPath("create-directory"); - if (this->client_->Exists(path)) { ASSERT_OK(this->client_->Delete(path, true)); } + if (this->client_->Exists(path)) { + ASSERT_OK(this->client_->Delete(path, true)); + } ASSERT_OK(this->client_->MakeDirectory(path)); ASSERT_TRUE(this->client_->Exists(path)); @@ -187,7 +190,7 @@ TYPED_TEST(TestHdfsClient, MakeDirectory) { ASSERT_RAISES(IOError, this->client_->ListDirectory(path, &listing)); } -TYPED_TEST(TestHdfsClient, GetCapacityUsed) { +TYPED_TEST(TestHadoopFileSystem, GetCapacityUsed) { SKIP_IF_NO_DRIVER(); // Who knows what is actually in your DFS cluster, but expect it to have @@ -200,7 +203,7 @@ TYPED_TEST(TestHdfsClient, GetCapacityUsed) { ASSERT_LT(0, nbytes); } -TYPED_TEST(TestHdfsClient, GetPathInfo) { +TYPED_TEST(TestHadoopFileSystem, GetPathInfo) { SKIP_IF_NO_DRIVER(); HdfsPathInfo info; @@ -230,7 +233,7 @@ TYPED_TEST(TestHdfsClient, GetPathInfo) { ASSERT_EQ(size, info.size); } -TYPED_TEST(TestHdfsClient, AppendToFile) { +TYPED_TEST(TestHadoopFileSystem, AppendToFile) { SKIP_IF_NO_DRIVER(); ASSERT_OK(this->MakeScratchDir()); @@ -249,7 +252,7 @@ TYPED_TEST(TestHdfsClient, AppendToFile) { ASSERT_EQ(size * 2, info.size); } -TYPED_TEST(TestHdfsClient, ListDirectory) { +TYPED_TEST(TestHadoopFileSystem, ListDirectory) { SKIP_IF_NO_DRIVER(); const int size = 100; @@ -289,7 +292,7 @@ TYPED_TEST(TestHdfsClient, ListDirectory) { } } -TYPED_TEST(TestHdfsClient, ReadableMethods) { +TYPED_TEST(TestHadoopFileSystem, ReadableMethods) { SKIP_IF_NO_DRIVER(); ASSERT_OK(this->MakeScratchDir()); @@ -336,7 +339,7 @@ TYPED_TEST(TestHdfsClient, ReadableMethods) { ASSERT_EQ(60, position); } -TYPED_TEST(TestHdfsClient, LargeFile) { +TYPED_TEST(TestHadoopFileSystem, LargeFile) { SKIP_IF_NO_DRIVER(); ASSERT_OK(this->MakeScratchDir()); @@ -371,7 +374,7 @@ TYPED_TEST(TestHdfsClient, LargeFile) { ASSERT_EQ(size, bytes_read); } -TYPED_TEST(TestHdfsClient, RenameFile) { +TYPED_TEST(TestHadoopFileSystem, RenameFile) { SKIP_IF_NO_DRIVER(); ASSERT_OK(this->MakeScratchDir()); @@ -388,7 +391,32 @@ TYPED_TEST(TestHdfsClient, RenameFile) { ASSERT_TRUE(this->client_->Exists(dst_path)); } -TYPED_TEST(TestHdfsClient, ThreadSafety) { +TYPED_TEST(TestHadoopFileSystem, ChmodChown) { + SKIP_IF_NO_DRIVER(); + ASSERT_OK(this->MakeScratchDir()); + + auto path = this->ScratchPath("path-to-chmod"); + + int16_t mode = 0755; + const int size = 100; + + std::vector data = RandomData(size); + ASSERT_OK(this->WriteDummyFile(path, data.data(), size)); + + HdfsPathInfo info; + ASSERT_OK(this->client_->Chmod(path, mode)); + ASSERT_OK(this->client_->GetPathInfo(path, &info)); + ASSERT_EQ(mode, info.permissions); + + std::string owner = "hadoop"; + std::string group = "hadoop"; + ASSERT_OK(this->client_->Chown(path, owner.c_str(), group.c_str())); + ASSERT_OK(this->client_->GetPathInfo(path, &info)); + ASSERT_EQ("hadoop", info.owner); + ASSERT_EQ("hadoop", info.group); +} + +TYPED_TEST(TestHadoopFileSystem, ThreadSafety) { SKIP_IF_NO_DRIVER(); ASSERT_OK(this->MakeScratchDir()); @@ -396,7 +424,7 @@ TYPED_TEST(TestHdfsClient, ThreadSafety) { std::string data = "foobar"; ASSERT_OK(this->WriteDummyFile(src_path, reinterpret_cast(data.c_str()), - static_cast(data.size()))); + static_cast(data.size()))); std::shared_ptr file; ASSERT_OK(this->client_->OpenReadable(src_path, &file)); @@ -409,10 +437,14 @@ TYPED_TEST(TestHdfsClient, ThreadSafety) { std::shared_ptr buffer; if (i % 2 == 0) { ASSERT_OK(file->ReadAt(3, 3, &buffer)); - if (0 == memcmp(data.c_str() + 3, buffer->data(), 3)) { correct_count += 1; } + if (0 == memcmp(data.c_str() + 3, buffer->data(), 3)) { + correct_count += 1; + } } else { ASSERT_OK(file->ReadAt(0, 4, &buffer)); - if (0 == memcmp(data.c_str() + 0, buffer->data(), 4)) { correct_count += 1; } + if (0 == memcmp(data.c_str() + 0, buffer->data(), 4)) { + correct_count += 1; + } } } }; diff --git a/cpp/src/arrow/io/memory.cc b/cpp/src/arrow/io/memory.cc index 4d8bf63757d..50f3ddfaf65 100644 --- a/cpp/src/arrow/io/memory.cc +++ b/cpp/src/arrow/io/memory.cc @@ -46,7 +46,7 @@ BufferOutputStream::BufferOutputStream(const std::shared_ptr& b mutable_data_(buffer->mutable_data()) {} Status BufferOutputStream::Create(int64_t initial_capacity, MemoryPool* pool, - std::shared_ptr* out) { + std::shared_ptr* out) { std::shared_ptr buffer; RETURN_NOT_OK(AllocateResizableBuffer(pool, initial_capacity, &buffer)); *out = std::make_shared(buffer); @@ -55,7 +55,9 @@ Status BufferOutputStream::Create(int64_t initial_capacity, MemoryPool* pool, BufferOutputStream::~BufferOutputStream() { // This can fail, better to explicitly call close - if (buffer_) { DCHECK(Close().ok()); } + if (buffer_) { + DCHECK(Close().ok()); + } } Status BufferOutputStream::Close() { @@ -102,9 +104,7 @@ Status BufferOutputStream::Reserve(int64_t nbytes) { // ---------------------------------------------------------------------- // OutputStream that doesn't write anything -Status MockOutputStream::Close() { - return Status::OK(); -} +Status MockOutputStream::Close() { return Status::OK(); } Status MockOutputStream::Tell(int64_t* position) { *position = extent_bytes_written_; @@ -157,8 +157,8 @@ Status FixedSizeBufferWriter::Tell(int64_t* position) { Status FixedSizeBufferWriter::Write(const uint8_t* data, int64_t nbytes) { if (nbytes > memcopy_threshold_ && memcopy_num_threads_ > 1) { - parallel_memcopy(mutable_data_ + position_, data, nbytes, memcopy_blocksize_, - memcopy_num_threads_); + internal::parallel_memcopy(mutable_data_ + position_, data, nbytes, + memcopy_blocksize_, memcopy_num_threads_); } else { memcpy(mutable_data_ + position_, data, nbytes); } @@ -166,8 +166,8 @@ Status FixedSizeBufferWriter::Write(const uint8_t* data, int64_t nbytes) { return Status::OK(); } -Status FixedSizeBufferWriter::WriteAt( - int64_t position, const uint8_t* data, int64_t nbytes) { +Status FixedSizeBufferWriter::WriteAt(int64_t position, const uint8_t* data, + int64_t nbytes) { std::lock_guard guard(lock_); RETURN_NOT_OK(Seek(position)); return Write(data, nbytes); @@ -206,9 +206,7 @@ Status BufferReader::Tell(int64_t* position) { return Status::OK(); } -bool BufferReader::supports_zero_copy() const { - return true; -} +bool BufferReader::supports_zero_copy() const { return true; } Status BufferReader::Read(int64_t nbytes, int64_t* bytes_read, uint8_t* buffer) { memcpy(buffer, data_ + position_, nbytes); diff --git a/cpp/src/arrow/io/memory.h b/cpp/src/arrow/io/memory.h index 06384f0d4c4..1f817743647 100644 --- a/cpp/src/arrow/io/memory.h +++ b/cpp/src/arrow/io/memory.h @@ -45,7 +45,7 @@ class ARROW_EXPORT BufferOutputStream : public OutputStream { explicit BufferOutputStream(const std::shared_ptr& buffer); static Status Create(int64_t initial_capacity, MemoryPool* pool, - std::shared_ptr* out); + std::shared_ptr* out); ~BufferOutputStream(); diff --git a/cpp/src/arrow/io/test-common.h b/cpp/src/arrow/io/test-common.h index 438f378085f..a4974b77528 100644 --- a/cpp/src/arrow/io/test-common.h +++ b/cpp/src/arrow/io/test-common.h @@ -73,8 +73,8 @@ class MemoryMapFixture { tmp_files_.push_back(path); } - Status InitMemoryMap( - int64_t size, const std::string& path, std::shared_ptr* mmap) { + Status InitMemoryMap(int64_t size, const std::string& path, + std::shared_ptr* mmap) { RETURN_NOT_OK(MemoryMappedFile::Create(path, size, mmap)); tmp_files_.push_back(path); return Status::OK(); diff --git a/cpp/src/arrow/ipc/feather-internal.h b/cpp/src/arrow/ipc/feather-internal.h index 646c3b2f9f2..1b5924e3030 100644 --- a/cpp/src/arrow/ipc/feather-internal.h +++ b/cpp/src/arrow/ipc/feather-internal.h @@ -15,8 +15,8 @@ // specific language governing permissions and limitations // under the License. -/// Public API for the "Feather" file format, originally created at -/// http://github.com/wesm/feather +// Public API for the "Feather" file format, originally created at +// http://github.com/wesm/feather #ifndef ARROW_IPC_FEATHER_INTERNAL_H #define ARROW_IPC_FEATHER_INTERNAL_H @@ -49,7 +49,7 @@ struct ARROW_EXPORT ArrayMetadata { ArrayMetadata() {} ArrayMetadata(fbs::Type type, int64_t offset, int64_t length, int64_t null_count, - int64_t total_bytes) + int64_t total_bytes) : type(type), offset(offset), length(length), @@ -135,7 +135,9 @@ class ARROW_EXPORT TableMetadata { bool HasDescription() const { return table_->description() != 0; } std::string GetDescription() const { - if (!HasDescription()) { return std::string(""); } + if (!HasDescription()) { + return std::string(""); + } return table_->description()->str(); } @@ -153,7 +155,7 @@ class ARROW_EXPORT TableMetadata { static inline flatbuffers::Offset GetPrimitiveArray( FBB& fbb, const ArrayMetadata& array) { return fbs::CreatePrimitiveArray(fbb, array.type, fbs::Encoding_PLAIN, array.offset, - array.length, array.null_count, array.total_bytes); + array.length, array.null_count, array.total_bytes); } static inline fbs::TimeUnit ToFlatbufferEnum(TimeUnit::type unit) { diff --git a/cpp/src/arrow/ipc/feather-test.cc b/cpp/src/arrow/ipc/feather-test.cc index 029aae31ff5..b76b518788b 100644 --- a/cpp/src/arrow/ipc/feather-test.cc +++ b/cpp/src/arrow/ipc/feather-test.cc @@ -365,8 +365,8 @@ TEST_F(TestTableWriter, TimeTypes) { ArrayFromVector(is_valid, date_values_vec, &date_array); const auto& prim_values = static_cast(*values); - std::vector> buffers = { - prim_values.null_bitmap(), prim_values.values()}; + std::vector> buffers = {prim_values.null_bitmap(), + prim_values.values()}; std::vector> arrays; arrays.push_back(date_array->data()); @@ -400,7 +400,8 @@ TEST_F(TestTableWriter, PrimitiveNullRoundTrip) { ASSERT_OK(reader_->GetColumn(i, &col)); ASSERT_EQ(batch->column_name(i), col->name()); StringArray str_values(batch->column(i)->length(), nullptr, nullptr, - batch->column(i)->null_bitmap(), batch->column(i)->null_count()); + batch->column(i)->null_bitmap(), + batch->column(i)->null_count()); CheckArrays(str_values, *col->data()->chunk(0)); } } diff --git a/cpp/src/arrow/ipc/feather.cc b/cpp/src/arrow/ipc/feather.cc index 61b96e0c1dc..54771d3356b 100644 --- a/cpp/src/arrow/ipc/feather.cc +++ b/cpp/src/arrow/ipc/feather.cc @@ -61,26 +61,30 @@ static int64_t GetOutputLength(int64_t nbytes) { } static Status WritePadded(io::OutputStream* stream, const uint8_t* data, int64_t length, - int64_t* bytes_written) { + int64_t* bytes_written) { RETURN_NOT_OK(stream->Write(data, length)); int64_t remainder = PaddedLength(length) - length; - if (remainder != 0) { RETURN_NOT_OK(stream->Write(kPaddingBytes, remainder)); } + if (remainder != 0) { + RETURN_NOT_OK(stream->Write(kPaddingBytes, remainder)); + } *bytes_written = length + remainder; return Status::OK(); } /// For compability, we need to write any data sometimes just to keep producing /// files that can be read with an older reader. -static Status WritePaddedBlank( - io::OutputStream* stream, int64_t length, int64_t* bytes_written) { +static Status WritePaddedBlank(io::OutputStream* stream, int64_t length, + int64_t* bytes_written) { const uint8_t null = 0; for (int64_t i = 0; i < length; i++) { RETURN_NOT_OK(stream->Write(&null, 1)); } int64_t remainder = PaddedLength(length) - length; - if (remainder != 0) { RETURN_NOT_OK(stream->Write(kPaddingBytes, remainder)); } + if (remainder != 0) { + RETURN_NOT_OK(stream->Write(kPaddingBytes, remainder)); + } *bytes_written = length + remainder; return Status::OK(); } @@ -90,20 +94,22 @@ static Status WritePaddedBlank( TableBuilder::TableBuilder(int64_t num_rows) : finished_(false), num_rows_(num_rows) {} -FBB& TableBuilder::fbb() { - return fbb_; -} +FBB& TableBuilder::fbb() { return fbb_; } Status TableBuilder::Finish() { - if (finished_) { return Status::Invalid("can only call this once"); } + if (finished_) { + return Status::Invalid("can only call this once"); + } FBString desc = 0; - if (!description_.empty()) { desc = fbb_.CreateString(description_); } + if (!description_.empty()) { + desc = fbb_.CreateString(description_); + } flatbuffers::Offset metadata = 0; - auto root = fbs::CreateCTable( - fbb_, desc, num_rows_, fbb_.CreateVector(columns_), kFeatherVersion, metadata); + auto root = fbs::CreateCTable(fbb_, desc, num_rows_, fbb_.CreateVector(columns_), + kFeatherVersion, metadata); fbb_.Finish(root); finished_ = true; @@ -111,17 +117,15 @@ Status TableBuilder::Finish() { } std::shared_ptr TableBuilder::GetBuffer() const { - return std::make_shared( - fbb_.GetBufferPointer(), static_cast(fbb_.GetSize())); + return std::make_shared(fbb_.GetBufferPointer(), + static_cast(fbb_.GetSize())); } void TableBuilder::SetDescription(const std::string& description) { description_ = description; } -void TableBuilder::SetNumRows(int64_t num_rows) { - num_rows_ = num_rows; -} +void TableBuilder::SetNumRows(int64_t num_rows) { num_rows_ = num_rows; } void TableBuilder::add_column(const flatbuffers::Offset& col) { columns_.push_back(col); @@ -177,21 +181,17 @@ Status ColumnBuilder::Finish() { flatbuffers::Offset metadata = CreateColumnMetadata(); auto column = fbs::CreateColumn(buf, buf.CreateString(name_), values, - ToFlatbufferEnum(type_), // metadata_type - metadata, buf.CreateString(user_metadata_)); + ToFlatbufferEnum(type_), // metadata_type + metadata, buf.CreateString(user_metadata_)); // bad coupling, but OK for now parent_->add_column(column); return Status::OK(); } -void ColumnBuilder::SetValues(const ArrayMetadata& values) { - values_ = values; -} +void ColumnBuilder::SetValues(const ArrayMetadata& values) { values_ = values; } -void ColumnBuilder::SetUserMetadata(const std::string& data) { - user_metadata_ = data; -} +void ColumnBuilder::SetUserMetadata(const std::string& data) { user_metadata_ = data; } void ColumnBuilder::SetCategory(const ArrayMetadata& levels, bool ordered) { type_ = ColumnType::CATEGORY; @@ -209,18 +209,14 @@ void ColumnBuilder::SetTimestamp(TimeUnit::type unit, const std::string& timezon meta_timestamp_.timezone = timezone; } -void ColumnBuilder::SetDate() { - type_ = ColumnType::DATE; -} +void ColumnBuilder::SetDate() { type_ = ColumnType::DATE; } void ColumnBuilder::SetTime(TimeUnit::type unit) { type_ = ColumnType::TIME; meta_time_.unit = unit; } -FBB& ColumnBuilder::fbb() { - return *fbb_; -} +FBB& ColumnBuilder::fbb() { return *fbb_; } std::unique_ptr TableBuilder::AddColumn(const std::string& name) { return std::unique_ptr(new ColumnBuilder(this, name)); @@ -272,7 +268,7 @@ class TableReader::TableReaderImpl { } Status GetDataType(const fbs::PrimitiveArray* values, fbs::TypeMetadata metadata_type, - const void* metadata, std::shared_ptr* out) { + const void* metadata, std::shared_ptr* out) { #define PRIMITIVE_CASE(CAP_TYPE, FACTORY_FUNC) \ case fbs::Type_##CAP_TYPE: \ *out = FACTORY_FUNC(); \ @@ -342,7 +338,7 @@ class TableReader::TableReaderImpl { // @returns: a Buffer instance, the precise type will depend on the kind of // input data source (which may or may not have memory-map like semantics) Status LoadValues(const fbs::PrimitiveArray* meta, fbs::TypeMetadata metadata_type, - const void* metadata, std::shared_ptr* out) { + const void* metadata, std::shared_ptr* out) { std::shared_ptr type; RETURN_NOT_OK(GetDataType(meta, metadata_type, metadata, &type)); @@ -394,8 +390,8 @@ class TableReader::TableReaderImpl { // if (user_meta->size() > 0) { user_metadata_ = user_meta->str(); } std::shared_ptr values; - RETURN_NOT_OK(LoadValues( - col_meta->values(), col_meta->metadata_type(), col_meta->metadata(), &values)); + RETURN_NOT_OK(LoadValues(col_meta->values(), col_meta->metadata_type(), + col_meta->metadata(), &values)); out->reset(new Column(col_meta->name()->str(), values)); return Status::OK(); } @@ -410,41 +406,27 @@ class TableReader::TableReaderImpl { // ---------------------------------------------------------------------- // TableReader public API -TableReader::TableReader() { - impl_.reset(new TableReaderImpl()); -} +TableReader::TableReader() { impl_.reset(new TableReaderImpl()); } TableReader::~TableReader() {} Status TableReader::Open(const std::shared_ptr& source, - std::unique_ptr* out) { + std::unique_ptr* out) { out->reset(new TableReader()); return (*out)->impl_->Open(source); } -bool TableReader::HasDescription() const { - return impl_->HasDescription(); -} +bool TableReader::HasDescription() const { return impl_->HasDescription(); } -std::string TableReader::GetDescription() const { - return impl_->GetDescription(); -} +std::string TableReader::GetDescription() const { return impl_->GetDescription(); } -int TableReader::version() const { - return impl_->version(); -} +int TableReader::version() const { return impl_->version(); } -int64_t TableReader::num_rows() const { - return impl_->num_rows(); -} +int64_t TableReader::num_rows() const { return impl_->num_rows(); } -int64_t TableReader::num_columns() const { - return impl_->num_columns(); -} +int64_t TableReader::num_columns() const { return impl_->num_columns(); } -std::string TableReader::GetColumnName(int i) const { - return impl_->GetColumnName(i); -} +std::string TableReader::GetColumnName(int i) const { return impl_->GetColumnName(i); } Status TableReader::GetColumn(int i, std::shared_ptr* out) { return impl_->GetColumn(i, out); @@ -501,8 +483,8 @@ static Status SanitizeUnsupportedTypes(const Array& values, std::shared_ptr( - values.length(), nullptr, nullptr, values.null_bitmap(), values.null_count()); + *out = std::make_shared(values.length(), nullptr, nullptr, + values.null_bitmap(), values.null_count()); return Status::OK(); } else { return MakeArray(values.data(), out); @@ -537,8 +519,8 @@ class TableWriter::TableWriterImpl : public ArrayVisitor { // Footer: metadata length, magic bytes RETURN_NOT_OK( stream_->Write(reinterpret_cast(&buffer_size), sizeof(uint32_t))); - return stream_->Write( - reinterpret_cast(kFeatherMagicBytes), strlen(kFeatherMagicBytes)); + return stream_->Write(reinterpret_cast(kFeatherMagicBytes), + strlen(kFeatherMagicBytes)); } Status LoadArrayMetadata(const Array& values, ArrayMetadata* meta) { @@ -571,7 +553,7 @@ class TableWriter::TableWriterImpl : public ArrayVisitor { // byte boundary, and we write this much data into the stream if (values.null_bitmap()) { RETURN_NOT_OK(WritePadded(stream_.get(), values.null_bitmap()->data(), - values.null_bitmap()->size(), &bytes_written)); + values.null_bitmap()->size(), &bytes_written)); } else { RETURN_NOT_OK(WritePaddedBlank( stream_.get(), BitUtil::BytesForBits(values.length()), &bytes_written)); @@ -592,15 +574,17 @@ class TableWriter::TableWriterImpl : public ArrayVisitor { values_bytes = bin_values.raw_value_offsets()[values.length()]; // Write the variable-length offsets - RETURN_NOT_OK(WritePadded(stream_.get(), - reinterpret_cast(bin_values.raw_value_offsets()), - offset_bytes, &bytes_written)); + RETURN_NOT_OK(WritePadded(stream_.get(), reinterpret_cast( + bin_values.raw_value_offsets()), + offset_bytes, &bytes_written)); } else { RETURN_NOT_OK(WritePaddedBlank(stream_.get(), offset_bytes, &bytes_written)); } meta->total_bytes += bytes_written; - if (bin_values.value_data()) { values_buffer = bin_values.value_data()->data(); } + if (bin_values.value_data()) { + values_buffer = bin_values.value_data()->data(); + } } else { const auto& prim_values = static_cast(values); const auto& fw_type = static_cast(*values.type()); @@ -612,7 +596,9 @@ class TableWriter::TableWriterImpl : public ArrayVisitor { values_bytes = values.length() * fw_type.bit_width() / 8; } - if (prim_values.values()) { values_buffer = prim_values.values()->data(); } + if (prim_values.values()) { + values_buffer = prim_values.values()->data(); + } } if (values_buffer) { RETURN_NOT_OK( @@ -710,9 +696,9 @@ class TableWriter::TableWriterImpl : public ArrayVisitor { Status CheckStarted() { if (!initialized_stream_) { int64_t bytes_written_unused; - RETURN_NOT_OK( - WritePadded(stream_.get(), reinterpret_cast(kFeatherMagicBytes), - strlen(kFeatherMagicBytes), &bytes_written_unused)); + RETURN_NOT_OK(WritePadded(stream_.get(), + reinterpret_cast(kFeatherMagicBytes), + strlen(kFeatherMagicBytes), &bytes_written_unused)); initialized_stream_ = true; } return Status::OK(); @@ -728,33 +714,25 @@ class TableWriter::TableWriterImpl : public ArrayVisitor { Status AppendPrimitive(const PrimitiveArray& values, ArrayMetadata* out); }; -TableWriter::TableWriter() { - impl_.reset(new TableWriterImpl()); -} +TableWriter::TableWriter() { impl_.reset(new TableWriterImpl()); } TableWriter::~TableWriter() {} -Status TableWriter::Open( - const std::shared_ptr& stream, std::unique_ptr* out) { +Status TableWriter::Open(const std::shared_ptr& stream, + std::unique_ptr* out) { out->reset(new TableWriter()); return (*out)->impl_->Open(stream); } -void TableWriter::SetDescription(const std::string& desc) { - impl_->SetDescription(desc); -} +void TableWriter::SetDescription(const std::string& desc) { impl_->SetDescription(desc); } -void TableWriter::SetNumRows(int64_t num_rows) { - impl_->SetNumRows(num_rows); -} +void TableWriter::SetNumRows(int64_t num_rows) { impl_->SetNumRows(num_rows); } Status TableWriter::Append(const std::string& name, const Array& values) { return impl_->Append(name, values); } -Status TableWriter::Finalize() { - return impl_->Finalize(); -} +Status TableWriter::Finalize() { return impl_->Finalize(); } } // namespace feather } // namespace ipc diff --git a/cpp/src/arrow/ipc/feather.h b/cpp/src/arrow/ipc/feather.h index 4d59a8bbd54..2ab35a9556d 100644 --- a/cpp/src/arrow/ipc/feather.h +++ b/cpp/src/arrow/ipc/feather.h @@ -15,8 +15,8 @@ // specific language governing permissions and limitations // under the License. -/// Public API for the "Feather" file format, originally created at -/// http://github.com/wesm/feather +// Public API for the "Feather" file format, originally created at +// http://github.com/wesm/feather #ifndef ARROW_IPC_FEATHER_H #define ARROW_IPC_FEATHER_H @@ -56,7 +56,7 @@ class ARROW_EXPORT TableReader { ~TableReader(); static Status Open(const std::shared_ptr& source, - std::unique_ptr* out); + std::unique_ptr* out); // Optional table description // @@ -83,8 +83,8 @@ class ARROW_EXPORT TableWriter { public: ~TableWriter(); - static Status Open( - const std::shared_ptr& stream, std::unique_ptr* out); + static Status Open(const std::shared_ptr& stream, + std::unique_ptr* out); void SetDescription(const std::string& desc); void SetNumRows(int64_t num_rows); diff --git a/cpp/src/arrow/ipc/file-to-stream.cc b/cpp/src/arrow/ipc/file-to-stream.cc index a1feedc2126..4707c4fcdf0 100644 --- a/cpp/src/arrow/ipc/file-to-stream.cc +++ b/cpp/src/arrow/ipc/file-to-stream.cc @@ -15,11 +15,11 @@ // specific language governing permissions and limitations // under the License. +#include #include "arrow/io/file.h" #include "arrow/ipc/reader.h" #include "arrow/ipc/writer.h" #include "arrow/status.h" -#include #include "arrow/util/io-util.h" diff --git a/cpp/src/arrow/ipc/ipc-json-test.cc b/cpp/src/arrow/ipc/ipc-json-test.cc index 79344df46b2..35264fa02c5 100644 --- a/cpp/src/arrow/ipc/ipc-json-test.cc +++ b/cpp/src/arrow/ipc/ipc-json-test.cc @@ -77,7 +77,9 @@ void TestArrayRoundTrip(const Array& array) { rj::Document d; d.Parse(array_as_json); - if (d.HasParseError()) { FAIL() << "JSON parsing failed"; } + if (d.HasParseError()) { + FAIL() << "JSON parsing failed"; + } std::shared_ptr out; ASSERT_OK(internal::ReadArray(default_memory_pool(), d, array.type(), &out)); @@ -88,7 +90,8 @@ void TestArrayRoundTrip(const Array& array) { template void CheckPrimitive(const std::shared_ptr& type, - const std::vector& is_valid, const std::vector& values) { + const std::vector& is_valid, + const std::vector& values) { MemoryPool* pool = default_memory_pool(); typename TypeTraits::BuilderType builder(pool); @@ -108,16 +111,17 @@ void CheckPrimitive(const std::shared_ptr& type, TEST(TestJsonSchemaWriter, FlatTypes) { // TODO // field("f14", date32()) - std::vector> fields = {field("f0", int8()), - field("f1", int16(), false), field("f2", int32()), field("f3", int64(), false), - field("f4", uint8()), field("f5", uint16()), field("f6", uint32()), - field("f7", uint64()), field("f8", float32()), field("f9", float64()), - field("f10", utf8()), field("f11", binary()), field("f12", list(int32())), + std::vector> fields = { + field("f0", int8()), field("f1", int16(), false), field("f2", int32()), + field("f3", int64(), false), field("f4", uint8()), field("f5", uint16()), + field("f6", uint32()), field("f7", uint64()), field("f8", float32()), + field("f9", float64()), field("f10", utf8()), field("f11", binary()), + field("f12", list(int32())), field("f13", struct_({field("s1", int32()), field("s2", utf8())})), field("f15", date64()), field("f16", timestamp(TimeUnit::NANO)), field("f17", time64(TimeUnit::MICRO)), field("f18", union_({field("u1", int8()), field("u2", time32(TimeUnit::MILLI))}, - {0, 1}, UnionMode::DENSE))}; + {0, 1}, UnionMode::DENSE))}; Schema schema(fields); TestSchemaRoundTrip(schema); @@ -185,8 +189,8 @@ TEST(TestJsonArrayWriter, NestedTypes) { struct_({field("f1", int32()), field("f2", int32()), field("f3", int32())}); std::vector> fields = {values_array, values_array, values_array}; - StructArray struct_array( - struct_type, static_cast(struct_is_valid.size()), fields, struct_bitmap, 2); + StructArray struct_array(struct_type, static_cast(struct_is_valid.size()), fields, + struct_bitmap, 2); TestArrayRoundTrip(struct_array); } @@ -202,7 +206,7 @@ TEST(TestJsonArrayWriter, Unions) { // Data generation for test case below void MakeBatchArrays(const std::shared_ptr& schema, const int num_rows, - std::vector>* arrays) { + std::vector>* arrays) { std::vector is_valid; test::random_is_valid(num_rows, 0.25, &is_valid); @@ -266,8 +270,8 @@ TEST(TestJsonFileReadWrite, BasicRoundTrip) { std::unique_ptr reader; - auto buffer = std::make_shared( - reinterpret_cast(result.c_str()), static_cast(result.size())); + auto buffer = std::make_shared(reinterpret_cast(result.c_str()), + static_cast(result.size())); ASSERT_OK(JsonReader::Open(buffer, &reader)); ASSERT_TRUE(reader->schema()->Equals(*schema)); @@ -332,8 +336,8 @@ TEST(TestJsonFileReadWrite, MinimalFormatExample) { } )example"; - auto buffer = std::make_shared( - reinterpret_cast(example), strlen(example)); + auto buffer = std::make_shared(reinterpret_cast(example), + strlen(example)); std::unique_ptr reader; ASSERT_OK(JsonReader::Open(buffer, &reader)); @@ -361,9 +365,9 @@ TEST(TestJsonFileReadWrite, MinimalFormatExample) { #define BATCH_CASES() \ ::testing::Values(&MakeIntRecordBatch, &MakeListRecordBatch, &MakeNonNullRecordBatch, \ - &MakeZeroLengthRecordBatch, &MakeDeeplyNestedList, &MakeStringTypesRecordBatch, \ - &MakeStruct, &MakeUnion, &MakeDates, &MakeTimestamps, &MakeTimes, &MakeFWBinary, \ - &MakeDictionary); + &MakeZeroLengthRecordBatch, &MakeDeeplyNestedList, \ + &MakeStringTypesRecordBatch, &MakeStruct, &MakeUnion, &MakeDates, \ + &MakeTimestamps, &MakeTimes, &MakeFWBinary, &MakeDictionary); class TestJsonRoundTrip : public ::testing::TestWithParam { public: @@ -382,7 +386,7 @@ void CheckRoundtrip(const RecordBatch& batch) { ASSERT_OK(writer->Finish(&result)); auto buffer = std::make_shared(reinterpret_cast(result.c_str()), - static_cast(result.size())); + static_cast(result.size())); std::unique_ptr reader; ASSERT_OK(JsonReader::Open(buffer, &reader)); diff --git a/cpp/src/arrow/ipc/ipc-read-write-benchmark.cc b/cpp/src/arrow/ipc/ipc-read-write-benchmark.cc index c890d829849..a88120a248d 100644 --- a/cpp/src/arrow/ipc/ipc-read-write-benchmark.cc +++ b/cpp/src/arrow/ipc/ipc-read-write-benchmark.cc @@ -80,7 +80,7 @@ static void BM_WriteRecordBatch(benchmark::State& state) { // NOLINT non-const int32_t metadata_length; int64_t body_length; if (!ipc::WriteRecordBatch(*record_batch, 0, &stream, &metadata_length, &body_length, - default_memory_pool()) + default_memory_pool()) .ok()) { state.SkipWithError("Failed to write!"); } @@ -101,7 +101,7 @@ static void BM_ReadRecordBatch(benchmark::State& state) { // NOLINT non-const r int32_t metadata_length; int64_t body_length; if (!ipc::WriteRecordBatch(*record_batch, 0, &stream, &metadata_length, &body_length, - default_memory_pool()) + default_memory_pool()) .ok()) { state.SkipWithError("Failed to write!"); } diff --git a/cpp/src/arrow/ipc/ipc-read-write-test.cc b/cpp/src/arrow/ipc/ipc-read-write-test.cc index 2119ff74056..a6246c96f2d 100644 --- a/cpp/src/arrow/ipc/ipc-read-write-test.cc +++ b/cpp/src/arrow/ipc/ipc-read-write-test.cc @@ -126,40 +126,45 @@ TEST_F(TestSchemaMetadata, NestedFields) { CheckRoundtrip(schema, &memo); } -#define BATCH_CASES() \ - ::testing::Values(&MakeIntRecordBatch, &MakeListRecordBatch, &MakeNonNullRecordBatch, \ - &MakeZeroLengthRecordBatch, &MakeDeeplyNestedList, &MakeStringTypesRecordBatch, \ - &MakeStruct, &MakeUnion, &MakeDictionary, &MakeDates, &MakeTimestamps, &MakeTimes, \ - &MakeFWBinary, &MakeBooleanBatch); +#define BATCH_CASES() \ + ::testing::Values(&MakeIntRecordBatch, &MakeListRecordBatch, &MakeNonNullRecordBatch, \ + &MakeZeroLengthRecordBatch, &MakeDeeplyNestedList, \ + &MakeStringTypesRecordBatch, &MakeStruct, &MakeUnion, \ + &MakeDictionary, &MakeDates, &MakeTimestamps, &MakeTimes, \ + &MakeFWBinary, &MakeBooleanBatch); static int g_file_number = 0; class IpcTestFixture : public io::MemoryMapFixture { public: Status DoStandardRoundTrip(const RecordBatch& batch, bool zero_data, - std::shared_ptr* batch_result) { + std::shared_ptr* batch_result) { int32_t metadata_length; int64_t body_length; const int64_t buffer_offset = 0; - if (zero_data) { RETURN_NOT_OK(ZeroMemoryMap(mmap_.get())); } + if (zero_data) { + RETURN_NOT_OK(ZeroMemoryMap(mmap_.get())); + } RETURN_NOT_OK(mmap_->Seek(0)); - RETURN_NOT_OK(WriteRecordBatch( - batch, buffer_offset, mmap_.get(), &metadata_length, &body_length, pool_)); + RETURN_NOT_OK(WriteRecordBatch(batch, buffer_offset, mmap_.get(), &metadata_length, + &body_length, pool_)); std::unique_ptr message; RETURN_NOT_OK(ReadMessage(0, metadata_length, mmap_.get(), &message)); io::BufferReader buffer_reader(message->body()); - return ReadRecordBatch( - *message->metadata(), batch.schema(), &buffer_reader, batch_result); + return ReadRecordBatch(*message->metadata(), batch.schema(), &buffer_reader, + batch_result); } - Status DoLargeRoundTrip( - const RecordBatch& batch, bool zero_data, std::shared_ptr* result) { - if (zero_data) { RETURN_NOT_OK(ZeroMemoryMap(mmap_.get())); } + Status DoLargeRoundTrip(const RecordBatch& batch, bool zero_data, + std::shared_ptr* result) { + if (zero_data) { + RETURN_NOT_OK(ZeroMemoryMap(mmap_.get())); + } RETURN_NOT_OK(mmap_->Seek(0)); std::shared_ptr file_writer; @@ -244,8 +249,8 @@ TEST_F(TestIpcRoundTrip, MetadataVersion) { const int64_t buffer_offset = 0; - ASSERT_OK(WriteRecordBatch( - *batch, buffer_offset, mmap_.get(), &metadata_length, &body_length, pool_)); + ASSERT_OK(WriteRecordBatch(*batch, buffer_offset, mmap_.get(), &metadata_length, + &body_length, pool_)); std::unique_ptr message; ASSERT_OK(ReadMessage(0, metadata_length, mmap_.get(), &message)); @@ -258,7 +263,9 @@ TEST_P(TestIpcRoundTrip, SliceRoundTrip) { ASSERT_OK((*GetParam())(&batch)); // NOLINT clang-tidy gtest issue // Skip the zero-length case - if (batch->num_rows() < 2) { return; } + if (batch->num_rows() < 2) { + return; + } auto sliced_batch = batch->Slice(2, 10); CheckRoundtrip(*sliced_batch, 1 << 20); @@ -282,8 +289,9 @@ TEST_P(TestIpcRoundTrip, ZeroLengthArrays) { ASSERT_OK(AllocateBuffer(pool_, sizeof(int32_t), &value_offsets)); *reinterpret_cast(value_offsets->mutable_data()) = 0; - std::shared_ptr bin_array = std::make_shared(0, value_offsets, - std::make_shared(nullptr, 0), std::make_shared(nullptr, 0)); + std::shared_ptr bin_array = std::make_shared( + 0, value_offsets, std::make_shared(nullptr, 0), + std::make_shared(nullptr, 0)); // null value_offsets std::shared_ptr bin_array2 = std::make_shared(0, nullptr, nullptr); @@ -357,8 +365,8 @@ TEST_F(TestWriteRecordBatch, SliceTruncatesBuffers) { std::shared_ptr offsets_buffer; ASSERT_OK( test::CopyBufferFromVector(type_offsets, default_memory_pool(), &offsets_buffer)); - a1 = std::make_shared( - dense_union_type, a0->length(), struct_children, ids_buffer, offsets_buffer); + a1 = std::make_shared(dense_union_type, a0->length(), struct_children, + ids_buffer, offsets_buffer); CheckArray(a1); } @@ -367,8 +375,8 @@ void TestGetRecordBatchSize(std::shared_ptr batch) { int32_t mock_metadata_length = -1; int64_t mock_body_length = -1; int64_t size = -1; - ASSERT_OK(WriteRecordBatch( - *batch, 0, &mock, &mock_metadata_length, &mock_body_length, default_memory_pool())); + ASSERT_OK(WriteRecordBatch(*batch, 0, &mock, &mock_metadata_length, &mock_body_length, + default_memory_pool())); ASSERT_OK(GetRecordBatchSize(*batch, &size)); ASSERT_EQ(mock.GetExtentBytesWritten(), size); } @@ -398,10 +406,10 @@ class RecursionLimits : public ::testing::Test, public io::MemoryMapFixture { void TearDown() { io::MemoryMapFixture::TearDown(); } Status WriteToMmap(int recursion_level, bool override_level, int32_t* metadata_length, - int64_t* body_length, std::shared_ptr* batch, - std::shared_ptr* schema) { + int64_t* body_length, std::shared_ptr* batch, + std::shared_ptr* schema) { const int batch_length = 5; - TypePtr type = int32(); + auto type = int32(); std::shared_ptr array; const bool include_nulls = true; RETURN_NOT_OK(MakeRandomInt32Array(1000, include_nulls, pool_, &array)); @@ -425,10 +433,10 @@ class RecursionLimits : public ::testing::Test, public io::MemoryMapFixture { if (override_level) { return WriteRecordBatch(**batch, 0, mmap_.get(), metadata_length, body_length, - pool_, recursion_level + 1); + pool_, recursion_level + 1); } else { - return WriteRecordBatch( - **batch, 0, mmap_.get(), metadata_length, body_length, pool_); + return WriteRecordBatch(**batch, 0, mmap_.get(), metadata_length, body_length, + pool_); } } @@ -442,8 +450,8 @@ TEST_F(RecursionLimits, WriteLimit) { int64_t body_length = -1; std::shared_ptr schema; std::shared_ptr batch; - ASSERT_RAISES(Invalid, - WriteToMmap((1 << 8) + 1, false, &metadata_length, &body_length, &batch, &schema)); + ASSERT_RAISES(Invalid, WriteToMmap((1 << 8) + 1, false, &metadata_length, &body_length, + &batch, &schema)); } TEST_F(RecursionLimits, ReadLimit) { @@ -454,8 +462,8 @@ TEST_F(RecursionLimits, ReadLimit) { const int recursion_depth = 64; std::shared_ptr batch; - ASSERT_OK(WriteToMmap( - recursion_depth, true, &metadata_length, &body_length, &batch, &schema)); + ASSERT_OK(WriteToMmap(recursion_depth, true, &metadata_length, &body_length, &batch, + &schema)); std::unique_ptr message; ASSERT_OK(ReadMessage(0, metadata_length, mmap_.get(), &message)); @@ -472,16 +480,16 @@ TEST_F(RecursionLimits, StressLimit) { int64_t body_length = -1; std::shared_ptr schema; std::shared_ptr batch; - ASSERT_OK(WriteToMmap( - recursion_depth, true, &metadata_length, &body_length, &batch, &schema)); + ASSERT_OK(WriteToMmap(recursion_depth, true, &metadata_length, &body_length, &batch, + &schema)); std::unique_ptr message; ASSERT_OK(ReadMessage(0, metadata_length, mmap_.get(), &message)); io::BufferReader reader(message->body()); std::shared_ptr result; - ASSERT_OK(ReadRecordBatch( - *message->metadata(), schema, recursion_depth + 1, &reader, &result)); + ASSERT_OK(ReadRecordBatch(*message->metadata(), schema, recursion_depth + 1, &reader, + &result)); *it_works = result->Equals(*batch); }; @@ -568,8 +576,8 @@ class TestStreamFormat : public ::testing::TestWithParam { } void TearDown() {} - Status RoundTripHelper( - const RecordBatch& batch, std::vector>* out_batches) { + Status RoundTripHelper(const RecordBatch& batch, + std::vector>* out_batches) { // Write the file std::shared_ptr writer; RETURN_NOT_OK(RecordBatchStreamWriter::Open(sink_.get(), batch.schema(), &writer)); @@ -589,7 +597,9 @@ class TestStreamFormat : public ::testing::TestWithParam { std::shared_ptr chunk; while (true) { RETURN_NOT_OK(reader->ReadNextRecordBatch(&chunk)); - if (chunk == nullptr) { break; } + if (chunk == nullptr) { + break; + } out_batches->emplace_back(chunk); } return Status::OK(); @@ -747,8 +757,8 @@ TEST_F(TestTensorRoundTrip, NonContiguous) { int32_t metadata_length; int64_t body_length; ASSERT_OK(mmap_->Seek(0)); - ASSERT_RAISES( - Invalid, WriteTensor(tensor, mmap_.get(), &metadata_length, &body_length)); + ASSERT_RAISES(Invalid, + WriteTensor(tensor, mmap_.get(), &metadata_length, &body_length)); } } // namespace ipc diff --git a/cpp/src/arrow/ipc/json-integration-test.cc b/cpp/src/arrow/ipc/json-integration-test.cc index 18f5dfaf570..035f7086e7e 100644 --- a/cpp/src/arrow/ipc/json-integration-test.cc +++ b/cpp/src/arrow/ipc/json-integration-test.cc @@ -40,7 +40,8 @@ DEFINE_string(arrow, "", "Arrow file name"); DEFINE_string(json, "", "JSON file name"); -DEFINE_string(mode, "VALIDATE", +DEFINE_string( + mode, "VALIDATE", "Mode of integration testing tool (ARROW_TO_JSON, JSON_TO_ARROW, VALIDATE)"); DEFINE_bool(integration, false, "Run in integration test mode"); DEFINE_bool(verbose, true, "Verbose output"); @@ -55,8 +56,8 @@ bool file_exists(const char* path) { } // Convert JSON file to IPC binary format -static Status ConvertJsonToArrow( - const std::string& json_path, const std::string& arrow_path) { +static Status ConvertJsonToArrow(const std::string& json_path, + const std::string& arrow_path) { std::shared_ptr in_file; std::shared_ptr out_file; @@ -89,8 +90,8 @@ static Status ConvertJsonToArrow( } // Convert IPC binary format to JSON -static Status ConvertArrowToJson( - const std::string& arrow_path, const std::string& json_path) { +static Status ConvertArrowToJson(const std::string& arrow_path, + const std::string& json_path) { std::shared_ptr in_file; std::shared_ptr out_file; @@ -116,11 +117,11 @@ static Status ConvertArrowToJson( std::string result; RETURN_NOT_OK(writer->Finish(&result)); return out_file->Write(reinterpret_cast(result.c_str()), - static_cast(result.size())); + static_cast(result.size())); } -static Status ValidateArrowVsJson( - const std::string& arrow_path, const std::string& json_path) { +static Status ValidateArrowVsJson(const std::string& arrow_path, + const std::string& json_path) { // Construct JSON reader std::shared_ptr json_file; RETURN_NOT_OK(io::ReadableFile::Open(json_path, &json_file)); @@ -151,7 +152,9 @@ static Status ValidateArrowVsJson( << "Arrow schema: \n" << arrow_schema->ToString(); - if (FLAGS_verbose) { std::cout << ss.str() << std::endl; } + if (FLAGS_verbose) { + std::cout << ss.str() << std::endl; + } return Status::Invalid("Schemas did not match"); } @@ -188,10 +191,14 @@ static Status ValidateArrowVsJson( } Status RunCommand(const std::string& json_path, const std::string& arrow_path, - const std::string& command) { - if (json_path == "") { return Status::Invalid("Must specify json file name"); } + const std::string& command) { + if (json_path == "") { + return Status::Invalid("Must specify json file name"); + } - if (arrow_path == "") { return Status::Invalid("Must specify arrow file name"); } + if (arrow_path == "") { + return Status::Invalid("Must specify arrow file name"); + } if (command == "ARROW_TO_JSON") { if (!file_exists(arrow_path.c_str())) { @@ -240,8 +247,8 @@ class TestJSONIntegration : public ::testing::Test { do { std::shared_ptr out; RETURN_NOT_OK(io::FileOutputStream::Open(path, &out)); - RETURN_NOT_OK(out->Write( - reinterpret_cast(data), static_cast(strlen(data)))); + RETURN_NOT_OK(out->Write(reinterpret_cast(data), + static_cast(strlen(data)))); } while (0); return Status::OK(); } diff --git a/cpp/src/arrow/ipc/json-internal.cc b/cpp/src/arrow/ipc/json-internal.cc index 69e4ae8d14a..175d75b7d1e 100644 --- a/cpp/src/arrow/ipc/json-internal.cc +++ b/cpp/src/arrow/ipc/json-internal.cc @@ -199,7 +199,7 @@ class SchemaWriter { typename std::enable_if::value || std::is_base_of::value || std::is_base_of::value, - void>::type + void>::type WriteTypeMetadata(const T& type) {} void WriteTypeMetadata(const Integer& type) { @@ -508,7 +508,7 @@ class ArrayWriter { } Status WriteChildren(const std::vector>& fields, - const std::vector>& arrays) { + const std::vector>& arrays) { writer_->Key("children"); writer_->StartArray(); for (size_t i = 0; i < fields.size(); ++i) { @@ -602,16 +602,16 @@ static Status GetObjectBool(const RjObject& obj, const std::string& key, bool* o return Status::OK(); } -static Status GetObjectString( - const RjObject& obj, const std::string& key, std::string* out) { +static Status GetObjectString(const RjObject& obj, const std::string& key, + std::string* out) { const auto& it = obj.FindMember(key); RETURN_NOT_STRING(key, it, obj); *out = it->value.GetString(); return Status::OK(); } -static Status GetInteger( - const rj::Value::ConstObject& json_type, std::shared_ptr* type) { +static Status GetInteger(const rj::Value::ConstObject& json_type, + std::shared_ptr* type) { const auto& it_bit_width = json_type.FindMember("bitWidth"); RETURN_NOT_INT("bitWidth", it_bit_width, json_type); @@ -642,8 +642,8 @@ static Status GetInteger( return Status::OK(); } -static Status GetFloatingPoint( - const RjObject& json_type, std::shared_ptr* type) { +static Status GetFloatingPoint(const RjObject& json_type, + std::shared_ptr* type) { const auto& it_precision = json_type.FindMember("precision"); RETURN_NOT_STRING("precision", it_precision, json_type); @@ -663,8 +663,8 @@ static Status GetFloatingPoint( return Status::OK(); } -static Status GetFixedSizeBinary( - const RjObject& json_type, std::shared_ptr* type) { +static Status GetFixedSizeBinary(const RjObject& json_type, + std::shared_ptr* type) { const auto& it_byte_width = json_type.FindMember("byteWidth"); RETURN_NOT_INT("byteWidth", it_byte_width, json_type); @@ -756,8 +756,8 @@ static Status GetTimestamp(const RjObject& json_type, std::shared_ptr* } static Status GetUnion(const RjObject& json_type, - const std::vector>& children, - std::shared_ptr* type) { + const std::vector>& children, + std::shared_ptr* type) { const auto& it_mode = json_type.FindMember("mode"); RETURN_NOT_STRING("mode", it_mode, json_type); @@ -790,8 +790,8 @@ static Status GetUnion(const RjObject& json_type, } static Status GetType(const RjObject& json_type, - const std::vector>& children, - std::shared_ptr* type) { + const std::vector>& children, + std::shared_ptr* type) { const auto& it_type_name = json_type.FindMember("name"); RETURN_NOT_STRING("name", it_type_name, json_type); @@ -831,10 +831,11 @@ static Status GetType(const RjObject& json_type, } static Status GetField(const rj::Value& obj, const DictionaryMemo* dictionary_memo, - std::shared_ptr* field); + std::shared_ptr* field); static Status GetFieldsFromArray(const rj::Value& obj, - const DictionaryMemo* dictionary_memo, std::vector>* fields) { + const DictionaryMemo* dictionary_memo, + std::vector>* fields) { const auto& values = obj.GetArray(); fields->resize(values.Size()); @@ -845,7 +846,7 @@ static Status GetFieldsFromArray(const rj::Value& obj, } static Status ParseDictionary(const RjObject& obj, int64_t* id, bool* is_ordered, - std::shared_ptr* index_type) { + std::shared_ptr* index_type) { int32_t int32_id; RETURN_NOT_OK(GetObjectInt(obj, "id", &int32_id)); *id = int32_id; @@ -866,8 +867,10 @@ static Status ParseDictionary(const RjObject& obj, int64_t* id, bool* is_ordered } static Status GetField(const rj::Value& obj, const DictionaryMemo* dictionary_memo, - std::shared_ptr* field) { - if (!obj.IsObject()) { return Status::Invalid("Field was not a JSON object"); } + std::shared_ptr* field) { + if (!obj.IsObject()) { + return Status::Invalid("Field was not a JSON object"); + } const auto& json_field = obj.GetObject(); std::string name; @@ -884,8 +887,8 @@ static Status GetField(const rj::Value& obj, const DictionaryMemo* dictionary_me int64_t dictionary_id; bool is_ordered; std::shared_ptr index_type; - RETURN_NOT_OK(ParseDictionary( - it_dictionary->value.GetObject(), &dictionary_id, &is_ordered, &index_type)); + RETURN_NOT_OK(ParseDictionary(it_dictionary->value.GetObject(), &dictionary_id, + &is_ordered, &index_type)); std::shared_ptr dictionary; RETURN_NOT_OK(dictionary_memo->GetDictionary(dictionary_id, &dictionary)); @@ -941,13 +944,13 @@ UnboxValue(const rj::Value& val) { class ArrayReader { public: explicit ArrayReader(const rj::Value& json_array, const std::shared_ptr& type, - MemoryPool* pool) + MemoryPool* pool) : json_array_(json_array), type_(type), pool_(pool) {} Status ParseTypeValues(const DataType& type); Status GetValidityBuffer(const std::vector& is_valid, int32_t* null_count, - std::shared_ptr* validity_buffer) { + std::shared_ptr* validity_buffer) { int length = static_cast(is_valid.size()); std::shared_ptr out_buffer; @@ -1024,7 +1027,9 @@ class ArrayReader { DCHECK(hex_string.size() % 2 == 0) << "Expected base16 hex string"; int32_t length = static_cast(hex_string.size()) / 2; - if (byte_buffer->size() < length) { RETURN_NOT_OK(byte_buffer->Resize(length)); } + if (byte_buffer->size() < length) { + RETURN_NOT_OK(byte_buffer->Resize(length)); + } const char* hex_data = hex_string.c_str(); uint8_t* byte_buffer_data = byte_buffer->mutable_data(); @@ -1078,8 +1083,8 @@ class ArrayReader { } template - Status GetIntArray( - const RjArray& json_array, const int32_t length, std::shared_ptr* out) { + Status GetIntArray(const RjArray& json_array, const int32_t length, + std::shared_ptr* out) { std::shared_ptr buffer; RETURN_NOT_OK(AllocateBuffer(pool_, length * sizeof(T), &buffer)); @@ -1102,15 +1107,15 @@ class ArrayReader { const auto& json_offsets = obj_->FindMember("OFFSET"); RETURN_NOT_ARRAY("OFFSET", json_offsets, *obj_); std::shared_ptr offsets_buffer; - RETURN_NOT_OK(GetIntArray( - json_offsets->value.GetArray(), length_ + 1, &offsets_buffer)); + RETURN_NOT_OK(GetIntArray(json_offsets->value.GetArray(), length_ + 1, + &offsets_buffer)); std::vector> children; RETURN_NOT_OK(GetChildren(*obj_, type, &children)); DCHECK_EQ(children.size(), 1); - result_ = std::make_shared( - type_, length_, offsets_buffer, children[0], validity_buffer, null_count); + result_ = std::make_shared(type_, length_, offsets_buffer, children[0], + validity_buffer, null_count); return Status::OK(); } @@ -1123,8 +1128,8 @@ class ArrayReader { std::vector> fields; RETURN_NOT_OK(GetChildren(*obj_, type, &fields)); - result_ = std::make_shared( - type_, length_, fields, validity_buffer, null_count); + result_ = std::make_shared(type_, length_, fields, validity_buffer, + null_count); return Status::OK(); } @@ -1154,7 +1159,7 @@ class ArrayReader { RETURN_NOT_OK(GetChildren(*obj_, type, &children)); result_ = std::make_shared(type_, length_, children, type_id_buffer, - offsets_buffer, validity_buffer, null_count); + offsets_buffer, validity_buffer, null_count); return Status::OK(); } @@ -1177,7 +1182,7 @@ class ArrayReader { } Status GetChildren(const RjObject& obj, const DataType& type, - std::vector>* array) { + std::vector>* array) { const auto& json_children = obj.FindMember("children"); RETURN_NOT_ARRAY("children", json_children, obj); const auto& json_children_arr = json_children->value.GetArray(); @@ -1280,7 +1285,8 @@ static Status GetDictionaryTypes(const RjArray& fields, DictionaryTypeMap* id_to } static Status ReadDictionary(const RjObject& obj, const DictionaryTypeMap& id_to_field, - MemoryPool* pool, int64_t* dictionary_id, std::shared_ptr* out) { + MemoryPool* pool, int64_t* dictionary_id, + std::shared_ptr* out) { int id; RETURN_NOT_OK(GetObjectInt(obj, "id", &id)); @@ -1312,7 +1318,7 @@ static Status ReadDictionary(const RjObject& obj, const DictionaryTypeMap& id_to } static Status ReadDictionaries(const rj::Value& doc, const DictionaryTypeMap& id_to_field, - MemoryPool* pool, DictionaryMemo* dictionary_memo) { + MemoryPool* pool, DictionaryMemo* dictionary_memo) { auto it = doc.FindMember("dictionaries"); if (it == doc.MemberEnd()) { // No dictionaries @@ -1334,8 +1340,8 @@ static Status ReadDictionaries(const rj::Value& doc, const DictionaryTypeMap& id return Status::OK(); } -Status ReadSchema( - const rj::Value& json_schema, MemoryPool* pool, std::shared_ptr* schema) { +Status ReadSchema(const rj::Value& json_schema, MemoryPool* pool, + std::shared_ptr* schema) { auto it = json_schema.FindMember("schema"); RETURN_NOT_OBJECT("schema", it, json_schema); const auto& obj_schema = it->value.GetObject(); @@ -1359,7 +1365,7 @@ Status ReadSchema( } Status ReadRecordBatch(const rj::Value& json_obj, const std::shared_ptr& schema, - MemoryPool* pool, std::shared_ptr* batch) { + MemoryPool* pool, std::shared_ptr* batch) { DCHECK(json_obj.IsObject()); const auto& batch_obj = json_obj.GetObject(); @@ -1409,14 +1415,16 @@ Status WriteArray(const std::string& name, const Array& array, RjWriter* json_wr } Status ReadArray(MemoryPool* pool, const rj::Value& json_array, - const std::shared_ptr& type, std::shared_ptr* array) { + const std::shared_ptr& type, std::shared_ptr* array) { ArrayReader converter(json_array, type, pool); return converter.GetArray(array); } Status ReadArray(MemoryPool* pool, const rj::Value& json_array, const Schema& schema, - std::shared_ptr* array) { - if (!json_array.IsObject()) { return Status::Invalid("Element was not a JSON object"); } + std::shared_ptr* array) { + if (!json_array.IsObject()) { + return Status::Invalid("Element was not a JSON object"); + } const auto& json_obj = json_array.GetObject(); diff --git a/cpp/src/arrow/ipc/json-internal.h b/cpp/src/arrow/ipc/json-internal.h index 5571d923396..9b641cd5332 100644 --- a/cpp/src/arrow/ipc/json-internal.h +++ b/cpp/src/arrow/ipc/json-internal.h @@ -99,17 +99,17 @@ Status WriteSchema(const Schema& schema, RjWriter* writer); Status WriteRecordBatch(const RecordBatch& batch, RjWriter* writer); Status WriteArray(const std::string& name, const Array& array, RjWriter* writer); -Status ReadSchema( - const rj::Value& json_obj, MemoryPool* pool, std::shared_ptr* schema); +Status ReadSchema(const rj::Value& json_obj, MemoryPool* pool, + std::shared_ptr* schema); Status ReadRecordBatch(const rj::Value& json_obj, const std::shared_ptr& schema, - MemoryPool* pool, std::shared_ptr* batch); + MemoryPool* pool, std::shared_ptr* batch); Status ReadArray(MemoryPool* pool, const rj::Value& json_obj, - const std::shared_ptr& type, std::shared_ptr* array); + const std::shared_ptr& type, std::shared_ptr* array); Status ReadArray(MemoryPool* pool, const rj::Value& json_obj, const Schema& schema, - std::shared_ptr* array); + std::shared_ptr* array); } // namespace internal } // namespace json diff --git a/cpp/src/arrow/ipc/json.cc b/cpp/src/arrow/ipc/json.cc index 36e343e5fb5..f57101a31a9 100644 --- a/cpp/src/arrow/ipc/json.cc +++ b/cpp/src/arrow/ipc/json.cc @@ -79,15 +79,13 @@ JsonWriter::JsonWriter(const std::shared_ptr& schema) { JsonWriter::~JsonWriter() {} -Status JsonWriter::Open( - const std::shared_ptr& schema, std::unique_ptr* writer) { +Status JsonWriter::Open(const std::shared_ptr& schema, + std::unique_ptr* writer) { *writer = std::unique_ptr(new JsonWriter(schema)); return (*writer)->impl_->Start(); } -Status JsonWriter::Finish(std::string* result) { - return impl_->Finish(result); -} +Status JsonWriter::Finish(std::string* result) { return impl_->Finish(result); } Status JsonWriter::WriteRecordBatch(const RecordBatch& batch) { return impl_->WriteRecordBatch(batch); @@ -103,8 +101,10 @@ class JsonReader::JsonReaderImpl { Status ParseAndReadSchema() { doc_.Parse(reinterpret_cast(data_->data()), - static_cast(data_->size())); - if (doc_.HasParseError()) { return Status::IOError("JSON parsing failed"); } + static_cast(data_->size())); + if (doc_.HasParseError()) { + return Status::IOError("JSON parsing failed"); + } RETURN_NOT_OK(json::internal::ReadSchema(doc_, pool_, &schema_)); @@ -120,8 +120,8 @@ class JsonReader::JsonReaderImpl { DCHECK_LT(i, static_cast(record_batches_->GetArray().Size())) << "i out of bounds"; - return json::internal::ReadRecordBatch( - record_batches_->GetArray()[i], schema_, pool_, batch); + return json::internal::ReadRecordBatch(record_batches_->GetArray()[i], schema_, pool_, + batch); } std::shared_ptr schema() const { return schema_; } @@ -145,24 +145,20 @@ JsonReader::JsonReader(MemoryPool* pool, const std::shared_ptr& data) { JsonReader::~JsonReader() {} -Status JsonReader::Open( - const std::shared_ptr& data, std::unique_ptr* reader) { +Status JsonReader::Open(const std::shared_ptr& data, + std::unique_ptr* reader) { return Open(default_memory_pool(), data, reader); } Status JsonReader::Open(MemoryPool* pool, const std::shared_ptr& data, - std::unique_ptr* reader) { + std::unique_ptr* reader) { *reader = std::unique_ptr(new JsonReader(pool, data)); return (*reader)->impl_->ParseAndReadSchema(); } -std::shared_ptr JsonReader::schema() const { - return impl_->schema(); -} +std::shared_ptr JsonReader::schema() const { return impl_->schema(); } -int JsonReader::num_record_batches() const { - return impl_->num_record_batches(); -} +int JsonReader::num_record_batches() const { return impl_->num_record_batches(); } Status JsonReader::ReadRecordBatch(int i, std::shared_ptr* batch) const { return impl_->ReadRecordBatch(i, batch); diff --git a/cpp/src/arrow/ipc/json.h b/cpp/src/arrow/ipc/json.h index 2ba27c7f2c3..be26f0233eb 100644 --- a/cpp/src/arrow/ipc/json.h +++ b/cpp/src/arrow/ipc/json.h @@ -41,8 +41,8 @@ class ARROW_EXPORT JsonWriter { public: ~JsonWriter(); - static Status Open( - const std::shared_ptr& schema, std::unique_ptr* out); + static Status Open(const std::shared_ptr& schema, + std::unique_ptr* out); Status WriteRecordBatch(const RecordBatch& batch); Status Finish(std::string* result); @@ -61,11 +61,11 @@ class ARROW_EXPORT JsonReader { ~JsonReader(); static Status Open(MemoryPool* pool, const std::shared_ptr& data, - std::unique_ptr* reader); + std::unique_ptr* reader); // Use the default memory pool - static Status Open( - const std::shared_ptr& data, std::unique_ptr* reader); + static Status Open(const std::shared_ptr& data, + std::unique_ptr* reader); std::shared_ptr schema() const; diff --git a/cpp/src/arrow/ipc/metadata.cc b/cpp/src/arrow/ipc/metadata.cc index 49c24c72727..20fd280db6d 100644 --- a/cpp/src/arrow/ipc/metadata.cc +++ b/cpp/src/arrow/ipc/metadata.cc @@ -58,8 +58,8 @@ static constexpr flatbuf::MetadataVersion kCurrentMetadataVersion = static constexpr flatbuf::MetadataVersion kMinMetadataVersion = flatbuf::MetadataVersion_V3; -static Status IntFromFlatbuffer( - const flatbuf::Int* int_data, std::shared_ptr* out) { +static Status IntFromFlatbuffer(const flatbuf::Int* int_data, + std::shared_ptr* out) { if (int_data->bitWidth() > 64) { return Status::NotImplemented("Integers with more than 64 bits not implemented"); } @@ -86,8 +86,8 @@ static Status IntFromFlatbuffer( return Status::OK(); } -static Status FloatFromFlatuffer( - const flatbuf::FloatingPoint* float_data, std::shared_ptr* out) { +static Status FloatFromFlatuffer(const flatbuf::FloatingPoint* float_data, + std::shared_ptr* out) { if (float_data->precision() == flatbuf::Precision_HALF) { *out = float16(); } else if (float_data->precision() == flatbuf::Precision_SINGLE) { @@ -100,7 +100,7 @@ static Status FloatFromFlatuffer( // Forward declaration static Status FieldToFlatbuffer(FBB& fbb, const std::shared_ptr& field, - DictionaryMemo* dictionary_memo, FieldOffset* offset); + DictionaryMemo* dictionary_memo, FieldOffset* offset); static Offset IntToFlatbuffer(FBB& fbb, int bitWidth, bool is_signed) { return flatbuf::CreateInt(fbb, bitWidth, is_signed).Union(); @@ -111,7 +111,8 @@ static Offset FloatToFlatbuffer(FBB& fbb, flatbuf::Precision precision) { } static Status AppendChildFields(FBB& fbb, const std::shared_ptr& type, - std::vector* out_children, DictionaryMemo* dictionary_memo) { + std::vector* out_children, + DictionaryMemo* dictionary_memo) { FieldOffset field; for (int i = 0; i < type->num_children(); ++i) { RETURN_NOT_OK(FieldToFlatbuffer(fbb, type->child(i), dictionary_memo, &field)); @@ -121,16 +122,16 @@ static Status AppendChildFields(FBB& fbb, const std::shared_ptr& type, } static Status ListToFlatbuffer(FBB& fbb, const std::shared_ptr& type, - std::vector* out_children, DictionaryMemo* dictionary_memo, - Offset* offset) { + std::vector* out_children, + DictionaryMemo* dictionary_memo, Offset* offset) { RETURN_NOT_OK(AppendChildFields(fbb, type, out_children, dictionary_memo)); *offset = flatbuf::CreateList(fbb).Union(); return Status::OK(); } static Status StructToFlatbuffer(FBB& fbb, const std::shared_ptr& type, - std::vector* out_children, DictionaryMemo* dictionary_memo, - Offset* offset) { + std::vector* out_children, + DictionaryMemo* dictionary_memo, Offset* offset) { RETURN_NOT_OK(AppendChildFields(fbb, type, out_children, dictionary_memo)); *offset = flatbuf::CreateStruct_(fbb).Union(); return Status::OK(); @@ -140,7 +141,8 @@ static Status StructToFlatbuffer(FBB& fbb, const std::shared_ptr& type // Union implementation static Status UnionFromFlatbuffer(const flatbuf::Union* union_data, - const std::vector>& children, std::shared_ptr* out) { + const std::vector>& children, + std::shared_ptr* out) { UnionMode mode = union_data->mode() == flatbuf::UnionMode_Sparse ? UnionMode::SPARSE : UnionMode::DENSE; @@ -163,8 +165,8 @@ static Status UnionFromFlatbuffer(const flatbuf::Union* union_data, } static Status UnionToFlatBuffer(FBB& fbb, const std::shared_ptr& type, - std::vector* out_children, DictionaryMemo* dictionary_memo, - Offset* offset) { + std::vector* out_children, + DictionaryMemo* dictionary_memo, Offset* offset) { RETURN_NOT_OK(AppendChildFields(fbb, type, out_children, dictionary_memo)); const auto& union_type = static_cast(*type); @@ -224,15 +226,16 @@ static inline TimeUnit::type FromFlatbufferUnit(flatbuf::TimeUnit unit) { } static Status TypeFromFlatbuffer(flatbuf::Type type, const void* type_data, - const std::vector>& children, std::shared_ptr* out) { + const std::vector>& children, + std::shared_ptr* out) { switch (type) { case flatbuf::Type_NONE: return Status::Invalid("Type metadata cannot be none"); case flatbuf::Type_Int: return IntFromFlatbuffer(static_cast(type_data), out); case flatbuf::Type_FloatingPoint: - return FloatFromFlatuffer( - static_cast(type_data), out); + return FloatFromFlatuffer(static_cast(type_data), + out); case flatbuf::Type_Binary: *out = binary(); return Status::OK(); @@ -301,8 +304,8 @@ static Status TypeFromFlatbuffer(flatbuf::Type type, const void* type_data, *out = std::make_shared(children); return Status::OK(); case flatbuf::Type_Union: - return UnionFromFlatbuffer( - static_cast(type_data), children, out); + return UnionFromFlatbuffer(static_cast(type_data), children, + out); default: return Status::Invalid("Unrecognized type"); } @@ -310,15 +313,17 @@ static Status TypeFromFlatbuffer(flatbuf::Type type, const void* type_data, // TODO(wesm): Convert this to visitor pattern static Status TypeToFlatbuffer(FBB& fbb, const std::shared_ptr& type, - std::vector* children, std::vector* layout, - flatbuf::Type* out_type, DictionaryMemo* dictionary_memo, Offset* offset) { + std::vector* children, + std::vector* layout, + flatbuf::Type* out_type, DictionaryMemo* dictionary_memo, + Offset* offset) { if (type->id() == Type::DICTIONARY) { // In this library, the dictionary "type" is a logical construct. Here we // pass through to the value type, as we've already captured the index // type in the DictionaryEncoding metadata in the parent field const auto& dict_type = static_cast(*type); return TypeToFlatbuffer(fbb, dict_type.dictionary()->type(), children, layout, - out_type, dictionary_memo, offset); + out_type, dictionary_memo, offset); } std::vector buffer_layout = type->GetBufferLayout(); @@ -436,7 +441,7 @@ static Status TypeToFlatbuffer(FBB& fbb, const std::shared_ptr& type, } static Status TensorTypeToFlatbuffer(FBB& fbb, const std::shared_ptr& type, - flatbuf::Type* out_type, Offset* offset) { + flatbuf::Type* out_type, Offset* offset) { switch (type->id()) { case Type::UINT8: INT_TO_FB_CASE(8, false); @@ -475,8 +480,8 @@ static Status TensorTypeToFlatbuffer(FBB& fbb, const std::shared_ptr& return Status::OK(); } -static DictionaryOffset GetDictionaryEncoding( - FBB& fbb, const DictionaryType& type, DictionaryMemo* memo) { +static DictionaryOffset GetDictionaryEncoding(FBB& fbb, const DictionaryType& type, + DictionaryMemo* memo) { int64_t dictionary_id = memo->GetId(type.dictionary()); // We assume that the dictionary index type (as an integer) has already been @@ -491,7 +496,7 @@ static DictionaryOffset GetDictionaryEncoding( } static Status FieldToFlatbuffer(FBB& fbb, const std::shared_ptr& field, - DictionaryMemo* dictionary_memo, FieldOffset* offset) { + DictionaryMemo* dictionary_memo, FieldOffset* offset) { auto fb_name = fbb.CreateString(field->name()); flatbuf::Type type_enum; @@ -500,8 +505,8 @@ static Status FieldToFlatbuffer(FBB& fbb, const std::shared_ptr& field, std::vector children; std::vector layout; - RETURN_NOT_OK(TypeToFlatbuffer( - fbb, field->type(), &children, &layout, &type_enum, dictionary_memo, &type_offset)); + RETURN_NOT_OK(TypeToFlatbuffer(fbb, field->type(), &children, &layout, &type_enum, + dictionary_memo, &type_offset)); auto fb_children = fbb.CreateVector(children); auto fb_layout = fbb.CreateVector(layout); @@ -513,13 +518,14 @@ static Status FieldToFlatbuffer(FBB& fbb, const std::shared_ptr& field, // TODO: produce the list of VectorTypes *offset = flatbuf::CreateField(fbb, fb_name, field->nullable(), type_enum, type_offset, - dictionary, fb_children, fb_layout); + dictionary, fb_children, fb_layout); return Status::OK(); } static Status FieldFromFlatbuffer(const flatbuf::Field* field, - const DictionaryMemo& dictionary_memo, std::shared_ptr* out) { + const DictionaryMemo& dictionary_memo, + std::shared_ptr* out) { std::shared_ptr type; const flatbuf::DictionaryEncoding* encoding = field->dictionary(); @@ -551,8 +557,8 @@ static Status FieldFromFlatbuffer(const flatbuf::Field* field, return Status::OK(); } -static Status FieldFromFlatbufferDictionary( - const flatbuf::Field* field, std::shared_ptr* out) { +static Status FieldFromFlatbufferDictionary(const flatbuf::Field* field, + std::shared_ptr* out) { // Need an empty memo to pass down for constructing children DictionaryMemo dummy_memo; @@ -584,7 +590,8 @@ flatbuf::Endianness endianness() { } static Status SchemaToFlatbuffer(FBB& fbb, const Schema& schema, - DictionaryMemo* dictionary_memo, flatbuffers::Offset* out) { + DictionaryMemo* dictionary_memo, + flatbuffers::Offset* out) { /// Fields std::vector field_offsets; for (int i = 0; i < schema.num_fields(); ++i) { @@ -609,8 +616,8 @@ static Status SchemaToFlatbuffer(FBB& fbb, const Schema& schema, key_value_offsets.push_back( flatbuf::CreateKeyValue(fbb, fbb.CreateString(key), fbb.CreateString(value))); } - *out = flatbuf::CreateSchema( - fbb, endianness(), fb_offsets, fbb.CreateVector(key_value_offsets)); + *out = flatbuf::CreateSchema(fbb, endianness(), fb_offsets, + fbb.CreateVector(key_value_offsets)); } else { *out = flatbuf::CreateSchema(fbb, endianness(), fb_offsets); } @@ -631,15 +638,16 @@ static Status WriteFlatbufferBuilder(FBB& fbb, std::shared_ptr* out) { } static Status WriteFBMessage(FBB& fbb, flatbuf::MessageHeader header_type, - flatbuffers::Offset header, int64_t body_length, std::shared_ptr* out) { - auto message = flatbuf::CreateMessage( - fbb, kCurrentMetadataVersion, header_type, header, body_length); + flatbuffers::Offset header, int64_t body_length, + std::shared_ptr* out) { + auto message = flatbuf::CreateMessage(fbb, kCurrentMetadataVersion, header_type, header, + body_length); fbb.Finish(message); return WriteFlatbufferBuilder(fbb, out); } -Status WriteSchemaMessage( - const Schema& schema, DictionaryMemo* dictionary_memo, std::shared_ptr* out) { +Status WriteSchemaMessage(const Schema& schema, DictionaryMemo* dictionary_memo, + std::shared_ptr* out) { FBB fbb; flatbuffers::Offset fb_schema; RETURN_NOT_OK(SchemaToFlatbuffer(fbb, schema, dictionary_memo, &fb_schema)); @@ -650,8 +658,8 @@ using FieldNodeVector = flatbuffers::Offset>; using BufferVector = flatbuffers::Offset>; -static Status WriteFieldNodes( - FBB& fbb, const std::vector& nodes, FieldNodeVector* out) { +static Status WriteFieldNodes(FBB& fbb, const std::vector& nodes, + FieldNodeVector* out) { std::vector fb_nodes; fb_nodes.reserve(nodes.size()); @@ -666,8 +674,8 @@ static Status WriteFieldNodes( return Status::OK(); } -static Status WriteBuffers( - FBB& fbb, const std::vector& buffers, BufferVector* out) { +static Status WriteBuffers(FBB& fbb, const std::vector& buffers, + BufferVector* out) { std::vector fb_buffers; fb_buffers.reserve(buffers.size()); @@ -680,8 +688,9 @@ static Status WriteBuffers( } static Status MakeRecordBatch(FBB& fbb, int64_t length, int64_t body_length, - const std::vector& nodes, const std::vector& buffers, - RecordBatchOffset* offset) { + const std::vector& nodes, + const std::vector& buffers, + RecordBatchOffset* offset) { FieldNodeVector fb_nodes; BufferVector fb_buffers; @@ -693,17 +702,18 @@ static Status MakeRecordBatch(FBB& fbb, int64_t length, int64_t body_length, } Status WriteRecordBatchMessage(int64_t length, int64_t body_length, - const std::vector& nodes, const std::vector& buffers, - std::shared_ptr* out) { + const std::vector& nodes, + const std::vector& buffers, + std::shared_ptr* out) { FBB fbb; RecordBatchOffset record_batch; RETURN_NOT_OK(MakeRecordBatch(fbb, length, body_length, nodes, buffers, &record_batch)); - return WriteFBMessage( - fbb, flatbuf::MessageHeader_RecordBatch, record_batch.Union(), body_length, out); + return WriteFBMessage(fbb, flatbuf::MessageHeader_RecordBatch, record_batch.Union(), + body_length, out); } -Status WriteTensorMessage( - const Tensor& tensor, int64_t buffer_start_offset, std::shared_ptr* out) { +Status WriteTensorMessage(const Tensor& tensor, int64_t buffer_start_offset, + std::shared_ptr* out) { using TensorDimOffset = flatbuffers::Offset; using TensorOffset = flatbuffers::Offset; @@ -727,19 +737,20 @@ Status WriteTensorMessage( TensorOffset fb_tensor = flatbuf::CreateTensor(fbb, fb_type_type, fb_type, fb_shape, fb_strides, &buffer); - return WriteFBMessage( - fbb, flatbuf::MessageHeader_Tensor, fb_tensor.Union(), body_length, out); + return WriteFBMessage(fbb, flatbuf::MessageHeader_Tensor, fb_tensor.Union(), + body_length, out); } Status WriteDictionaryMessage(int64_t id, int64_t length, int64_t body_length, - const std::vector& nodes, const std::vector& buffers, - std::shared_ptr* out) { + const std::vector& nodes, + const std::vector& buffers, + std::shared_ptr* out) { FBB fbb; RecordBatchOffset record_batch; RETURN_NOT_OK(MakeRecordBatch(fbb, length, body_length, nodes, buffers, &record_batch)); auto dictionary_batch = flatbuf::CreateDictionaryBatch(fbb, id, record_batch).Union(); - return WriteFBMessage( - fbb, flatbuf::MessageHeader_DictionaryBatch, dictionary_batch, body_length, out); + return WriteFBMessage(fbb, flatbuf::MessageHeader_DictionaryBatch, dictionary_batch, + body_length, out); } static flatbuffers::Offset> @@ -754,8 +765,8 @@ FileBlocksToFlatbuffer(FBB& fbb, const std::vector& blocks) { } Status WriteFileFooter(const Schema& schema, const std::vector& dictionaries, - const std::vector& record_batches, DictionaryMemo* dictionary_memo, - io::OutputStream* out) { + const std::vector& record_batches, + DictionaryMemo* dictionary_memo, io::OutputStream* out) { FBB fbb; flatbuffers::Offset fb_schema; @@ -764,8 +775,8 @@ Status WriteFileFooter(const Schema& schema, const std::vector& dicti auto fb_dictionaries = FileBlocksToFlatbuffer(fbb, dictionaries); auto fb_record_batches = FileBlocksToFlatbuffer(fbb, record_batches); - auto footer = flatbuf::CreateFooter( - fbb, kCurrentMetadataVersion, fb_schema, fb_dictionaries, fb_record_batches); + auto footer = flatbuf::CreateFooter(fbb, kCurrentMetadataVersion, fb_schema, + fb_dictionaries, fb_record_batches); fbb.Finish(footer); @@ -780,8 +791,8 @@ Status WriteFileFooter(const Schema& schema, const std::vector& dicti DictionaryMemo::DictionaryMemo() {} // Returns KeyError if dictionary not found -Status DictionaryMemo::GetDictionary( - int64_t id, std::shared_ptr* dictionary) const { +Status DictionaryMemo::GetDictionary(int64_t id, + std::shared_ptr* dictionary) const { auto it = id_to_dictionary_.find(id); if (it == id_to_dictionary_.end()) { std::stringstream ss; @@ -817,8 +828,8 @@ bool DictionaryMemo::HasDictionaryId(int64_t id) const { return it != id_to_dictionary_.end(); } -Status DictionaryMemo::AddDictionary( - int64_t id, const std::shared_ptr& dictionary) { +Status DictionaryMemo::AddDictionary(int64_t id, + const std::shared_ptr& dictionary) { if (HasDictionaryId(id)) { std::stringstream ss; ss << "Dictionary with id " << id << " already exists"; @@ -835,8 +846,8 @@ Status DictionaryMemo::AddDictionary( class Message::MessageImpl { public: - explicit MessageImpl( - const std::shared_ptr& metadata, const std::shared_ptr& body) + explicit MessageImpl(const std::shared_ptr& metadata, + const std::shared_ptr& body) : metadata_(metadata), message_(nullptr), body_(body) {} Status Open() { @@ -897,43 +908,35 @@ class Message::MessageImpl { std::shared_ptr body_; }; -Message::Message( - const std::shared_ptr& metadata, const std::shared_ptr& body) { +Message::Message(const std::shared_ptr& metadata, + const std::shared_ptr& body) { impl_.reset(new MessageImpl(metadata, body)); } Status Message::Open(const std::shared_ptr& metadata, - const std::shared_ptr& body, std::unique_ptr* out) { + const std::shared_ptr& body, std::unique_ptr* out) { out->reset(new Message(metadata, body)); return (*out)->impl_->Open(); } Message::~Message() {} -std::shared_ptr Message::body() const { - return impl_->body(); -} +std::shared_ptr Message::body() const { return impl_->body(); } -std::shared_ptr Message::metadata() const { - return impl_->metadata(); -} +std::shared_ptr Message::metadata() const { return impl_->metadata(); } -Message::Type Message::type() const { - return impl_->type(); -} +Message::Type Message::type() const { return impl_->type(); } -MetadataVersion Message::metadata_version() const { - return impl_->version(); -} +MetadataVersion Message::metadata_version() const { return impl_->version(); } -const void* Message::header() const { - return impl_->header(); -} +const void* Message::header() const { return impl_->header(); } bool Message::Equals(const Message& other) const { int64_t metadata_bytes = std::min(metadata()->size(), other.metadata()->size()); - if (!metadata()->Equals(*other.metadata(), metadata_bytes)) { return false; } + if (!metadata()->Equals(*other.metadata(), metadata_bytes)) { + return false; + } // Compare bodies, if they have them auto this_body = body(); @@ -1012,7 +1015,7 @@ Status GetDictionaryTypes(const void* opaque_schema, DictionaryTypeMap* id_to_fi } Status GetSchema(const void* opaque_schema, const DictionaryMemo& dictionary_memo, - std::shared_ptr* out) { + std::shared_ptr* out) { auto schema = static_cast(opaque_schema); int num_fields = static_cast(schema->fields()->size()); @@ -1036,8 +1039,8 @@ Status GetSchema(const void* opaque_schema, const DictionaryMemo& dictionary_mem } Status GetTensorMetadata(const Buffer& metadata, std::shared_ptr* type, - std::vector* shape, std::vector* strides, - std::vector* dim_names) { + std::vector* shape, std::vector* strides, + std::vector* dim_names) { auto message = flatbuf::GetMessage(metadata.data()); auto tensor = reinterpret_cast(message->header()); @@ -1068,7 +1071,8 @@ Status GetTensorMetadata(const Buffer& metadata, std::shared_ptr* type // Read and write messages static Status ReadFullMessage(const std::shared_ptr& metadata, - io::InputStream* stream, std::unique_ptr* message) { + io::InputStream* stream, + std::unique_ptr* message) { auto fb_message = flatbuf::GetMessage(metadata->data()); int64_t body_length = fb_message->bodyLength(); @@ -1087,7 +1091,7 @@ static Status ReadFullMessage(const std::shared_ptr& metadata, } Status ReadMessage(int64_t offset, int32_t metadata_length, io::RandomAccessFile* file, - std::unique_ptr* message) { + std::unique_ptr* message) { std::shared_ptr buffer; RETURN_NOT_OK(file->ReadAt(offset, metadata_length, &buffer)); @@ -1141,8 +1145,8 @@ InputStreamMessageReader::~InputStreamMessageReader() {} // ---------------------------------------------------------------------- // Implement message writing -Status WriteMessage( - const Buffer& message, io::OutputStream* file, int32_t* message_length) { +Status WriteMessage(const Buffer& message, io::OutputStream* file, + int32_t* message_length) { // Need to write 4 bytes (message size), the message, plus padding to // end on an 8-byte offset int64_t start_offset; @@ -1151,7 +1155,9 @@ Status WriteMessage( int32_t padded_message_length = static_cast(message.size()) + 4; const int32_t remainder = (padded_message_length + static_cast(start_offset)) % 8; - if (remainder != 0) { padded_message_length += 8 - remainder; } + if (remainder != 0) { + padded_message_length += 8 - remainder; + } // The returned message size includes the length prefix, the flatbuffer, // plus padding @@ -1167,7 +1173,9 @@ Status WriteMessage( // Write any padding int32_t padding = padded_message_length - static_cast(message.size()) - 4; - if (padding > 0) { RETURN_NOT_OK(file->Write(kPaddingBytes, padding)); } + if (padding > 0) { + RETURN_NOT_OK(file->Write(kPaddingBytes, padding)); + } return Status::OK(); } diff --git a/cpp/src/arrow/ipc/metadata.h b/cpp/src/arrow/ipc/metadata.h index 614f7a6a922..90e4defd6a3 100644 --- a/cpp/src/arrow/ipc/metadata.h +++ b/cpp/src/arrow/ipc/metadata.h @@ -133,11 +133,14 @@ Status GetDictionaryTypes(const void* opaque_schema, DictionaryTypeMap* id_to_fi // Construct a complete Schema from the message. May be expensive for very // large schemas if you are only interested in a few fields Status ARROW_EXPORT GetSchema(const void* opaque_schema, - const DictionaryMemo& dictionary_memo, std::shared_ptr* out); + const DictionaryMemo& dictionary_memo, + std::shared_ptr* out); Status ARROW_EXPORT GetTensorMetadata(const Buffer& metadata, - std::shared_ptr* type, std::vector* shape, - std::vector* strides, std::vector* dim_names); + std::shared_ptr* type, + std::vector* shape, + std::vector* strides, + std::vector* dim_names); /// \brief An IPC message including metadata and body class ARROW_EXPORT Message { @@ -157,7 +160,7 @@ class ARROW_EXPORT Message { /// \param[in] body a buffer containing the message body, which may be nullptr /// \param[out] out the created message static Status Open(const std::shared_ptr& metadata, - const std::shared_ptr& body, std::unique_ptr* out); + const std::shared_ptr& body, std::unique_ptr* out); /// \brief Write length-prefixed metadata and body to output stream /// @@ -242,22 +245,23 @@ class ARROW_EXPORT InputStreamMessageReader : public MessageReader { /// \param[out] message the message read /// \return Status success or failure Status ARROW_EXPORT ReadMessage(int64_t offset, int32_t metadata_length, - io::RandomAccessFile* file, std::unique_ptr* message); + io::RandomAccessFile* file, + std::unique_ptr* message); /// \brief Read encapulated RPC message (metadata and body) from InputStream /// /// Read length-prefixed message with as-yet unknown length. Returns nullptr if /// there are not enough bytes available or the message length is 0 (e.g. EOS /// in a stream) -Status ARROW_EXPORT ReadMessage( - io::InputStream* stream, std::unique_ptr* message); +Status ARROW_EXPORT ReadMessage(io::InputStream* stream, + std::unique_ptr* message); /// Write a serialized message metadata with a length-prefix and padding to an /// 8-byte offset /// /// -Status ARROW_EXPORT WriteMessage( - const Buffer& message, io::OutputStream* file, int32_t* message_length); +Status ARROW_EXPORT WriteMessage(const Buffer& message, io::OutputStream* file, + int32_t* message_length); // Serialize arrow::Schema as a Flatbuffer // @@ -266,23 +270,26 @@ Status ARROW_EXPORT WriteMessage( // dictionary ids // \param[out] out the serialized arrow::Buffer // \return Status outcome -Status ARROW_EXPORT WriteSchemaMessage( - const Schema& schema, DictionaryMemo* dictionary_memo, std::shared_ptr* out); +Status ARROW_EXPORT WriteSchemaMessage(const Schema& schema, + DictionaryMemo* dictionary_memo, + std::shared_ptr* out); Status ARROW_EXPORT WriteRecordBatchMessage(int64_t length, int64_t body_length, - const std::vector& nodes, const std::vector& buffers, - std::shared_ptr* out); + const std::vector& nodes, + const std::vector& buffers, + std::shared_ptr* out); -Status ARROW_EXPORT WriteTensorMessage( - const Tensor& tensor, int64_t buffer_start_offset, std::shared_ptr* out); +Status ARROW_EXPORT WriteTensorMessage(const Tensor& tensor, int64_t buffer_start_offset, + std::shared_ptr* out); Status WriteDictionaryMessage(int64_t id, int64_t length, int64_t body_length, - const std::vector& nodes, const std::vector& buffers, - std::shared_ptr* out); + const std::vector& nodes, + const std::vector& buffers, + std::shared_ptr* out); Status WriteFileFooter(const Schema& schema, const std::vector& dictionaries, - const std::vector& record_batches, DictionaryMemo* dictionary_memo, - io::OutputStream* out); + const std::vector& record_batches, + DictionaryMemo* dictionary_memo, io::OutputStream* out); } // namespace ipc } // namespace arrow diff --git a/cpp/src/arrow/ipc/reader.cc b/cpp/src/arrow/ipc/reader.cc index 88ab33087b6..8ae82804c31 100644 --- a/cpp/src/arrow/ipc/reader.cc +++ b/cpp/src/arrow/ipc/reader.cc @@ -95,12 +95,12 @@ struct ArrayLoaderContext { }; static Status LoadArray(const std::shared_ptr& type, - ArrayLoaderContext* context, internal::ArrayData* out); + ArrayLoaderContext* context, internal::ArrayData* out); class ArrayLoader { public: ArrayLoader(const std::shared_ptr& type, internal::ArrayData* out, - ArrayLoaderContext* context) + ArrayLoaderContext* context) : type_(type), context_(context), out_(out) {} Status Load() { @@ -184,7 +184,7 @@ class ArrayLoader { typename std::enable_if::value && !std::is_base_of::value && !std::is_base_of::value, - Status>::type + Status>::type Visit(const T& type) { return LoadPrimitive(); } @@ -252,18 +252,18 @@ class ArrayLoader { }; static Status LoadArray(const std::shared_ptr& type, - ArrayLoaderContext* context, internal::ArrayData* out) { + ArrayLoaderContext* context, internal::ArrayData* out) { ArrayLoader loader(type, out, context); return loader.Load(); } Status ReadRecordBatch(const Buffer& metadata, const std::shared_ptr& schema, - io::RandomAccessFile* file, std::shared_ptr* out) { + io::RandomAccessFile* file, std::shared_ptr* out) { return ReadRecordBatch(metadata, schema, kMaxNestingDepth, file, out); } Status ReadRecordBatch(const Message& message, const std::shared_ptr& schema, - std::shared_ptr* out) { + std::shared_ptr* out) { io::BufferReader reader(message.body()); DCHECK_EQ(message.type(), Message::RECORD_BATCH); return ReadRecordBatch(*message.metadata(), schema, kMaxNestingDepth, &reader, out); @@ -273,8 +273,9 @@ Status ReadRecordBatch(const Message& message, const std::shared_ptr& sc // Array loading static Status LoadRecordBatchFromSource(const std::shared_ptr& schema, - int64_t num_rows, int max_recursion_depth, IpcComponentSource* source, - std::shared_ptr* out) { + int64_t num_rows, int max_recursion_depth, + IpcComponentSource* source, + std::shared_ptr* out) { ArrayLoaderContext context; context.source = source; context.field_index = 0; @@ -294,16 +295,17 @@ static Status LoadRecordBatchFromSource(const std::shared_ptr& schema, } static inline Status ReadRecordBatch(const flatbuf::RecordBatch* metadata, - const std::shared_ptr& schema, int max_recursion_depth, - io::RandomAccessFile* file, std::shared_ptr* out) { + const std::shared_ptr& schema, + int max_recursion_depth, io::RandomAccessFile* file, + std::shared_ptr* out) { IpcComponentSource source(metadata, file); - return LoadRecordBatchFromSource( - schema, metadata->length(), max_recursion_depth, &source, out); + return LoadRecordBatchFromSource(schema, metadata->length(), max_recursion_depth, + &source, out); } Status ReadRecordBatch(const Buffer& metadata, const std::shared_ptr& schema, - int max_recursion_depth, io::RandomAccessFile* file, - std::shared_ptr* out) { + int max_recursion_depth, io::RandomAccessFile* file, + std::shared_ptr* out) { auto message = flatbuf::GetMessage(metadata.data()); if (message->header_type() != flatbuf::MessageHeader_RecordBatch) { DCHECK_EQ(message->header_type(), flatbuf::MessageHeader_RecordBatch); @@ -313,7 +315,8 @@ Status ReadRecordBatch(const Buffer& metadata, const std::shared_ptr& sc } Status ReadDictionary(const Buffer& metadata, const DictionaryTypeMap& dictionary_types, - io::RandomAccessFile* file, int64_t* dictionary_id, std::shared_ptr* out) { + io::RandomAccessFile* file, int64_t* dictionary_id, + std::shared_ptr* out) { auto message = flatbuf::GetMessage(metadata.data()); auto dictionary_batch = reinterpret_cast(message->header()); @@ -347,7 +350,7 @@ Status ReadDictionary(const Buffer& metadata, const DictionaryTypeMap& dictionar } static Status ReadMessageAndValidate(MessageReader* reader, Message::Type expected_type, - bool allow_null, std::unique_ptr* message) { + bool allow_null, std::unique_ptr* message) { RETURN_NOT_OK(reader->ReadNextMessage(message)); if (!(*message) && !allow_null) { @@ -357,7 +360,9 @@ static Status ReadMessageAndValidate(MessageReader* reader, Message::Type expect return Status::Invalid(ss.str()); } - if ((*message) == nullptr) { return Status::OK(); } + if ((*message) == nullptr) { + return Status::OK(); + } if ((*message)->type() != expected_type) { std::stringstream ss; @@ -389,15 +394,15 @@ class RecordBatchStreamReader::RecordBatchStreamReaderImpl { Status ReadNextDictionary() { std::unique_ptr message; - RETURN_NOT_OK(ReadMessageAndValidate( - message_reader_.get(), Message::DICTIONARY_BATCH, false, &message)); + RETURN_NOT_OK(ReadMessageAndValidate(message_reader_.get(), Message::DICTIONARY_BATCH, + false, &message)); io::BufferReader reader(message->body()); std::shared_ptr dictionary; int64_t id; - RETURN_NOT_OK(ReadDictionary( - *message->metadata(), dictionary_types_, &reader, &id, &dictionary)); + RETURN_NOT_OK(ReadDictionary(*message->metadata(), dictionary_types_, &reader, &id, + &dictionary)); return dictionary_memo_.AddDictionary(id, dictionary); } @@ -420,8 +425,8 @@ class RecordBatchStreamReader::RecordBatchStreamReaderImpl { Status ReadNextRecordBatch(std::shared_ptr* batch) { std::unique_ptr message; - RETURN_NOT_OK(ReadMessageAndValidate( - message_reader_.get(), Message::RECORD_BATCH, true, &message)); + RETURN_NOT_OK(ReadMessageAndValidate(message_reader_.get(), Message::RECORD_BATCH, + true, &message)); if (message == nullptr) { // End of stream @@ -451,14 +456,14 @@ RecordBatchStreamReader::RecordBatchStreamReader() { RecordBatchStreamReader::~RecordBatchStreamReader() {} Status RecordBatchStreamReader::Open(std::unique_ptr message_reader, - std::shared_ptr* reader) { + std::shared_ptr* reader) { // Private ctor *reader = std::shared_ptr(new RecordBatchStreamReader()); return (*reader)->impl_->Open(std::move(message_reader)); } Status RecordBatchStreamReader::Open(const std::shared_ptr& stream, - std::shared_ptr* out) { + std::shared_ptr* out) { std::unique_ptr message_reader(new InputStreamMessageReader(stream)); return Open(std::move(message_reader), out); } @@ -502,8 +507,8 @@ class RecordBatchFileReader::RecordBatchFileReaderImpl { } // Now read the footer - RETURN_NOT_OK(file_->ReadAt( - footer_offset_ - footer_length - file_end_size, footer_length, &footer_buffer_)); + RETURN_NOT_OK(file_->ReadAt(footer_offset_ - footer_length - file_end_size, + footer_length, &footer_buffer_)); // TODO(wesm): Verify the footer footer_ = flatbuf::GetFooter(footer_buffer_->data()); @@ -568,7 +573,7 @@ class RecordBatchFileReader::RecordBatchFileReaderImpl { std::shared_ptr dictionary; int64_t dictionary_id; RETURN_NOT_OK(ReadDictionary(*message->metadata(), dictionary_fields_, &reader, - &dictionary_id, &dictionary)); + &dictionary_id, &dictionary)); RETURN_NOT_OK(dictionary_memo_->AddDictionary(dictionary_id, dictionary)); } @@ -610,37 +615,34 @@ RecordBatchFileReader::RecordBatchFileReader() { RecordBatchFileReader::~RecordBatchFileReader() {} Status RecordBatchFileReader::Open(const std::shared_ptr& file, - std::shared_ptr* reader) { + std::shared_ptr* reader) { int64_t footer_offset; RETURN_NOT_OK(file->GetSize(&footer_offset)); return Open(file, footer_offset, reader); } Status RecordBatchFileReader::Open(const std::shared_ptr& file, - int64_t footer_offset, std::shared_ptr* reader) { + int64_t footer_offset, + std::shared_ptr* reader) { *reader = std::shared_ptr(new RecordBatchFileReader()); return (*reader)->impl_->Open(file, footer_offset); } -std::shared_ptr RecordBatchFileReader::schema() const { - return impl_->schema(); -} +std::shared_ptr RecordBatchFileReader::schema() const { return impl_->schema(); } int RecordBatchFileReader::num_record_batches() const { return impl_->num_record_batches(); } -MetadataVersion RecordBatchFileReader::version() const { - return impl_->version(); -} +MetadataVersion RecordBatchFileReader::version() const { return impl_->version(); } -Status RecordBatchFileReader::ReadRecordBatch( - int i, std::shared_ptr* batch) { +Status RecordBatchFileReader::ReadRecordBatch(int i, + std::shared_ptr* batch) { return impl_->ReadRecordBatch(i, batch); } -static Status ReadContiguousPayload( - int64_t offset, io::RandomAccessFile* file, std::unique_ptr* message) { +static Status ReadContiguousPayload(int64_t offset, io::RandomAccessFile* file, + std::unique_ptr* message) { std::shared_ptr buffer; RETURN_NOT_OK(file->Seek(offset)); RETURN_NOT_OK(ReadMessage(file, message)); @@ -652,16 +654,16 @@ static Status ReadContiguousPayload( } Status ReadRecordBatch(const std::shared_ptr& schema, int64_t offset, - io::RandomAccessFile* file, std::shared_ptr* out) { + io::RandomAccessFile* file, std::shared_ptr* out) { std::unique_ptr message; RETURN_NOT_OK(ReadContiguousPayload(offset, file, &message)); io::BufferReader buffer_reader(message->body()); - return ReadRecordBatch( - *message->metadata(), schema, kMaxNestingDepth, &buffer_reader, out); + return ReadRecordBatch(*message->metadata(), schema, kMaxNestingDepth, &buffer_reader, + out); } -Status ReadTensor( - int64_t offset, io::RandomAccessFile* file, std::shared_ptr* out) { +Status ReadTensor(int64_t offset, io::RandomAccessFile* file, + std::shared_ptr* out) { // Respect alignment of Tensor messages (see WriteTensor) offset = PaddedLength(offset); std::unique_ptr message; diff --git a/cpp/src/arrow/ipc/reader.h b/cpp/src/arrow/ipc/reader.h index d6c26147501..c0d3fb1f185 100644 --- a/cpp/src/arrow/ipc/reader.h +++ b/cpp/src/arrow/ipc/reader.h @@ -72,7 +72,7 @@ class ARROW_EXPORT RecordBatchStreamReader : public RecordBatchReader { /// \param(out) out the created RecordBatchStreamReader object /// \return Status static Status Open(std::unique_ptr message_reader, - std::shared_ptr* out); + std::shared_ptr* out); /// \Create Record batch stream reader from InputStream /// @@ -80,7 +80,7 @@ class ARROW_EXPORT RecordBatchStreamReader : public RecordBatchReader { /// \param(out) out the created RecordBatchStreamReader object /// \return Status static Status Open(const std::shared_ptr& stream, - std::shared_ptr* out); + std::shared_ptr* out); std::shared_ptr schema() const override; Status ReadNextRecordBatch(std::shared_ptr* batch) override; @@ -103,7 +103,7 @@ class ARROW_EXPORT RecordBatchFileReader { // need only locate the end of the Arrow file stream to discover the metadata // and then proceed to read the data into memory. static Status Open(const std::shared_ptr& file, - std::shared_ptr* reader); + std::shared_ptr* reader); // If the file is embedded within some larger file or memory region, you can // pass the absolute memory offset to the end of the file (which contains the @@ -113,7 +113,8 @@ class ARROW_EXPORT RecordBatchFileReader { // @param file: the data source // @param footer_offset: the position of the end of the Arrow "file" static Status Open(const std::shared_ptr& file, - int64_t footer_offset, std::shared_ptr* reader); + int64_t footer_offset, + std::shared_ptr* reader); /// The schema includes any dictionaries std::shared_ptr schema() const; @@ -148,8 +149,9 @@ class ARROW_EXPORT RecordBatchFileReader { /// \param(in) file a random access file /// \param(out) out the read record batch Status ARROW_EXPORT ReadRecordBatch(const Buffer& metadata, - const std::shared_ptr& schema, io::RandomAccessFile* file, - std::shared_ptr* out); + const std::shared_ptr& schema, + io::RandomAccessFile* file, + std::shared_ptr* out); /// \brief Read record batch from fully encapulated Message /// @@ -158,7 +160,8 @@ Status ARROW_EXPORT ReadRecordBatch(const Buffer& metadata, /// \param[out] out the resulting RecordBatch /// \return Status Status ARROW_EXPORT ReadRecordBatch(const Message& message, - const std::shared_ptr& schema, std::shared_ptr* out); + const std::shared_ptr& schema, + std::shared_ptr* out); /// Read record batch from file given metadata and schema /// @@ -168,8 +171,9 @@ Status ARROW_EXPORT ReadRecordBatch(const Message& message, /// \param(in) max_recursion_depth the maximum permitted nesting depth /// \param(out) out the read record batch Status ARROW_EXPORT ReadRecordBatch(const Buffer& metadata, - const std::shared_ptr& schema, int max_recursion_depth, - io::RandomAccessFile* file, std::shared_ptr* out); + const std::shared_ptr& schema, + int max_recursion_depth, io::RandomAccessFile* file, + std::shared_ptr* out); /// Read record batch as encapsulated IPC message with metadata size prefix and /// header @@ -179,15 +183,16 @@ Status ARROW_EXPORT ReadRecordBatch(const Buffer& metadata, /// \param(in) file the file where the batch is located /// \param(out) out the read record batch Status ARROW_EXPORT ReadRecordBatch(const std::shared_ptr& schema, int64_t offset, - io::RandomAccessFile* file, std::shared_ptr* out); + io::RandomAccessFile* file, + std::shared_ptr* out); /// EXPERIMENTAL: Read arrow::Tensor as encapsulated IPC message in file /// /// \param(in) offset the file location of the start of the message /// \param(in) file the file where the batch is located /// \param(out) out the read tensor -Status ARROW_EXPORT ReadTensor( - int64_t offset, io::RandomAccessFile* file, std::shared_ptr* out); +Status ARROW_EXPORT ReadTensor(int64_t offset, io::RandomAccessFile* file, + std::shared_ptr* out); /// Backwards-compatibility for Arrow < 0.4.0 /// diff --git a/cpp/src/arrow/ipc/stream-to-file.cc b/cpp/src/arrow/ipc/stream-to-file.cc index de658839101..33719b3c89c 100644 --- a/cpp/src/arrow/ipc/stream-to-file.cc +++ b/cpp/src/arrow/ipc/stream-to-file.cc @@ -15,11 +15,11 @@ // specific language governing permissions and limitations // under the License. +#include #include "arrow/io/file.h" #include "arrow/ipc/reader.h" #include "arrow/ipc/writer.h" #include "arrow/status.h" -#include #include "arrow/util/io-util.h" diff --git a/cpp/src/arrow/ipc/test-common.h b/cpp/src/arrow/ipc/test-common.h index 67a41ba086b..cb827372d21 100644 --- a/cpp/src/arrow/ipc/test-common.h +++ b/cpp/src/arrow/ipc/test-common.h @@ -69,8 +69,8 @@ static inline void CompareBatch(const RecordBatch& left, const RecordBatch& righ } } -static inline void CompareArraysDetailed( - int index, const Array& result, const Array& expected) { +static inline void CompareArraysDetailed(int index, const Array& result, + const Array& expected) { if (!expected.Equals(result)) { std::stringstream pp_result; std::stringstream pp_expected; @@ -83,8 +83,8 @@ static inline void CompareArraysDetailed( } } -static inline void CompareBatchColumnsDetailed( - const RecordBatch& result, const RecordBatch& expected) { +static inline void CompareBatchColumnsDetailed(const RecordBatch& result, + const RecordBatch& expected) { for (int i = 0; i < expected.num_columns(); ++i) { auto left = result.column(i); auto right = expected.column(i); @@ -95,16 +95,16 @@ static inline void CompareBatchColumnsDetailed( const auto kListInt32 = list(int32()); const auto kListListInt32 = list(kListInt32); -Status MakeRandomInt32Array( - int64_t length, bool include_nulls, MemoryPool* pool, std::shared_ptr* out) { +Status MakeRandomInt32Array(int64_t length, bool include_nulls, MemoryPool* pool, + std::shared_ptr* out) { std::shared_ptr data; RETURN_NOT_OK(test::MakeRandomInt32PoolBuffer(length, pool, &data)); Int32Builder builder(pool, int32()); if (include_nulls) { std::shared_ptr valid_bytes; RETURN_NOT_OK(test::MakeRandomBytePoolBuffer(length, pool, &valid_bytes)); - RETURN_NOT_OK(builder.Append( - reinterpret_cast(data->data()), length, valid_bytes->data())); + RETURN_NOT_OK(builder.Append(reinterpret_cast(data->data()), length, + valid_bytes->data())); return builder.Finish(out); } RETURN_NOT_OK(builder.Append(reinterpret_cast(data->data()), length)); @@ -112,7 +112,8 @@ Status MakeRandomInt32Array( } Status MakeRandomListArray(const std::shared_ptr& child_array, int num_lists, - bool include_nulls, MemoryPool* pool, std::shared_ptr* out) { + bool include_nulls, MemoryPool* pool, + std::shared_ptr* out) { // Create the null list values std::vector valid_lists(num_lists); const double null_percent = include_nulls ? 0.1 : 0; @@ -129,15 +130,16 @@ Status MakeRandomListArray(const std::shared_ptr& child_array, int num_li test::rand_uniform_int(num_lists, seed, 0, max_list_size, list_sizes.data()); // make sure sizes are consistent with null std::transform(list_sizes.begin(), list_sizes.end(), valid_lists.begin(), - list_sizes.begin(), - [](int32_t size, int32_t valid) { return valid == 0 ? 0 : size; }); + list_sizes.begin(), + [](int32_t size, int32_t valid) { return valid == 0 ? 0 : size; }); std::partial_sum(list_sizes.begin(), list_sizes.end(), ++offsets.begin()); // Force invariants const int32_t child_length = static_cast(child_array->length()); offsets[0] = 0; std::replace_if(offsets.begin(), offsets.end(), - [child_length](int32_t offset) { return offset > child_length; }, child_length); + [child_length](int32_t offset) { return offset > child_length; }, + child_length); } offsets[num_lists] = static_cast(child_array->length()); @@ -148,14 +150,14 @@ Status MakeRandomListArray(const std::shared_ptr& child_array, int num_li RETURN_NOT_OK(test::CopyBufferFromVector(offsets, pool, &offsets_buffer)); *out = std::make_shared(list(child_array->type()), num_lists, offsets_buffer, - child_array, null_bitmap, kUnknownNullCount); + child_array, null_bitmap, kUnknownNullCount); return ValidateArray(**out); } typedef Status MakeRecordBatch(std::shared_ptr* out); -Status MakeRandomBooleanArray( - const int length, bool include_nulls, std::shared_ptr* out) { +Status MakeRandomBooleanArray(const int length, bool include_nulls, + std::shared_ptr* out) { std::vector values(length); test::random_null_bytes(length, 0.5, values.data()); std::shared_ptr data; @@ -210,10 +212,10 @@ Status MakeIntRecordBatch(std::shared_ptr* out) { } template -Status MakeRandomBinaryArray( - int64_t length, bool include_nulls, MemoryPool* pool, std::shared_ptr* out) { - const std::vector values = { - "", "", "abc", "123", "efg", "456!@#!@#", "12312"}; +Status MakeRandomBinaryArray(int64_t length, bool include_nulls, MemoryPool* pool, + std::shared_ptr* out) { + const std::vector values = {"", "", "abc", "123", + "efg", "456!@#!@#", "12312"}; Builder builder(pool); const size_t values_len = values.size(); for (int64_t i = 0; i < length; ++i) { @@ -223,7 +225,7 @@ Status MakeRandomBinaryArray( } else { const std::string& value = values[values_index]; RETURN_NOT_OK(builder.Append(reinterpret_cast(value.data()), - static_cast(value.size()))); + static_cast(value.size()))); } } return builder.Finish(out); @@ -331,7 +333,7 @@ Status MakeNonNullRecordBatch(std::shared_ptr* out) { Status MakeDeeplyNestedList(std::shared_ptr* out) { const int batch_length = 5; - TypePtr type = int32(); + auto type = int32(); MemoryPool* pool = default_memory_pool(); std::shared_ptr array; @@ -434,11 +436,12 @@ Status MakeUnion(std::shared_ptr* out) { // construct individual nullable/non-nullable struct arrays auto sparse_no_nulls = std::make_shared(sparse_type, length, sparse_children, type_ids_buffer); - auto sparse = std::make_shared( - sparse_type, length, sparse_children, type_ids_buffer, nullptr, null_bitmask, 1); + auto sparse = std::make_shared(sparse_type, length, sparse_children, + type_ids_buffer, nullptr, null_bitmask, 1); - auto dense = std::make_shared(dense_type, length, dense_children, - type_ids_buffer, offsets_buffer, null_bitmask, 1); + auto dense = + std::make_shared(dense_type, length, dense_children, type_ids_buffer, + offsets_buffer, null_bitmask, 1); // construct batch std::vector> arrays = {sparse_no_nulls, sparse, dense}; @@ -480,8 +483,8 @@ Status MakeDictionary(std::shared_ptr* out) { std::vector list_offsets = {0, 0, 2, 2, 5, 6, 9}; std::shared_ptr offsets, indices3; - ArrayFromVector( - std::vector(list_offsets.size(), true), list_offsets, &offsets); + ArrayFromVector(std::vector(list_offsets.size(), true), + list_offsets, &offsets); std::vector indices3_values = {0, 1, 2, 0, 1, 2, 0, 1, 2}; std::vector is_valid3(9, true); @@ -490,8 +493,8 @@ Status MakeDictionary(std::shared_ptr* out) { std::shared_ptr null_bitmap; RETURN_NOT_OK(test::GetBitmapFromVector(is_valid, &null_bitmap)); - std::shared_ptr a3 = std::make_shared(f3_type, length, - std::static_pointer_cast(offsets)->values(), + std::shared_ptr a3 = std::make_shared( + f3_type, length, std::static_pointer_cast(offsets)->values(), std::make_shared(f1_type, indices3), null_bitmap, 1); // Dictionary-encoded list of integer @@ -500,14 +503,15 @@ Status MakeDictionary(std::shared_ptr* out) { std::shared_ptr offsets4, values4, indices4; std::vector list_offsets4 = {0, 2, 2, 3}; - ArrayFromVector( - std::vector(4, true), list_offsets4, &offsets4); + ArrayFromVector(std::vector(4, true), list_offsets4, + &offsets4); std::vector list_values4 = {0, 1, 2}; ArrayFromVector(std::vector(3, true), list_values4, &values4); - auto dict3 = std::make_shared(f4_value_type, 3, - std::static_pointer_cast(offsets4)->values(), values4); + auto dict3 = std::make_shared( + f4_value_type, 3, std::static_pointer_cast(offsets4)->values(), + values4); std::vector indices4_values = {0, 1, 2, 0, 1, 2}; ArrayFromVector(is_valid, indices4_values, &indices4); @@ -516,9 +520,9 @@ Status MakeDictionary(std::shared_ptr* out) { auto a4 = std::make_shared(f4_type, indices4); // construct batch - std::shared_ptr schema(new Schema({field("dict1", f0_type), - field("sparse", f1_type), field("dense", f2_type), - field("list of encoded string", f3_type), field("encoded list", f4_type)})); + std::shared_ptr schema(new Schema( + {field("dict1", f0_type), field("sparse", f1_type), field("dense", f2_type), + field("list of encoded string", f3_type), field("encoded list", f4_type)})); std::vector> arrays = {a0, a1, a2, a3, a4}; @@ -575,7 +579,8 @@ Status MakeDates(std::shared_ptr* out) { ArrayFromVector(is_valid, date32_values, &date32_array); std::vector date64_values = {1489269000000, 1489270000000, 1489271000000, - 1489272000000, 1489272000000, 1489273000000, 1489274000000}; + 1489272000000, 1489272000000, 1489273000000, + 1489274000000}; std::shared_ptr date64_array; ArrayFromVector(is_valid, date64_values, &date64_array); @@ -592,7 +597,7 @@ Status MakeTimestamps(std::shared_ptr* out) { std::shared_ptr schema(new Schema({f0, f1, f2})); std::vector ts_values = {1489269000000, 1489270000000, 1489271000000, - 1489272000000, 1489272000000, 1489273000000}; + 1489272000000, 1489272000000, 1489273000000}; std::shared_ptr a0, a1, a2; ArrayFromVector(f0->type(), is_valid, ts_values, &a0); @@ -612,10 +617,10 @@ Status MakeTimes(std::shared_ptr* out) { auto f3 = field("f3", time64(TimeUnit::NANO)); std::shared_ptr schema(new Schema({f0, f1, f2, f3})); - std::vector t32_values = { - 1489269000, 1489270000, 1489271000, 1489272000, 1489272000, 1489273000}; + std::vector t32_values = {1489269000, 1489270000, 1489271000, + 1489272000, 1489272000, 1489273000}; std::vector t64_values = {1489269000000, 1489270000000, 1489271000000, - 1489272000000, 1489272000000, 1489273000000}; + 1489272000000, 1489272000000, 1489273000000}; std::shared_ptr a0, a1, a2, a3; ArrayFromVector(f0->type(), is_valid, t32_values, &a0); @@ -630,7 +635,7 @@ Status MakeTimes(std::shared_ptr* out) { template void AppendValues(const std::vector& is_valid, const std::vector& values, - BuilderType* builder) { + BuilderType* builder) { for (size_t i = 0; i < values.size(); ++i) { if (is_valid[i]) { ASSERT_OK(builder->Append(values[i])); diff --git a/cpp/src/arrow/ipc/writer.cc b/cpp/src/arrow/ipc/writer.cc index 14708a1e7a0..163b27b4433 100644 --- a/cpp/src/arrow/ipc/writer.cc +++ b/cpp/src/arrow/ipc/writer.cc @@ -45,8 +45,9 @@ namespace ipc { // Record batch write path static inline Status GetTruncatedBitmap(int64_t offset, int64_t length, - const std::shared_ptr input, MemoryPool* pool, - std::shared_ptr* buffer) { + const std::shared_ptr input, + MemoryPool* pool, + std::shared_ptr* buffer) { if (!input) { *buffer = input; return Status::OK(); @@ -63,8 +64,8 @@ static inline Status GetTruncatedBitmap(int64_t offset, int64_t length, template inline Status GetTruncatedBuffer(int64_t offset, int64_t length, - const std::shared_ptr input, MemoryPool* pool, - std::shared_ptr* buffer) { + const std::shared_ptr input, MemoryPool* pool, + std::shared_ptr* buffer) { if (!input) { *buffer = input; return Status::OK(); @@ -80,17 +81,19 @@ inline Status GetTruncatedBuffer(int64_t offset, int64_t length, return Status::OK(); } -static inline bool NeedTruncate( - int64_t offset, const Buffer* buffer, int64_t min_length) { +static inline bool NeedTruncate(int64_t offset, const Buffer* buffer, + int64_t min_length) { // buffer can be NULL - if (buffer == nullptr) { return false; } + if (buffer == nullptr) { + return false; + } return offset != 0 || min_length < buffer->size(); } class RecordBatchSerializer : public ArrayVisitor { public: RecordBatchSerializer(MemoryPool* pool, int64_t buffer_start_offset, - int max_recursion_depth, bool allow_64bit) + int max_recursion_depth, bool allow_64bit) : pool_(pool), max_recursion_depth_(max_recursion_depth), buffer_start_offset_(buffer_start_offset), @@ -114,8 +117,8 @@ class RecordBatchSerializer : public ArrayVisitor { if (arr.null_count() > 0) { std::shared_ptr bitmap; - RETURN_NOT_OK(GetTruncatedBitmap( - arr.offset(), arr.length(), arr.null_bitmap(), pool_, &bitmap)); + RETURN_NOT_OK(GetTruncatedBitmap(arr.offset(), arr.length(), arr.null_bitmap(), + pool_, &bitmap)); buffers_.push_back(bitmap); } else { // Push a dummy zero-length buffer, not to be copied @@ -175,14 +178,14 @@ class RecordBatchSerializer : public ArrayVisitor { } // Override this for writing dictionary metadata - virtual Status WriteMetadataMessage( - int64_t num_rows, int64_t body_length, std::shared_ptr* out) { - return WriteRecordBatchMessage( - num_rows, body_length, field_nodes_, buffer_meta_, out); + virtual Status WriteMetadataMessage(int64_t num_rows, int64_t body_length, + std::shared_ptr* out) { + return WriteRecordBatchMessage(num_rows, body_length, field_nodes_, buffer_meta_, + out); } Status Write(const RecordBatch& batch, io::OutputStream* dst, int32_t* metadata_length, - int64_t* body_length) { + int64_t* body_length) { RETURN_NOT_OK(Assemble(batch, body_length)); #ifndef NDEBUG @@ -216,9 +219,13 @@ class RecordBatchSerializer : public ArrayVisitor { padding = BitUtil::RoundUpToMultipleOf64(size) - size; } - if (size > 0) { RETURN_NOT_OK(dst->Write(buffer->data(), size)); } + if (size > 0) { + RETURN_NOT_OK(dst->Write(buffer->data(), size)); + } - if (padding > 0) { RETURN_NOT_OK(dst->Write(kPaddingBytes, padding)); } + if (padding > 0) { + RETURN_NOT_OK(dst->Write(kPaddingBytes, padding)); + } } #ifndef NDEBUG @@ -245,7 +252,7 @@ class RecordBatchSerializer : public ArrayVisitor { // Send padding if it's available const int64_t buffer_length = std::min(BitUtil::RoundUpToMultipleOf64(array.length() * type_width), - data->size() - byte_offset); + data->size() - byte_offset); data = SliceBuffer(data, byte_offset, buffer_length); } buffers_.push_back(data); @@ -253,8 +260,8 @@ class RecordBatchSerializer : public ArrayVisitor { } template - Status GetZeroBasedValueOffsets( - const ArrayType& array, std::shared_ptr* value_offsets) { + Status GetZeroBasedValueOffsets(const ArrayType& array, + std::shared_ptr* value_offsets) { // Share slicing logic between ListArray and BinaryArray auto offsets = array.value_offsets(); @@ -265,8 +272,8 @@ class RecordBatchSerializer : public ArrayVisitor { // b) slice the values array accordingly std::shared_ptr shifted_offsets; - RETURN_NOT_OK(AllocateBuffer( - pool_, sizeof(int32_t) * (array.length() + 1), &shifted_offsets)); + RETURN_NOT_OK(AllocateBuffer(pool_, sizeof(int32_t) * (array.length() + 1), + &shifted_offsets)); int32_t* dest_offsets = reinterpret_cast(shifted_offsets->mutable_data()); const int32_t start_offset = array.value_offset(0); @@ -392,13 +399,15 @@ class RecordBatchSerializer : public ArrayVisitor { const auto& type = static_cast(*array.type()); std::shared_ptr value_offsets; - RETURN_NOT_OK(GetTruncatedBuffer( - offset, length, array.value_offsets(), pool_, &value_offsets)); + RETURN_NOT_OK(GetTruncatedBuffer(offset, length, array.value_offsets(), + pool_, &value_offsets)); // The Union type codes are not necessary 0-indexed uint8_t max_code = 0; for (uint8_t code : type.type_codes()) { - if (code > max_code) { max_code = code; } + if (code > max_code) { + max_code = code; + } } // Allocate an array of child offsets. Set all to -1 to indicate that we @@ -424,7 +433,9 @@ class RecordBatchSerializer : public ArrayVisitor { for (int64_t i = 0; i < length; ++i) { const uint8_t code = type_ids[i]; int32_t shift = child_offsets[code]; - if (shift == -1) { child_offsets[code] = shift = unshifted_offsets[i]; } + if (shift == -1) { + child_offsets[code] = shift = unshifted_offsets[i]; + } shifted_offsets[i] = unshifted_offsets[i] - shift; // Update the child length to account for observed value @@ -486,14 +497,14 @@ class DictionaryWriter : public RecordBatchSerializer { public: using RecordBatchSerializer::RecordBatchSerializer; - Status WriteMetadataMessage( - int64_t num_rows, int64_t body_length, std::shared_ptr* out) override { - return WriteDictionaryMessage( - dictionary_id_, num_rows, body_length, field_nodes_, buffer_meta_, out); + Status WriteMetadataMessage(int64_t num_rows, int64_t body_length, + std::shared_ptr* out) override { + return WriteDictionaryMessage(dictionary_id_, num_rows, body_length, field_nodes_, + buffer_meta_, out); } Status Write(int64_t dictionary_id, const std::shared_ptr& dictionary, - io::OutputStream* dst, int32_t* metadata_length, int64_t* body_length) { + io::OutputStream* dst, int32_t* metadata_length, int64_t* body_length) { dictionary_id_ = dictionary_id; // Make a dummy record batch. A bit tedious as we have to make a schema @@ -516,27 +527,30 @@ Status AlignStreamPosition(io::OutputStream* stream) { int64_t position; RETURN_NOT_OK(stream->Tell(&position)); int64_t remainder = PaddedLength(position) - position; - if (remainder > 0) { return stream->Write(kPaddingBytes, remainder); } + if (remainder > 0) { + return stream->Write(kPaddingBytes, remainder); + } return Status::OK(); } Status WriteRecordBatch(const RecordBatch& batch, int64_t buffer_start_offset, - io::OutputStream* dst, int32_t* metadata_length, int64_t* body_length, - MemoryPool* pool, int max_recursion_depth, bool allow_64bit) { - RecordBatchSerializer writer( - pool, buffer_start_offset, max_recursion_depth, allow_64bit); + io::OutputStream* dst, int32_t* metadata_length, + int64_t* body_length, MemoryPool* pool, int max_recursion_depth, + bool allow_64bit) { + RecordBatchSerializer writer(pool, buffer_start_offset, max_recursion_depth, + allow_64bit); return writer.Write(batch, dst, metadata_length, body_length); } Status WriteLargeRecordBatch(const RecordBatch& batch, int64_t buffer_start_offset, - io::OutputStream* dst, int32_t* metadata_length, int64_t* body_length, - MemoryPool* pool) { + io::OutputStream* dst, int32_t* metadata_length, + int64_t* body_length, MemoryPool* pool) { return WriteRecordBatch(batch, buffer_start_offset, dst, metadata_length, body_length, - pool, kMaxNestingDepth, true); + pool, kMaxNestingDepth, true); } Status WriteTensor(const Tensor& tensor, io::OutputStream* dst, int32_t* metadata_length, - int64_t* body_length) { + int64_t* body_length) { if (!tensor.is_contiguous()) { return Status::Invalid("No support yet for writing non-contiguous tensors"); } @@ -556,8 +570,8 @@ Status WriteTensor(const Tensor& tensor, io::OutputStream* dst, int32_t* metadat } Status WriteDictionary(int64_t dictionary_id, const std::shared_ptr& dictionary, - int64_t buffer_start_offset, io::OutputStream* dst, int32_t* metadata_length, - int64_t* body_length, MemoryPool* pool) { + int64_t buffer_start_offset, io::OutputStream* dst, + int32_t* metadata_length, int64_t* body_length, MemoryPool* pool) { DictionaryWriter writer(pool, buffer_start_offset, kMaxNestingDepth, false); return writer.Write(dictionary_id, dictionary, dst, metadata_length, body_length); } @@ -568,7 +582,7 @@ Status GetRecordBatchSize(const RecordBatch& batch, int64_t* size) { int64_t body_length = 0; io::MockOutputStream dst; RETURN_NOT_OK(WriteRecordBatch(batch, 0, &dst, &metadata_length, &body_length, - default_memory_pool(), kMaxNestingDepth, true)); + default_memory_pool(), kMaxNestingDepth, true)); *size = dst.GetExtentBytesWritten(); return Status::OK(); } @@ -632,7 +646,9 @@ class RecordBatchStreamWriter::RecordBatchStreamWriterImpl { } Status CheckStarted() { - if (!started_) { return Start(); } + if (!started_) { + return Start(); + } return Status::OK(); } @@ -653,7 +669,7 @@ class RecordBatchStreamWriter::RecordBatchStreamWriterImpl { // Frame of reference in file format is 0, see ARROW-384 const int64_t buffer_start_offset = 0; RETURN_NOT_OK(WriteDictionary(entry.first, entry.second, buffer_start_offset, sink_, - &block->metadata_length, &block->body_length, pool_)); + &block->metadata_length, &block->body_length, pool_)); RETURN_NOT_OK(UpdatePosition()); DCHECK(position_ % 8 == 0) << "WriteDictionary did not perform aligned writes"; } @@ -668,9 +684,9 @@ class RecordBatchStreamWriter::RecordBatchStreamWriterImpl { // Frame of reference in file format is 0, see ARROW-384 const int64_t buffer_start_offset = 0; - RETURN_NOT_OK(arrow::ipc::WriteRecordBatch(batch, buffer_start_offset, sink_, - &block->metadata_length, &block->body_length, pool_, kMaxNestingDepth, - allow_64bit)); + RETURN_NOT_OK(arrow::ipc::WriteRecordBatch( + batch, buffer_start_offset, sink_, &block->metadata_length, &block->body_length, + pool_, kMaxNestingDepth, allow_64bit)); RETURN_NOT_OK(UpdatePosition()); DCHECK(position_ % 8 == 0) << "WriteRecordBatch did not perform aligned writes"; @@ -681,15 +697,17 @@ class RecordBatchStreamWriter::RecordBatchStreamWriterImpl { Status WriteRecordBatch(const RecordBatch& batch, bool allow_64bit) { // Push an empty FileBlock. Can be written in the footer later record_batches_.push_back({0, 0, 0}); - return WriteRecordBatch( - batch, allow_64bit, &record_batches_[record_batches_.size() - 1]); + return WriteRecordBatch(batch, allow_64bit, + &record_batches_[record_batches_.size() - 1]); } // Adds padding bytes if necessary to ensure all memory blocks are written on // 64-byte (or other alignment) boundaries. Status Align(int64_t alignment = kArrowAlignment) { int64_t remainder = PaddedLength(position_, alignment) - position_; - if (remainder > 0) { return Write(kPaddingBytes, remainder); } + if (remainder > 0) { + return Write(kPaddingBytes, remainder); + } return Status::OK(); } @@ -725,8 +743,8 @@ RecordBatchStreamWriter::RecordBatchStreamWriter() { RecordBatchStreamWriter::~RecordBatchStreamWriter() {} -Status RecordBatchStreamWriter::WriteRecordBatch( - const RecordBatch& batch, bool allow_64bit) { +Status RecordBatchStreamWriter::WriteRecordBatch(const RecordBatch& batch, + bool allow_64bit) { return impl_->WriteRecordBatch(batch, allow_64bit); } @@ -735,16 +753,14 @@ void RecordBatchStreamWriter::set_memory_pool(MemoryPool* pool) { } Status RecordBatchStreamWriter::Open(io::OutputStream* sink, - const std::shared_ptr& schema, - std::shared_ptr* out) { + const std::shared_ptr& schema, + std::shared_ptr* out) { // ctor is private *out = std::shared_ptr(new RecordBatchStreamWriter()); return (*out)->impl_->Open(sink, schema); } -Status RecordBatchStreamWriter::Close() { - return impl_->Close(); -} +Status RecordBatchStreamWriter::Close() { return impl_->Close(); } // ---------------------------------------------------------------------- // File writer implementation @@ -756,8 +772,8 @@ class RecordBatchFileWriter::RecordBatchFileWriterImpl Status Start() override { // It is only necessary to align to 8-byte boundary at the start of the file - RETURN_NOT_OK(Write( - reinterpret_cast(kArrowMagicBytes), strlen(kArrowMagicBytes))); + RETURN_NOT_OK(Write(reinterpret_cast(kArrowMagicBytes), + strlen(kArrowMagicBytes))); RETURN_NOT_OK(Align(8)); // We write the schema at the start of the file (and the end). This also @@ -768,21 +784,23 @@ class RecordBatchFileWriter::RecordBatchFileWriterImpl Status Close() override { // Write metadata int64_t initial_position = position_; - RETURN_NOT_OK(WriteFileFooter( - *schema_, dictionaries_, record_batches_, &dictionary_memo_, sink_)); + RETURN_NOT_OK(WriteFileFooter(*schema_, dictionaries_, record_batches_, + &dictionary_memo_, sink_)); RETURN_NOT_OK(UpdatePosition()); // Write footer length int32_t footer_length = static_cast(position_ - initial_position); - if (footer_length <= 0) { return Status::Invalid("Invalid file footer"); } + if (footer_length <= 0) { + return Status::Invalid("Invalid file footer"); + } RETURN_NOT_OK( Write(reinterpret_cast(&footer_length), sizeof(int32_t))); // Write magic bytes to end file - return Write( - reinterpret_cast(kArrowMagicBytes), strlen(kArrowMagicBytes)); + return Write(reinterpret_cast(kArrowMagicBytes), + strlen(kArrowMagicBytes)); } }; @@ -793,20 +811,19 @@ RecordBatchFileWriter::RecordBatchFileWriter() { RecordBatchFileWriter::~RecordBatchFileWriter() {} Status RecordBatchFileWriter::Open(io::OutputStream* sink, - const std::shared_ptr& schema, std::shared_ptr* out) { + const std::shared_ptr& schema, + std::shared_ptr* out) { *out = std::shared_ptr( new RecordBatchFileWriter()); // ctor is private return (*out)->impl_->Open(sink, schema); } -Status RecordBatchFileWriter::WriteRecordBatch( - const RecordBatch& batch, bool allow_64bit) { +Status RecordBatchFileWriter::WriteRecordBatch(const RecordBatch& batch, + bool allow_64bit) { return impl_->WriteRecordBatch(batch, allow_64bit); } -Status RecordBatchFileWriter::Close() { - return impl_->Close(); -} +Status RecordBatchFileWriter::Close() { return impl_->Close(); } } // namespace ipc } // namespace arrow diff --git a/cpp/src/arrow/ipc/writer.h b/cpp/src/arrow/ipc/writer.h index 899a1b2cc1e..c28dfe0afbb 100644 --- a/cpp/src/arrow/ipc/writer.h +++ b/cpp/src/arrow/ipc/writer.h @@ -85,7 +85,7 @@ class ARROW_EXPORT RecordBatchStreamWriter : public RecordBatchWriter { /// \param(out) out the created stream writer /// \return Status indicating success or failure static Status Open(io::OutputStream* sink, const std::shared_ptr& schema, - std::shared_ptr* out); + std::shared_ptr* out); Status WriteRecordBatch(const RecordBatch& batch, bool allow_64bit = false) override; Status Close() override; @@ -113,7 +113,7 @@ class ARROW_EXPORT RecordBatchFileWriter : public RecordBatchStreamWriter { /// \param(out) out the created stream writer /// \return Status indicating success or failure static Status Open(io::OutputStream* sink, const std::shared_ptr& schema, - std::shared_ptr* out); + std::shared_ptr* out); Status WriteRecordBatch(const RecordBatch& batch, bool allow_64bit = false) override; Status Close() override; @@ -145,14 +145,16 @@ class ARROW_EXPORT RecordBatchFileWriter : public RecordBatchStreamWriter { /// \param(out) body_length: the size of the contiguous buffer block plus /// padding bytes Status ARROW_EXPORT WriteRecordBatch(const RecordBatch& batch, - int64_t buffer_start_offset, io::OutputStream* dst, int32_t* metadata_length, - int64_t* body_length, MemoryPool* pool, int max_recursion_depth = kMaxNestingDepth, - bool allow_64bit = false); + int64_t buffer_start_offset, io::OutputStream* dst, + int32_t* metadata_length, int64_t* body_length, + MemoryPool* pool, + int max_recursion_depth = kMaxNestingDepth, + bool allow_64bit = false); // Write Array as a DictionaryBatch message Status WriteDictionary(int64_t dictionary_id, const std::shared_ptr& dictionary, - int64_t buffer_start_offset, io::OutputStream* dst, int32_t* metadata_length, - int64_t* body_length, MemoryPool* pool); + int64_t buffer_start_offset, io::OutputStream* dst, + int32_t* metadata_length, int64_t* body_length, MemoryPool* pool); // Compute the precise number of bytes needed in a contiguous memory segment to // write the record batch. This involves generating the complete serialized @@ -166,13 +168,14 @@ Status ARROW_EXPORT GetTensorSize(const Tensor& tensor, int64_t* size); /// EXPERIMENTAL: Write RecordBatch allowing lengths over INT32_MAX. This data /// may not be readable by all Arrow implementations Status ARROW_EXPORT WriteLargeRecordBatch(const RecordBatch& batch, - int64_t buffer_start_offset, io::OutputStream* dst, int32_t* metadata_length, - int64_t* body_length, MemoryPool* pool); + int64_t buffer_start_offset, + io::OutputStream* dst, int32_t* metadata_length, + int64_t* body_length, MemoryPool* pool); /// EXPERIMENTAL: Write arrow::Tensor as a contiguous message /// Status ARROW_EXPORT WriteTensor(const Tensor& tensor, io::OutputStream* dst, - int32_t* metadata_length, int64_t* body_length); + int32_t* metadata_length, int64_t* body_length); /// Backwards-compatibility for Arrow < 0.4.0 /// diff --git a/cpp/src/arrow/memory_pool-test.cc b/cpp/src/arrow/memory_pool-test.cc index 8a185abca71..52e48dbefab 100644 --- a/cpp/src/arrow/memory_pool-test.cc +++ b/cpp/src/arrow/memory_pool-test.cc @@ -27,9 +27,7 @@ class TestDefaultMemoryPool : public ::arrow::test::TestMemoryPoolBase { ::arrow::MemoryPool* memory_pool() override { return ::arrow::default_memory_pool(); } }; -TEST_F(TestDefaultMemoryPool, MemoryTracking) { - this->TestMemoryTracking(); -} +TEST_F(TestDefaultMemoryPool, MemoryTracking) { this->TestMemoryTracking(); } TEST_F(TestDefaultMemoryPool, OOM) { #ifndef ADDRESS_SANITIZER @@ -37,9 +35,7 @@ TEST_F(TestDefaultMemoryPool, OOM) { #endif } -TEST_F(TestDefaultMemoryPool, Reallocate) { - this->TestReallocate(); -} +TEST_F(TestDefaultMemoryPool, Reallocate) { this->TestReallocate(); } // Death tests and valgrind are known to not play well 100% of the time. See // googletest documentation @@ -53,7 +49,7 @@ TEST(DefaultMemoryPoolDeathTest, FreeLargeMemory) { #ifndef NDEBUG EXPECT_EXIT(pool->Free(data, 120), ::testing::ExitedWithCode(1), - ".*Check failed: \\(bytes_allocated_\\) >= \\(size\\)"); + ".*Check failed: \\(bytes_allocated_\\) >= \\(size\\)"); #endif pool->Free(data, 100); diff --git a/cpp/src/arrow/memory_pool.cc b/cpp/src/arrow/memory_pool.cc index e7de5c4fc58..769fc1037ee 100644 --- a/cpp/src/arrow/memory_pool.cc +++ b/cpp/src/arrow/memory_pool.cc @@ -17,12 +17,12 @@ #include "arrow/memory_pool.h" +#include #include #include #include #include #include -#include #include "arrow/status.h" #include "arrow/util/logging.h" @@ -60,8 +60,8 @@ Status AllocateAligned(int64_t size, uint8_t** out) { return Status::OutOfMemory(ss.str()); } #else - const int result = posix_memalign( - reinterpret_cast(out), kAlignment, static_cast(size)); + const int result = posix_memalign(reinterpret_cast(out), kAlignment, + static_cast(size)); if (result == ENOMEM) { std::stringstream ss; ss << "malloc of size " << size << " failed"; @@ -82,13 +82,9 @@ MemoryPool::MemoryPool() {} MemoryPool::~MemoryPool() {} -int64_t MemoryPool::max_memory() const { - return -1; -} +int64_t MemoryPool::max_memory() const { return -1; } -DefaultMemoryPool::DefaultMemoryPool() : bytes_allocated_(0) { - max_memory_ = 0; -} +DefaultMemoryPool::DefaultMemoryPool() : bytes_allocated_(0) { max_memory_ = 0; } Status DefaultMemoryPool::Allocate(int64_t size, uint8_t** out) { RETURN_NOT_OK(AllocateAligned(size, out)); @@ -96,7 +92,9 @@ Status DefaultMemoryPool::Allocate(int64_t size, uint8_t** out) { { std::lock_guard guard(lock_); - if (bytes_allocated_ > max_memory_) { max_memory_ = bytes_allocated_.load(); } + if (bytes_allocated_ > max_memory_) { + max_memory_ = bytes_allocated_.load(); + } } return Status::OK(); } @@ -128,15 +126,15 @@ Status DefaultMemoryPool::Reallocate(int64_t old_size, int64_t new_size, uint8_t bytes_allocated_ += new_size - old_size; { std::lock_guard guard(lock_); - if (bytes_allocated_ > max_memory_) { max_memory_ = bytes_allocated_.load(); } + if (bytes_allocated_ > max_memory_) { + max_memory_ = bytes_allocated_.load(); + } } return Status::OK(); } -int64_t DefaultMemoryPool::bytes_allocated() const { - return bytes_allocated_.load(); -} +int64_t DefaultMemoryPool::bytes_allocated() const { return bytes_allocated_.load(); } void DefaultMemoryPool::Free(uint8_t* buffer, int64_t size) { DCHECK_GE(bytes_allocated_, size); @@ -150,9 +148,7 @@ void DefaultMemoryPool::Free(uint8_t* buffer, int64_t size) { bytes_allocated_ -= size; } -int64_t DefaultMemoryPool::max_memory() const { - return max_memory_.load(); -} +int64_t DefaultMemoryPool::max_memory() const { return max_memory_.load(); } DefaultMemoryPool::~DefaultMemoryPool() {} diff --git a/cpp/src/arrow/pretty_print-test.cc b/cpp/src/arrow/pretty_print-test.cc index 10a91f5e4e4..049f5a58a68 100644 --- a/cpp/src/arrow/pretty_print-test.cc +++ b/cpp/src/arrow/pretty_print-test.cc @@ -57,7 +57,7 @@ void CheckArray(const Array& arr, int indent, const char* expected) { template void CheckPrimitive(int indent, const std::vector& is_valid, - const std::vector& values, const char* expected) { + const std::vector& values, const char* expected) { std::shared_ptr array; ArrayFromVector(is_valid, values, &array); CheckArray(*array, indent, expected); diff --git a/cpp/src/arrow/pretty_print.cc b/cpp/src/arrow/pretty_print.cc index 93f6ff0f363..aedad1228df 100644 --- a/cpp/src/arrow/pretty_print.cc +++ b/cpp/src/arrow/pretty_print.cc @@ -42,7 +42,9 @@ class ArrayPrinter { const T& array) { const auto data = array.raw_values(); for (int i = 0; i < array.length(); ++i) { - if (i > 0) { (*sink_) << ", "; } + if (i > 0) { + (*sink_) << ", "; + } if (array.IsNull(i)) { (*sink_) << "null"; } else { @@ -56,7 +58,9 @@ class ArrayPrinter { const T& array) { const auto data = array.raw_values(); for (int i = 0; i < array.length(); ++i) { - if (i > 0) { (*sink_) << ", "; } + if (i > 0) { + (*sink_) << ", "; + } if (array.IsNull(i)) { Write("null"); } else { @@ -71,7 +75,9 @@ class ArrayPrinter { WriteDataValues(const T& array) { int32_t length; for (int i = 0; i < array.length(); ++i) { - if (i > 0) { (*sink_) << ", "; } + if (i > 0) { + (*sink_) << ", "; + } if (array.IsNull(i)) { Write("null"); } else { @@ -87,7 +93,9 @@ class ArrayPrinter { WriteDataValues(const T& array) { int32_t length; for (int i = 0; i < array.length(); ++i) { - if (i > 0) { (*sink_) << ", "; } + if (i > 0) { + (*sink_) << ", "; + } if (array.IsNull(i)) { Write("null"); } else { @@ -102,7 +110,9 @@ class ArrayPrinter { WriteDataValues(const T& array) { int32_t width = array.byte_width(); for (int i = 0; i < array.length(); ++i) { - if (i > 0) { (*sink_) << ", "; } + if (i > 0) { + (*sink_) << ", "; + } if (array.IsNull(i)) { Write("null"); } else { @@ -116,7 +126,9 @@ class ArrayPrinter { inline typename std::enable_if::value, void>::type WriteDataValues(const T& array) { for (int i = 0; i < array.length(); ++i) { - if (i > 0) { (*sink_) << ", "; } + if (i > 0) { + (*sink_) << ", "; + } if (array.IsNull(i)) { Write("null"); } else { @@ -138,7 +150,7 @@ class ArrayPrinter { typename std::enable_if::value || std::is_base_of::value || std::is_base_of::value, - Status>::type + Status>::type Visit(const T& array) { OpenArray(); WriteDataValues(array); @@ -157,8 +169,8 @@ class ArrayPrinter { Newline(); Write("-- value_offsets: "); - Int32Array value_offsets( - array.length() + 1, array.value_offsets(), nullptr, 0, array.offset()); + Int32Array value_offsets(array.length() + 1, array.value_offsets(), nullptr, 0, + array.offset()); RETURN_NOT_OK(PrettyPrint(value_offsets, indent_ + 2, sink_)); Newline(); @@ -170,8 +182,8 @@ class ArrayPrinter { return Status::OK(); } - Status PrintChildren( - const std::vector>& fields, int64_t offset, int64_t length) { + Status PrintChildren(const std::vector>& fields, int64_t offset, + int64_t length) { for (size_t i = 0; i < fields.size(); ++i) { Newline(); std::stringstream ss; @@ -179,7 +191,9 @@ class ArrayPrinter { Write(ss.str()); std::shared_ptr field = fields[i]; - if (offset != 0) { field = field->Slice(offset, length); } + if (offset != 0) { + field = field->Slice(offset, length); + } RETURN_NOT_OK(PrettyPrint(*field, indent_ + 2, sink_)); } @@ -207,8 +221,8 @@ class ArrayPrinter { if (array.mode() == UnionMode::DENSE) { Newline(); Write("-- value_offsets: "); - Int32Array value_offsets( - array.length(), array.value_offsets(), nullptr, 0, array.offset()); + Int32Array value_offsets(array.length(), array.value_offsets(), nullptr, 0, + array.offset()); RETURN_NOT_OK(PrettyPrint(value_offsets, indent_ + 2, sink_)); } @@ -247,8 +261,8 @@ Status ArrayPrinter::WriteValidityBitmap(const Array& array) { Write("-- is_valid: "); if (array.null_count() > 0) { - BooleanArray is_valid( - array.length(), array.null_bitmap(), nullptr, 0, array.offset()); + BooleanArray is_valid(array.length(), array.null_bitmap(), nullptr, 0, + array.offset()); return PrettyPrint(is_valid, indent_ + 2, sink_); } else { Write("all not null"); @@ -256,20 +270,12 @@ Status ArrayPrinter::WriteValidityBitmap(const Array& array) { } } -void ArrayPrinter::OpenArray() { - (*sink_) << "["; -} -void ArrayPrinter::CloseArray() { - (*sink_) << "]"; -} +void ArrayPrinter::OpenArray() { (*sink_) << "["; } +void ArrayPrinter::CloseArray() { (*sink_) << "]"; } -void ArrayPrinter::Write(const char* data) { - (*sink_) << data; -} +void ArrayPrinter::Write(const char* data) { (*sink_) << data; } -void ArrayPrinter::Write(const std::string& data) { - (*sink_) << data; -} +void ArrayPrinter::Write(const std::string& data) { (*sink_) << data; } void ArrayPrinter::Newline() { (*sink_) << "\n"; diff --git a/cpp/src/arrow/python/arrow_to_pandas.cc b/cpp/src/arrow/python/arrow_to_pandas.cc index d40609fe3fa..86f82fdbd8d 100644 --- a/cpp/src/arrow/python/arrow_to_pandas.cc +++ b/cpp/src/arrow/python/arrow_to_pandas.cc @@ -56,6 +56,9 @@ namespace arrow { namespace py { +using internal::kPandasTimestampNull; +using internal::kNanosecondsInDay; + // ---------------------------------------------------------------------- // Utility code @@ -147,8 +150,8 @@ static inline PyArray_Descr* GetSafeNumPyDtype(int type) { return PyArray_DescrFromType(type); } } -static inline PyObject* NewArray1DFromType( - DataType* arrow_type, int type, int64_t length, void* data) { +static inline PyObject* NewArray1DFromType(DataType* arrow_type, int type, int64_t length, + void* data) { npy_intp dims[1] = {length}; PyArray_Descr* descr = GetSafeNumPyDtype(type); @@ -159,7 +162,8 @@ static inline PyObject* NewArray1DFromType( set_numpy_metadata(type, arrow_type, descr); return PyArray_NewFromDescr(&PyArray_Type, descr, 1, dims, nullptr, data, - NPY_ARRAY_OWNDATA | NPY_ARRAY_CARRAY | NPY_ARRAY_WRITEABLE, nullptr); + NPY_ARRAY_OWNDATA | NPY_ARRAY_CARRAY | NPY_ARRAY_WRITEABLE, + nullptr); } class PandasBlock { @@ -188,7 +192,7 @@ class PandasBlock { virtual Status Allocate() = 0; virtual Status Write(const std::shared_ptr& col, int64_t abs_placement, - int64_t rel_placement) = 0; + int64_t rel_placement) = 0; PyObject* block_arr() const { return block_arr_.obj(); } @@ -408,7 +412,9 @@ inline Status ConvertFixedSizeBinary(const ChunkedArray& data, PyObject** out_va inline Status ConvertStruct(const ChunkedArray& data, PyObject** out_values) { PyAcquireGIL lock; - if (data.num_chunks() <= 0) { return Status::OK(); } + if (data.num_chunks() <= 0) { + return Status::OK(); + } // ChunkedArray has at least one chunk auto arr = static_cast(data.chunk(0).get()); // Use it to cache the struct type and number of fields for all chunks @@ -467,8 +473,8 @@ inline Status ConvertStruct(const ChunkedArray& data, PyObject** out_values) { } template -inline Status ConvertListsLike( - const std::shared_ptr& col, PyObject** out_values) { +inline Status ConvertListsLike(const std::shared_ptr& col, + PyObject** out_values) { const ChunkedArray& data = *col->data().get(); auto list_type = std::static_pointer_cast(col->type()); @@ -532,8 +538,8 @@ inline void ConvertNumericNullable(const ChunkedArray& data, T na_value, T* out_ } template -inline void ConvertNumericNullableCast( - const ChunkedArray& data, OutType na_value, OutType* out_values) { +inline void ConvertNumericNullableCast(const ChunkedArray& data, OutType na_value, + OutType* out_values) { for (int c = 0; c < data.num_chunks(); c++) { const std::shared_ptr arr = data.chunk(c); auto prim_arr = static_cast(arr.get()); @@ -602,8 +608,8 @@ Status ValidateDecimalPrecision(int precision) { } template -Status RawDecimalToString( - const uint8_t* bytes, int precision, int scale, std::string* result) { +Status RawDecimalToString(const uint8_t* bytes, int precision, int scale, + std::string* result) { DCHECK_NE(bytes, nullptr); DCHECK_NE(result, nullptr); RETURN_NOT_OK(ValidateDecimalPrecision(precision)); @@ -613,13 +619,13 @@ Status RawDecimalToString( return Status::OK(); } -template Status RawDecimalToString( - const uint8_t*, int, int, std::string* result); -template Status RawDecimalToString( - const uint8_t*, int, int, std::string* result); +template Status RawDecimalToString(const uint8_t*, int, int, + std::string* result); +template Status RawDecimalToString(const uint8_t*, int, int, + std::string* result); Status RawDecimalToString(const uint8_t* bytes, int precision, int scale, - bool is_negative, std::string* result) { + bool is_negative, std::string* result) { DCHECK_NE(bytes, nullptr); DCHECK_NE(result, nullptr); RETURN_NOT_OK(ValidateDecimalPrecision(precision)); @@ -684,7 +690,7 @@ class ObjectBlock : public PandasBlock { Status Allocate() override { return AllocateNDArray(NPY_OBJECT); } Status Write(const std::shared_ptr& col, int64_t abs_placement, - int64_t rel_placement) override { + int64_t rel_placement) override { Type::type type = col->type()->id(); PyObject** out_buffer = @@ -749,11 +755,11 @@ class IntBlock : public PandasBlock { public: using PandasBlock::PandasBlock; Status Allocate() override { - return AllocateNDArray(arrow_traits::npy_type); + return AllocateNDArray(internal::arrow_traits::npy_type); } Status Write(const std::shared_ptr& col, int64_t abs_placement, - int64_t rel_placement) override { + int64_t rel_placement) override { Type::type type = col->type()->id(); C_TYPE* out_buffer = @@ -789,7 +795,7 @@ class Float32Block : public PandasBlock { Status Allocate() override { return AllocateNDArray(NPY_FLOAT32); } Status Write(const std::shared_ptr& col, int64_t abs_placement, - int64_t rel_placement) override { + int64_t rel_placement) override { Type::type type = col->type()->id(); if (type != Type::FLOAT) { @@ -813,7 +819,7 @@ class Float64Block : public PandasBlock { Status Allocate() override { return AllocateNDArray(NPY_FLOAT64); } Status Write(const std::shared_ptr& col, int64_t abs_placement, - int64_t rel_placement) override { + int64_t rel_placement) override { Type::type type = col->type()->id(); double* out_buffer = @@ -868,7 +874,7 @@ class BoolBlock : public PandasBlock { Status Allocate() override { return AllocateNDArray(NPY_BOOL); } Status Write(const std::shared_ptr& col, int64_t abs_placement, - int64_t rel_placement) override { + int64_t rel_placement) override { Type::type type = col->type()->id(); if (type != Type::BOOL) { @@ -903,7 +909,7 @@ class DatetimeBlock : public PandasBlock { Status Allocate() override { return AllocateDatetime(2); } Status Write(const std::shared_ptr& col, int64_t abs_placement, - int64_t rel_placement) override { + int64_t rel_placement) override { Type::type type = col->type()->id(); int64_t* out_buffer = @@ -978,18 +984,18 @@ class CategoricalBlock : public PandasBlock { public: explicit CategoricalBlock(int64_t num_rows) : PandasBlock(num_rows, 1) {} Status Allocate() override { - constexpr int npy_type = arrow_traits::npy_type; + constexpr int npy_type = internal::arrow_traits::npy_type; if (!(npy_type == NPY_INT8 || npy_type == NPY_INT16 || npy_type == NPY_INT32 || - npy_type == NPY_INT64)) { + npy_type == NPY_INT64)) { return Status::Invalid("Category indices must be signed integers"); } return AllocateNDArray(npy_type, 1); } Status Write(const std::shared_ptr& col, int64_t abs_placement, - int64_t rel_placement) override { - using T = typename arrow_traits::T; + int64_t rel_placement) override { + using T = typename internal::arrow_traits::T; T* out_values = reinterpret_cast(block_data_) + rel_placement * num_rows_; @@ -1036,7 +1042,7 @@ class CategoricalBlock : public PandasBlock { }; Status MakeBlock(PandasBlock::type type, int64_t num_rows, int num_columns, - std::shared_ptr* block) { + std::shared_ptr* block) { #define BLOCK_CASE(NAME, TYPE) \ case PandasBlock::NAME: \ *block = std::make_shared(num_rows, num_columns); \ @@ -1066,7 +1072,8 @@ Status MakeBlock(PandasBlock::type type, int64_t num_rows, int num_columns, } static inline Status MakeCategoricalBlock(const std::shared_ptr& type, - int64_t num_rows, std::shared_ptr* block) { + int64_t num_rows, + std::shared_ptr* block) { // All categoricals become a block with a single column auto dict_type = static_cast(type.get()); switch (dict_type->index_type()->id()) { @@ -1259,7 +1266,9 @@ class DataFrameBlockCreator { block = it->second; } else { auto it = this->blocks_.find(output_type); - if (it == this->blocks_.end()) { return Status::KeyError("No block allocated"); } + if (it == this->blocks_.end()) { + return Status::KeyError("No block allocated"); + } block = it->second; } return block->Write(col, i, rel_placement); @@ -1286,7 +1295,9 @@ class DataFrameBlockCreator { int column_num; while (!error_occurred) { column_num = task_counter.fetch_add(1); - if (column_num >= this->table_->num_columns()) { break; } + if (column_num >= this->table_->num_columns()) { + break; + } Status s = WriteColumn(column_num); if (!s.ok()) { std::lock_guard lock(error_mtx); @@ -1301,7 +1312,9 @@ class DataFrameBlockCreator { thread.join(); } - if (error_occurred) { return error; } + if (error_occurred) { + return error; + } } return Status::OK(); } @@ -1310,7 +1323,9 @@ class DataFrameBlockCreator { for (const auto& it : blocks) { PyObject* item; RETURN_NOT_OK(it.second->GetPyResult(&item)); - if (PyList_Append(list, item) < 0) { RETURN_IF_PYERROR(); } + if (PyList_Append(list, item) < 0) { + RETURN_IF_PYERROR(); + } // ARROW-1017; PyList_Append increments object refcount Py_DECREF(item); @@ -1369,7 +1384,7 @@ class ArrowDeserializer { template Status ConvertValuesZeroCopy(int npy_type, std::shared_ptr arr) { - typedef typename arrow_traits::T T; + typedef typename internal::arrow_traits::T T; auto prim_arr = static_cast(arr.get()); auto in_values = reinterpret_cast(prim_arr->raw_values()); @@ -1413,7 +1428,7 @@ class ArrowDeserializer { typename std::enable_if::value, Status>::type Visit(const Type& type) { constexpr int TYPE = Type::type_id; - using traits = arrow_traits; + using traits = internal::arrow_traits; typedef typename traits::T T; int npy_type = traits::npy_type; @@ -1432,10 +1447,10 @@ class ArrowDeserializer { template typename std::enable_if::value || std::is_base_of::value, - Status>::type + Status>::type Visit(const Type& type) { constexpr int TYPE = Type::type_id; - using traits = arrow_traits; + using traits = internal::arrow_traits; typedef typename traits::T T; @@ -1468,7 +1483,7 @@ class ArrowDeserializer { typename std::enable_if::value, Status>::type Visit( const Type& type) { constexpr int TYPE = Type::type_id; - using traits = arrow_traits; + using traits = internal::arrow_traits; typedef typename traits::T T; @@ -1523,7 +1538,7 @@ class ArrowDeserializer { if (data_.null_count() > 0) { return VisitObjects(ConvertBooleanWithNulls); } else { - RETURN_NOT_OK(AllocateOutput(arrow_traits::npy_type)); + RETURN_NOT_OK(AllocateOutput(internal::arrow_traits::npy_type)); auto out_values = reinterpret_cast(PyArray_DATA(arr_)); ConvertBooleanNoNulls(data_, out_values); } @@ -1603,22 +1618,22 @@ class ArrowDeserializer { PyObject* result_; }; -Status ConvertArrayToPandas( - const std::shared_ptr& arr, PyObject* py_ref, PyObject** out) { +Status ConvertArrayToPandas(const std::shared_ptr& arr, PyObject* py_ref, + PyObject** out) { static std::string dummy_name = "dummy"; auto field = std::make_shared(dummy_name, arr->type()); auto col = std::make_shared(field, arr); return ConvertColumnToPandas(col, py_ref, out); } -Status ConvertColumnToPandas( - const std::shared_ptr& col, PyObject* py_ref, PyObject** out) { +Status ConvertColumnToPandas(const std::shared_ptr& col, PyObject* py_ref, + PyObject** out) { ArrowDeserializer converter(col, py_ref); return converter.Convert(out); } -Status ConvertTableToPandas( - const std::shared_ptr& table, int nthreads, PyObject** out) { +Status ConvertTableToPandas(const std::shared_ptr
& table, int nthreads, + PyObject** out) { DataFrameBlockCreator helper(table); return helper.Convert(nthreads, out); } diff --git a/cpp/src/arrow/python/arrow_to_pandas.h b/cpp/src/arrow/python/arrow_to_pandas.h index c606dcbbe0a..5a99274a33e 100644 --- a/cpp/src/arrow/python/arrow_to_pandas.h +++ b/cpp/src/arrow/python/arrow_to_pandas.h @@ -40,12 +40,12 @@ class Table; namespace py { ARROW_EXPORT -Status ConvertArrayToPandas( - const std::shared_ptr& arr, PyObject* py_ref, PyObject** out); +Status ConvertArrayToPandas(const std::shared_ptr& arr, PyObject* py_ref, + PyObject** out); ARROW_EXPORT -Status ConvertColumnToPandas( - const std::shared_ptr& col, PyObject* py_ref, PyObject** out); +Status ConvertColumnToPandas(const std::shared_ptr& col, PyObject* py_ref, + PyObject** out); struct PandasOptions { bool strings_to_categorical; @@ -58,8 +58,8 @@ struct PandasOptions { // // tuple item: (indices: ndarray[int32], block: ndarray[TYPE, ndim=2]) ARROW_EXPORT -Status ConvertTableToPandas( - const std::shared_ptr
& table, int nthreads, PyObject** out); +Status ConvertTableToPandas(const std::shared_ptr
& table, int nthreads, + PyObject** out); } // namespace py } // namespace arrow diff --git a/cpp/src/arrow/python/builtin_convert.cc b/cpp/src/arrow/python/builtin_convert.cc index a76b6ba2553..6eaa37fb8ca 100644 --- a/cpp/src/arrow/python/builtin_convert.cc +++ b/cpp/src/arrow/python/builtin_convert.cc @@ -44,8 +44,8 @@ static inline bool IsPyInteger(PyObject* obj) { #endif } -Status InvalidConversion( - PyObject* obj, const std::string& expected_types, std::ostream* out) { +Status InvalidConversion(PyObject* obj, const std::string& expected_types, + std::ostream* out) { OwnedRef type(PyObject_Type(obj)); RETURN_IF_PYERROR(); DCHECK_NE(type.obj(), nullptr); @@ -161,7 +161,9 @@ class SeqVisitor { // co-recursive with VisitElem Status Visit(PyObject* obj, int level = 0) { - if (level > max_nesting_level_) { max_nesting_level_ = level; } + if (level > max_nesting_level_) { + max_nesting_level_ = level; + } // Loop through either a sequence or an iterator. if (PySequence_Check(obj)) { Py_ssize_t size = PySequence_Size(obj); @@ -226,7 +228,9 @@ class SeqVisitor { int max_observed_level() const { int result = 0; for (int i = 0; i < MAX_NESTING_LEVELS; ++i) { - if (nesting_histogram_[i] > 0) { result = i; } + if (nesting_histogram_[i] > 0) { + result = i; + } } return result; } @@ -235,7 +239,9 @@ class SeqVisitor { int num_nesting_levels() const { int result = 0; for (int i = 0; i < MAX_NESTING_LEVELS; ++i) { - if (nesting_histogram_[i] > 0) { ++result; } + if (nesting_histogram_[i] > 0) { + ++result; + } } return result; } @@ -300,13 +306,15 @@ Status InferArrowType(PyObject* obj, std::shared_ptr* out_type) { RETURN_NOT_OK(seq_visitor.Validate()); *out_type = seq_visitor.GetType(); - if (*out_type == nullptr) { return Status::TypeError("Unable to determine data type"); } + if (*out_type == nullptr) { + return Status::TypeError("Unable to determine data type"); + } return Status::OK(); } -Status InferArrowTypeAndSize( - PyObject* obj, int64_t* size, std::shared_ptr* out_type) { +Status InferArrowTypeAndSize(PyObject* obj, int64_t* size, + std::shared_ptr* out_type) { RETURN_NOT_OK(InferArrowSize(obj, size)); // For 0-length sequences, refuse to guess @@ -372,7 +380,9 @@ class TypedConverterVisitor : public TypedConverter { RETURN_NOT_OK(static_cast(this)->AppendItem(ref)); ++i; } - if (size != i) { RETURN_NOT_OK(this->typed_builder_->Resize(i)); } + if (size != i) { + RETURN_NOT_OK(this->typed_builder_->Resize(i)); + } } else { return Status::TypeError("Object is not a sequence or iterable"); } @@ -487,8 +497,9 @@ class FixedWidthBytesConverter inline Status AppendItem(const OwnedRef& item) { PyObject* bytes_obj; OwnedRef tmp; - Py_ssize_t expected_length = std::dynamic_pointer_cast( - typed_builder_->type())->byte_width(); + Py_ssize_t expected_length = + std::dynamic_pointer_cast(typed_builder_->type()) + ->byte_width(); if (item.obj() == Py_None) { RETURN_NOT_OK(typed_builder_->AppendNull()); return Status::OK(); @@ -636,7 +647,7 @@ Status ListConverter::Init(ArrayBuilder* builder) { } Status AppendPySequence(PyObject* obj, int64_t size, - const std::shared_ptr& type, ArrayBuilder* builder) { + const std::shared_ptr& type, ArrayBuilder* builder) { PyDateTime_IMPORT; std::shared_ptr converter = GetConverter(type); if (converter == nullptr) { @@ -656,7 +667,7 @@ Status ConvertPySequence(PyObject* obj, MemoryPool* pool, std::shared_ptr } Status ConvertPySequence(PyObject* obj, MemoryPool* pool, std::shared_ptr* out, - const std::shared_ptr& type, int64_t size) { + const std::shared_ptr& type, int64_t size) { // Handle NA / NullType case if (type->id() == Type::NA) { out->reset(new NullArray(size)); @@ -671,7 +682,7 @@ Status ConvertPySequence(PyObject* obj, MemoryPool* pool, std::shared_ptr } Status ConvertPySequence(PyObject* obj, MemoryPool* pool, std::shared_ptr* out, - const std::shared_ptr& type) { + const std::shared_ptr& type) { int64_t size; RETURN_NOT_OK(InferArrowSize(obj, &size)); return ConvertPySequence(obj, pool, out, type, size); diff --git a/cpp/src/arrow/python/builtin_convert.h b/cpp/src/arrow/python/builtin_convert.h index 4f84fbb7cac..cde7a1bd4cf 100644 --- a/cpp/src/arrow/python/builtin_convert.h +++ b/cpp/src/arrow/python/builtin_convert.h @@ -39,14 +39,15 @@ class Status; namespace py { -ARROW_EXPORT arrow::Status InferArrowType( - PyObject* obj, std::shared_ptr* out_type); +ARROW_EXPORT arrow::Status InferArrowType(PyObject* obj, + std::shared_ptr* out_type); ARROW_EXPORT arrow::Status InferArrowTypeAndSize( PyObject* obj, int64_t* size, std::shared_ptr* out_type); ARROW_EXPORT arrow::Status InferArrowSize(PyObject* obj, int64_t* size); ARROW_EXPORT arrow::Status AppendPySequence(PyObject* obj, int64_t size, - const std::shared_ptr& type, arrow::ArrayBuilder* builder); + const std::shared_ptr& type, + arrow::ArrayBuilder* builder); // Type and size inference ARROW_EXPORT @@ -55,19 +56,19 @@ Status ConvertPySequence(PyObject* obj, MemoryPool* pool, std::shared_ptr // Size inference ARROW_EXPORT Status ConvertPySequence(PyObject* obj, MemoryPool* pool, std::shared_ptr* out, - const std::shared_ptr& type); + const std::shared_ptr& type); // No inference ARROW_EXPORT Status ConvertPySequence(PyObject* obj, MemoryPool* pool, std::shared_ptr* out, - const std::shared_ptr& type, int64_t size); + const std::shared_ptr& type, int64_t size); ARROW_EXPORT -Status InvalidConversion( - PyObject* obj, const std::string& expected_type_name, std::ostream* out); +Status InvalidConversion(PyObject* obj, const std::string& expected_type_name, + std::ostream* out); -ARROW_EXPORT Status CheckPythonBytesAreFixedLength( - PyObject* obj, Py_ssize_t expected_length); +ARROW_EXPORT Status CheckPythonBytesAreFixedLength(PyObject* obj, + Py_ssize_t expected_length); } // namespace py } // namespace arrow diff --git a/cpp/src/arrow/python/config.cc b/cpp/src/arrow/python/config.cc index 3cec7c41a2f..bda7a7af163 100644 --- a/cpp/src/arrow/python/config.cc +++ b/cpp/src/arrow/python/config.cc @@ -16,7 +16,6 @@ // under the License. #include "arrow/python/platform.h" -#include #include "arrow/python/config.h" diff --git a/cpp/src/arrow/python/helpers.cc b/cpp/src/arrow/python/helpers.cc index 76ec3a1ba87..164e42e52e4 100644 --- a/cpp/src/arrow/python/helpers.cc +++ b/cpp/src/arrow/python/helpers.cc @@ -89,8 +89,8 @@ Status PythonDecimalToString(PyObject* python_decimal, std::string* out) { return Status::OK(); } -Status InferDecimalPrecisionAndScale( - PyObject* python_decimal, int* precision, int* scale) { +Status InferDecimalPrecisionAndScale(PyObject* python_decimal, int* precision, + int* scale) { // Call Python's str(decimal_object) OwnedRef str_obj(PyObject_Str(python_decimal)); RETURN_IF_PYERROR(); @@ -102,12 +102,12 @@ Status InferDecimalPrecisionAndScale( auto size = str.size; std::string c_string(bytes, size); - return FromString( - c_string, static_cast(nullptr), precision, scale); + return FromString(c_string, static_cast(nullptr), precision, + scale); } -Status DecimalFromString( - PyObject* decimal_constructor, const std::string& decimal_string, PyObject** out) { +Status DecimalFromString(PyObject* decimal_constructor, const std::string& decimal_string, + PyObject** out) { DCHECK_NE(decimal_constructor, nullptr); DCHECK_NE(out, nullptr); @@ -117,8 +117,8 @@ Status DecimalFromString( auto string_bytes = decimal_string.c_str(); DCHECK_NE(string_bytes, nullptr); - *out = PyObject_CallFunction( - decimal_constructor, const_cast("s#"), string_bytes, string_size); + *out = PyObject_CallFunction(decimal_constructor, const_cast("s#"), string_bytes, + string_size); RETURN_IF_PYERROR(); return Status::OK(); } diff --git a/cpp/src/arrow/python/helpers.h b/cpp/src/arrow/python/helpers.h index e0656699ce4..8b8c6673c8e 100644 --- a/cpp/src/arrow/python/helpers.h +++ b/cpp/src/arrow/python/helpers.h @@ -36,16 +36,17 @@ class OwnedRef; ARROW_EXPORT std::shared_ptr GetPrimitiveType(Type::type type); Status ARROW_EXPORT ImportModule(const std::string& module_name, OwnedRef* ref); -Status ARROW_EXPORT ImportFromModule( - const OwnedRef& module, const std::string& module_name, OwnedRef* ref); +Status ARROW_EXPORT ImportFromModule(const OwnedRef& module, + const std::string& module_name, OwnedRef* ref); Status ARROW_EXPORT PythonDecimalToString(PyObject* python_decimal, std::string* out); -Status ARROW_EXPORT InferDecimalPrecisionAndScale( - PyObject* python_decimal, int* precision = nullptr, int* scale = nullptr); +Status ARROW_EXPORT InferDecimalPrecisionAndScale(PyObject* python_decimal, + int* precision = nullptr, + int* scale = nullptr); -Status ARROW_EXPORT DecimalFromString( - PyObject* decimal_constructor, const std::string& decimal_string, PyObject** out); +Status ARROW_EXPORT DecimalFromString(PyObject* decimal_constructor, + const std::string& decimal_string, PyObject** out); } // namespace py } // namespace arrow diff --git a/cpp/src/arrow/python/init.cc b/cpp/src/arrow/python/init.cc index db648915465..dba293bbe23 100644 --- a/cpp/src/arrow/python/init.cc +++ b/cpp/src/arrow/python/init.cc @@ -21,6 +21,4 @@ #include "arrow/python/init.h" #include "arrow/python/numpy_interop.h" -int arrow_init_numpy() { - return arrow::py::import_numpy(); -} +int arrow_init_numpy() { return arrow::py::import_numpy(); } diff --git a/cpp/src/arrow/python/io.cc b/cpp/src/arrow/python/io.cc index a7193854c4d..4c73fd6401c 100644 --- a/cpp/src/arrow/python/io.cc +++ b/cpp/src/arrow/python/io.cc @@ -33,23 +33,19 @@ namespace py { // ---------------------------------------------------------------------- // Python file -PythonFile::PythonFile(PyObject* file) : file_(file) { - Py_INCREF(file_); -} +PythonFile::PythonFile(PyObject* file) : file_(file) { Py_INCREF(file_); } -PythonFile::~PythonFile() { - Py_DECREF(file_); -} +PythonFile::~PythonFile() { Py_DECREF(file_); } // This is annoying: because C++11 does not allow implicit conversion of string // literals to non-const char*, we need to go through some gymnastics to use // PyObject_CallMethod without a lot of pain (its arguments are non-const // char*) template -static inline PyObject* cpp_PyObject_CallMethod( - PyObject* obj, const char* method_name, const char* argspec, ArgTypes... args) { - return PyObject_CallMethod( - obj, const_cast(method_name), const_cast(argspec), args...); +static inline PyObject* cpp_PyObject_CallMethod(PyObject* obj, const char* method_name, + const char* argspec, ArgTypes... args) { + return PyObject_CallMethod(obj, const_cast(method_name), + const_cast(argspec), args...); } Status PythonFile::Close() { @@ -103,9 +99,7 @@ Status PythonFile::Tell(int64_t* position) { // ---------------------------------------------------------------------- // Seekable input stream -PyReadableFile::PyReadableFile(PyObject* file) { - file_.reset(new PythonFile(file)); -} +PyReadableFile::PyReadableFile(PyObject* file) { file_.reset(new PythonFile(file)); } PyReadableFile::~PyReadableFile() {} @@ -167,9 +161,7 @@ Status PyReadableFile::GetSize(int64_t* size) { return Status::OK(); } -bool PyReadableFile::supports_zero_copy() const { - return false; -} +bool PyReadableFile::supports_zero_copy() const { return false; } // ---------------------------------------------------------------------- // Output stream diff --git a/cpp/src/arrow/python/numpy_convert.cc b/cpp/src/arrow/python/numpy_convert.cc index c391b5d7a10..95d63b8fecb 100644 --- a/cpp/src/arrow/python/numpy_convert.cc +++ b/cpp/src/arrow/python/numpy_convert.cc @@ -38,7 +38,7 @@ namespace py { bool is_contiguous(PyObject* array) { if (PyArray_Check(array)) { return (PyArray_FLAGS(reinterpret_cast(array)) & - (NPY_ARRAY_C_CONTIGUOUS | NPY_ARRAY_F_CONTIGUOUS)) != 0; + (NPY_ARRAY_C_CONTIGUOUS | NPY_ARRAY_F_CONTIGUOUS)) != 0; } else { return false; } @@ -49,8 +49,12 @@ int cast_npy_type_compat(int type_num) { // U/LONGLONG to U/INT64 so things work properly. #if (NPY_INT64 == NPY_LONGLONG) && (NPY_SIZEOF_LONGLONG == 8) - if (type_num == NPY_LONGLONG) { type_num = NPY_INT64; } - if (type_num == NPY_ULONGLONG) { type_num = NPY_UINT64; } + if (type_num == NPY_LONGLONG) { + type_num = NPY_INT64; + } + if (type_num == NPY_ULONGLONG) { + type_num = NPY_UINT64; + } #endif return type_num; @@ -66,13 +70,13 @@ NumPyBuffer::NumPyBuffer(PyObject* ao) : Buffer(nullptr, 0) { size_ = PyArray_SIZE(ndarray) * PyArray_DESCR(ndarray)->elsize; capacity_ = size_; - if (PyArray_FLAGS(ndarray) & NPY_ARRAY_WRITEABLE) { is_mutable_ = true; } + if (PyArray_FLAGS(ndarray) & NPY_ARRAY_WRITEABLE) { + is_mutable_ = true; + } } } -NumPyBuffer::~NumPyBuffer() { - Py_XDECREF(arr_); -} +NumPyBuffer::~NumPyBuffer() { Py_XDECREF(arr_); } #define TO_ARROW_TYPE_CASE(NPY_NAME, FACTORY) \ case NPY_##NPY_NAME: \ @@ -198,7 +202,9 @@ Status NumPyDtypeToArrow(PyObject* dtype, std::shared_ptr* out) { #undef TO_ARROW_TYPE_CASE Status NdarrayToTensor(MemoryPool* pool, PyObject* ao, std::shared_ptr* out) { - if (!PyArray_Check(ao)) { return Status::TypeError("Did not pass ndarray object"); } + if (!PyArray_Check(ao)) { + return Status::TypeError("Did not pass ndarray object"); + } PyArrayObject* ndarray = reinterpret_cast(ao); @@ -242,18 +248,27 @@ Status TensorToNdarray(const Tensor& tensor, PyObject* base, PyObject** out) { } const void* immutable_data = nullptr; - if (tensor.data()) { immutable_data = tensor.data()->data(); } + if (tensor.data()) { + immutable_data = tensor.data()->data(); + } // Remove const =( void* mutable_data = const_cast(immutable_data); int array_flags = 0; - if (tensor.is_row_major()) { array_flags |= NPY_ARRAY_C_CONTIGUOUS; } - if (tensor.is_column_major()) { array_flags |= NPY_ARRAY_F_CONTIGUOUS; } - if (tensor.is_mutable()) { array_flags |= NPY_ARRAY_WRITEABLE; } + if (tensor.is_row_major()) { + array_flags |= NPY_ARRAY_C_CONTIGUOUS; + } + if (tensor.is_column_major()) { + array_flags |= NPY_ARRAY_F_CONTIGUOUS; + } + if (tensor.is_mutable()) { + array_flags |= NPY_ARRAY_WRITEABLE; + } - PyObject* result = PyArray_NewFromDescr(&PyArray_Type, dtype, tensor.ndim(), - npy_shape.data(), npy_strides.data(), mutable_data, array_flags, nullptr); + PyObject* result = + PyArray_NewFromDescr(&PyArray_Type, dtype, tensor.ndim(), npy_shape.data(), + npy_strides.data(), mutable_data, array_flags, nullptr); RETURN_IF_PYERROR() if (base != Py_None) { diff --git a/cpp/src/arrow/python/numpy_convert.h b/cpp/src/arrow/python/numpy_convert.h index a486646cdec..7b3b3b7c9a2 100644 --- a/cpp/src/arrow/python/numpy_convert.h +++ b/cpp/src/arrow/python/numpy_convert.h @@ -63,8 +63,8 @@ Status GetTensorType(PyObject* dtype, std::shared_ptr* out); ARROW_EXPORT Status GetNumPyType(const DataType& type, int* type_num); -ARROW_EXPORT Status NdarrayToTensor( - MemoryPool* pool, PyObject* ao, std::shared_ptr* out); +ARROW_EXPORT Status NdarrayToTensor(MemoryPool* pool, PyObject* ao, + std::shared_ptr* out); ARROW_EXPORT Status TensorToNdarray(const Tensor& tensor, PyObject* base, PyObject** out); diff --git a/cpp/src/arrow/python/pandas_to_arrow.cc b/cpp/src/arrow/python/pandas_to_arrow.cc index 1368c3605a4..2fbed1b8fdf 100644 --- a/cpp/src/arrow/python/pandas_to_arrow.cc +++ b/cpp/src/arrow/python/pandas_to_arrow.cc @@ -50,8 +50,14 @@ #include "arrow/python/util/datetime.h" namespace arrow { + +using internal::ArrayData; +using internal::MakeArray; + namespace py { +using internal::NumPyTypeSize; + // ---------------------------------------------------------------------- // Conversion utilities @@ -75,9 +81,7 @@ static inline bool PyObject_is_string(const PyObject* obj) { #endif } -static inline bool PyObject_is_float(const PyObject* obj) { - return PyFloat_Check(obj); -} +static inline bool PyObject_is_float(const PyObject* obj) { return PyFloat_Check(obj); } static inline bool PyObject_is_integer(const PyObject* obj) { return (!PyBool_Check(obj)) && PyArray_IsIntegerScalar(obj); @@ -85,7 +89,7 @@ static inline bool PyObject_is_integer(const PyObject* obj) { template static int64_t ValuesToBitmap(PyArrayObject* arr, uint8_t* bitmap) { - typedef npy_traits traits; + typedef internal::npy_traits traits; typedef typename traits::value_type T; int64_t null_count = 0; @@ -120,9 +124,9 @@ static int64_t MaskToBitmap(PyArrayObject* mask, int64_t length, uint8_t* bitmap } template -static int64_t ValuesToValidBytes( - const void* data, int64_t length, uint8_t* valid_bytes) { - typedef npy_traits traits; +static int64_t ValuesToValidBytes(const void* data, int64_t length, + uint8_t* valid_bytes) { + typedef internal::npy_traits traits; typedef typename traits::value_type T; int64_t null_count = 0; @@ -163,7 +167,8 @@ constexpr int64_t kBinaryMemoryLimit = std::numeric_limits::max(); /// be length of arr if fully consumed /// \param[out] have_bytes true if we encountered any PyBytes object static Status AppendObjectStrings(PyArrayObject* arr, PyArrayObject* mask, int64_t offset, - StringBuilder* builder, int64_t* end_offset, bool* have_bytes) { + StringBuilder* builder, int64_t* end_offset, + bool* have_bytes) { PyObject* obj; Ndarray1DIndexer objects(arr); @@ -210,8 +215,9 @@ static Status AppendObjectStrings(PyArrayObject* arr, PyArrayObject* mask, int64 } static Status AppendObjectFixedWidthBytes(PyArrayObject* arr, PyArrayObject* mask, - int byte_width, int64_t offset, FixedSizeBinaryBuilder* builder, - int64_t* end_offset) { + int byte_width, int64_t offset, + FixedSizeBinaryBuilder* builder, + int64_t* end_offset) { PyObject* obj; Ndarray1DIndexer objects(arr); @@ -245,8 +251,8 @@ static Status AppendObjectFixedWidthBytes(PyArrayObject* arr, PyArrayObject* mas } RETURN_NOT_OK(CheckPythonBytesAreFixedLength(obj, byte_width)); - if (ARROW_PREDICT_FALSE( - builder->value_data_length() + byte_width > kBinaryMemoryLimit)) { + if (ARROW_PREDICT_FALSE(builder->value_data_length() + byte_width > + kBinaryMemoryLimit)) { break; } RETURN_NOT_OK( @@ -263,13 +269,15 @@ static Status AppendObjectFixedWidthBytes(PyArrayObject* arr, PyArrayObject* mas class PandasConverter { public: - PandasConverter( - MemoryPool* pool, PyObject* ao, PyObject* mo, const std::shared_ptr& type) + PandasConverter(MemoryPool* pool, PyObject* ao, PyObject* mo, + const std::shared_ptr& type) : pool_(pool), type_(type), arr_(reinterpret_cast(ao)), mask_(nullptr) { - if (mo != nullptr && mo != Py_None) { mask_ = reinterpret_cast(mo); } + if (mo != nullptr && mo != Py_None) { + mask_ = reinterpret_cast(mo); + } length_ = static_cast(PyArray_SIZE(arr_)); } @@ -304,18 +312,20 @@ class PandasConverter { return Status::OK(); } - Status PushArray(const std::shared_ptr& data) { + Status PushArray(const std::shared_ptr& data) { std::shared_ptr result; - RETURN_NOT_OK(internal::MakeArray(data, &result)); + RETURN_NOT_OK(MakeArray(data, &result)); out_arrays_.emplace_back(std::move(result)); return Status::OK(); } template Status VisitNative() { - using traits = arrow_traits; + using traits = internal::arrow_traits; - if (mask_ != nullptr || traits::supports_nulls) { RETURN_NOT_OK(InitNullBitmap()); } + if (mask_ != nullptr || traits::supports_nulls) { + RETURN_NOT_OK(InitNullBitmap()); + } std::shared_ptr data; RETURN_NOT_OK(ConvertData(&data)); @@ -330,14 +340,14 @@ class PandasConverter { } BufferVector buffers = {null_bitmap_, data}; - return PushArray(std::make_shared( - type_, length_, std::move(buffers), null_count, 0)); + return PushArray( + std::make_shared(type_, length_, std::move(buffers), null_count, 0)); } template typename std::enable_if::value || std::is_same::value, - Status>::type + Status>::type Visit(const T& type) { return VisitNative(); } @@ -373,7 +383,9 @@ class PandasConverter { return Status::Invalid("only handle 1-dimensional arrays"); } - if (type_ == nullptr) { return Status::Invalid("Must pass data type"); } + if (type_ == nullptr) { + return Status::Invalid("Must pass data type"); + } // Visit the type to perform conversion return VisitTypeInline(*type_, this); @@ -385,8 +397,8 @@ class PandasConverter { // Conversion logic for various object dtype arrays template - Status ConvertTypedLists( - const std::shared_ptr& type, ListBuilder* builder, PyObject* list); + Status ConvertTypedLists(const std::shared_ptr& type, ListBuilder* builder, + PyObject* list); template Status ConvertDates(); @@ -397,8 +409,8 @@ class PandasConverter { Status ConvertObjectFixedWidthBytes(const std::shared_ptr& type); Status ConvertObjectIntegers(); Status ConvertLists(const std::shared_ptr& type); - Status ConvertLists( - const std::shared_ptr& type, ListBuilder* builder, PyObject* list); + Status ConvertLists(const std::shared_ptr& type, ListBuilder* builder, + PyObject* list); Status ConvertObjects(); Status ConvertDecimals(); Status ConvertTimes(); @@ -428,25 +440,27 @@ void CopyStrided(T* input_data, int64_t length, int64_t stride, T* output_data) } template <> -void CopyStrided( - PyObject** input_data, int64_t length, int64_t stride, PyObject** output_data) { +void CopyStrided(PyObject** input_data, int64_t length, int64_t stride, + PyObject** output_data) { int64_t j = 0; for (int64_t i = 0; i < length; ++i) { output_data[i] = input_data[j]; - if (output_data[i] != nullptr) { Py_INCREF(output_data[i]); } + if (output_data[i] != nullptr) { + Py_INCREF(output_data[i]); + } j += stride; } } template inline Status PandasConverter::ConvertData(std::shared_ptr* data) { - using traits = arrow_traits; + using traits = internal::arrow_traits; using T = typename traits::T; // Handle LONGLONG->INT64 and other fun things int type_num_compat = cast_npy_type_compat(PyArray_DESCR(arr_)->type_num); - if (numpy_type_size(traits::npy_type) != numpy_type_size(type_num_compat)) { + if (NumPyTypeSize(traits::npy_type) != NumPyTypeSize(type_num_compat)) { return Status::NotImplemented("NumPy type casts not yet implemented"); } @@ -458,7 +472,7 @@ inline Status PandasConverter::ConvertData(std::shared_ptr* data) { auto new_buffer = std::make_shared(pool_); RETURN_NOT_OK(new_buffer->Resize(sizeof(T) * length_)); CopyStrided(reinterpret_cast(PyArray_DATA(arr_)), length_, stride_elements, - reinterpret_cast(new_buffer->mutable_data())); + reinterpret_cast(new_buffer->mutable_data())); *data = new_buffer; } else { // Can zero-copy @@ -479,7 +493,9 @@ inline Status PandasConverter::ConvertData(std::shared_ptr* memset(bitmap, 0, nbytes); for (int i = 0; i < length_; ++i) { - if (values[i] > 0) { BitUtil::SetBit(bitmap, i); } + if (values[i] > 0) { + BitUtil::SetBit(bitmap, i); + } } *data = buffer; @@ -913,9 +929,9 @@ Status LoopPySequence(PyObject* sequence, T func) { } template -inline Status PandasConverter::ConvertTypedLists( - const std::shared_ptr& type, ListBuilder* builder, PyObject* list) { - typedef npy_traits traits; +inline Status PandasConverter::ConvertTypedLists(const std::shared_ptr& type, + ListBuilder* builder, PyObject* list) { + typedef internal::npy_traits traits; typedef typename traits::value_type T; typedef typename traits::BuilderClass BuilderT; @@ -1002,8 +1018,8 @@ inline Status PandasConverter::ConvertTypedLists( RETURN_NOT_OK(CheckFlatNumpyArray(numpy_array, NPY_OBJECT)); int64_t offset = 0; - RETURN_NOT_OK(AppendObjectStrings( - numpy_array, nullptr, 0, value_builder, &offset, &have_bytes)); + RETURN_NOT_OK(AppendObjectStrings(numpy_array, nullptr, 0, value_builder, &offset, + &have_bytes)); if (offset < PyArray_SIZE(numpy_array)) { return Status::Invalid("Array cell value exceeded 2GB"); } @@ -1032,8 +1048,8 @@ inline Status PandasConverter::ConvertTypedLists( return ConvertTypedLists(type, builder, list); \ } -Status PandasConverter::ConvertLists( - const std::shared_ptr& type, ListBuilder* builder, PyObject* list) { +Status PandasConverter::ConvertLists(const std::shared_ptr& type, + ListBuilder* builder, PyObject* list) { switch (type->id()) { LIST_CASE(UINT8, NPY_UINT8, UInt8Type) LIST_CASE(INT8, NPY_INT8, Int8Type) @@ -1080,7 +1096,7 @@ Status PandasConverter::ConvertLists(const std::shared_ptr& type) { } Status PandasToArrow(MemoryPool* pool, PyObject* ao, PyObject* mo, - const std::shared_ptr& type, std::shared_ptr* out) { + const std::shared_ptr& type, std::shared_ptr* out) { PandasConverter converter(pool, ao, mo, type); RETURN_NOT_OK(converter.Convert()); *out = converter.result()[0]; @@ -1088,7 +1104,8 @@ Status PandasToArrow(MemoryPool* pool, PyObject* ao, PyObject* mo, } Status PandasObjectsToArrow(MemoryPool* pool, PyObject* ao, PyObject* mo, - const std::shared_ptr& type, std::shared_ptr* out) { + const std::shared_ptr& type, + std::shared_ptr* out) { PandasConverter converter(pool, ao, mo, type); RETURN_NOT_OK(converter.ConvertObjects()); *out = std::make_shared(converter.result()); diff --git a/cpp/src/arrow/python/pandas_to_arrow.h b/cpp/src/arrow/python/pandas_to_arrow.h index 8f1862470bc..3e655ba3fee 100644 --- a/cpp/src/arrow/python/pandas_to_arrow.h +++ b/cpp/src/arrow/python/pandas_to_arrow.h @@ -38,7 +38,7 @@ namespace py { ARROW_EXPORT Status PandasToArrow(MemoryPool* pool, PyObject* ao, PyObject* mo, - const std::shared_ptr& type, std::shared_ptr* out); + const std::shared_ptr& type, std::shared_ptr* out); /// Convert dtype=object arrays. If target data type is not known, pass a type /// with nullptr @@ -50,7 +50,8 @@ Status PandasToArrow(MemoryPool* pool, PyObject* ao, PyObject* mo, /// \param[out] out a ChunkedArray, to accommodate chunked output ARROW_EXPORT Status PandasObjectsToArrow(MemoryPool* pool, PyObject* ao, PyObject* mo, - const std::shared_ptr& type, std::shared_ptr* out); + const std::shared_ptr& type, + std::shared_ptr* out); } // namespace py } // namespace arrow diff --git a/cpp/src/arrow/python/platform.h b/cpp/src/arrow/python/platform.h index a354b38f04c..ae394695fac 100644 --- a/cpp/src/arrow/python/platform.h +++ b/cpp/src/arrow/python/platform.h @@ -23,6 +23,7 @@ #include #include +#include // Work around C2528 error #if _MSC_VER >= 1900 diff --git a/cpp/src/arrow/python/pyarrow.cc b/cpp/src/arrow/python/pyarrow.cc index 5d88051117b..d080cc0a814 100644 --- a/cpp/src/arrow/python/pyarrow.cc +++ b/cpp/src/arrow/python/pyarrow.cc @@ -31,13 +31,9 @@ namespace { namespace arrow { namespace py { -int import_pyarrow() { - return ::import_pyarrow__lib(); -} +int import_pyarrow() { return ::import_pyarrow__lib(); } -bool is_buffer(PyObject* buffer) { - return ::pyarrow_is_buffer(buffer) != 0; -} +bool is_buffer(PyObject* buffer) { return ::pyarrow_is_buffer(buffer) != 0; } Status unwrap_buffer(PyObject* buffer, std::shared_ptr* out) { *out = ::pyarrow_unwrap_buffer(buffer); @@ -52,9 +48,7 @@ PyObject* wrap_buffer(const std::shared_ptr& buffer) { return ::pyarrow_wrap_buffer(buffer); } -bool is_data_type(PyObject* data_type) { - return ::pyarrow_is_data_type(data_type) != 0; -} +bool is_data_type(PyObject* data_type) { return ::pyarrow_is_data_type(data_type) != 0; } Status unwrap_data_type(PyObject* object, std::shared_ptr* out) { *out = ::pyarrow_unwrap_data_type(object); @@ -69,9 +63,7 @@ PyObject* wrap_data_type(const std::shared_ptr& type) { return ::pyarrow_wrap_data_type(type); } -bool is_field(PyObject* field) { - return ::pyarrow_is_field(field) != 0; -} +bool is_field(PyObject* field) { return ::pyarrow_is_field(field) != 0; } Status unwrap_field(PyObject* field, std::shared_ptr* out) { *out = ::pyarrow_unwrap_field(field); @@ -86,9 +78,7 @@ PyObject* wrap_field(const std::shared_ptr& field) { return ::pyarrow_wrap_field(field); } -bool is_schema(PyObject* schema) { - return ::pyarrow_is_schema(schema) != 0; -} +bool is_schema(PyObject* schema) { return ::pyarrow_is_schema(schema) != 0; } Status unwrap_schema(PyObject* schema, std::shared_ptr* out) { *out = ::pyarrow_unwrap_schema(schema); @@ -103,9 +93,7 @@ PyObject* wrap_schema(const std::shared_ptr& schema) { return ::pyarrow_wrap_schema(schema); } -bool is_array(PyObject* array) { - return ::pyarrow_is_array(array) != 0; -} +bool is_array(PyObject* array) { return ::pyarrow_is_array(array) != 0; } Status unwrap_array(PyObject* array, std::shared_ptr* out) { *out = ::pyarrow_unwrap_array(array); @@ -120,9 +108,7 @@ PyObject* wrap_array(const std::shared_ptr& array) { return ::pyarrow_wrap_array(array); } -bool is_tensor(PyObject* tensor) { - return ::pyarrow_is_tensor(tensor) != 0; -} +bool is_tensor(PyObject* tensor) { return ::pyarrow_is_tensor(tensor) != 0; } Status unwrap_tensor(PyObject* tensor, std::shared_ptr* out) { *out = ::pyarrow_unwrap_tensor(tensor); @@ -137,9 +123,7 @@ PyObject* wrap_tensor(const std::shared_ptr& tensor) { return ::pyarrow_wrap_tensor(tensor); } -bool is_column(PyObject* column) { - return ::pyarrow_is_column(column) != 0; -} +bool is_column(PyObject* column) { return ::pyarrow_is_column(column) != 0; } Status unwrap_column(PyObject* column, std::shared_ptr* out) { *out = ::pyarrow_unwrap_column(column); @@ -154,9 +138,7 @@ PyObject* wrap_column(const std::shared_ptr& column) { return ::pyarrow_wrap_column(column); } -bool is_table(PyObject* table) { - return ::pyarrow_is_table(table) != 0; -} +bool is_table(PyObject* table) { return ::pyarrow_is_table(table) != 0; } Status unwrap_table(PyObject* table, std::shared_ptr
* out) { *out = ::pyarrow_unwrap_table(table); @@ -171,9 +153,7 @@ PyObject* wrap_table(const std::shared_ptr
& table) { return ::pyarrow_wrap_table(table); } -bool is_record_batch(PyObject* batch) { - return ::pyarrow_is_batch(batch) != 0; -} +bool is_record_batch(PyObject* batch) { return ::pyarrow_is_batch(batch) != 0; } Status unwrap_record_batch(PyObject* batch, std::shared_ptr* out) { *out = ::pyarrow_unwrap_batch(batch); diff --git a/cpp/src/arrow/python/pyarrow.h b/cpp/src/arrow/python/pyarrow.h index 7278d1c2857..e6376270061 100644 --- a/cpp/src/arrow/python/pyarrow.h +++ b/cpp/src/arrow/python/pyarrow.h @@ -74,8 +74,8 @@ ARROW_EXPORT Status unwrap_table(PyObject* table, std::shared_ptr
* out); ARROW_EXPORT PyObject* wrap_table(const std::shared_ptr
& table); ARROW_EXPORT bool is_record_batch(PyObject* batch); -ARROW_EXPORT Status unwrap_record_batch( - PyObject* batch, std::shared_ptr* out); +ARROW_EXPORT Status unwrap_record_batch(PyObject* batch, + std::shared_ptr* out); ARROW_EXPORT PyObject* wrap_record_batch(const std::shared_ptr& batch); } // namespace py diff --git a/cpp/src/arrow/python/python-test.cc b/cpp/src/arrow/python/python-test.cc index c0e555d4904..b50699d1ae9 100644 --- a/cpp/src/arrow/python/python-test.cc +++ b/cpp/src/arrow/python/python-test.cc @@ -36,9 +36,7 @@ namespace arrow { namespace py { -TEST(PyBuffer, InvalidInputObject) { - PyBuffer buffer(Py_None); -} +TEST(PyBuffer, InvalidInputObject) { PyBuffer buffer(Py_None); } TEST(DecimalTest, TestPythonDecimalToString) { PyAcquireGIL lock; @@ -58,8 +56,8 @@ TEST(DecimalTest, TestPythonDecimalToString) { auto c_string_size = decimal_string.size(); ASSERT_GT(c_string_size, 0); - OwnedRef pydecimal(PyObject_CallFunction( - Decimal.obj(), const_cast(format), c_string, c_string_size)); + OwnedRef pydecimal(PyObject_CallFunction(Decimal.obj(), const_cast(format), + c_string, c_string_size)); ASSERT_NE(pydecimal.obj(), nullptr); ASSERT_EQ(PyErr_Occurred(), nullptr); @@ -88,7 +86,8 @@ TEST(PandasConversionTest, TestObjectBlockWriteFails) { auto f3 = field("f3", utf8()); std::vector> fields = {f1, f2, f3}; std::vector> cols = {std::make_shared(f1, arr), - std::make_shared(f2, arr), std::make_shared(f3, arr)}; + std::make_shared(f2, arr), + std::make_shared(f3, arr)}; auto schema = std::make_shared(fields); auto table = std::make_shared
(schema, cols); diff --git a/cpp/src/arrow/python/type_traits.h b/cpp/src/arrow/python/type_traits.h index b6761ae0d26..2cbbdf4cf15 100644 --- a/cpp/src/arrow/python/type_traits.h +++ b/cpp/src/arrow/python/type_traits.h @@ -30,6 +30,7 @@ namespace arrow { namespace py { +namespace internal { template struct npy_traits {}; @@ -227,7 +228,7 @@ struct arrow_traits { static constexpr bool supports_nulls = true; }; -static inline int numpy_type_size(int npy_type) { +static inline int NumPyTypeSize(int npy_type) { switch (npy_type) { case NPY_BOOL: return 1; @@ -272,5 +273,6 @@ static inline int numpy_type_size(int npy_type) { return -1; } +} // namespace internal } // namespace py } // namespace arrow diff --git a/cpp/src/arrow/python/util/datetime.h b/cpp/src/arrow/python/util/datetime.h index d32421e8e36..de751510151 100644 --- a/cpp/src/arrow/python/util/datetime.h +++ b/cpp/src/arrow/python/util/datetime.h @@ -18,8 +18,8 @@ #ifndef PYARROW_UTIL_DATETIME_H #define PYARROW_UTIL_DATETIME_H -#include "arrow/python/platform.h" #include +#include "arrow/python/platform.h" namespace arrow { namespace py { @@ -31,8 +31,8 @@ static inline int64_t PyTime_to_us(PyObject* pytime) { PyDateTime_TIME_GET_MICROSECOND(pytime)); } -static inline Status PyTime_from_int( - int64_t val, const TimeUnit::type unit, PyObject** out) { +static inline Status PyTime_from_int(int64_t val, const TimeUnit::type unit, + PyObject** out) { int64_t hour = 0, minute = 0, second = 0, microsecond = 0; switch (unit) { case TimeUnit::NANO: @@ -65,7 +65,7 @@ static inline Status PyTime_from_int( break; } *out = PyTime_FromTime(static_cast(hour), static_cast(minute), - static_cast(second), static_cast(microsecond)); + static_cast(second), static_cast(microsecond)); return Status::OK(); } diff --git a/cpp/src/arrow/status.cc b/cpp/src/arrow/status.cc index 99897428eae..9b509b48351 100644 --- a/cpp/src/arrow/status.cc +++ b/cpp/src/arrow/status.cc @@ -33,7 +33,9 @@ void Status::CopyFrom(const State* state) { } std::string Status::CodeAsString() const { - if (state_ == NULL) { return "OK"; } + if (state_ == NULL) { + return "OK"; + } const char* type; switch (code()) { @@ -70,7 +72,9 @@ std::string Status::CodeAsString() const { std::string Status::ToString() const { std::string result(CodeAsString()); - if (state_ == NULL) { return result; } + if (state_ == NULL) { + return result; + } result += ": "; result += state_->msg; return result; diff --git a/cpp/src/arrow/status.h b/cpp/src/arrow/status.h index 1bea1fca84e..a02752f21e4 100644 --- a/cpp/src/arrow/status.h +++ b/cpp/src/arrow/status.h @@ -23,10 +23,12 @@ #include "arrow/util/visibility.h" // Return the given status if it is not OK. -#define ARROW_RETURN_NOT_OK(s) \ - do { \ - ::arrow::Status _s = (s); \ - if (ARROW_PREDICT_FALSE(!_s.ok())) { return _s; } \ +#define ARROW_RETURN_NOT_OK(s) \ + do { \ + ::arrow::Status _s = (s); \ + if (ARROW_PREDICT_FALSE(!_s.ok())) { \ + return _s; \ + } \ } while (0) // If 'to_call' returns a bad status, CHECK immediately with a logged message @@ -43,10 +45,12 @@ namespace arrow { -#define RETURN_NOT_OK(s) \ - do { \ - Status _s = (s); \ - if (ARROW_PREDICT_FALSE(!_s.ok())) { return _s; } \ +#define RETURN_NOT_OK(s) \ + do { \ + Status _s = (s); \ + if (ARROW_PREDICT_FALSE(!_s.ok())) { \ + return _s; \ + } \ } while (0) #define RETURN_NOT_OK_ELSE(s, else_) \ @@ -187,7 +191,9 @@ inline Status::Status(const Status& s) inline void Status::operator=(const Status& s) { // The following condition catches both aliasing (when this == &s), // and the common case where both s and *this are ok. - if (state_ != s.state_) { CopyFrom(s.state_); } + if (state_ != s.state_) { + CopyFrom(s.state_); + } } } // namespace arrow diff --git a/cpp/src/arrow/symbols.map b/cpp/src/arrow/symbols.map index cc8c9ba3c94..49511c6a749 100644 --- a/cpp/src/arrow/symbols.map +++ b/cpp/src/arrow/symbols.map @@ -1,14 +1,19 @@ -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. See accompanying LICENSE file. +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. { # Symbols marked as 'local' are not exported by the DSO and thus may not diff --git a/cpp/src/arrow/table-test.cc b/cpp/src/arrow/table-test.cc index e46fdc77cf7..8dba8c052e9 100644 --- a/cpp/src/arrow/table-test.cc +++ b/cpp/src/arrow/table-test.cc @@ -198,11 +198,11 @@ class TestTable : public TestBase { schema_ = std::make_shared(fields); arrays_ = {MakePrimitive(length), MakePrimitive(length), - MakePrimitive(length)}; + MakePrimitive(length)}; columns_ = {std::make_shared(schema_->field(0), arrays_[0]), - std::make_shared(schema_->field(1), arrays_[1]), - std::make_shared(schema_->field(2), arrays_[2])}; + std::make_shared(schema_->field(1), arrays_[1]), + std::make_shared(schema_->field(2), arrays_[2])}; } protected: @@ -412,8 +412,8 @@ TEST_F(TestTable, AddColumn) { ASSERT_OK(table.AddColumn(0, columns_[0], &result)); auto ex_schema = std::shared_ptr(new Schema( {schema_->field(0), schema_->field(0), schema_->field(1), schema_->field(2)})); - std::vector> ex_columns = { - table.column(0), table.column(0), table.column(1), table.column(2)}; + std::vector> ex_columns = {table.column(0), table.column(0), + table.column(1), table.column(2)}; ASSERT_TRUE(result->Equals(Table(ex_schema, ex_columns))); ASSERT_OK(table.AddColumn(1, columns_[0], &result)); diff --git a/cpp/src/arrow/table.cc b/cpp/src/arrow/table.cc index c09628ed395..665ce2d84de 100644 --- a/cpp/src/arrow/table.cc +++ b/cpp/src/arrow/table.cc @@ -43,8 +43,12 @@ ChunkedArray::ChunkedArray(const ArrayVector& chunks) : chunks_(chunks) { } bool ChunkedArray::Equals(const ChunkedArray& other) const { - if (length_ != other.length()) { return false; } - if (null_count_ != other.null_count()) { return false; } + if (length_ != other.length()) { + return false; + } + if (null_count_ != other.null_count()) { + return false; + } // Check contents of the underlying arrays. This checks for equality of // the underlying data independently of the chunk size. @@ -57,10 +61,10 @@ bool ChunkedArray::Equals(const ChunkedArray& other) const { while (elements_compared < length_) { const std::shared_ptr this_array = chunks_[this_chunk_idx]; const std::shared_ptr other_array = other.chunk(other_chunk_idx); - int64_t common_length = std::min( - this_array->length() - this_start_idx, other_array->length() - other_start_idx); + int64_t common_length = std::min(this_array->length() - this_start_idx, + other_array->length() - other_start_idx); if (!this_array->RangeEquals(this_start_idx, this_start_idx + common_length, - other_start_idx, other_array)) { + other_start_idx, other_array)) { return false; } @@ -85,8 +89,12 @@ bool ChunkedArray::Equals(const ChunkedArray& other) const { } bool ChunkedArray::Equals(const std::shared_ptr& other) const { - if (this == other.get()) { return true; } - if (!other) { return false; } + if (this == other.get()) { + return true; + } + if (!other) { + return false; + } return Equals(*other.get()); } @@ -107,18 +115,24 @@ Column::Column(const std::shared_ptr& field, const std::shared_ptr Column::Column(const std::string& name, const std::shared_ptr& data) : Column(::arrow::field(name, data->type()), data) {} -Column::Column( - const std::shared_ptr& field, const std::shared_ptr& data) +Column::Column(const std::shared_ptr& field, + const std::shared_ptr& data) : field_(field), data_(data) {} bool Column::Equals(const Column& other) const { - if (!field_->Equals(other.field())) { return false; } + if (!field_->Equals(other.field())) { + return false; + } return data_->Equals(other.data()); } bool Column::Equals(const std::shared_ptr& other) const { - if (this == other.get()) { return true; } - if (!other) { return false; } + if (this == other.get()) { + return true; + } + if (!other) { + return false; + } return Equals(*other.get()); } @@ -141,11 +155,13 @@ Status Column::ValidateData() { void AssertBatchValid(const RecordBatch& batch) { Status s = batch.Validate(); - if (!s.ok()) { DCHECK(false) << s.ToString(); } + if (!s.ok()) { + DCHECK(false) << s.ToString(); + } } RecordBatch::RecordBatch(const std::shared_ptr& schema, int64_t num_rows, - const std::vector>& columns) + const std::vector>& columns) : schema_(schema), num_rows_(num_rows), columns_(columns.size()) { for (size_t i = 0; i < columns.size(); ++i) { columns_[i] = columns[i]->data(); @@ -153,7 +169,7 @@ RecordBatch::RecordBatch(const std::shared_ptr& schema, int64_t num_rows } RecordBatch::RecordBatch(const std::shared_ptr& schema, int64_t num_rows, - std::vector>&& columns) + std::vector>&& columns) : schema_(schema), num_rows_(num_rows), columns_(columns.size()) { for (size_t i = 0; i < columns.size(); ++i) { columns_[i] = columns[i]->data(); @@ -161,11 +177,11 @@ RecordBatch::RecordBatch(const std::shared_ptr& schema, int64_t num_rows } RecordBatch::RecordBatch(const std::shared_ptr& schema, int64_t num_rows, - std::vector>&& columns) + std::vector>&& columns) : schema_(schema), num_rows_(num_rows), columns_(std::move(columns)) {} RecordBatch::RecordBatch(const std::shared_ptr& schema, int64_t num_rows, - const std::vector>& columns) + const std::vector>& columns) : schema_(schema), num_rows_(num_rows), columns_(columns) {} std::shared_ptr RecordBatch::column(int i) const { @@ -184,7 +200,9 @@ bool RecordBatch::Equals(const RecordBatch& other) const { } for (int i = 0; i < num_columns(); ++i) { - if (!column(i)->Equals(other.column(i))) { return false; } + if (!column(i)->Equals(other.column(i))) { + return false; + } } return true; @@ -196,7 +214,9 @@ bool RecordBatch::ApproxEquals(const RecordBatch& other) const { } for (int i = 0; i < num_columns(); ++i) { - if (!column(i)->ApproxEquals(other.column(i))) { return false; } + if (!column(i)->ApproxEquals(other.column(i))) { + return false; + } } return true; @@ -253,7 +273,7 @@ Status RecordBatch::Validate() const { // Table methods Table::Table(const std::shared_ptr& schema, - const std::vector>& columns) + const std::vector>& columns) : schema_(schema), columns_(columns) { if (columns.size() == 0) { num_rows_ = 0; @@ -263,7 +283,7 @@ Table::Table(const std::shared_ptr& schema, } Table::Table(const std::shared_ptr& schema, - const std::vector>& columns, int64_t num_rows) + const std::vector>& columns, int64_t num_rows) : schema_(schema), columns_(columns), num_rows_(num_rows) {} std::shared_ptr
Table::ReplaceSchemaMetadata( @@ -273,7 +293,7 @@ std::shared_ptr
Table::ReplaceSchemaMetadata( } Status Table::FromRecordBatches(const std::vector>& batches, - std::shared_ptr
* table) { + std::shared_ptr
* table) { if (batches.size() == 0) { return Status::Invalid("Must pass at least one record batch"); } @@ -307,9 +327,11 @@ Status Table::FromRecordBatches(const std::vector>& return Status::OK(); } -Status ConcatenateTables( - const std::vector>& tables, std::shared_ptr
* table) { - if (tables.size() == 0) { return Status::Invalid("Must pass at least one table"); } +Status ConcatenateTables(const std::vector>& tables, + std::shared_ptr
* table) { + if (tables.size() == 0) { + return Status::Invalid("Must pass at least one table"); + } std::shared_ptr schema = tables[0]->schema(); @@ -343,12 +365,20 @@ Status ConcatenateTables( } bool Table::Equals(const Table& other) const { - if (this == &other) { return true; } - if (!schema_->Equals(*other.schema())) { return false; } - if (static_cast(columns_.size()) != other.num_columns()) { return false; } + if (this == &other) { + return true; + } + if (!schema_->Equals(*other.schema())) { + return false; + } + if (static_cast(columns_.size()) != other.num_columns()) { + return false; + } for (int i = 0; i < static_cast(columns_.size()); i++) { - if (!columns_[i]->Equals(other.column(i))) { return false; } + if (!columns_[i]->Equals(other.column(i))) { + return false; + } } return true; } @@ -357,13 +387,15 @@ Status Table::RemoveColumn(int i, std::shared_ptr
* out) const { std::shared_ptr new_schema; RETURN_NOT_OK(schema_->RemoveField(i, &new_schema)); - *out = std::make_shared
(new_schema, DeleteVectorElement(columns_, i)); + *out = std::make_shared
(new_schema, internal::DeleteVectorElement(columns_, i)); return Status::OK(); } -Status Table::AddColumn( - int i, const std::shared_ptr& col, std::shared_ptr
* out) const { - if (i < 0 || i > num_columns() + 1) { return Status::Invalid("Invalid column index."); } +Status Table::AddColumn(int i, const std::shared_ptr& col, + std::shared_ptr
* out) const { + if (i < 0 || i > num_columns() + 1) { + return Status::Invalid("Invalid column index."); + } if (col == nullptr) { std::stringstream ss; ss << "Column " << i << " was null"; @@ -379,7 +411,8 @@ Status Table::AddColumn( std::shared_ptr new_schema; RETURN_NOT_OK(schema_->AddField(i, col->field(), &new_schema)); - *out = std::make_shared
(new_schema, AddVectorElement(columns_, i, col)); + *out = + std::make_shared
(new_schema, internal::AddVectorElement(columns_, i, col)); return Status::OK(); } @@ -407,7 +440,8 @@ Status Table::ValidateColumns() const { } Status ARROW_EXPORT MakeTable(const std::shared_ptr& schema, - const std::vector>& arrays, std::shared_ptr
* table) { + const std::vector>& arrays, + std::shared_ptr
* table) { // Make sure the length of the schema corresponds to the length of the vector if (schema->num_fields() != static_cast(arrays.size())) { std::stringstream ss; diff --git a/cpp/src/arrow/table.h b/cpp/src/arrow/table.h index 7ada0e9709f..6afd618da04 100644 --- a/cpp/src/arrow/table.h +++ b/cpp/src/arrow/table.h @@ -121,11 +121,11 @@ class ARROW_EXPORT RecordBatch { /// num_rows RecordBatch(const std::shared_ptr& schema, int64_t num_rows, - const std::vector>& columns); + const std::vector>& columns); /// \brief Deprecated move constructor for a vector of Array instances RecordBatch(const std::shared_ptr& schema, int64_t num_rows, - std::vector>&& columns); + std::vector>&& columns); /// \brief Construct record batch from vector of internal data structures /// \since 0.5.0 @@ -138,12 +138,12 @@ class ARROW_EXPORT RecordBatch { /// should be equal to the length of each field /// \param columns the data for the batch's columns RecordBatch(const std::shared_ptr& schema, int64_t num_rows, - std::vector>&& columns); + std::vector>&& columns); /// \brief Construct record batch by copying vector of array data /// \since 0.5.0 RecordBatch(const std::shared_ptr& schema, int64_t num_rows, - const std::vector>& columns); + const std::vector>& columns); bool Equals(const RecordBatch& other) const; @@ -194,14 +194,14 @@ class ARROW_EXPORT Table { public: // If columns is zero-length, the table's number of rows is zero Table(const std::shared_ptr& schema, - const std::vector>& columns); + const std::vector>& columns); // num_rows is a parameter to allow for tables of a particular size not // having any materialized columns. Each column should therefore have the // same length as num_rows -- you can validate this using // Table::ValidateColumns Table(const std::shared_ptr& schema, - const std::vector>& columns, int64_t num_rows); + const std::vector>& columns, int64_t num_rows); // Construct table from RecordBatch, but only if all of the batch schemas are // equal. Returns Status::Invalid if there is some problem @@ -221,8 +221,8 @@ class ARROW_EXPORT Table { Status RemoveColumn(int i, std::shared_ptr
* out) const; /// Add column to the table, producing a new Table - Status AddColumn( - int i, const std::shared_ptr& column, std::shared_ptr
* out) const; + Status AddColumn(int i, const std::shared_ptr& column, + std::shared_ptr
* out) const; /// \brief Replace schema key-value metadata with new metadata (EXPERIMENTAL) /// \since 0.5.0 @@ -252,11 +252,12 @@ class ARROW_EXPORT Table { // Construct table from multiple input tables. Return Status::Invalid if // schemas are not equal -Status ARROW_EXPORT ConcatenateTables( - const std::vector>& tables, std::shared_ptr
* table); +Status ARROW_EXPORT ConcatenateTables(const std::vector>& tables, + std::shared_ptr
* table); Status ARROW_EXPORT MakeTable(const std::shared_ptr& schema, - const std::vector>& arrays, std::shared_ptr
* table); + const std::vector>& arrays, + std::shared_ptr
* table); } // namespace arrow diff --git a/cpp/src/arrow/tensor.cc b/cpp/src/arrow/tensor.cc index bcd9d8d94c6..31b1a359219 100644 --- a/cpp/src/arrow/tensor.cc +++ b/cpp/src/arrow/tensor.cc @@ -35,7 +35,8 @@ namespace arrow { static void ComputeRowMajorStrides(const FixedWidthType& type, - const std::vector& shape, std::vector* strides) { + const std::vector& shape, + std::vector* strides) { int64_t remaining = type.bit_width() / 8; for (int64_t dimsize : shape) { remaining *= dimsize; @@ -53,7 +54,8 @@ static void ComputeRowMajorStrides(const FixedWidthType& type, } static void ComputeColumnMajorStrides(const FixedWidthType& type, - const std::vector& shape, std::vector* strides) { + const std::vector& shape, + std::vector* strides) { int64_t total = type.bit_width() / 8; for (int64_t dimsize : shape) { if (dimsize == 0) { @@ -69,8 +71,8 @@ static void ComputeColumnMajorStrides(const FixedWidthType& type, /// Constructor with strides and dimension names Tensor::Tensor(const std::shared_ptr& type, const std::shared_ptr& data, - const std::vector& shape, const std::vector& strides, - const std::vector& dim_names) + const std::vector& shape, const std::vector& strides, + const std::vector& dim_names) : type_(type), data_(data), shape_(shape), strides_(strides), dim_names_(dim_names) { DCHECK(is_tensor_supported(type->id())); if (shape.size() > 0 && strides.size() == 0) { @@ -79,11 +81,11 @@ Tensor::Tensor(const std::shared_ptr& type, const std::shared_ptr& type, const std::shared_ptr& data, - const std::vector& shape, const std::vector& strides) + const std::vector& shape, const std::vector& strides) : Tensor(type, data, shape, strides, {}) {} Tensor::Tensor(const std::shared_ptr& type, const std::shared_ptr& data, - const std::vector& shape) + const std::vector& shape) : Tensor(type, data, shape, {}, {}) {} const std::string& Tensor::dim_name(int i) const { @@ -100,9 +102,7 @@ int64_t Tensor::size() const { return std::accumulate(shape_.begin(), shape_.end(), 1LL, std::multiplies()); } -bool Tensor::is_contiguous() const { - return is_row_major() || is_column_major(); -} +bool Tensor::is_contiguous() const { return is_row_major() || is_column_major(); } bool Tensor::is_row_major() const { std::vector c_strides; @@ -118,14 +118,14 @@ bool Tensor::is_column_major() const { return strides_ == f_strides; } -Type::type Tensor::type_id() const { - return type_->id(); -} +Type::type Tensor::type_id() const { return type_->id(); } bool Tensor::Equals(const Tensor& other) const { bool are_equal = false; Status error = TensorEquals(*this, other, &are_equal); - if (!error.ok()) { DCHECK(false) << "Tensors not comparable: " << error.ToString(); } + if (!error.ok()) { + DCHECK(false) << "Tensors not comparable: " << error.ToString(); + } return are_equal; } diff --git a/cpp/src/arrow/tensor.h b/cpp/src/arrow/tensor.h index 371f5911a43..b074b8c309b 100644 --- a/cpp/src/arrow/tensor.h +++ b/cpp/src/arrow/tensor.h @@ -62,16 +62,16 @@ class ARROW_EXPORT Tensor { /// Constructor with no dimension names or strides, data assumed to be row-major Tensor(const std::shared_ptr& type, const std::shared_ptr& data, - const std::vector& shape); + const std::vector& shape); /// Constructor with non-negative strides Tensor(const std::shared_ptr& type, const std::shared_ptr& data, - const std::vector& shape, const std::vector& strides); + const std::vector& shape, const std::vector& strides); /// Constructor with strides and dimension names Tensor(const std::shared_ptr& type, const std::shared_ptr& data, - const std::vector& shape, const std::vector& strides, - const std::vector& dim_names); + const std::vector& shape, const std::vector& strides, + const std::vector& dim_names); std::shared_ptr type() const { return type_; } std::shared_ptr data() const { return data_; } diff --git a/cpp/src/arrow/test-common.h b/cpp/src/arrow/test-common.h index b3e5af86d4b..4ce06408d17 100644 --- a/cpp/src/arrow/test-common.h +++ b/cpp/src/arrow/test-common.h @@ -70,7 +70,7 @@ class TestBuilder : public ::testing::Test { public: void SetUp() { pool_ = default_memory_pool(); - type_ = TypePtr(new UInt8Type()); + type_ = uint8(); builder_.reset(new UInt8Builder(pool_)); builder_nn_.reset(new UInt8Builder(pool_)); } @@ -78,7 +78,7 @@ class TestBuilder : public ::testing::Test { protected: MemoryPool* pool_; - TypePtr type_; + std::shared_ptr type_; std::unique_ptr builder_; std::unique_ptr builder_nn_; }; diff --git a/cpp/src/arrow/test-util.h b/cpp/src/arrow/test-util.h index 2bc66252671..1a3376cee60 100644 --- a/cpp/src/arrow/test-util.h +++ b/cpp/src/arrow/test-util.h @@ -39,16 +39,20 @@ #include "arrow/util/logging.h" #include "arrow/util/random.h" -#define ASSERT_RAISES(ENUM, expr) \ - do { \ - ::arrow::Status s = (expr); \ - if (!s.Is##ENUM()) { FAIL() << s.ToString(); } \ +#define ASSERT_RAISES(ENUM, expr) \ + do { \ + ::arrow::Status s = (expr); \ + if (!s.Is##ENUM()) { \ + FAIL() << s.ToString(); \ + } \ } while (0) -#define ASSERT_OK(expr) \ - do { \ - ::arrow::Status s = (expr); \ - if (!s.ok()) { FAIL() << s.ToString(); } \ +#define ASSERT_OK(expr) \ + do { \ + ::arrow::Status s = (expr); \ + if (!s.ok()) { \ + FAIL() << s.ToString(); \ + } \ } while (0) #define ASSERT_OK_NO_THROW(expr) ASSERT_NO_THROW(ASSERT_OK(expr)) @@ -59,10 +63,12 @@ EXPECT_TRUE(s.ok()); \ } while (0) -#define ABORT_NOT_OK(s) \ - do { \ - ::arrow::Status _s = (s); \ - if (ARROW_PREDICT_FALSE(!_s.ok())) { exit(-1); } \ +#define ABORT_NOT_OK(s) \ + do { \ + ::arrow::Status _s = (s); \ + if (ARROW_PREDICT_FALSE(!_s.ok())) { \ + exit(-1); \ + } \ } while (0); namespace arrow { @@ -85,8 +91,8 @@ void randint(int64_t N, T lower, T upper, std::vector* out) { } template -void random_real( - int64_t n, uint32_t seed, T min_value, T max_value, std::vector* out) { +void random_real(int64_t n, uint32_t seed, T min_value, T max_value, + std::vector* out) { std::mt19937 gen(seed); std::uniform_real_distribution d(min_value, max_value); for (int64_t i = 0; i < n; ++i) { @@ -96,13 +102,13 @@ void random_real( template std::shared_ptr GetBufferFromVector(const std::vector& values) { - return std::make_shared( - reinterpret_cast(values.data()), values.size() * sizeof(T)); + return std::make_shared(reinterpret_cast(values.data()), + values.size() * sizeof(T)); } template -inline Status CopyBufferFromVector( - const std::vector& values, MemoryPool* pool, std::shared_ptr* result) { +inline Status CopyBufferFromVector(const std::vector& values, MemoryPool* pool, + std::shared_ptr* result) { int64_t nbytes = static_cast(values.size()) * sizeof(T); auto buffer = std::make_shared(pool); @@ -114,8 +120,8 @@ inline Status CopyBufferFromVector( } template -static inline Status GetBitmapFromVector( - const std::vector& is_valid, std::shared_ptr* result) { +static inline Status GetBitmapFromVector(const std::vector& is_valid, + std::shared_ptr* result) { size_t length = is_valid.size(); std::shared_ptr buffer; @@ -123,7 +129,9 @@ static inline Status GetBitmapFromVector( uint8_t* bitmap = buffer->mutable_data(); for (size_t i = 0; i < static_cast(length); ++i) { - if (is_valid[i]) { BitUtil::SetBit(bitmap, i); } + if (is_valid[i]) { + BitUtil::SetBit(bitmap, i); + } } *result = buffer; @@ -139,8 +147,8 @@ static inline void random_null_bytes(int64_t n, double pct_null, uint8_t* null_b } } -static inline void random_is_valid( - int64_t n, double pct_null, std::vector* is_valid) { +static inline void random_is_valid(int64_t n, double pct_null, + std::vector* is_valid) { Random rng(random_seed()); for (int64_t i = 0; i < n; ++i) { is_valid->push_back(rng.NextDoubleFraction() > pct_null); @@ -178,24 +186,28 @@ void rand_uniform_int(int64_t n, uint32_t seed, T min_value, T max_value, T* out static inline int64_t null_count(const std::vector& valid_bytes) { int64_t result = 0; for (size_t i = 0; i < valid_bytes.size(); ++i) { - if (valid_bytes[i] == 0) { ++result; } + if (valid_bytes[i] == 0) { + ++result; + } } return result; } Status MakeRandomInt32PoolBuffer(int64_t length, MemoryPool* pool, - std::shared_ptr* pool_buffer, uint32_t seed = 0) { + std::shared_ptr* pool_buffer, + uint32_t seed = 0) { DCHECK(pool); auto data = std::make_shared(pool); RETURN_NOT_OK(data->Resize(length * sizeof(int32_t))); test::rand_uniform_int(length, seed, 0, std::numeric_limits::max(), - reinterpret_cast(data->mutable_data())); + reinterpret_cast(data->mutable_data())); *pool_buffer = data; return Status::OK(); } Status MakeRandomBytePoolBuffer(int64_t length, MemoryPool* pool, - std::shared_ptr* pool_buffer, uint32_t seed = 0) { + std::shared_ptr* pool_buffer, + uint32_t seed = 0) { auto bytes = std::make_shared(pool); RETURN_NOT_OK(bytes->Resize(length)); test::random_bytes(length, seed, bytes->mutable_data()); @@ -207,8 +219,8 @@ Status MakeRandomBytePoolBuffer(int64_t length, MemoryPool* pool, template void ArrayFromVector(const std::shared_ptr& type, - const std::vector& is_valid, const std::vector& values, - std::shared_ptr* out) { + const std::vector& is_valid, const std::vector& values, + std::shared_ptr* out) { MemoryPool* pool = default_memory_pool(); typename TypeTraits::BuilderType builder(pool, type); for (size_t i = 0; i < values.size(); ++i) { @@ -223,7 +235,7 @@ void ArrayFromVector(const std::shared_ptr& type, template void ArrayFromVector(const std::vector& is_valid, const std::vector& values, - std::shared_ptr* out) { + std::shared_ptr* out) { MemoryPool* pool = default_memory_pool(); typename TypeTraits::BuilderType builder(pool); for (size_t i = 0; i < values.size(); ++i) { @@ -248,7 +260,7 @@ void ArrayFromVector(const std::vector& values, std::shared_ptr* template Status MakeArray(const std::vector& valid_bytes, const std::vector& values, - int64_t size, Builder* builder, std::shared_ptr* out) { + int64_t size, Builder* builder, std::shared_ptr* out) { // Append the first 1000 for (int64_t i = 0; i < size; ++i) { if (valid_bytes[i] > 0) { diff --git a/cpp/src/arrow/type-test.cc b/cpp/src/arrow/type-test.cc index 7f3adef6337..6b86b4d2f10 100644 --- a/cpp/src/arrow/type-test.cc +++ b/cpp/src/arrow/type-test.cc @@ -345,16 +345,16 @@ TEST(TestTimestampType, ToString) { } TEST(TestNestedType, Equals) { - auto create_struct = []( - std::string inner_name, std::string struct_name) -> shared_ptr { + auto create_struct = [](std::string inner_name, + std::string struct_name) -> shared_ptr { auto f_type = field(inner_name, int32()); vector> fields = {f_type}; auto s_type = std::make_shared(fields); return field(struct_name, s_type); }; - auto create_union = []( - std::string inner_name, std::string union_name) -> shared_ptr { + auto create_union = [](std::string inner_name, + std::string union_name) -> shared_ptr { auto f_type = field(inner_name, int32()); vector> fields = {f_type}; vector codes = {Type::INT32}; diff --git a/cpp/src/arrow/type.cc b/cpp/src/arrow/type.cc index 623c1934f87..b8489d44cdb 100644 --- a/cpp/src/arrow/type.cc +++ b/cpp/src/arrow/type.cc @@ -37,7 +37,7 @@ std::shared_ptr Field::AddMetadata( } Status Field::AddMetadata(const std::shared_ptr& metadata, - std::shared_ptr* out) const { + std::shared_ptr* out) const { *out = AddMetadata(metadata); return Status::OK(); } @@ -47,7 +47,9 @@ std::shared_ptr Field::RemoveMetadata() const { } bool Field::Equals(const Field& other) const { - if (this == &other) { return true; } + if (this == &other) { + return true; + } if (this->name_ == other.name_ && this->nullable_ == other.nullable_ && this->type_->Equals(*other.type_.get())) { if (metadata_ == nullptr && other.metadata_ == nullptr) { @@ -68,7 +70,9 @@ bool Field::Equals(const std::shared_ptr& other) const { std::string Field::ToString() const { std::stringstream ss; ss << this->name_ << ": " << this->type_->ToString(); - if (!this->nullable_) { ss << " not null"; } + if (!this->nullable_) { + ss << " not null"; + } return ss.str(); } @@ -77,34 +81,28 @@ DataType::~DataType() {} bool DataType::Equals(const DataType& other) const { bool are_equal = false; Status error = TypeEquals(*this, other, &are_equal); - if (!error.ok()) { DCHECK(false) << "Types not comparable: " << error.ToString(); } + if (!error.ok()) { + DCHECK(false) << "Types not comparable: " << error.ToString(); + } return are_equal; } bool DataType::Equals(const std::shared_ptr& other) const { - if (!other) { return false; } + if (!other) { + return false; + } return Equals(*other.get()); } -std::string BooleanType::ToString() const { - return name(); -} +std::string BooleanType::ToString() const { return name(); } -FloatingPoint::Precision HalfFloatType::precision() const { - return FloatingPoint::HALF; -} +FloatingPoint::Precision HalfFloatType::precision() const { return FloatingPoint::HALF; } -FloatingPoint::Precision FloatType::precision() const { - return FloatingPoint::SINGLE; -} +FloatingPoint::Precision FloatType::precision() const { return FloatingPoint::SINGLE; } -FloatingPoint::Precision DoubleType::precision() const { - return FloatingPoint::DOUBLE; -} +FloatingPoint::Precision DoubleType::precision() const { return FloatingPoint::DOUBLE; } -std::string StringType::ToString() const { - return std::string("string"); -} +std::string StringType::ToString() const { return std::string("string"); } std::string ListType::ToString() const { std::stringstream s; @@ -112,13 +110,9 @@ std::string ListType::ToString() const { return s.str(); } -std::string BinaryType::ToString() const { - return std::string("binary"); -} +std::string BinaryType::ToString() const { return std::string("binary"); } -int FixedSizeBinaryType::bit_width() const { - return CHAR_BIT * byte_width(); -} +int FixedSizeBinaryType::bit_width() const { return CHAR_BIT * byte_width(); } std::string FixedSizeBinaryType::ToString() const { std::stringstream ss; @@ -130,7 +124,9 @@ std::string StructType::ToString() const { std::stringstream s; s << "struct<"; for (int i = 0; i < this->num_children(); ++i) { - if (i > 0) { s << ", "; } + if (i > 0) { + s << ", "; + } std::shared_ptr field = this->child(i); s << field->name() << ": " << field->type()->ToString(); } @@ -148,13 +144,9 @@ Date32Type::Date32Type() : DateType(Type::DATE32, DateUnit::DAY) {} Date64Type::Date64Type() : DateType(Type::DATE64, DateUnit::MILLI) {} -std::string Date64Type::ToString() const { - return std::string("date64[ms]"); -} +std::string Date64Type::ToString() const { return std::string("date64[ms]"); } -std::string Date32Type::ToString() const { - return std::string("date32[day]"); -} +std::string Date32Type::ToString() const { return std::string("date32[day]"); } // ---------------------------------------------------------------------- // Time types @@ -190,7 +182,9 @@ std::string Time64Type::ToString() const { std::string TimestampType::ToString() const { std::stringstream ss; ss << "timestamp[" << this->unit_; - if (this->timezone_.size() > 0) { ss << ", tz=" << this->timezone_; } + if (this->timezone_.size() > 0) { + ss << ", tz=" << this->timezone_; + } ss << "]"; return ss.str(); } @@ -199,7 +193,7 @@ std::string TimestampType::ToString() const { // Union type UnionType::UnionType(const std::vector>& fields, - const std::vector& type_codes, UnionMode mode) + const std::vector& type_codes, UnionMode mode) : NestedType(Type::UNION), mode_(mode), type_codes_(type_codes) { children_ = fields; } @@ -214,7 +208,9 @@ std::string UnionType::ToString() const { } for (size_t i = 0; i < children_.size(); ++i) { - if (i) { s << ", "; } + if (i) { + s << ", "; + } s << children_[i]->ToString() << "=" << static_cast(type_codes_[i]); } s << ">"; @@ -225,7 +221,7 @@ std::string UnionType::ToString() const { // DictionaryType DictionaryType::DictionaryType(const std::shared_ptr& index_type, - const std::shared_ptr& dictionary, bool ordered) + const std::shared_ptr& dictionary, bool ordered) : FixedWidthType(Type::DICTIONARY), index_type_(index_type), dictionary_(dictionary), @@ -235,9 +231,7 @@ int DictionaryType::bit_width() const { return static_cast(index_type_.get())->bit_width(); } -std::shared_ptr DictionaryType::dictionary() const { - return dictionary_; -} +std::shared_ptr DictionaryType::dictionary() const { return dictionary_; } std::string DictionaryType::ToString() const { std::stringstream ss; @@ -249,23 +243,27 @@ std::string DictionaryType::ToString() const { // ---------------------------------------------------------------------- // Null type -std::string NullType::ToString() const { - return name(); -} +std::string NullType::ToString() const { return name(); } // ---------------------------------------------------------------------- // Schema implementation Schema::Schema(const std::vector>& fields, - const std::shared_ptr& metadata) + const std::shared_ptr& metadata) : fields_(fields), metadata_(metadata) {} bool Schema::Equals(const Schema& other) const { - if (this == &other) { return true; } + if (this == &other) { + return true; + } - if (num_fields() != other.num_fields()) { return false; } + if (num_fields() != other.num_fields()) { + return false; + } for (int i = 0; i < num_fields(); ++i) { - if (!field(i)->Equals(*other.field(i).get())) { return false; } + if (!field(i)->Equals(*other.field(i).get())) { + return false; + } } return true; } @@ -290,12 +288,13 @@ int64_t Schema::GetFieldIndex(const std::string& name) const { } } -Status Schema::AddField( - int i, const std::shared_ptr& field, std::shared_ptr* out) const { +Status Schema::AddField(int i, const std::shared_ptr& field, + std::shared_ptr* out) const { DCHECK_GE(i, 0); DCHECK_LE(i, this->num_fields()); - *out = std::make_shared(AddVectorElement(fields_, i, field), metadata_); + *out = + std::make_shared(internal::AddVectorElement(fields_, i, field), metadata_); return Status::OK(); } @@ -305,7 +304,7 @@ std::shared_ptr Schema::AddMetadata( } Status Schema::AddMetadata(const std::shared_ptr& metadata, - std::shared_ptr* out) const { + std::shared_ptr* out) const { *out = AddMetadata(metadata); return Status::OK(); } @@ -318,7 +317,7 @@ Status Schema::RemoveField(int i, std::shared_ptr* out) const { DCHECK_GE(i, 0); DCHECK_LT(i, this->num_fields()); - *out = std::make_shared(DeleteVectorElement(fields_, i), metadata_); + *out = std::make_shared(internal::DeleteVectorElement(fields_, i), metadata_); return Status::OK(); } @@ -327,7 +326,9 @@ std::string Schema::ToString() const { int i = 0; for (auto field : fields_) { - if (i > 0) { buffer << std::endl; } + if (i > 0) { + buffer << std::endl; + } buffer << field->ToString(); ++i; } @@ -422,18 +423,18 @@ std::shared_ptr struct_(const std::vector>& fie } std::shared_ptr union_(const std::vector>& child_fields, - const std::vector& type_codes, UnionMode mode) { + const std::vector& type_codes, UnionMode mode) { return std::make_shared(child_fields, type_codes, mode); } std::shared_ptr dictionary(const std::shared_ptr& index_type, - const std::shared_ptr& dict_values) { + const std::shared_ptr& dict_values) { return std::make_shared(index_type, dict_values); } std::shared_ptr field(const std::string& name, - const std::shared_ptr& type, bool nullable, - const std::shared_ptr& metadata) { + const std::shared_ptr& type, bool nullable, + const std::shared_ptr& metadata) { return std::make_shared(name, type, nullable, metadata); } @@ -454,9 +455,7 @@ std::vector FixedWidthType::GetBufferLayout() const { return {kValidityBuffer, BufferDescr(BufferType::DATA, bit_width())}; } -std::vector NullType::GetBufferLayout() const { - return {}; -} +std::vector NullType::GetBufferLayout() const { return {}; } std::vector BinaryType::GetBufferLayout() const { return {kValidityBuffer, kOffsetBuffer, kValues8}; @@ -474,9 +473,7 @@ std::vector ListType::GetBufferLayout() const { return {kValidityBuffer, kOffsetBuffer}; } -std::vector StructType::GetBufferLayout() const { - return {kValidityBuffer}; -} +std::vector StructType::GetBufferLayout() const { return {kValidityBuffer}; } std::vector UnionType::GetBufferLayout() const { if (mode_ == UnionMode::SPARSE) { diff --git a/cpp/src/arrow/type.h b/cpp/src/arrow/type.h index fffb840e3ce..45d97fdb32b 100644 --- a/cpp/src/arrow/type.h +++ b/cpp/src/arrow/type.h @@ -162,7 +162,8 @@ class ARROW_EXPORT DataType { DISALLOW_COPY_AND_ASSIGN(DataType); }; -typedef std::shared_ptr TypePtr; +// TODO(wesm): Remove this from parquet-cpp +using TypePtr = std::shared_ptr; class ARROW_EXPORT FixedWidthType : public DataType { public: @@ -204,15 +205,15 @@ class NoExtraMeta {}; class ARROW_EXPORT Field { public: Field(const std::string& name, const std::shared_ptr& type, - bool nullable = true, - const std::shared_ptr& metadata = nullptr) + bool nullable = true, + const std::shared_ptr& metadata = nullptr) : name_(name), type_(type), nullable_(nullable), metadata_(metadata) {} std::shared_ptr metadata() const { return metadata_; } /// \deprecated Status AddMetadata(const std::shared_ptr& metadata, - std::shared_ptr* out) const; + std::shared_ptr* out) const; std::shared_ptr AddMetadata( const std::shared_ptr& metadata) const; @@ -241,7 +242,7 @@ class ARROW_EXPORT Field { std::shared_ptr metadata_; }; -typedef std::shared_ptr FieldPtr; +namespace detail { template class ARROW_EXPORT CTypeImpl : public BASE { @@ -260,6 +261,13 @@ class ARROW_EXPORT CTypeImpl : public BASE { std::string ToString() const override { return std::string(DERIVED::name()); } }; +template +class IntegerTypeImpl : public detail::CTypeImpl { + bool is_signed() const override { return std::is_signed::value; } +}; + +} // namespace detail + class ARROW_EXPORT NullType : public DataType, public NoExtraMeta { public: static constexpr Type::type type_id = Type::NA; @@ -274,11 +282,6 @@ class ARROW_EXPORT NullType : public DataType, public NoExtraMeta { std::vector GetBufferLayout() const override; }; -template -class IntegerTypeImpl : public CTypeImpl { - bool is_signed() const override { return std::is_signed::value; } -}; - class ARROW_EXPORT BooleanType : public FixedWidthType, public NoExtraMeta { public: static constexpr Type::type type_id = Type::BOOL; @@ -292,65 +295,70 @@ class ARROW_EXPORT BooleanType : public FixedWidthType, public NoExtraMeta { static std::string name() { return "bool"; } }; -class ARROW_EXPORT UInt8Type : public IntegerTypeImpl { +class ARROW_EXPORT UInt8Type + : public detail::IntegerTypeImpl { public: static std::string name() { return "uint8"; } }; -class ARROW_EXPORT Int8Type : public IntegerTypeImpl { +class ARROW_EXPORT Int8Type + : public detail::IntegerTypeImpl { public: static std::string name() { return "int8"; } }; class ARROW_EXPORT UInt16Type - : public IntegerTypeImpl { + : public detail::IntegerTypeImpl { public: static std::string name() { return "uint16"; } }; -class ARROW_EXPORT Int16Type : public IntegerTypeImpl { +class ARROW_EXPORT Int16Type + : public detail::IntegerTypeImpl { public: static std::string name() { return "int16"; } }; class ARROW_EXPORT UInt32Type - : public IntegerTypeImpl { + : public detail::IntegerTypeImpl { public: static std::string name() { return "uint32"; } }; -class ARROW_EXPORT Int32Type : public IntegerTypeImpl { +class ARROW_EXPORT Int32Type + : public detail::IntegerTypeImpl { public: static std::string name() { return "int32"; } }; class ARROW_EXPORT UInt64Type - : public IntegerTypeImpl { + : public detail::IntegerTypeImpl { public: static std::string name() { return "uint64"; } }; -class ARROW_EXPORT Int64Type : public IntegerTypeImpl { +class ARROW_EXPORT Int64Type + : public detail::IntegerTypeImpl { public: static std::string name() { return "int64"; } }; class ARROW_EXPORT HalfFloatType - : public CTypeImpl { + : public detail::CTypeImpl { public: Precision precision() const override; static std::string name() { return "halffloat"; } }; class ARROW_EXPORT FloatType - : public CTypeImpl { + : public detail::CTypeImpl { public: Precision precision() const override; static std::string name() { return "float"; } }; class ARROW_EXPORT DoubleType - : public CTypeImpl { + : public detail::CTypeImpl { public: Precision precision() const override; static std::string name() { return "double"; } @@ -489,7 +497,7 @@ class ARROW_EXPORT UnionType : public NestedType { static constexpr Type::type type_id = Type::UNION; UnionType(const std::vector>& fields, - const std::vector& type_codes, UnionMode mode = UnionMode::SPARSE); + const std::vector& type_codes, UnionMode mode = UnionMode::SPARSE); std::string ToString() const override; static std::string name() { return "union"; } @@ -669,7 +677,7 @@ class ARROW_EXPORT DictionaryType : public FixedWidthType { static constexpr Type::type type_id = Type::DICTIONARY; DictionaryType(const std::shared_ptr& index_type, - const std::shared_ptr& dictionary, bool ordered = false); + const std::shared_ptr& dictionary, bool ordered = false); int bit_width() const override; @@ -699,7 +707,7 @@ class ARROW_EXPORT DictionaryType : public FixedWidthType { class ARROW_EXPORT Schema { public: explicit Schema(const std::vector>& fields, - const std::shared_ptr& metadata = nullptr); + const std::shared_ptr& metadata = nullptr); virtual ~Schema() = default; /// Returns true if all of the schema fields are equal @@ -724,13 +732,13 @@ class ARROW_EXPORT Schema { /// \brief Render a string representation of the schema suitable for debugging std::string ToString() const; - Status AddField( - int i, const std::shared_ptr& field, std::shared_ptr* out) const; + Status AddField(int i, const std::shared_ptr& field, + std::shared_ptr* out) const; Status RemoveField(int i, std::shared_ptr* out) const; /// \deprecated Status AddMetadata(const std::shared_ptr& metadata, - std::shared_ptr* out) const; + std::shared_ptr* out) const; /// \brief Replace key-value metadata with new metadata /// @@ -761,8 +769,8 @@ std::shared_ptr ARROW_EXPORT list(const std::shared_ptr& value_ std::shared_ptr ARROW_EXPORT list(const std::shared_ptr& value_type); std::shared_ptr ARROW_EXPORT timestamp(TimeUnit::type unit); -std::shared_ptr ARROW_EXPORT timestamp( - TimeUnit::type unit, const std::string& timezone); +std::shared_ptr ARROW_EXPORT timestamp(TimeUnit::type unit, + const std::string& timezone); /// Unit can be either SECOND or MILLI std::shared_ptr ARROW_EXPORT time32(TimeUnit::type unit); @@ -770,18 +778,18 @@ std::shared_ptr ARROW_EXPORT time32(TimeUnit::type unit); /// Unit can be either MICRO or NANO std::shared_ptr ARROW_EXPORT time64(TimeUnit::type unit); -std::shared_ptr ARROW_EXPORT struct_( - const std::vector>& fields); +std::shared_ptr ARROW_EXPORT +struct_(const std::vector>& fields); -std::shared_ptr ARROW_EXPORT union_( - const std::vector>& child_fields, - const std::vector& type_codes, UnionMode mode = UnionMode::SPARSE); +std::shared_ptr ARROW_EXPORT +union_(const std::vector>& child_fields, + const std::vector& type_codes, UnionMode mode = UnionMode::SPARSE); std::shared_ptr ARROW_EXPORT dictionary( const std::shared_ptr& index_type, const std::shared_ptr& values); -std::shared_ptr ARROW_EXPORT field(const std::string& name, - const std::shared_ptr& type, bool nullable = true, +std::shared_ptr ARROW_EXPORT field( + const std::string& name, const std::shared_ptr& type, bool nullable = true, const std::shared_ptr& metadata = nullptr); // ---------------------------------------------------------------------- diff --git a/cpp/src/arrow/type_traits.h b/cpp/src/arrow/type_traits.h index 3e8ea23432b..973b0e15c54 100644 --- a/cpp/src/arrow/type_traits.h +++ b/cpp/src/arrow/type_traits.h @@ -296,6 +296,8 @@ struct TypeTraits { constexpr static bool is_parameter_free = false; }; +namespace detail { + // Not all type classes have a c_type template struct as_void { @@ -319,10 +321,13 @@ GET_ATTR(TypeClass, void); #undef GET_ATTR -#define PRIMITIVE_TRAITS(T) \ - using TypeClass = typename std::conditional::value, T, \ - typename GetAttr_TypeClass::type>::type; \ - using c_type = typename GetAttr_c_type::type; +} // namespace detail + +#define PRIMITIVE_TRAITS(T) \ + using TypeClass = \ + typename std::conditional::value, T, \ + typename detail::GetAttr_TypeClass::type>::type; \ + using c_type = typename detail::GetAttr_c_type::type; template struct IsUnsignedInt { diff --git a/cpp/src/arrow/util/bit-stream-utils.h b/cpp/src/arrow/util/bit-stream-utils.h index 537fdc3045c..d312fef4d7d 100644 --- a/cpp/src/arrow/util/bit-stream-utils.h +++ b/cpp/src/arrow/util/bit-stream-utils.h @@ -20,9 +20,9 @@ #ifndef ARROW_UTIL_BIT_STREAM_UTILS_H #define ARROW_UTIL_BIT_STREAM_UTILS_H +#include #include #include -#include #include "arrow/util/bit-util.h" #include "arrow/util/bpacking.h" @@ -227,15 +227,17 @@ inline bool BitWriter::PutVlqInt(uint32_t v) { return result; } +namespace detail { + template inline void GetValue_(int num_bits, T* v, int max_bytes, const uint8_t* buffer, - int* bit_offset, int* byte_offset, uint64_t* buffered_values) { + int* bit_offset, int* byte_offset, uint64_t* buffered_values) { #ifdef _MSC_VER #pragma warning(push) #pragma warning(disable : 4800) #endif - *v = static_cast( - BitUtil::TrailingBits(*buffered_values, *bit_offset + num_bits) >> *bit_offset); + *v = static_cast(BitUtil::TrailingBits(*buffered_values, *bit_offset + num_bits) >> + *bit_offset); #ifdef _MSC_VER #pragma warning(pop) #endif @@ -264,6 +266,8 @@ inline void GetValue_(int num_bits, T* v, int max_bytes, const uint8_t* buffer, } } +} // namespace detail + template inline bool BitReader::GetValue(int num_bits, T* v) { return GetBatch(num_bits, v, 1) == 1; @@ -291,14 +295,15 @@ inline int BitReader::GetBatch(int num_bits, T* v, int batch_size) { int i = 0; if (UNLIKELY(bit_offset != 0)) { for (; i < batch_size && bit_offset != 0; ++i) { - GetValue_(num_bits, &v[i], max_bytes, buffer, &bit_offset, &byte_offset, - &buffered_values); + detail::GetValue_(num_bits, &v[i], max_bytes, buffer, &bit_offset, &byte_offset, + &buffered_values); } } if (sizeof(T) == 4) { - int num_unpacked = unpack32(reinterpret_cast(buffer + byte_offset), - reinterpret_cast(v + i), batch_size - i, num_bits); + int num_unpacked = + internal::unpack32(reinterpret_cast(buffer + byte_offset), + reinterpret_cast(v + i), batch_size - i, num_bits); i += num_unpacked; byte_offset += num_unpacked * num_bits / 8; } else { @@ -306,9 +311,12 @@ inline int BitReader::GetBatch(int num_bits, T* v, int batch_size) { uint32_t unpack_buffer[buffer_size]; while (i < batch_size) { int unpack_size = std::min(buffer_size, batch_size - i); - int num_unpacked = unpack32(reinterpret_cast(buffer + byte_offset), - unpack_buffer, unpack_size, num_bits); - if (num_unpacked == 0) { break; } + int num_unpacked = + internal::unpack32(reinterpret_cast(buffer + byte_offset), + unpack_buffer, unpack_size, num_bits); + if (num_unpacked == 0) { + break; + } for (int k = 0; k < num_unpacked; ++k) { #ifdef _MSC_VER #pragma warning(push) @@ -332,8 +340,8 @@ inline int BitReader::GetBatch(int num_bits, T* v, int batch_size) { } for (; i < batch_size; ++i) { - GetValue_( - num_bits, &v[i], max_bytes, buffer, &bit_offset, &byte_offset, &buffered_values); + detail::GetValue_(num_bits, &v[i], max_bytes, buffer, &bit_offset, &byte_offset, + &buffered_values); } bit_offset_ = bit_offset; diff --git a/cpp/src/arrow/util/bit-util-test.cc b/cpp/src/arrow/util/bit-util-test.cc index cd945585ba2..231bf54a2a3 100644 --- a/cpp/src/arrow/util/bit-util-test.cc +++ b/cpp/src/arrow/util/bit-util-test.cc @@ -35,7 +35,9 @@ namespace arrow { static void EnsureCpuInfoInitialized() { - if (!CpuInfo::initialized()) { CpuInfo::Init(); } + if (!CpuInfo::initialized()) { + CpuInfo::Init(); + } } TEST(BitUtilTests, TestIsMultipleOf64) { @@ -68,11 +70,13 @@ TEST(BitUtilTests, TestNextPower2) { ASSERT_EQ(1LL << 62, NextPower2((1LL << 62) - 1)); } -static inline int64_t SlowCountBits( - const uint8_t* data, int64_t bit_offset, int64_t length) { +static inline int64_t SlowCountBits(const uint8_t* data, int64_t bit_offset, + int64_t length) { int64_t count = 0; for (int64_t i = bit_offset; i < bit_offset + length; ++i) { - if (BitUtil::GetBit(data, i)) { ++count; } + if (BitUtil::GetBit(data, i)) { + ++count; + } } return count; } @@ -175,9 +179,9 @@ TEST(BitUtil, TrailingBits) { EXPECT_EQ(BitUtil::TrailingBits(BOOST_BINARY(1 1 1 1 1 1 1 1), 0), 0); EXPECT_EQ(BitUtil::TrailingBits(BOOST_BINARY(1 1 1 1 1 1 1 1), 1), 1); EXPECT_EQ(BitUtil::TrailingBits(BOOST_BINARY(1 1 1 1 1 1 1 1), 64), - BOOST_BINARY(1 1 1 1 1 1 1 1)); + BOOST_BINARY(1 1 1 1 1 1 1 1)); EXPECT_EQ(BitUtil::TrailingBits(BOOST_BINARY(1 1 1 1 1 1 1 1), 100), - BOOST_BINARY(1 1 1 1 1 1 1 1)); + BOOST_BINARY(1 1 1 1 1 1 1 1)); EXPECT_EQ(BitUtil::TrailingBits(0, 1), 0); EXPECT_EQ(BitUtil::TrailingBits(0, 64), 0); EXPECT_EQ(BitUtil::TrailingBits(1LL << 63, 0), 0); @@ -193,12 +197,12 @@ TEST(BitUtil, ByteSwap) { EXPECT_EQ(BitUtil::ByteSwap(static_cast(0x11223344)), 0x44332211); EXPECT_EQ(BitUtil::ByteSwap(static_cast(0)), 0); - EXPECT_EQ( - BitUtil::ByteSwap(static_cast(0x1122334455667788)), 0x8877665544332211); + EXPECT_EQ(BitUtil::ByteSwap(static_cast(0x1122334455667788)), + 0x8877665544332211); EXPECT_EQ(BitUtil::ByteSwap(static_cast(0)), 0); - EXPECT_EQ( - BitUtil::ByteSwap(static_cast(0x1122334455667788)), 0x8877665544332211); + EXPECT_EQ(BitUtil::ByteSwap(static_cast(0x1122334455667788)), + 0x8877665544332211); EXPECT_EQ(BitUtil::ByteSwap(static_cast(0)), 0); EXPECT_EQ(BitUtil::ByteSwap(static_cast(0x1122)), 0x2211); diff --git a/cpp/src/arrow/util/bit-util.cc b/cpp/src/arrow/util/bit-util.cc index 5bbec6f2311..f255f95f30a 100644 --- a/cpp/src/arrow/util/bit-util.cc +++ b/cpp/src/arrow/util/bit-util.cc @@ -36,12 +36,14 @@ namespace arrow { void BitUtil::FillBitsFromBytes(const std::vector& bytes, uint8_t* bits) { for (size_t i = 0; i < bytes.size(); ++i) { - if (bytes[i] > 0) { SetBit(bits, i); } + if (bytes[i] > 0) { + SetBit(bits, i); + } } } -Status BitUtil::BytesToBits( - const std::vector& bytes, std::shared_ptr* out) { +Status BitUtil::BytesToBits(const std::vector& bytes, + std::shared_ptr* out) { int64_t bit_length = BitUtil::BytesForBits(bytes.size()); std::shared_ptr buffer; @@ -65,7 +67,9 @@ int64_t CountSetBits(const uint8_t* data, int64_t bit_offset, int64_t length) { // The number of bits until fast_count_start const int64_t initial_bits = std::min(length, fast_count_start - bit_offset); for (int64_t i = bit_offset; i < bit_offset + initial_bits; ++i) { - if (BitUtil::GetBit(data, i)) { ++count; } + if (BitUtil::GetBit(data, i)) { + ++count; + } } const int64_t fast_counts = (length - initial_bits) / pop_len; @@ -85,21 +89,23 @@ int64_t CountSetBits(const uint8_t* data, int64_t bit_offset, int64_t length) { // versions of popcount but the code complexity is likely not worth it) const int64_t tail_index = bit_offset + initial_bits + fast_counts * pop_len; for (int64_t i = tail_index; i < bit_offset + length; ++i) { - if (BitUtil::GetBit(data, i)) { ++count; } + if (BitUtil::GetBit(data, i)) { + ++count; + } } return count; } -Status GetEmptyBitmap( - MemoryPool* pool, int64_t length, std::shared_ptr* result) { +Status GetEmptyBitmap(MemoryPool* pool, int64_t length, + std::shared_ptr* result) { RETURN_NOT_OK(AllocateBuffer(pool, BitUtil::BytesForBits(length), result)); memset((*result)->mutable_data(), 0, static_cast((*result)->size())); return Status::OK(); } Status CopyBitmap(MemoryPool* pool, const uint8_t* data, int64_t offset, int64_t length, - std::shared_ptr* out) { + std::shared_ptr* out) { std::shared_ptr buffer; RETURN_NOT_OK(GetEmptyBitmap(pool, length, &buffer)); uint8_t* dest = buffer->mutable_data(); @@ -111,12 +117,14 @@ Status CopyBitmap(MemoryPool* pool, const uint8_t* data, int64_t offset, int64_t } bool BitmapEquals(const uint8_t* left, int64_t left_offset, const uint8_t* right, - int64_t right_offset, int64_t bit_length) { + int64_t right_offset, int64_t bit_length) { if (left_offset % 8 == 0 && right_offset % 8 == 0) { // byte aligned, can use memcmp bool bytes_equal = std::memcmp(left + left_offset / 8, right + right_offset / 8, - bit_length / 8) == 0; - if (!bytes_equal) { return false; } + bit_length / 8) == 0; + if (!bytes_equal) { + return false; + } for (int64_t i = (bit_length / 8) * 8; i < bit_length; ++i) { if (BitUtil::GetBit(left, left_offset + i) != BitUtil::GetBit(right, right_offset + i)) { diff --git a/cpp/src/arrow/util/bit-util.h b/cpp/src/arrow/util/bit-util.h index d055c751d16..fc360bae4e4 100644 --- a/cpp/src/arrow/util/bit-util.h +++ b/cpp/src/arrow/util/bit-util.h @@ -66,6 +66,8 @@ namespace arrow { // // We add a partial stub implementation here +namespace detail { + template struct make_unsigned {}; @@ -89,6 +91,8 @@ struct make_unsigned { typedef uint64_t type; }; +} // namespace detail + class Buffer; class MemoryPool; class MutableBuffer; @@ -101,17 +105,11 @@ static constexpr uint8_t kBitmask[] = {1, 2, 4, 8, 16, 32, 64, 128}; // the ~i byte version of kBitmaks static constexpr uint8_t kFlippedBitmask[] = {254, 253, 251, 247, 239, 223, 191, 127}; -static inline int64_t CeilByte(int64_t size) { - return (size + 7) & ~7; -} +static inline int64_t CeilByte(int64_t size) { return (size + 7) & ~7; } -static inline int64_t BytesForBits(int64_t size) { - return CeilByte(size) / 8; -} +static inline int64_t BytesForBits(int64_t size) { return CeilByte(size) / 8; } -static inline int64_t Ceil2Bytes(int64_t size) { - return (size + 15) & ~15; -} +static inline int64_t Ceil2Bytes(int64_t size) { return (size + 15) & ~15; } static inline bool GetBit(const uint8_t* bits, int64_t i) { return (bits[i / 8] & kBitmask[i % 8]) != 0; @@ -125,13 +123,13 @@ static inline void ClearBit(uint8_t* bits, int64_t i) { bits[i / 8] &= kFlippedBitmask[i % 8]; } -static inline void SetBit(uint8_t* bits, int64_t i) { - bits[i / 8] |= kBitmask[i % 8]; -} +static inline void SetBit(uint8_t* bits, int64_t i) { bits[i / 8] |= kBitmask[i % 8]; } /// Set bit if is_set is true, but cannot clear bit static inline void SetArrayBit(uint8_t* bits, int i, bool is_set) { - if (is_set) { SetBit(bits, i); } + if (is_set) { + SetBit(bits, i); + } } static inline void SetBitTo(uint8_t* bits, int64_t i, bool bit_is_set) { @@ -168,13 +166,9 @@ static inline int64_t NextPower2(int64_t n) { return n; } -static inline bool IsMultipleOf64(int64_t n) { - return (n & 63) == 0; -} +static inline bool IsMultipleOf64(int64_t n) { return (n & 63) == 0; } -static inline bool IsMultipleOf8(int64_t n) { - return (n & 7) == 0; -} +static inline bool IsMultipleOf8(int64_t n) { return (n & 7) == 0; } /// Returns the ceil of value/divisor static inline int64_t Ceil(int64_t value, int64_t divisor) { @@ -206,34 +200,22 @@ static inline int RoundDownToPowerOf2(int value, int factor) { /// Specialized round up and down functions for frequently used factors, /// like 8 (bits->bytes), 32 (bits->i32), and 64 (bits->i64). /// Returns the rounded up number of bytes that fit the number of bits. -static inline uint32_t RoundUpNumBytes(uint32_t bits) { - return (bits + 7) >> 3; -} +static inline uint32_t RoundUpNumBytes(uint32_t bits) { return (bits + 7) >> 3; } /// Returns the rounded down number of bytes that fit the number of bits. -static inline uint32_t RoundDownNumBytes(uint32_t bits) { - return bits >> 3; -} +static inline uint32_t RoundDownNumBytes(uint32_t bits) { return bits >> 3; } /// Returns the rounded up to 32 multiple. Used for conversions of bits to i32. -static inline uint32_t RoundUpNumi32(uint32_t bits) { - return (bits + 31) >> 5; -} +static inline uint32_t RoundUpNumi32(uint32_t bits) { return (bits + 31) >> 5; } /// Returns the rounded up 32 multiple. -static inline uint32_t RoundDownNumi32(uint32_t bits) { - return bits >> 5; -} +static inline uint32_t RoundDownNumi32(uint32_t bits) { return bits >> 5; } /// Returns the rounded up to 64 multiple. Used for conversions of bits to i64. -static inline uint32_t RoundUpNumi64(uint32_t bits) { - return (bits + 63) >> 6; -} +static inline uint32_t RoundUpNumi64(uint32_t bits) { return (bits + 63) >> 6; } /// Returns the rounded down to 64 multiple. -static inline uint32_t RoundDownNumi64(uint32_t bits) { - return bits >> 6; -} +static inline uint32_t RoundDownNumi64(uint32_t bits) { return bits >> 6; } static inline int64_t RoundUpToMultipleOf64(int64_t num) { // TODO(wesm): is this definitely needed? @@ -242,7 +224,9 @@ static inline int64_t RoundUpToMultipleOf64(int64_t num) { constexpr int64_t force_carry_addend = round_to - 1; constexpr int64_t truncate_bitmask = ~(round_to - 1); constexpr int64_t max_roundable_num = std::numeric_limits::max() - round_to; - if (num <= max_roundable_num) { return (num + force_carry_addend) & truncate_bitmask; } + if (num <= max_roundable_num) { + return (num + force_carry_addend) & truncate_bitmask; + } // handle overflow case. This should result in a malloc error upstream return num; } @@ -252,8 +236,7 @@ static inline int64_t RoundUpToMultipleOf64(int64_t num) { /// might be a much faster way to implement this. static inline int PopcountNoHw(uint64_t x) { int count = 0; - for (; x != 0; ++count) - x &= x - 1; + for (; x != 0; ++count) x &= x - 1; return count; } @@ -274,7 +257,7 @@ static inline int Popcount(uint64_t x) { template static inline int PopcountSigned(T v) { // Converting to same-width unsigned then extending preserves the bit pattern. - return BitUtil::Popcount(static_cast::type>(v)); + return BitUtil::Popcount(static_cast::type>(v)); } /// Returns the 'num_bits' least-significant bits of 'v'. @@ -297,21 +280,16 @@ static inline int Log2(uint64_t x) { // (floor(log2(n)) = MSB(n) (0-indexed)) --x; int result = 1; - while (x >>= 1) - ++result; + while (x >>= 1) ++result; return result; } /// Swaps the byte order (i.e. endianess) -static inline int64_t ByteSwap(int64_t value) { - return ARROW_BYTE_SWAP64(value); -} +static inline int64_t ByteSwap(int64_t value) { return ARROW_BYTE_SWAP64(value); } static inline uint64_t ByteSwap(uint64_t value) { return static_cast(ARROW_BYTE_SWAP64(value)); } -static inline int32_t ByteSwap(int32_t value) { - return ARROW_BYTE_SWAP32(value); -} +static inline int32_t ByteSwap(int32_t value) { return ARROW_BYTE_SWAP32(value); } static inline uint32_t ByteSwap(uint32_t value) { return static_cast(ARROW_BYTE_SWAP32(value)); } @@ -352,84 +330,36 @@ static inline void ByteSwap(void* dst, const void* src, int len) { /// Converts to big endian format (if not already in big endian) from the /// machine's native endian format. #if __BYTE_ORDER == __LITTLE_ENDIAN -static inline int64_t ToBigEndian(int64_t value) { - return ByteSwap(value); -} -static inline uint64_t ToBigEndian(uint64_t value) { - return ByteSwap(value); -} -static inline int32_t ToBigEndian(int32_t value) { - return ByteSwap(value); -} -static inline uint32_t ToBigEndian(uint32_t value) { - return ByteSwap(value); -} -static inline int16_t ToBigEndian(int16_t value) { - return ByteSwap(value); -} -static inline uint16_t ToBigEndian(uint16_t value) { - return ByteSwap(value); -} +static inline int64_t ToBigEndian(int64_t value) { return ByteSwap(value); } +static inline uint64_t ToBigEndian(uint64_t value) { return ByteSwap(value); } +static inline int32_t ToBigEndian(int32_t value) { return ByteSwap(value); } +static inline uint32_t ToBigEndian(uint32_t value) { return ByteSwap(value); } +static inline int16_t ToBigEndian(int16_t value) { return ByteSwap(value); } +static inline uint16_t ToBigEndian(uint16_t value) { return ByteSwap(value); } #else -static inline int64_t ToBigEndian(int64_t val) { - return val; -} -static inline uint64_t ToBigEndian(uint64_t val) { - return val; -} -static inline int32_t ToBigEndian(int32_t val) { - return val; -} -static inline uint32_t ToBigEndian(uint32_t val) { - return val; -} -static inline int16_t ToBigEndian(int16_t val) { - return val; -} -static inline uint16_t ToBigEndian(uint16_t val) { - return val; -} +static inline int64_t ToBigEndian(int64_t val) { return val; } +static inline uint64_t ToBigEndian(uint64_t val) { return val; } +static inline int32_t ToBigEndian(int32_t val) { return val; } +static inline uint32_t ToBigEndian(uint32_t val) { return val; } +static inline int16_t ToBigEndian(int16_t val) { return val; } +static inline uint16_t ToBigEndian(uint16_t val) { return val; } #endif /// Converts from big endian format to the machine's native endian format. #if __BYTE_ORDER == __LITTLE_ENDIAN -static inline int64_t FromBigEndian(int64_t value) { - return ByteSwap(value); -} -static inline uint64_t FromBigEndian(uint64_t value) { - return ByteSwap(value); -} -static inline int32_t FromBigEndian(int32_t value) { - return ByteSwap(value); -} -static inline uint32_t FromBigEndian(uint32_t value) { - return ByteSwap(value); -} -static inline int16_t FromBigEndian(int16_t value) { - return ByteSwap(value); -} -static inline uint16_t FromBigEndian(uint16_t value) { - return ByteSwap(value); -} +static inline int64_t FromBigEndian(int64_t value) { return ByteSwap(value); } +static inline uint64_t FromBigEndian(uint64_t value) { return ByteSwap(value); } +static inline int32_t FromBigEndian(int32_t value) { return ByteSwap(value); } +static inline uint32_t FromBigEndian(uint32_t value) { return ByteSwap(value); } +static inline int16_t FromBigEndian(int16_t value) { return ByteSwap(value); } +static inline uint16_t FromBigEndian(uint16_t value) { return ByteSwap(value); } #else -static inline int64_t FromBigEndian(int64_t val) { - return val; -} -static inline uint64_t FromBigEndian(uint64_t val) { - return val; -} -static inline int32_t FromBigEndian(int32_t val) { - return val; -} -static inline uint32_t FromBigEndian(uint32_t val) { - return val; -} -static inline int16_t FromBigEndian(int16_t val) { - return val; -} -static inline uint16_t FromBigEndian(uint16_t val) { - return val; -} +static inline int64_t FromBigEndian(int64_t val) { return val; } +static inline uint64_t FromBigEndian(uint64_t val) { return val; } +static inline int32_t FromBigEndian(int32_t val) { return val; } +static inline uint32_t FromBigEndian(uint32_t val) { return val; } +static inline int16_t FromBigEndian(int16_t val) { return val; } +static inline uint16_t FromBigEndian(uint16_t val) { return val; } #endif // Logical right shift for signed integer types @@ -438,7 +368,7 @@ static inline uint16_t FromBigEndian(uint16_t val) { template static T ShiftRightLogical(T v, int shift) { // Conversion to unsigned ensures most significant bits always filled with 0's - return static_cast::type>(v) >> shift; + return static_cast::type>(v) >> shift; } void FillBitsFromBytes(const std::vector& bytes, uint8_t* bits); @@ -449,8 +379,8 @@ ARROW_EXPORT Status BytesToBits(const std::vector&, std::shared_ptr* result); +Status ARROW_EXPORT GetEmptyBitmap(MemoryPool* pool, int64_t length, + std::shared_ptr* result); /// Copy a bit range of an existing bitmap /// @@ -462,7 +392,7 @@ Status ARROW_EXPORT GetEmptyBitmap( /// /// \return Status message Status ARROW_EXPORT CopyBitmap(MemoryPool* pool, const uint8_t* bitmap, int64_t offset, - int64_t length, std::shared_ptr* out); + int64_t length, std::shared_ptr* out); /// Compute the number of 1's in the given data array /// @@ -471,11 +401,12 @@ Status ARROW_EXPORT CopyBitmap(MemoryPool* pool, const uint8_t* bitmap, int64_t /// \param[in] length the number of bits to inspect in the bitmap relative to the offset /// /// \return The number of set (1) bits in the range -int64_t ARROW_EXPORT CountSetBits( - const uint8_t* data, int64_t bit_offset, int64_t length); +int64_t ARROW_EXPORT CountSetBits(const uint8_t* data, int64_t bit_offset, + int64_t length); bool ARROW_EXPORT BitmapEquals(const uint8_t* left, int64_t left_offset, - const uint8_t* right, int64_t right_offset, int64_t bit_length); + const uint8_t* right, int64_t right_offset, + int64_t bit_length); } // namespace arrow #endif // ARROW_UTIL_BIT_UTIL_H diff --git a/cpp/src/arrow/util/bpacking.h b/cpp/src/arrow/util/bpacking.h index fce5f55224c..14258cff6e4 100644 --- a/cpp/src/arrow/util/bpacking.h +++ b/cpp/src/arrow/util/bpacking.h @@ -20,12 +20,9 @@ // https://github.com/lemire/FrameOfReference/blob/6ccaf9e97160f9a3b299e23a8ef739e711ef0c71/src/bpacking.cpp // The original copyright notice follows. -/** -* -* This code is released under the -* Apache License Version 2.0 http://www.apache.org/licenses/. -* (c) Daniel Lemire 2013 -*/ +// This code is released under the +// Apache License Version 2.0 http://www.apache.org/licenses/. +// (c) Daniel Lemire 2013 #ifndef ARROW_UTIL_BPACKING_H #define ARROW_UTIL_BPACKING_H @@ -33,6 +30,7 @@ #include "arrow/util/logging.h" namespace arrow { +namespace internal { inline const uint32_t* unpack1_32(const uint32_t* in, uint32_t* out) { *out = ((*in) >> 0) & 1; @@ -3199,136 +3197,103 @@ inline int unpack32(const uint32_t* in, uint32_t* out, int batch_size, int num_b switch (num_bits) { case 0: - for (int i = 0; i < num_loops; ++i) - in = nullunpacker32(in, out + i * 32); + for (int i = 0; i < num_loops; ++i) in = nullunpacker32(in, out + i * 32); break; case 1: - for (int i = 0; i < num_loops; ++i) - in = unpack1_32(in, out + i * 32); + for (int i = 0; i < num_loops; ++i) in = unpack1_32(in, out + i * 32); break; case 2: - for (int i = 0; i < num_loops; ++i) - in = unpack2_32(in, out + i * 32); + for (int i = 0; i < num_loops; ++i) in = unpack2_32(in, out + i * 32); break; case 3: - for (int i = 0; i < num_loops; ++i) - in = unpack3_32(in, out + i * 32); + for (int i = 0; i < num_loops; ++i) in = unpack3_32(in, out + i * 32); break; case 4: - for (int i = 0; i < num_loops; ++i) - in = unpack4_32(in, out + i * 32); + for (int i = 0; i < num_loops; ++i) in = unpack4_32(in, out + i * 32); break; case 5: - for (int i = 0; i < num_loops; ++i) - in = unpack5_32(in, out + i * 32); + for (int i = 0; i < num_loops; ++i) in = unpack5_32(in, out + i * 32); break; case 6: - for (int i = 0; i < num_loops; ++i) - in = unpack6_32(in, out + i * 32); + for (int i = 0; i < num_loops; ++i) in = unpack6_32(in, out + i * 32); break; case 7: - for (int i = 0; i < num_loops; ++i) - in = unpack7_32(in, out + i * 32); + for (int i = 0; i < num_loops; ++i) in = unpack7_32(in, out + i * 32); break; case 8: - for (int i = 0; i < num_loops; ++i) - in = unpack8_32(in, out + i * 32); + for (int i = 0; i < num_loops; ++i) in = unpack8_32(in, out + i * 32); break; case 9: - for (int i = 0; i < num_loops; ++i) - in = unpack9_32(in, out + i * 32); + for (int i = 0; i < num_loops; ++i) in = unpack9_32(in, out + i * 32); break; case 10: - for (int i = 0; i < num_loops; ++i) - in = unpack10_32(in, out + i * 32); + for (int i = 0; i < num_loops; ++i) in = unpack10_32(in, out + i * 32); break; case 11: - for (int i = 0; i < num_loops; ++i) - in = unpack11_32(in, out + i * 32); + for (int i = 0; i < num_loops; ++i) in = unpack11_32(in, out + i * 32); break; case 12: - for (int i = 0; i < num_loops; ++i) - in = unpack12_32(in, out + i * 32); + for (int i = 0; i < num_loops; ++i) in = unpack12_32(in, out + i * 32); break; case 13: - for (int i = 0; i < num_loops; ++i) - in = unpack13_32(in, out + i * 32); + for (int i = 0; i < num_loops; ++i) in = unpack13_32(in, out + i * 32); break; case 14: - for (int i = 0; i < num_loops; ++i) - in = unpack14_32(in, out + i * 32); + for (int i = 0; i < num_loops; ++i) in = unpack14_32(in, out + i * 32); break; case 15: - for (int i = 0; i < num_loops; ++i) - in = unpack15_32(in, out + i * 32); + for (int i = 0; i < num_loops; ++i) in = unpack15_32(in, out + i * 32); break; case 16: - for (int i = 0; i < num_loops; ++i) - in = unpack16_32(in, out + i * 32); + for (int i = 0; i < num_loops; ++i) in = unpack16_32(in, out + i * 32); break; case 17: - for (int i = 0; i < num_loops; ++i) - in = unpack17_32(in, out + i * 32); + for (int i = 0; i < num_loops; ++i) in = unpack17_32(in, out + i * 32); break; case 18: - for (int i = 0; i < num_loops; ++i) - in = unpack18_32(in, out + i * 32); + for (int i = 0; i < num_loops; ++i) in = unpack18_32(in, out + i * 32); break; case 19: - for (int i = 0; i < num_loops; ++i) - in = unpack19_32(in, out + i * 32); + for (int i = 0; i < num_loops; ++i) in = unpack19_32(in, out + i * 32); break; case 20: - for (int i = 0; i < num_loops; ++i) - in = unpack20_32(in, out + i * 32); + for (int i = 0; i < num_loops; ++i) in = unpack20_32(in, out + i * 32); break; case 21: - for (int i = 0; i < num_loops; ++i) - in = unpack21_32(in, out + i * 32); + for (int i = 0; i < num_loops; ++i) in = unpack21_32(in, out + i * 32); break; case 22: - for (int i = 0; i < num_loops; ++i) - in = unpack22_32(in, out + i * 32); + for (int i = 0; i < num_loops; ++i) in = unpack22_32(in, out + i * 32); break; case 23: - for (int i = 0; i < num_loops; ++i) - in = unpack23_32(in, out + i * 32); + for (int i = 0; i < num_loops; ++i) in = unpack23_32(in, out + i * 32); break; case 24: - for (int i = 0; i < num_loops; ++i) - in = unpack24_32(in, out + i * 32); + for (int i = 0; i < num_loops; ++i) in = unpack24_32(in, out + i * 32); break; case 25: - for (int i = 0; i < num_loops; ++i) - in = unpack25_32(in, out + i * 32); + for (int i = 0; i < num_loops; ++i) in = unpack25_32(in, out + i * 32); break; case 26: - for (int i = 0; i < num_loops; ++i) - in = unpack26_32(in, out + i * 32); + for (int i = 0; i < num_loops; ++i) in = unpack26_32(in, out + i * 32); break; case 27: - for (int i = 0; i < num_loops; ++i) - in = unpack27_32(in, out + i * 32); + for (int i = 0; i < num_loops; ++i) in = unpack27_32(in, out + i * 32); break; case 28: - for (int i = 0; i < num_loops; ++i) - in = unpack28_32(in, out + i * 32); + for (int i = 0; i < num_loops; ++i) in = unpack28_32(in, out + i * 32); break; case 29: - for (int i = 0; i < num_loops; ++i) - in = unpack29_32(in, out + i * 32); + for (int i = 0; i < num_loops; ++i) in = unpack29_32(in, out + i * 32); break; case 30: - for (int i = 0; i < num_loops; ++i) - in = unpack30_32(in, out + i * 32); + for (int i = 0; i < num_loops; ++i) in = unpack30_32(in, out + i * 32); break; case 31: - for (int i = 0; i < num_loops; ++i) - in = unpack31_32(in, out + i * 32); + for (int i = 0; i < num_loops; ++i) in = unpack31_32(in, out + i * 32); break; case 32: - for (int i = 0; i < num_loops; ++i) - in = unpack32_32(in, out + i * 32); + for (int i = 0; i < num_loops; ++i) in = unpack32_32(in, out + i * 32); break; default: DCHECK(false) << "Unsupported num_bits"; @@ -3337,6 +3302,7 @@ inline int unpack32(const uint32_t* in, uint32_t* out, int batch_size, int num_b return batch_size; } -}; // namespace arrow +} // namespace internal +} // namespace arrow #endif // ARROW_UTIL_BPACKING_H diff --git a/cpp/src/arrow/util/compression-test.cc b/cpp/src/arrow/util/compression-test.cc index f7739fc6dd7..64896dd6a4a 100644 --- a/cpp/src/arrow/util/compression-test.cc +++ b/cpp/src/arrow/util/compression-test.cc @@ -15,8 +15,8 @@ // specific language governing permissions and limitations // under the License. -#include #include +#include #include #include @@ -43,25 +43,25 @@ void CheckCodecRoundtrip(const vector& data) { // compress with c1 int64_t actual_size; - ASSERT_OK(c1->Compress( - data.size(), &data[0], max_compressed_len, &compressed[0], &actual_size)); + ASSERT_OK(c1->Compress(data.size(), &data[0], max_compressed_len, &compressed[0], + &actual_size)); compressed.resize(actual_size); // decompress with c2 - ASSERT_OK(c2->Decompress( - compressed.size(), &compressed[0], decompressed.size(), &decompressed[0])); + ASSERT_OK(c2->Decompress(compressed.size(), &compressed[0], decompressed.size(), + &decompressed[0])); ASSERT_EQ(data, decompressed); // compress with c2 int64_t actual_size2; - ASSERT_OK(c2->Compress( - data.size(), &data[0], max_compressed_len, &compressed[0], &actual_size2)); + ASSERT_OK(c2->Compress(data.size(), &data[0], max_compressed_len, &compressed[0], + &actual_size2)); ASSERT_EQ(actual_size2, actual_size); // decompress with c1 - ASSERT_OK(c1->Decompress( - compressed.size(), &compressed[0], decompressed.size(), &decompressed[0])); + ASSERT_OK(c1->Decompress(compressed.size(), &compressed[0], decompressed.size(), + &decompressed[0])); ASSERT_EQ(data, decompressed); } @@ -76,24 +76,14 @@ void CheckCodec() { } } -TEST(TestCompressors, Snappy) { - CheckCodec(); -} +TEST(TestCompressors, Snappy) { CheckCodec(); } -TEST(TestCompressors, Brotli) { - CheckCodec(); -} +TEST(TestCompressors, Brotli) { CheckCodec(); } -TEST(TestCompressors, GZip) { - CheckCodec(); -} +TEST(TestCompressors, GZip) { CheckCodec(); } -TEST(TestCompressors, ZSTD) { - CheckCodec(); -} +TEST(TestCompressors, ZSTD) { CheckCodec(); } -TEST(TestCompressors, Lz4) { - CheckCodec(); -} +TEST(TestCompressors, Lz4) { CheckCodec(); } } // namespace arrow diff --git a/cpp/src/arrow/util/compression.h b/cpp/src/arrow/util/compression.h index 19c61179a50..ae187a7fcdf 100644 --- a/cpp/src/arrow/util/compression.h +++ b/cpp/src/arrow/util/compression.h @@ -37,10 +37,11 @@ class ARROW_EXPORT Codec { static Status Create(Compression::type codec, std::unique_ptr* out); virtual Status Decompress(int64_t input_len, const uint8_t* input, int64_t output_len, - uint8_t* output_buffer) = 0; + uint8_t* output_buffer) = 0; virtual Status Compress(int64_t input_len, const uint8_t* input, - int64_t output_buffer_len, uint8_t* output_buffer, int64_t* output_length) = 0; + int64_t output_buffer_len, uint8_t* output_buffer, + int64_t* output_length) = 0; virtual int64_t MaxCompressedLen(int64_t input_len, const uint8_t* input) = 0; diff --git a/cpp/src/arrow/util/compression_brotli.cc b/cpp/src/arrow/util/compression_brotli.cc index c03573bc46c..e4639083dfa 100644 --- a/cpp/src/arrow/util/compression_brotli.cc +++ b/cpp/src/arrow/util/compression_brotli.cc @@ -33,8 +33,8 @@ namespace arrow { // ---------------------------------------------------------------------- // Brotli implementation -Status BrotliCodec::Decompress( - int64_t input_len, const uint8_t* input, int64_t output_len, uint8_t* output_buffer) { +Status BrotliCodec::Decompress(int64_t input_len, const uint8_t* input, + int64_t output_len, uint8_t* output_buffer) { size_t output_size = output_len; if (BrotliDecoderDecompress(input_len, input, &output_size, output_buffer) != BROTLI_DECODER_RESULT_SUCCESS) { @@ -48,12 +48,13 @@ int64_t BrotliCodec::MaxCompressedLen(int64_t input_len, const uint8_t* input) { } Status BrotliCodec::Compress(int64_t input_len, const uint8_t* input, - int64_t output_buffer_len, uint8_t* output_buffer, int64_t* output_length) { + int64_t output_buffer_len, uint8_t* output_buffer, + int64_t* output_length) { size_t output_len = output_buffer_len; // TODO: Make quality configurable. We use 8 as a default as it is the best // trade-off for Parquet workload if (BrotliEncoderCompress(8, BROTLI_DEFAULT_WINDOW, BROTLI_DEFAULT_MODE, input_len, - input, &output_len, output_buffer) == BROTLI_FALSE) { + input, &output_len, output_buffer) == BROTLI_FALSE) { return Status::IOError("Brotli compression failure."); } *output_length = output_len; diff --git a/cpp/src/arrow/util/compression_brotli.h b/cpp/src/arrow/util/compression_brotli.h index 08bd3379e34..9e92cb106d4 100644 --- a/cpp/src/arrow/util/compression_brotli.h +++ b/cpp/src/arrow/util/compression_brotli.h @@ -30,10 +30,10 @@ namespace arrow { class ARROW_EXPORT BrotliCodec : public Codec { public: Status Decompress(int64_t input_len, const uint8_t* input, int64_t output_len, - uint8_t* output_buffer) override; + uint8_t* output_buffer) override; Status Compress(int64_t input_len, const uint8_t* input, int64_t output_buffer_len, - uint8_t* output_buffer, int64_t* output_length) override; + uint8_t* output_buffer, int64_t* output_length) override; int64_t MaxCompressedLen(int64_t input_len, const uint8_t* input) override; diff --git a/cpp/src/arrow/util/compression_lz4.cc b/cpp/src/arrow/util/compression_lz4.cc index 65eaa08946e..295e9a438f7 100644 --- a/cpp/src/arrow/util/compression_lz4.cc +++ b/cpp/src/arrow/util/compression_lz4.cc @@ -32,12 +32,14 @@ namespace arrow { // ---------------------------------------------------------------------- // Lz4 implementation -Status Lz4Codec::Decompress( - int64_t input_len, const uint8_t* input, int64_t output_len, uint8_t* output_buffer) { - int64_t decompressed_size = LZ4_decompress_safe(reinterpret_cast(input), - reinterpret_cast(output_buffer), static_cast(input_len), - static_cast(output_len)); - if (decompressed_size < 1) { return Status::IOError("Corrupt Lz4 compressed data."); } +Status Lz4Codec::Decompress(int64_t input_len, const uint8_t* input, int64_t output_len, + uint8_t* output_buffer) { + int64_t decompressed_size = LZ4_decompress_safe( + reinterpret_cast(input), reinterpret_cast(output_buffer), + static_cast(input_len), static_cast(output_len)); + if (decompressed_size < 1) { + return Status::IOError("Corrupt Lz4 compressed data."); + } return Status::OK(); } @@ -46,11 +48,14 @@ int64_t Lz4Codec::MaxCompressedLen(int64_t input_len, const uint8_t* input) { } Status Lz4Codec::Compress(int64_t input_len, const uint8_t* input, - int64_t output_buffer_len, uint8_t* output_buffer, int64_t* output_length) { - *output_length = LZ4_compress_default(reinterpret_cast(input), - reinterpret_cast(output_buffer), static_cast(input_len), - static_cast(output_buffer_len)); - if (*output_length < 1) { return Status::IOError("Lz4 compression failure."); } + int64_t output_buffer_len, uint8_t* output_buffer, + int64_t* output_length) { + *output_length = LZ4_compress_default( + reinterpret_cast(input), reinterpret_cast(output_buffer), + static_cast(input_len), static_cast(output_buffer_len)); + if (*output_length < 1) { + return Status::IOError("Lz4 compression failure."); + } return Status::OK(); } diff --git a/cpp/src/arrow/util/compression_lz4.h b/cpp/src/arrow/util/compression_lz4.h index 9668fec126b..0af228963f3 100644 --- a/cpp/src/arrow/util/compression_lz4.h +++ b/cpp/src/arrow/util/compression_lz4.h @@ -30,10 +30,10 @@ namespace arrow { class ARROW_EXPORT Lz4Codec : public Codec { public: Status Decompress(int64_t input_len, const uint8_t* input, int64_t output_len, - uint8_t* output_buffer) override; + uint8_t* output_buffer) override; Status Compress(int64_t input_len, const uint8_t* input, int64_t output_buffer_len, - uint8_t* output_buffer, int64_t* output_length) override; + uint8_t* output_buffer, int64_t* output_length) override; int64_t MaxCompressedLen(int64_t input_len, const uint8_t* input) override; diff --git a/cpp/src/arrow/util/compression_snappy.cc b/cpp/src/arrow/util/compression_snappy.cc index db2b6735510..947ffe559bd 100644 --- a/cpp/src/arrow/util/compression_snappy.cc +++ b/cpp/src/arrow/util/compression_snappy.cc @@ -37,10 +37,11 @@ namespace arrow { // ---------------------------------------------------------------------- // Snappy implementation -Status SnappyCodec::Decompress( - int64_t input_len, const uint8_t* input, int64_t output_len, uint8_t* output_buffer) { +Status SnappyCodec::Decompress(int64_t input_len, const uint8_t* input, + int64_t output_len, uint8_t* output_buffer) { if (!snappy::RawUncompress(reinterpret_cast(input), - static_cast(input_len), reinterpret_cast(output_buffer))) { + static_cast(input_len), + reinterpret_cast(output_buffer))) { return Status::IOError("Corrupt snappy compressed data."); } return Status::OK(); @@ -51,11 +52,12 @@ int64_t SnappyCodec::MaxCompressedLen(int64_t input_len, const uint8_t* input) { } Status SnappyCodec::Compress(int64_t input_len, const uint8_t* input, - int64_t output_buffer_len, uint8_t* output_buffer, int64_t* output_length) { + int64_t output_buffer_len, uint8_t* output_buffer, + int64_t* output_length) { size_t output_len; snappy::RawCompress(reinterpret_cast(input), - static_cast(input_len), reinterpret_cast(output_buffer), - &output_len); + static_cast(input_len), + reinterpret_cast(output_buffer), &output_len); *output_length = static_cast(output_len); return Status::OK(); } diff --git a/cpp/src/arrow/util/compression_snappy.h b/cpp/src/arrow/util/compression_snappy.h index 25281e1a97a..5cc10c470af 100644 --- a/cpp/src/arrow/util/compression_snappy.h +++ b/cpp/src/arrow/util/compression_snappy.h @@ -29,10 +29,10 @@ namespace arrow { class ARROW_EXPORT SnappyCodec : public Codec { public: Status Decompress(int64_t input_len, const uint8_t* input, int64_t output_len, - uint8_t* output_buffer) override; + uint8_t* output_buffer) override; Status Compress(int64_t input_len, const uint8_t* input, int64_t output_buffer_len, - uint8_t* output_buffer, int64_t* output_length) override; + uint8_t* output_buffer, int64_t* output_length) override; int64_t MaxCompressedLen(int64_t input_len, const uint8_t* input) override; diff --git a/cpp/src/arrow/util/compression_zlib.cc b/cpp/src/arrow/util/compression_zlib.cc index 3ff33b82028..ae6627ea644 100644 --- a/cpp/src/arrow/util/compression_zlib.cc +++ b/cpp/src/arrow/util/compression_zlib.cc @@ -69,7 +69,7 @@ class GZipCodec::GZipCodecImpl { window_bits += GZIP_CODEC; } if ((ret = deflateInit2(&stream_, Z_DEFAULT_COMPRESSION, Z_DEFLATED, window_bits, 9, - Z_DEFAULT_STRATEGY)) != Z_OK) { + Z_DEFAULT_STRATEGY)) != Z_OK) { std::stringstream ss; ss << "zlib deflateInit failed: " << std::string(stream_.msg); return Status::IOError(ss.str()); @@ -79,7 +79,9 @@ class GZipCodec::GZipCodecImpl { } void EndCompressor() { - if (compressor_initialized_) { (void)deflateEnd(&stream_); } + if (compressor_initialized_) { + (void)deflateEnd(&stream_); + } compressor_initialized_ = false; } @@ -100,13 +102,17 @@ class GZipCodec::GZipCodecImpl { } void EndDecompressor() { - if (decompressor_initialized_) { (void)inflateEnd(&stream_); } + if (decompressor_initialized_) { + (void)inflateEnd(&stream_); + } decompressor_initialized_ = false; } Status Decompress(int64_t input_length, const uint8_t* input, int64_t output_length, - uint8_t* output) { - if (!decompressor_initialized_) { RETURN_NOT_OK(InitDecompressor()); } + uint8_t* output) { + if (!decompressor_initialized_) { + RETURN_NOT_OK(InitDecompressor()); + } if (output_length == 0) { // The zlib library does not allow *output to be NULL, even when output_length // is 0 (inflate() will return Z_STREAM_ERROR). We don't consider this an @@ -168,8 +174,10 @@ class GZipCodec::GZipCodecImpl { } Status Compress(int64_t input_length, const uint8_t* input, int64_t output_buffer_len, - uint8_t* output, int64_t* output_length) { - if (!compressor_initialized_) { RETURN_NOT_OK(InitCompressor()); } + uint8_t* output, int64_t* output_length) { + if (!compressor_initialized_) { + RETURN_NOT_OK(InitCompressor()); + } stream_.next_in = const_cast(reinterpret_cast(input)); stream_.avail_in = static_cast(input_length); stream_.next_out = reinterpret_cast(output); @@ -218,14 +226,12 @@ class GZipCodec::GZipCodecImpl { bool decompressor_initialized_; }; -GZipCodec::GZipCodec(Format format) { - impl_.reset(new GZipCodecImpl(format)); -} +GZipCodec::GZipCodec(Format format) { impl_.reset(new GZipCodecImpl(format)); } GZipCodec::~GZipCodec() {} Status GZipCodec::Decompress(int64_t input_length, const uint8_t* input, - int64_t output_buffer_len, uint8_t* output) { + int64_t output_buffer_len, uint8_t* output) { return impl_->Decompress(input_length, input, output_buffer_len, output); } @@ -234,12 +240,11 @@ int64_t GZipCodec::MaxCompressedLen(int64_t input_length, const uint8_t* input) } Status GZipCodec::Compress(int64_t input_length, const uint8_t* input, - int64_t output_buffer_len, uint8_t* output, int64_t* output_length) { + int64_t output_buffer_len, uint8_t* output, + int64_t* output_length) { return impl_->Compress(input_length, input, output_buffer_len, output, output_length); } -const char* GZipCodec::name() const { - return "gzip"; -} +const char* GZipCodec::name() const { return "gzip"; } } // namespace arrow diff --git a/cpp/src/arrow/util/compression_zlib.h b/cpp/src/arrow/util/compression_zlib.h index 517a06175ec..f55d6689edf 100644 --- a/cpp/src/arrow/util/compression_zlib.h +++ b/cpp/src/arrow/util/compression_zlib.h @@ -40,10 +40,10 @@ class ARROW_EXPORT GZipCodec : public Codec { virtual ~GZipCodec(); Status Decompress(int64_t input_len, const uint8_t* input, int64_t output_len, - uint8_t* output_buffer) override; + uint8_t* output_buffer) override; Status Compress(int64_t input_len, const uint8_t* input, int64_t output_buffer_len, - uint8_t* output_buffer, int64_t* output_length) override; + uint8_t* output_buffer, int64_t* output_length) override; int64_t MaxCompressedLen(int64_t input_len, const uint8_t* input) override; diff --git a/cpp/src/arrow/util/compression_zstd.cc b/cpp/src/arrow/util/compression_zstd.cc index 5511cb9dd8f..ac6e9065d22 100644 --- a/cpp/src/arrow/util/compression_zstd.cc +++ b/cpp/src/arrow/util/compression_zstd.cc @@ -32,10 +32,11 @@ namespace arrow { // ---------------------------------------------------------------------- // ZSTD implementation -Status ZSTDCodec::Decompress( - int64_t input_len, const uint8_t* input, int64_t output_len, uint8_t* output_buffer) { - int64_t decompressed_size = ZSTD_decompress(output_buffer, - static_cast(output_len), input, static_cast(input_len)); +Status ZSTDCodec::Decompress(int64_t input_len, const uint8_t* input, int64_t output_len, + uint8_t* output_buffer) { + int64_t decompressed_size = + ZSTD_decompress(output_buffer, static_cast(output_len), input, + static_cast(input_len)); if (decompressed_size != output_len) { return Status::IOError("Corrupt ZSTD compressed data."); } @@ -47,9 +48,10 @@ int64_t ZSTDCodec::MaxCompressedLen(int64_t input_len, const uint8_t* input) { } Status ZSTDCodec::Compress(int64_t input_len, const uint8_t* input, - int64_t output_buffer_len, uint8_t* output_buffer, int64_t* output_length) { + int64_t output_buffer_len, uint8_t* output_buffer, + int64_t* output_length) { *output_length = ZSTD_compress(output_buffer, static_cast(output_buffer_len), - input, static_cast(input_len), 1); + input, static_cast(input_len), 1); if (ZSTD_isError(*output_length)) { return Status::IOError("ZSTD compression failure."); } diff --git a/cpp/src/arrow/util/compression_zstd.h b/cpp/src/arrow/util/compression_zstd.h index 2356d5862e0..6e40e19d280 100644 --- a/cpp/src/arrow/util/compression_zstd.h +++ b/cpp/src/arrow/util/compression_zstd.h @@ -30,10 +30,10 @@ namespace arrow { class ARROW_EXPORT ZSTDCodec : public Codec { public: Status Decompress(int64_t input_len, const uint8_t* input, int64_t output_len, - uint8_t* output_buffer) override; + uint8_t* output_buffer) override; Status Compress(int64_t input_len, const uint8_t* input, int64_t output_buffer_len, - uint8_t* output_buffer, int64_t* output_length) override; + uint8_t* output_buffer, int64_t* output_length) override; int64_t MaxCompressedLen(int64_t input_len, const uint8_t* input) override; diff --git a/cpp/src/arrow/util/cpu-info.cc b/cpp/src/arrow/util/cpu-info.cc index c0fc8bdddf4..b0667cb33ad 100644 --- a/cpp/src/arrow/util/cpu-info.cc +++ b/cpp/src/arrow/util/cpu-info.cc @@ -30,6 +30,10 @@ #include #endif +#ifdef _WIN32 +#include +#endif + #include #include @@ -62,7 +66,9 @@ static struct { string name; int64_t flag; } flag_mappings[] = { - {"ssse3", CpuInfo::SSSE3}, {"sse4_1", CpuInfo::SSE4_1}, {"sse4_2", CpuInfo::SSE4_2}, + {"ssse3", CpuInfo::SSSE3}, + {"sse4_1", CpuInfo::SSE4_1}, + {"sse4_2", CpuInfo::SSE4_2}, {"popcnt", CpuInfo::POPCNT}, }; static const int64_t num_flags = sizeof(flag_mappings) / sizeof(flag_mappings[0]); @@ -74,15 +80,66 @@ static const int64_t num_flags = sizeof(flag_mappings) / sizeof(flag_mappings[0] int64_t ParseCPUFlags(const string& values) { int64_t flags = 0; for (int i = 0; i < num_flags; ++i) { - if (contains(values, flag_mappings[i].name)) { flags |= flag_mappings[i].flag; } + if (contains(values, flag_mappings[i].name)) { + flags |= flag_mappings[i].flag; + } } return flags; } +#ifdef _WIN32 +bool RetrieveCacheSize(int64_t* cache_sizes) { + if (!cache_sizes) { + return false; + } + PSYSTEM_LOGICAL_PROCESSOR_INFORMATION buffer = nullptr; + PSYSTEM_LOGICAL_PROCESSOR_INFORMATION buffer_position = nullptr; + DWORD buffer_size = 0; + DWORD offset = 0; + typedef BOOL(WINAPI * GetLogicalProcessorInformationFuncPointer)(void*, void*); + GetLogicalProcessorInformationFuncPointer func_pointer = + (GetLogicalProcessorInformationFuncPointer)GetProcAddress( + GetModuleHandle("kernel32"), "GetLogicalProcessorInformation"); + + if (!func_pointer) { + return false; + } + + // Get buffer size + if (func_pointer(buffer, &buffer_size) && GetLastError() != ERROR_INSUFFICIENT_BUFFER) + return false; + + buffer = (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION)malloc(buffer_size); + + if (!buffer || !func_pointer(buffer, &buffer_size)) { + return false; + } + + buffer_position = buffer; + while (offset + sizeof(SYSTEM_LOGICAL_PROCESSOR_INFORMATION) <= buffer_size) { + if (RelationCache == buffer_position->Relationship) { + PCACHE_DESCRIPTOR cache = &buffer_position->Cache; + if (cache->Level >= 1 && cache->Level <= 3) { + cache_sizes[cache->Level - 1] += cache->Size; + } + } + offset += sizeof(SYSTEM_LOGICAL_PROCESSOR_INFORMATION); + buffer_position++; + } + + if (buffer) { + free(buffer); + } + return true; +} +#endif + void CpuInfo::Init() { std::lock_guard cpuinfo_lock(cpuinfo_mutex); - if (initialized()) { return; } + if (initialized()) { + return; + } string line; string name; @@ -93,6 +150,16 @@ void CpuInfo::Init() { memset(&cache_sizes_, 0, sizeof(cache_sizes_)); +#ifdef _WIN32 + SYSTEM_INFO system_info; + GetSystemInfo(&system_info); + num_cores = system_info.dwNumberOfProcessors; + + LARGE_INTEGER performance_frequency; + if (QueryPerformanceFrequency(&performance_frequency)) { + max_mhz = static_cast(performance_frequency.QuadPart); + } +#else // Read from /proc/cpuinfo std::ifstream cpuinfo("/proc/cpuinfo", std::ios::in); while (cpuinfo) { @@ -120,6 +187,7 @@ void CpuInfo::Init() { } } if (cpuinfo.is_open()) cpuinfo.close(); +#endif #ifdef __APPLE__ // On Mac OS X use sysctl() to get the cache sizes @@ -131,22 +199,19 @@ void CpuInfo::Init() { for (size_t i = 0; i < 3; ++i) { cache_sizes_[i] = data[i]; } +#elif _WIN32 + if (!RetrieveCacheSize(cache_sizes_)) { + SetDefaultCacheSize(); + } #else -#ifndef _SC_LEVEL1_DCACHE_SIZE - // Provide reasonable default values if no info - cache_sizes_[0] = 32 * 1024; // Level 1: 32k - cache_sizes_[1] = 256 * 1024; // Level 2: 256k - cache_sizes_[2] = 3072 * 1024; // Level 3: 3M -#else - // Call sysconf to query for the cache sizes - cache_sizes_[0] = sysconf(_SC_LEVEL1_DCACHE_SIZE); - cache_sizes_[1] = sysconf(_SC_LEVEL2_CACHE_SIZE); - cache_sizes_[2] = sysconf(_SC_LEVEL3_CACHE_SIZE); -#endif + SetDefaultCacheSize(); #endif if (max_mhz != 0) { - cycles_per_ms_ = static_cast(max_mhz) * 1000; + cycles_per_ms_ = static_cast(max_mhz); +#ifndef _WIN32 + cycles_per_ms_ *= 1000; +#endif } else { cycles_per_ms_ = 1000000; } @@ -203,4 +268,18 @@ std::string CpuInfo::model_name() { return model_name_; } +void CpuInfo::SetDefaultCacheSize() { +#ifndef _SC_LEVEL1_DCACHE_SIZE + // Provide reasonable default values if no info + cache_sizes_[0] = 32 * 1024; // Level 1: 32k + cache_sizes_[1] = 256 * 1024; // Level 2: 256k + cache_sizes_[2] = 3072 * 1024; // Level 3: 3M +#else + // Call sysconf to query for the cache sizes + cache_sizes_[0] = sysconf(_SC_LEVEL1_DCACHE_SIZE); + cache_sizes_[1] = sysconf(_SC_LEVEL2_CACHE_SIZE); + cache_sizes_[2] = sysconf(_SC_LEVEL3_CACHE_SIZE); +#endif +} + } // namespace arrow diff --git a/cpp/src/arrow/util/cpu-info.h b/cpp/src/arrow/util/cpu-info.h index 06800fc2755..f4bc8c35e34 100644 --- a/cpp/src/arrow/util/cpu-info.h +++ b/cpp/src/arrow/util/cpu-info.h @@ -78,6 +78,9 @@ class ARROW_EXPORT CpuInfo { static bool initialized() { return initialized_; } private: + /// Inits CPU cache size variables with default values + static void SetDefaultCacheSize(); + static bool initialized_; static int64_t hardware_flags_; static int64_t original_hardware_flags_; diff --git a/cpp/src/arrow/util/decimal.cc b/cpp/src/arrow/util/decimal.cc index 72ede35bef9..1a12e20f9f9 100644 --- a/cpp/src/arrow/util/decimal.cc +++ b/cpp/src/arrow/util/decimal.cc @@ -21,8 +21,8 @@ namespace arrow { namespace decimal { template -ARROW_EXPORT Status FromString( - const std::string& s, Decimal* out, int* precision, int* scale) { +ARROW_EXPORT Status FromString(const std::string& s, Decimal* out, int* precision, + int* scale) { // Implements this regex: "(\\+?|-?)((0*)(\\d*))(\\.(\\d+))?"; if (s.empty()) { return Status::Invalid("Empty string cannot be converted to decimal"); @@ -34,7 +34,9 @@ ARROW_EXPORT Status FromString( char first_char = *charp; if (first_char == '+' || first_char == '-') { - if (first_char == '-') { sign = -1; } + if (first_char == '-') { + sign = -1; + } ++charp; } @@ -55,7 +57,9 @@ ARROW_EXPORT Status FromString( // all zeros and no decimal point if (charp == end) { - if (out != nullptr) { out->value = static_cast(0); } + if (out != nullptr) { + out->value = static_cast(0); + } // Not sure what other libraries assign precision to for this case (this case of // a string consisting only of one or more zeros) @@ -63,7 +67,9 @@ ARROW_EXPORT Status FromString( *precision = static_cast(charp - numeric_string_start); } - if (scale != nullptr) { *scale = 0; } + if (scale != nullptr) { + *scale = 0; + } return Status::OK(); } @@ -127,22 +133,26 @@ ARROW_EXPORT Status FromString( *precision = static_cast(whole_part.size() + fractional_part.size()); } - if (scale != nullptr) { *scale = static_cast(fractional_part.size()); } + if (scale != nullptr) { + *scale = static_cast(fractional_part.size()); + } - if (out != nullptr) { StringToInteger(whole_part, fractional_part, sign, &out->value); } + if (out != nullptr) { + StringToInteger(whole_part, fractional_part, sign, &out->value); + } return Status::OK(); } -template ARROW_EXPORT Status FromString( - const std::string& s, Decimal32* out, int* precision, int* scale); -template ARROW_EXPORT Status FromString( - const std::string& s, Decimal64* out, int* precision, int* scale); -template ARROW_EXPORT Status FromString( - const std::string& s, Decimal128* out, int* precision, int* scale); +template ARROW_EXPORT Status FromString(const std::string& s, Decimal32* out, + int* precision, int* scale); +template ARROW_EXPORT Status FromString(const std::string& s, Decimal64* out, + int* precision, int* scale); +template ARROW_EXPORT Status FromString(const std::string& s, Decimal128* out, + int* precision, int* scale); -void StringToInteger( - const std::string& whole, const std::string& fractional, int8_t sign, int32_t* out) { +void StringToInteger(const std::string& whole, const std::string& fractional, int8_t sign, + int32_t* out) { DCHECK(sign == -1 || sign == 1); DCHECK_NE(out, nullptr); DCHECK(!whole.empty() || !fractional.empty()); @@ -150,12 +160,14 @@ void StringToInteger( *out = std::stoi(whole, nullptr, 10) * static_cast(pow(10.0, static_cast(fractional.size()))); } - if (!fractional.empty()) { *out += std::stoi(fractional, nullptr, 10); } + if (!fractional.empty()) { + *out += std::stoi(fractional, nullptr, 10); + } *out *= sign; } -void StringToInteger( - const std::string& whole, const std::string& fractional, int8_t sign, int64_t* out) { +void StringToInteger(const std::string& whole, const std::string& fractional, int8_t sign, + int64_t* out) { DCHECK(sign == -1 || sign == 1); DCHECK_NE(out, nullptr); DCHECK(!whole.empty() || !fractional.empty()); @@ -163,12 +175,14 @@ void StringToInteger( *out = static_cast(std::stoll(whole, nullptr, 10)) * static_cast(pow(10.0, static_cast(fractional.size()))); } - if (!fractional.empty()) { *out += std::stoll(fractional, nullptr, 10); } + if (!fractional.empty()) { + *out += std::stoll(fractional, nullptr, 10); + } *out *= sign; } -void StringToInteger( - const std::string& whole, const std::string& fractional, int8_t sign, int128_t* out) { +void StringToInteger(const std::string& whole, const std::string& fractional, int8_t sign, + int128_t* out) { DCHECK(sign == -1 || sign == 1); DCHECK_NE(out, nullptr); DCHECK(!whole.empty() || !fractional.empty()); @@ -200,7 +214,9 @@ void FromBytes(const uint8_t* bytes, bool is_negative, Decimal128* decimal) { int128_t::backend_type& backend(decimal_value.backend()); backend.resize(LIMBS_IN_INT128, LIMBS_IN_INT128); std::memcpy(backend.limbs(), bytes, BYTES_IN_128_BITS); - if (is_negative) { decimal->value = -decimal->value; } + if (is_negative) { + decimal->value = -decimal->value; + } } void ToBytes(const Decimal32& value, uint8_t** bytes) { diff --git a/cpp/src/arrow/util/decimal.h b/cpp/src/arrow/util/decimal.h index 0d84ba89db9..20142faea3e 100644 --- a/cpp/src/arrow/util/decimal.h +++ b/cpp/src/arrow/util/decimal.h @@ -37,16 +37,16 @@ using boost::multiprecision::int128_t; template struct ARROW_EXPORT Decimal; -ARROW_EXPORT void StringToInteger( - const std::string& whole, const std::string& fractional, int8_t sign, int32_t* out); -ARROW_EXPORT void StringToInteger( - const std::string& whole, const std::string& fractional, int8_t sign, int64_t* out); -ARROW_EXPORT void StringToInteger( - const std::string& whole, const std::string& fractional, int8_t sign, int128_t* out); +ARROW_EXPORT void StringToInteger(const std::string& whole, const std::string& fractional, + int8_t sign, int32_t* out); +ARROW_EXPORT void StringToInteger(const std::string& whole, const std::string& fractional, + int8_t sign, int64_t* out); +ARROW_EXPORT void StringToInteger(const std::string& whole, const std::string& fractional, + int8_t sign, int128_t* out); template ARROW_EXPORT Status FromString(const std::string& s, Decimal* out, - int* precision = nullptr, int* scale = nullptr); + int* precision = nullptr, int* scale = nullptr); template struct ARROW_EXPORT Decimal { @@ -85,8 +85,8 @@ struct ARROW_EXPORT DecimalPrecision { }; template -ARROW_EXPORT std::string ToString( - const Decimal& decimal_value, int precision, int scale) { +ARROW_EXPORT std::string ToString(const Decimal& decimal_value, int precision, + int scale) { T value = decimal_value.value; // Decimal values are sent to clients as strings so in the interest of @@ -108,8 +108,8 @@ ARROW_EXPORT std::string ToString( if (scale > 0) { int remaining_scale = scale; do { - str[--last_char_idx] = static_cast( - (remaining_value % 10) + static_cast('0')); // Ascii offset + str[--last_char_idx] = static_cast((remaining_value % 10) + + static_cast('0')); // Ascii offset remaining_value /= 10; } while (--remaining_scale > 0); str[--last_char_idx] = '.'; diff --git a/cpp/src/arrow/util/key_value_metadata.cc b/cpp/src/arrow/util/key_value_metadata.cc index 8bddd5d0164..6877a6a5382 100644 --- a/cpp/src/arrow/util/key_value_metadata.cc +++ b/cpp/src/arrow/util/key_value_metadata.cc @@ -48,8 +48,8 @@ KeyValueMetadata::KeyValueMetadata( const std::unordered_map& map) : keys_(UnorderedMapKeys(map)), values_(UnorderedMapValues(map)) {} -KeyValueMetadata::KeyValueMetadata( - const std::vector& keys, const std::vector& values) +KeyValueMetadata::KeyValueMetadata(const std::vector& keys, + const std::vector& values) : keys_(keys), values_(values) { DCHECK_EQ(keys.size(), values.size()); } diff --git a/cpp/src/arrow/util/key_value_metadata.h b/cpp/src/arrow/util/key_value_metadata.h index a2a4623aee7..3d602131684 100644 --- a/cpp/src/arrow/util/key_value_metadata.h +++ b/cpp/src/arrow/util/key_value_metadata.h @@ -32,8 +32,8 @@ namespace arrow { class ARROW_EXPORT KeyValueMetadata { public: KeyValueMetadata(); - KeyValueMetadata( - const std::vector& keys, const std::vector& values); + KeyValueMetadata(const std::vector& keys, + const std::vector& values); explicit KeyValueMetadata(const std::unordered_map& map); virtual ~KeyValueMetadata() = default; diff --git a/cpp/src/arrow/util/logging.h b/cpp/src/arrow/util/logging.h index b6181219dba..89e69f932d5 100644 --- a/cpp/src/arrow/util/logging.h +++ b/cpp/src/arrow/util/logging.h @@ -50,32 +50,25 @@ namespace arrow { #define DCHECK(condition) \ ARROW_IGNORE_EXPR(condition) \ - while (false) \ - ::arrow::internal::NullLog() + while (false) ::arrow::internal::NullLog() #define DCHECK_EQ(val1, val2) \ ARROW_IGNORE_EXPR(val1) \ - while (false) \ - ::arrow::internal::NullLog() + while (false) ::arrow::internal::NullLog() #define DCHECK_NE(val1, val2) \ ARROW_IGNORE_EXPR(val1) \ - while (false) \ - ::arrow::internal::NullLog() + while (false) ::arrow::internal::NullLog() #define DCHECK_LE(val1, val2) \ ARROW_IGNORE_EXPR(val1) \ - while (false) \ - ::arrow::internal::NullLog() + while (false) ::arrow::internal::NullLog() #define DCHECK_LT(val1, val2) \ ARROW_IGNORE_EXPR(val1) \ - while (false) \ - ::arrow::internal::NullLog() + while (false) ::arrow::internal::NullLog() #define DCHECK_GE(val1, val2) \ ARROW_IGNORE_EXPR(val1) \ - while (false) \ - ::arrow::internal::NullLog() + while (false) ::arrow::internal::NullLog() #define DCHECK_GT(val1, val2) \ ARROW_IGNORE_EXPR(val1) \ - while (false) \ - ::arrow::internal::NullLog() + while (false) ::arrow::internal::NullLog() #else #define ARROW_DFATAL ARROW_FATAL @@ -107,14 +100,20 @@ class CerrLog { has_logged_(false) {} virtual ~CerrLog() { - if (has_logged_) { std::cerr << std::endl; } - if (severity_ == ARROW_FATAL) { std::exit(1); } + if (has_logged_) { + std::cerr << std::endl; + } + if (severity_ == ARROW_FATAL) { + std::exit(1); + } } template CerrLog& operator<<(const T& t) { - has_logged_ = true; - std::cerr << t; + if (severity_ != ARROW_DEBUG) { + has_logged_ = true; + std::cerr << t; + } return *this; } @@ -131,7 +130,9 @@ class FatalLog : public CerrLog { : CerrLog(ARROW_FATAL){} // NOLINT [[noreturn]] ~FatalLog() { - if (has_logged_) { std::cerr << std::endl; } + if (has_logged_) { + std::cerr << std::endl; + } std::exit(1); } }; diff --git a/cpp/src/arrow/util/memory.h b/cpp/src/arrow/util/memory.h index c5c17ef907c..fef6b315779 100644 --- a/cpp/src/arrow/util/memory.h +++ b/cpp/src/arrow/util/memory.h @@ -22,6 +22,7 @@ #include namespace arrow { +namespace internal { uint8_t* pointer_logical_and(const uint8_t* address, uintptr_t bits) { uintptr_t value = reinterpret_cast(address); @@ -31,7 +32,7 @@ uint8_t* pointer_logical_and(const uint8_t* address, uintptr_t bits) { // A helper function for doing memcpy with multiple threads. This is required // to saturate the memory bandwidth of modern cpus. void parallel_memcopy(uint8_t* dst, const uint8_t* src, int64_t nbytes, - uintptr_t block_size, int num_threads) { + uintptr_t block_size, int num_threads) { std::vector threadpool(num_threads); uint8_t* left = pointer_logical_and(src + block_size - 1, ~(block_size - 1)); uint8_t* right = pointer_logical_and(src + nbytes, ~(block_size - 1)); @@ -52,18 +53,21 @@ void parallel_memcopy(uint8_t* dst, const uint8_t* src, int64_t nbytes, // Start all threads first and handle leftovers while threads run. for (int i = 0; i < num_threads; i++) { - threadpool[i] = std::thread( - memcpy, dst + prefix + i * chunk_size, left + i * chunk_size, chunk_size); + threadpool[i] = std::thread(memcpy, dst + prefix + i * chunk_size, + left + i * chunk_size, chunk_size); } memcpy(dst, src, prefix); memcpy(dst + prefix + num_threads * chunk_size, right, suffix); for (auto& t : threadpool) { - if (t.joinable()) { t.join(); } + if (t.joinable()) { + t.join(); + } } } +} // namespace internal } // namespace arrow #endif // ARROW_UTIL_MEMORY_H diff --git a/cpp/src/arrow/util/random.h b/cpp/src/arrow/util/random.h index 31f2b0680fe..2e05a73033d 100644 --- a/cpp/src/arrow/util/random.h +++ b/cpp/src/arrow/util/random.h @@ -12,13 +12,14 @@ #include namespace arrow { - -namespace random_internal { +namespace internal { +namespace random { static const uint32_t M = 2147483647L; // 2^31-1 const double kTwoPi = 6.283185307179586476925286; -} // namespace random_internal +} // namespace random +} // namespace internal // A very simple random number generator. Not especially good at // generating truly random bits, but good enough for our needs in this @@ -27,7 +28,9 @@ class Random { public: explicit Random(uint32_t s) : seed_(s & 0x7fffffffu) { // Avoid bad seeds. - if (seed_ == 0 || seed_ == random_internal::M) { seed_ = 1; } + if (seed_ == 0 || seed_ == internal::random::M) { + seed_ = 1; + } } // Next pseudo-random 32-bit unsigned integer. @@ -44,11 +47,13 @@ class Random { uint64_t product = seed_ * A; // Compute (product % M) using the fact that ((x << 31) % M) == x. - seed_ = static_cast((product >> 31) + (product & random_internal::M)); + seed_ = static_cast((product >> 31) + (product & internal::random::M)); // The first reduction may overflow by 1 bit, so we may need to // repeat. mod == M is not possible; using > allows the faster // sign-bit-based test. - if (seed_ > random_internal::M) { seed_ -= random_internal::M; } + if (seed_ > internal::random::M) { + seed_ -= internal::random::M; + } return seed_; } @@ -95,16 +100,16 @@ class Random { // Adapted from WebRTC source code at: // webrtc/trunk/modules/video_coding/main/test/test_util.cc double Normal(double mean, double std_dev) { - double uniform1 = (Next() + 1.0) / (random_internal::M + 1.0); - double uniform2 = (Next() + 1.0) / (random_internal::M + 1.0); - return ( - mean + - std_dev * sqrt(-2 * ::log(uniform1)) * cos(random_internal::kTwoPi * uniform2)); + double uniform1 = (Next() + 1.0) / (internal::random::M + 1.0); + double uniform2 = (Next() + 1.0) / (internal::random::M + 1.0); + return (mean + + std_dev * sqrt(-2 * ::log(uniform1)) * + cos(internal::random::kTwoPi * uniform2)); } // Return a random number between 0.0 and 1.0 inclusive. double NextDoubleFraction() { - return Next() / static_cast(random_internal::M + 1.0); + return Next() / static_cast(internal::random::M + 1.0); } private: diff --git a/cpp/src/arrow/util/rle-encoding-test.cc b/cpp/src/arrow/util/rle-encoding-test.cc index 7c9b33c3494..7549b874355 100644 --- a/cpp/src/arrow/util/rle-encoding-test.cc +++ b/cpp/src/arrow/util/rle-encoding-test.cc @@ -178,7 +178,7 @@ TEST(BitArray, TestMixed) { // exactly 'expected_encoding'. // if expected_len is not -1, it will validate the encoded size is correct. void ValidateRle(const vector& values, int bit_width, uint8_t* expected_encoding, - int expected_len) { + int expected_len) { const int len = 64 * 1024; uint8_t buffer[len]; EXPECT_LE(expected_len, len); @@ -190,7 +190,9 @@ void ValidateRle(const vector& values, int bit_width, uint8_t* expected_enc } int encoded_len = encoder.Flush(); - if (expected_len != -1) { EXPECT_EQ(encoded_len, expected_len); } + if (expected_len != -1) { + EXPECT_EQ(encoded_len, expected_len); + } if (expected_encoding != NULL) { EXPECT_EQ(memcmp(buffer, expected_encoding, expected_len), 0); } @@ -211,7 +213,7 @@ void ValidateRle(const vector& values, int bit_width, uint8_t* expected_enc RleDecoder decoder(buffer, len, bit_width); vector values_read(values.size()); ASSERT_EQ(values.size(), - decoder.GetBatch(values_read.data(), static_cast(values.size()))); + decoder.GetBatch(values_read.data(), static_cast(values.size()))); EXPECT_EQ(values, values_read); } } @@ -224,7 +226,9 @@ bool CheckRoundTrip(const vector& values, int bit_width) { RleEncoder encoder(buffer, len, bit_width); for (size_t i = 0; i < values.size(); ++i) { bool result = encoder.Put(values[i]); - if (!result) { return false; } + if (!result) { + return false; + } } int encoded_len = encoder.Flush(); int out = 0; @@ -233,7 +237,9 @@ bool CheckRoundTrip(const vector& values, int bit_width) { RleDecoder decoder(buffer, encoded_len, bit_width); for (size_t i = 0; i < values.size(); ++i) { EXPECT_TRUE(decoder.Get(&out)); - if (values[i] != out) { return false; } + if (values[i] != out) { + return false; + } } } @@ -245,7 +251,9 @@ bool CheckRoundTrip(const vector& values, int bit_width) { decoder.GetBatch(values_read.data(), static_cast(values.size()))) { return false; } - if (values != values_read) { return false; } + if (values != values_read) { + return false; + } } return true; @@ -294,8 +302,8 @@ TEST(Rle, SpecificSequences) { ValidateRle(values, 1, expected_buffer, 1 + num_groups); for (int width = 2; width <= MAX_WIDTH; ++width) { int num_values = static_cast(BitUtil::Ceil(100, 8)) * 8; - ValidateRle( - values, width, NULL, 1 + static_cast(BitUtil::Ceil(width * num_values, 8))); + ValidateRle(values, width, NULL, + 1 + static_cast(BitUtil::Ceil(width * num_values, 8))); } } @@ -352,8 +360,7 @@ TEST(Rle, BitWidthZeroLiteral) { // group but flush before finishing. TEST(BitRle, Flush) { vector values; - for (int i = 0; i < 16; ++i) - values.push_back(1); + for (int i = 0; i < 16; ++i) values.push_back(1); values.push_back(0); ValidateRle(values, 1, NULL, -1); values.push_back(1); @@ -385,7 +392,9 @@ TEST(BitRle, Random) { for (int i = 0; i < ngroups; ++i) { int group_size = dist(gen); - if (group_size > max_group_size) { group_size = 1; } + if (group_size > max_group_size) { + group_size = 1; + } for (int i = 0; i < group_size; ++i) { values.push_back(parity); } diff --git a/cpp/src/arrow/util/rle-encoding.h b/cpp/src/arrow/util/rle-encoding.h index 9ec62351446..e69077807df 100644 --- a/cpp/src/arrow/util/rle-encoding.h +++ b/cpp/src/arrow/util/rle-encoding.h @@ -21,8 +21,8 @@ #ifndef ARROW_UTIL_RLE_ENCODING_H #define ARROW_UTIL_RLE_ENCODING_H -#include #include +#include #include "arrow/util/bit-stream-utils.h" #include "arrow/util/bit-util.h" @@ -122,7 +122,8 @@ class RleDecoder { /// Like GetBatchWithDict but add spacing for null entries template int GetBatchWithDictSpaced(const T* dictionary, T* values, int batch_size, - int null_count, const uint8_t* valid_bits, int64_t valid_bits_offset); + int null_count, const uint8_t* valid_bits, + int64_t valid_bits_offset); protected: BitReader bit_reader_; @@ -289,7 +290,7 @@ inline int RleDecoder::GetBatch(T* values, int batch_size) { int repeat_batch = std::min(batch_size - values_read, static_cast(repeat_count_)); std::fill(values + values_read, values + values_read + repeat_batch, - static_cast(current_value_)); + static_cast(current_value_)); repeat_count_ -= repeat_batch; values_read += repeat_batch; } else if (literal_count_ > 0) { @@ -318,7 +319,7 @@ inline int RleDecoder::GetBatchWithDict(const T* dictionary, T* values, int batc int repeat_batch = std::min(batch_size - values_read, static_cast(repeat_count_)); std::fill(values + values_read, values + values_read + repeat_batch, - dictionary[current_value_]); + dictionary[current_value_]); repeat_count_ -= repeat_batch; values_read += repeat_batch; } else if (literal_count_ > 0) { @@ -345,8 +346,9 @@ inline int RleDecoder::GetBatchWithDict(const T* dictionary, T* values, int batc template inline int RleDecoder::GetBatchWithDictSpaced(const T* dictionary, T* values, - int batch_size, int null_count, const uint8_t* valid_bits, - int64_t valid_bits_offset) { + int batch_size, int null_count, + const uint8_t* valid_bits, + int64_t valid_bits_offset) { DCHECK_GE(bit_width_, 0); int values_read = 0; int remaining_nulls = null_count; @@ -379,8 +381,8 @@ inline int RleDecoder::GetBatchWithDictSpaced(const T* dictionary, T* values, std::fill(values + values_read, values + values_read + repeat_batch, value); values_read += repeat_batch; } else if (literal_count_ > 0) { - int literal_batch = std::min( - batch_size - values_read - remaining_nulls, static_cast(literal_count_)); + int literal_batch = std::min(batch_size - values_read - remaining_nulls, + static_cast(literal_count_)); // Decode the literals constexpr int kBufferSize = 1024; @@ -434,7 +436,7 @@ bool RleDecoder::NextCounts() { repeat_count_ = indicator_value >> 1; bool result = bit_reader_.GetAligned(static_cast(BitUtil::Ceil(bit_width_, 8)), - reinterpret_cast(¤t_value_)); + reinterpret_cast(¤t_value_)); DCHECK(result); } return true; @@ -509,8 +511,8 @@ inline void RleEncoder::FlushRepeatedRun() { // The lsb of 0 indicates this is a repeated run int32_t indicator_value = repeat_count_ << 1 | 0; result &= bit_writer_.PutVlqInt(indicator_value); - result &= bit_writer_.PutAligned( - current_value_, static_cast(BitUtil::Ceil(bit_width_, 8))); + result &= bit_writer_.PutAligned(current_value_, + static_cast(BitUtil::Ceil(bit_width_, 8))); DCHECK(result); num_buffered_values_ = 0; repeat_count_ = 0; @@ -552,7 +554,7 @@ inline void RleEncoder::FlushBufferedValues(bool done) { inline int RleEncoder::Flush() { if (literal_count_ > 0 || repeat_count_ > 0 || num_buffered_values_ > 0) { bool all_repeat = literal_count_ == 0 && (repeat_count_ == num_buffered_values_ || - num_buffered_values_ == 0); + num_buffered_values_ == 0); // There is something pending, figure out if it's a repeated or literal run if (repeat_count_ > 0 && all_repeat) { FlushRepeatedRun(); diff --git a/cpp/src/arrow/util/sse-util.h b/cpp/src/arrow/util/sse-util.h index 570c4057a75..a0ec8a2e939 100644 --- a/cpp/src/arrow/util/sse-util.h +++ b/cpp/src/arrow/util/sse-util.h @@ -53,8 +53,8 @@ static const int STRCMP_MODE = /// Precomputed mask values up to 16 bits. static const int SSE_BITMASK[CHARS_PER_128_BIT_REGISTER] = { - 1 << 0, 1 << 1, 1 << 2, 1 << 3, 1 << 4, 1 << 5, 1 << 6, 1 << 7, 1 << 8, 1 << 9, - 1 << 10, 1 << 11, 1 << 12, 1 << 13, 1 << 14, 1 << 15, + 1 << 0, 1 << 1, 1 << 2, 1 << 3, 1 << 4, 1 << 5, 1 << 6, 1 << 7, + 1 << 8, 1 << 9, 1 << 10, 1 << 11, 1 << 12, 1 << 13, 1 << 14, 1 << 15, }; } // namespace SSEUtil diff --git a/cpp/src/arrow/util/stl-util-test.cc b/cpp/src/arrow/util/stl-util-test.cc index 526520e7a2d..629eb24c3d9 100644 --- a/cpp/src/arrow/util/stl-util-test.cc +++ b/cpp/src/arrow/util/stl-util-test.cc @@ -25,6 +25,7 @@ #include "arrow/test-util.h" namespace arrow { +namespace internal { TEST(StlUtilTest, VectorAddRemoveTest) { std::vector values; @@ -57,4 +58,5 @@ TEST(StlUtilTest, VectorAddRemoveTest) { EXPECT_TRUE(result3.empty()); } +} // namespace internal } // namespace arrow diff --git a/cpp/src/arrow/util/stl.h b/cpp/src/arrow/util/stl.h index d58689b7488..27c1778680c 100644 --- a/cpp/src/arrow/util/stl.h +++ b/cpp/src/arrow/util/stl.h @@ -23,6 +23,7 @@ #include "arrow/util/logging.h" namespace arrow { +namespace internal { template inline std::vector DeleteVectorElement(const std::vector& values, size_t index) { @@ -40,8 +41,8 @@ inline std::vector DeleteVectorElement(const std::vector& values, size_t i } template -inline std::vector AddVectorElement( - const std::vector& values, size_t index, const T& new_element) { +inline std::vector AddVectorElement(const std::vector& values, size_t index, + const T& new_element) { DCHECK_LE(index, values.size()); std::vector out; out.reserve(values.size() + 1); @@ -55,6 +56,7 @@ inline std::vector AddVectorElement( return out; } +} // namespace internal } // namespace arrow #endif // ARROW_UTIL_STL_H diff --git a/cpp/src/arrow/util/string.h b/cpp/src/arrow/util/string.h index 5d9fdc88ced..6e70ddcccef 100644 --- a/cpp/src/arrow/util/string.h +++ b/cpp/src/arrow/util/string.h @@ -46,7 +46,9 @@ static inline Status ParseHexValue(const char* data, uint8_t* out) { const char* pos2 = std::lower_bound(kAsciiTable, kAsciiTable + 16, c2); // Error checking - if (*pos1 != c1 || *pos2 != c2) { return Status::Invalid("Encountered non-hex digit"); } + if (*pos1 != c1 || *pos2 != c2) { + return Status::Invalid("Encountered non-hex digit"); + } *out = static_cast((pos1 - kAsciiTable) << 4 | (pos2 - kAsciiTable)); return Status::OK(); diff --git a/cpp/src/arrow/visitor.cc b/cpp/src/arrow/visitor.cc index 117578965cc..203ed6d4af9 100644 --- a/cpp/src/arrow/visitor.cc +++ b/cpp/src/arrow/visitor.cc @@ -59,6 +59,8 @@ Status ArrayVisitor::Visit(const DecimalArray& array) { return Status::NotImplemented("decimal"); } +#undef ARRAY_VISITOR_DEFAULT + // ---------------------------------------------------------------------- // Default implementations of TypeVisitor methods @@ -95,4 +97,6 @@ TYPE_VISITOR_DEFAULT(StructType); TYPE_VISITOR_DEFAULT(UnionType); TYPE_VISITOR_DEFAULT(DictionaryType); +#undef TYPE_VISITOR_DEFAULT + } // namespace arrow diff --git a/cpp/src/arrow/visitor_inline.h b/cpp/src/arrow/visitor_inline.h index 7478950b894..54f9e880b83 100644 --- a/cpp/src/arrow/visitor_inline.h +++ b/cpp/src/arrow/visitor_inline.h @@ -65,6 +65,8 @@ inline Status VisitTypeInline(const DataType& type, VISITOR* visitor) { return Status::NotImplemented("Type not implemented"); } +#undef TYPE_VISIT_INLINE + #define ARRAY_VISIT_INLINE(TYPE_CLASS) \ case TYPE_CLASS::type_id: \ return visitor->Visit( \ diff --git a/cpp/src/plasma/.gitignore b/cpp/src/plasma/.gitignore new file mode 100644 index 00000000000..163b5c56e91 --- /dev/null +++ b/cpp/src/plasma/.gitignore @@ -0,0 +1,18 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +*_generated.h diff --git a/cpp/src/plasma/CMakeLists.txt b/cpp/src/plasma/CMakeLists.txt index 4ff3beba779..7e91202623e 100644 --- a/cpp/src/plasma/CMakeLists.txt +++ b/cpp/src/plasma/CMakeLists.txt @@ -19,16 +19,13 @@ cmake_minimum_required(VERSION 2.8) project(plasma) +set(CMAKE_MODULE_PATH ${CMAKE_MODULE_PATH} "${CMAKE_SOURCE_DIR}/../python/cmake_modules") + find_package(PythonLibsNew REQUIRED) find_package(Threads) -option(PLASMA_PYTHON - "Build the Plasma Python extensions" - OFF) - -if(APPLE) - SET(CMAKE_SHARED_LIBRARY_SUFFIX ".so") -endif(APPLE) +set(PLASMA_SO_VERSION "0") +set(PLASMA_ABI_VERSION "${PLASMA_SO_VERSION}.0.0") include_directories(SYSTEM ${PYTHON_INCLUDE_DIRS}) include_directories("${FLATBUFFERS_INCLUDE_DIR}" "${CMAKE_CURRENT_LIST_DIR}/" "${CMAKE_CURRENT_LIST_DIR}/thirdparty/" "${CMAKE_CURRENT_LIST_DIR}/../") @@ -40,7 +37,7 @@ set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -Wno-conversion") # Compile flatbuffers set(PLASMA_FBS_SRC "${CMAKE_CURRENT_LIST_DIR}/format/plasma.fbs" "${CMAKE_CURRENT_LIST_DIR}/format/common.fbs") -set(OUTPUT_DIR ${CMAKE_CURRENT_LIST_DIR}/format/) +set(OUTPUT_DIR ${CMAKE_CURRENT_LIST_DIR}/) set(PLASMA_FBS_OUTPUT_FILES "${OUTPUT_DIR}/common_generated.h" @@ -69,8 +66,6 @@ endif() set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fPIC") -set_source_files_properties(extension.cc PROPERTIES COMPILE_FLAGS -Wno-strict-aliasing) - set(PLASMA_SRCS client.cc common.cc @@ -92,22 +87,53 @@ ADD_ARROW_LIB(plasma # The optimization flag -O3 is suggested by dlmalloc.c, which is #included in # malloc.cc; we set it here regardless of whether we do a debug or release build. -set_source_files_properties(malloc.cc PROPERTIES COMPILE_FLAGS "-Wno-error -O3") +set_source_files_properties(malloc.cc PROPERTIES + COMPILE_FLAGS "-O3") + +if ("${COMPILER_FAMILY}" STREQUAL "clang") + set_property(SOURCE malloc.cc + APPEND_STRING + PROPERTY COMPILE_FLAGS + " -Wno-parentheses-equality -Wno-shorten-64-to-32") +endif() + +if ("${COMPILER_FAMILY}" STREQUAL "gcc") + set_property(SOURCE malloc.cc + APPEND_STRING + PROPERTY COMPILE_FLAGS + " -Wno-conversion") +endif() add_executable(plasma_store store.cc) target_link_libraries(plasma_store plasma_static) +# Headers: top level +install(FILES + common.h + common_generated.h + client.h + events.h + plasma.h + plasma_generated.h + protocol.h + DESTINATION "${CMAKE_INSTALL_INCLUDEDIR}/plasma") + +# Plasma store +install(TARGETS plasma_store DESTINATION ${CMAKE_INSTALL_BINDIR}) + +# pkg-config support +configure_file(plasma.pc.in + "${CMAKE_CURRENT_BINARY_DIR}/plasma.pc" + @ONLY) +install( + FILES "${CMAKE_CURRENT_BINARY_DIR}/plasma.pc" + DESTINATION "${CMAKE_INSTALL_LIBDIR}/pkgconfig/") + +####################################### +# Unit tests +####################################### + ADD_ARROW_TEST(test/serialization_tests) ARROW_TEST_LINK_LIBRARIES(test/serialization_tests plasma_static) ADD_ARROW_TEST(test/client_tests) ARROW_TEST_LINK_LIBRARIES(test/client_tests plasma_static) - -if(PLASMA_PYTHON) - add_library(plasma_extension SHARED extension.cc) - - if(APPLE) - target_link_libraries(plasma_extension plasma_static "-undefined dynamic_lookup") - else(APPLE) - target_link_libraries(plasma_extension plasma_static -Wl,--whole-archive ${FLATBUFFERS_STATIC_LIB} -Wl,--no-whole-archive) - endif(APPLE) -endif() diff --git a/cpp/src/plasma/client.cc b/cpp/src/plasma/client.cc index dcb78e7ec52..8ea62c6e553 100644 --- a/cpp/src/plasma/client.cc +++ b/cpp/src/plasma/client.cc @@ -51,11 +51,31 @@ #define XXH64_DEFAULT_SEED 0 +namespace plasma { + // Number of threads used for memcopy and hash computations. constexpr int64_t kThreadPoolSize = 8; constexpr int64_t kBytesInMB = 1 << 20; static std::vector threadpool_(kThreadPoolSize); +struct ObjectInUseEntry { + /// A count of the number of times this client has called PlasmaClient::Create + /// or + /// PlasmaClient::Get on this object ID minus the number of calls to + /// PlasmaClient::Release. + /// When this count reaches zero, we remove the entry from the ObjectsInUse + /// and decrement a count in the relevant ClientMmapTableEntry. + int count; + /// Cached information to read the object. + PlasmaObject object; + /// A flag representing whether the object has been sealed. + bool is_sealed; +}; + +PlasmaClient::PlasmaClient() {} + +PlasmaClient::~PlasmaClient() {} + // If the file descriptor fd has been mmapped in this client process before, // return the pointer that was returned by mmap, otherwise mmap it and store the // pointer in a hash table. @@ -68,7 +88,9 @@ uint8_t* PlasmaClient::lookup_or_mmap(int fd, int store_fd_val, int64_t map_size uint8_t* result = reinterpret_cast( mmap(NULL, map_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0)); // TODO(pcm): Don't fail here, instead return a Status. - if (result == MAP_FAILED) { ARROW_LOG(FATAL) << "mmap failed"; } + if (result == MAP_FAILED) { + ARROW_LOG(FATAL) << "mmap failed"; + } close(fd); ClientMmapTableEntry& entry = mmap_table_[store_fd_val]; entry.pointer = result; @@ -86,8 +108,8 @@ uint8_t* PlasmaClient::lookup_mmapped_file(int store_fd_val) { return entry->second.pointer; } -void PlasmaClient::increment_object_count( - const ObjectID& object_id, PlasmaObject* object, bool is_sealed) { +void PlasmaClient::increment_object_count(const ObjectID& object_id, PlasmaObject* object, + bool is_sealed) { // Increment the count of the object to track the fact that it is being used. // The corresponding decrement should happen in PlasmaClient::Release. auto elem = objects_in_use_.find(object_id); @@ -122,7 +144,7 @@ void PlasmaClient::increment_object_count( } Status PlasmaClient::Create(const ObjectID& object_id, int64_t data_size, - uint8_t* metadata, int64_t metadata_size, uint8_t** data) { + uint8_t* metadata, int64_t metadata_size, uint8_t** data) { ARROW_LOG(DEBUG) << "called plasma_create on conn " << store_conn_ << " with size " << data_size << " and metadata size " << metadata_size; RETURN_NOT_OK(SendCreateRequest(store_conn_, object_id, data_size, metadata_size)); @@ -130,7 +152,7 @@ Status PlasmaClient::Create(const ObjectID& object_id, int64_t data_size, RETURN_NOT_OK(PlasmaReceive(store_conn_, MessageType_PlasmaCreateReply, &buffer)); ObjectID id; PlasmaObject object; - RETURN_NOT_OK(ReadCreateReply(buffer.data(), &id, &object)); + RETURN_NOT_OK(ReadCreateReply(buffer.data(), buffer.size(), &id, &object)); // If the CreateReply included an error, then the store will not send a file // descriptor. int fd = recv_fd(store_conn_); @@ -163,7 +185,7 @@ Status PlasmaClient::Create(const ObjectID& object_id, int64_t data_size, } Status PlasmaClient::Get(const ObjectID* object_ids, int64_t num_objects, - int64_t timeout_ms, ObjectBuffer* object_buffers) { + int64_t timeout_ms, ObjectBuffer* object_buffers) { // Fill out the info for the objects that are already in use locally. bool all_present = true; for (int i = 0; i < num_objects; ++i) { @@ -193,7 +215,9 @@ Status PlasmaClient::Get(const ObjectID* object_ids, int64_t num_objects, } } - if (all_present) { return Status::OK(); } + if (all_present) { + return Status::OK(); + } // If we get here, then the objects aren't all currently in use by this // client, so we need to send a request to the plasma store. @@ -203,8 +227,8 @@ Status PlasmaClient::Get(const ObjectID* object_ids, int64_t num_objects, std::vector received_object_ids(num_objects); std::vector object_data(num_objects); PlasmaObject* object; - RETURN_NOT_OK(ReadGetReply( - buffer.data(), received_object_ids.data(), object_data.data(), num_objects)); + RETURN_NOT_OK(ReadGetReply(buffer.data(), buffer.size(), received_object_ids.data(), + object_data.data(), num_objects)); for (int i = 0; i < num_objects; ++i) { DCHECK(received_object_ids[i] == object_ids[i]); @@ -300,13 +324,17 @@ Status PlasmaClient::PerformRelease(const ObjectID& object_id) { } Status PlasmaClient::Release(const ObjectID& object_id) { + // If the client is already disconnected, ignore release requests. + if (store_conn_ < 0) { + return Status::OK(); + } // Add the new object to the release history. release_history_.push_front(object_id); // If there are too many bytes in use by the client or if there are too many // pending release calls, and there are at least some pending release calls in // the release_history list, then release some objects. while ((in_use_object_bytes_ > std::min(kL3CacheSizeBytes, store_capacity_ / 100) || - release_history_.size() > config_.release_delay) && + release_history_.size() > config_.release_delay) && release_history_.size() > 0) { // Perform a release for the object ID for the first pending release. RETURN_NOT_OK(PerformRelease(release_history_.back())); @@ -328,7 +356,8 @@ Status PlasmaClient::Contains(const ObjectID& object_id, bool* has_object) { std::vector buffer; RETURN_NOT_OK(PlasmaReceive(store_conn_, MessageType_PlasmaContainsReply, &buffer)); ObjectID object_id2; - RETURN_NOT_OK(ReadContainsReply(buffer.data(), &object_id2, has_object)); + RETURN_NOT_OK( + ReadContainsReply(buffer.data(), buffer.size(), &object_id2, has_object)); } return Status::OK(); } @@ -340,8 +369,9 @@ static void ComputeBlockHash(const unsigned char* data, int64_t nbytes, uint64_t *hash = XXH64_digest(&hash_state); } -static inline bool compute_object_hash_parallel( - XXH64_state_t* hash_state, const unsigned char* data, int64_t nbytes) { +static inline bool compute_object_hash_parallel(XXH64_state_t* hash_state, + const unsigned char* data, + int64_t nbytes) { // Note that this function will likely be faster if the address of data is // aligned on a 64-byte boundary. const int num_threads = kThreadPoolSize; @@ -356,16 +386,18 @@ static inline bool compute_object_hash_parallel( // Each thread gets a "chunk" of k blocks, except the suffix thread. for (int i = 0; i < num_threads; i++) { - threadpool_[i] = std::thread(ComputeBlockHash, - reinterpret_cast(data_address) + i * chunk_size, chunk_size, - &threadhash[i]); + threadpool_[i] = std::thread( + ComputeBlockHash, reinterpret_cast(data_address) + i * chunk_size, + chunk_size, &threadhash[i]); } - ComputeBlockHash( - reinterpret_cast(right_address), suffix, &threadhash[num_threads]); + ComputeBlockHash(reinterpret_cast(right_address), suffix, + &threadhash[num_threads]); // Join the threads. for (auto& t : threadpool_) { - if (t.joinable()) { t.join(); } + if (t.joinable()) { + t.join(); + } } XXH64_update(hash_state, (unsigned char*)threadhash, sizeof(threadhash)); @@ -376,32 +408,16 @@ static uint64_t compute_object_hash(const ObjectBuffer& obj_buffer) { XXH64_state_t hash_state; XXH64_reset(&hash_state, XXH64_DEFAULT_SEED); if (obj_buffer.data_size >= kBytesInMB) { - compute_object_hash_parallel( - &hash_state, (unsigned char*)obj_buffer.data, obj_buffer.data_size); + compute_object_hash_parallel(&hash_state, (unsigned char*)obj_buffer.data, + obj_buffer.data_size); } else { XXH64_update(&hash_state, (unsigned char*)obj_buffer.data, obj_buffer.data_size); } - XXH64_update( - &hash_state, (unsigned char*)obj_buffer.metadata, obj_buffer.metadata_size); + XXH64_update(&hash_state, (unsigned char*)obj_buffer.metadata, + obj_buffer.metadata_size); return XXH64_digest(&hash_state); } -bool plasma_compute_object_hash( - PlasmaClient* conn, ObjectID object_id, unsigned char* digest) { - // Get the plasma object data. We pass in a timeout of 0 to indicate that - // the operation should timeout immediately. - ObjectBuffer object_buffer; - ARROW_CHECK_OK(conn->Get(&object_id, 1, 0, &object_buffer)); - // If the object was not retrieved, return false. - if (object_buffer.data_size == -1) { return false; } - // Compute the hash. - uint64_t hash = compute_object_hash(object_buffer); - memcpy(digest, &hash, sizeof(hash)); - // Release the plasma object. - ARROW_CHECK_OK(conn->Release(object_id)); - return true; -} - Status PlasmaClient::Seal(const ObjectID& object_id) { // Make sure this client has a reference to the object before sending the // request to Plasma. @@ -413,7 +429,7 @@ Status PlasmaClient::Seal(const ObjectID& object_id) { object_entry->second->is_sealed = true; /// Send the seal request to Plasma. static unsigned char digest[kDigestSize]; - ARROW_CHECK(plasma_compute_object_hash(this, object_id, &digest[0])); + RETURN_NOT_OK(Hash(object_id, &digest[0])); RETURN_NOT_OK(SendSealRequest(store_conn_, object_id, &digest[0])); // We call PlasmaClient::Release to decrement the number of instances of this // object @@ -436,7 +452,23 @@ Status PlasmaClient::Evict(int64_t num_bytes, int64_t& num_bytes_evicted) { std::vector buffer; int64_t type; RETURN_NOT_OK(ReadMessage(store_conn_, &type, &buffer)); - return ReadEvictReply(buffer.data(), num_bytes_evicted); + return ReadEvictReply(buffer.data(), buffer.size(), num_bytes_evicted); +} + +Status PlasmaClient::Hash(const ObjectID& object_id, uint8_t* digest) { + // Get the plasma object data. We pass in a timeout of 0 to indicate that + // the operation should timeout immediately. + ObjectBuffer object_buffer; + RETURN_NOT_OK(Get(&object_id, 1, 0, &object_buffer)); + // If the object was not retrieved, return false. + if (object_buffer.data_size == -1) { + return Status::PlasmaObjectNonexistent("Object not found"); + } + // Compute the hash. + uint64_t hash = compute_object_hash(object_buffer); + memcpy(digest, &hash, sizeof(hash)); + // Release the plasma object. + return Release(object_id); } Status PlasmaClient::Subscribe(int* fd) { @@ -459,11 +491,33 @@ Status PlasmaClient::Subscribe(int* fd) { return Status::OK(); } +Status PlasmaClient::GetNotification(int fd, ObjectID* object_id, int64_t* data_size, + int64_t* metadata_size) { + uint8_t* notification = read_message_async(fd); + if (notification == NULL) { + return Status::IOError("Failed to read object notification from Plasma socket"); + } + auto object_info = flatbuffers::GetRoot(notification); + ARROW_CHECK(object_info->object_id()->size() == sizeof(ObjectID)); + memcpy(object_id, object_info->object_id()->data(), sizeof(ObjectID)); + if (object_info->is_deletion()) { + *data_size = -1; + *metadata_size = -1; + } else { + *data_size = object_info->data_size(); + *metadata_size = object_info->metadata_size(); + } + delete[] notification; + return Status::OK(); +} + Status PlasmaClient::Connect(const std::string& store_socket_name, - const std::string& manager_socket_name, int release_delay) { - store_conn_ = connect_ipc_sock_retry(store_socket_name, -1, -1); + const std::string& manager_socket_name, int release_delay, + int num_retries) { + RETURN_NOT_OK(ConnectIpcSocketRetry(store_socket_name, num_retries, -1, &store_conn_)); if (manager_socket_name != "") { - manager_conn_ = connect_ipc_sock_retry(manager_socket_name, -1, -1); + RETURN_NOT_OK( + ConnectIpcSocketRetry(manager_socket_name, num_retries, -1, &manager_conn_)); } else { manager_conn_ = -1; } @@ -473,7 +527,7 @@ Status PlasmaClient::Connect(const std::string& store_socket_name, RETURN_NOT_OK(SendConnectRequest(store_conn_)); std::vector buffer; RETURN_NOT_OK(PlasmaReceive(store_conn_, MessageType_PlasmaConnectReply, &buffer)); - RETURN_NOT_OK(ReadConnectReply(buffer.data(), &store_capacity_)); + RETURN_NOT_OK(ReadConnectReply(buffer.data(), buffer.size(), &store_capacity_)); return Status::OK(); } @@ -485,7 +539,11 @@ Status PlasmaClient::Disconnect() { // Close the connections to Plasma. The Plasma store will release the objects // that were in use by us when handling the SIGPIPE. close(store_conn_); - if (manager_conn_ >= 0) { close(manager_conn_); } + store_conn_ = -1; + if (manager_conn_ >= 0) { + close(manager_conn_); + manager_conn_ = -1; + } return Status::OK(); } @@ -500,9 +558,7 @@ Status PlasmaClient::Fetch(int num_object_ids, const ObjectID* object_ids) { return SendFetchRequest(manager_conn_, object_ids, num_object_ids); } -int PlasmaClient::get_manager_fd() { - return manager_conn_; -} +int PlasmaClient::get_manager_fd() { return manager_conn_; } Status PlasmaClient::Info(const ObjectID& object_id, int* object_status) { ARROW_CHECK(manager_conn_ >= 0); @@ -511,13 +567,14 @@ Status PlasmaClient::Info(const ObjectID& object_id, int* object_status) { std::vector buffer; RETURN_NOT_OK(PlasmaReceive(manager_conn_, MessageType_PlasmaStatusReply, &buffer)); ObjectID id; - RETURN_NOT_OK(ReadStatusReply(buffer.data(), &id, object_status, 1)); + RETURN_NOT_OK(ReadStatusReply(buffer.data(), buffer.size(), &id, object_status, 1)); ARROW_CHECK(object_id == id); return Status::OK(); } Status PlasmaClient::Wait(int64_t num_object_requests, ObjectRequest* object_requests, - int num_ready_objects, int64_t timeout_ms, int* num_objects_ready) { + int num_ready_objects, int64_t timeout_ms, + int* num_objects_ready) { ARROW_CHECK(manager_conn_ >= 0); ARROW_CHECK(num_object_requests > 0); ARROW_CHECK(num_ready_objects > 0); @@ -529,10 +586,11 @@ Status PlasmaClient::Wait(int64_t num_object_requests, ObjectRequest* object_req } RETURN_NOT_OK(SendWaitRequest(manager_conn_, object_requests, num_object_requests, - num_ready_objects, timeout_ms)); + num_ready_objects, timeout_ms)); std::vector buffer; RETURN_NOT_OK(PlasmaReceive(manager_conn_, MessageType_PlasmaWaitReply, &buffer)); - RETURN_NOT_OK(ReadWaitReply(buffer.data(), object_requests, &num_ready_objects)); + RETURN_NOT_OK( + ReadWaitReply(buffer.data(), buffer.size(), object_requests, &num_ready_objects)); *num_objects_ready = 0; for (int i = 0; i < num_object_requests; ++i) { @@ -540,7 +598,9 @@ Status PlasmaClient::Wait(int64_t num_object_requests, ObjectRequest* object_req int status = object_requests[i].status; switch (type) { case PLASMA_QUERY_LOCAL: - if (status == ObjectStatus_Local) { *num_objects_ready += 1; } + if (status == ObjectStatus_Local) { + *num_objects_ready += 1; + } break; case PLASMA_QUERY_ANYWHERE: if (status == ObjectStatus_Local || status == ObjectStatus_Remote) { @@ -555,3 +615,5 @@ Status PlasmaClient::Wait(int64_t num_object_requests, ObjectRequest* object_req } return Status::OK(); } + +} // namespace plasma diff --git a/cpp/src/plasma/client.h b/cpp/src/plasma/client.h index fb3a161795d..50ec55f5ec8 100644 --- a/cpp/src/plasma/client.h +++ b/cpp/src/plasma/client.h @@ -22,12 +22,18 @@ #include #include +#include #include +#include -#include "plasma/plasma.h" +#include "arrow/status.h" +#include "arrow/util/visibility.h" +#include "plasma/common.h" using arrow::Status; +namespace plasma { + #define PLASMA_DEFAULT_RELEASE_DELAY 64 // Use 100MB as an overestimate of the L3 cache size. @@ -63,22 +69,16 @@ struct ClientMmapTableEntry { int count; }; -struct ObjectInUseEntry { - /// A count of the number of times this client has called PlasmaClient::Create - /// or - /// PlasmaClient::Get on this object ID minus the number of calls to - /// PlasmaClient::Release. - /// When this count reaches zero, we remove the entry from the ObjectsInUse - /// and decrement a count in the relevant ClientMmapTableEntry. - int count; - /// Cached information to read the object. - PlasmaObject object; - /// A flag representing whether the object has been sealed. - bool is_sealed; -}; +struct ObjectInUseEntry; +struct ObjectRequest; +struct PlasmaObject; -class PlasmaClient { +class ARROW_EXPORT PlasmaClient { public: + PlasmaClient(); + + ~PlasmaClient(); + /// Connect to the local plasma store and plasma manager. Return /// the resulting connection. /// @@ -89,9 +89,11 @@ class PlasmaClient { /// function will not connect to a manager. /// @param release_delay Number of released objects that are kept around /// and not evicted to avoid too many munmaps. + /// @param num_retries number of attempts to connect to IPC socket, default 50 /// @return The return status. Status Connect(const std::string& store_socket_name, - const std::string& manager_socket_name, int release_delay); + const std::string& manager_socket_name, int release_delay, + int num_retries = -1); /// Create an object in the Plasma Store. Any metadata for this object must be /// be passed in when the object is created. @@ -108,7 +110,7 @@ class PlasmaClient { /// @param data The address of the newly created object will be written here. /// @return The return status. Status Create(const ObjectID& object_id, int64_t data_size, uint8_t* metadata, - int64_t metadata_size, uint8_t** data); + int64_t metadata_size, uint8_t** data); /// Get some objects from the Plasma Store. This function will block until the /// objects have all been created and sealed in the Plasma Store or the @@ -126,7 +128,7 @@ class PlasmaClient { /// size field is -1, then the object was not retrieved. /// @return The return status. Status Get(const ObjectID* object_ids, int64_t num_objects, int64_t timeout_ms, - ObjectBuffer* object_buffers); + ObjectBuffer* object_buffers); /// Tell Plasma that the client no longer needs the object. This should be /// called @@ -177,10 +179,18 @@ class PlasmaClient { /// @return The return status. Status Evict(int64_t num_bytes, int64_t& num_bytes_evicted); + /// Compute the hash of an object in the object store. + /// + /// @param conn The object containing the connection state. + /// @param object_id The ID of the object we want to hash. + /// @param digest A pointer at which to return the hash digest of the object. + /// The pointer must have at least kDigestSize bytes allocated. + /// @return The return status. + Status Hash(const ObjectID& object_id, uint8_t* digest); + /// Subscribe to notifications when objects are sealed in the object store. /// Whenever an object is sealed, a message will be written to the client - /// socket - /// that is returned by this method. + /// socket that is returned by this method. /// /// @param fd Out parameter for the file descriptor the client should use to /// read notifications @@ -188,6 +198,16 @@ class PlasmaClient { /// @return The return status. Status Subscribe(int* fd); + /// Receive next object notification for this client if Subscribe has been called. + /// + /// @param fd The file descriptor we are reading the notification from. + /// @param object_id Out parameter, the object_id of the object that was sealed. + /// @param data_size Out parameter, the data size of the object that was sealed. + /// @param metadata_size Out parameter, the metadata size of the object that was sealed. + /// @return The return status. + Status GetNotification(int fd, ObjectID* object_id, int64_t* data_size, + int64_t* metadata_size); + /// Disconnect from the local plasma instance, including the local store and /// manager. /// @@ -253,7 +273,7 @@ class PlasmaClient { /// min_num_ready_objects this means that timeout expired. /// @return The return status. Status Wait(int64_t num_object_requests, ObjectRequest* object_requests, - int num_ready_objects, int64_t timeout_ms, int* num_objects_ready); + int num_ready_objects, int64_t timeout_ms, int* num_objects_ready); /// Transfer local object to a different plasma manager. /// @@ -297,8 +317,8 @@ class PlasmaClient { uint8_t* lookup_mmapped_file(int store_fd_val); - void increment_object_count( - const ObjectID& object_id, PlasmaObject* object, bool is_sealed); + void increment_object_count(const ObjectID& object_id, PlasmaObject* object, + bool is_sealed); /// File descriptor of the Unix domain socket that connects to the store. int store_conn_; @@ -330,14 +350,6 @@ class PlasmaClient { int64_t store_capacity_; }; -/// Compute the hash of an object in the object store. -/// -/// @param conn The object containing the connection state. -/// @param object_id The ID of the object we want to hash. -/// @param digest A pointer at which to return the hash digest of the object. -/// The pointer must have at least DIGEST_SIZE bytes allocated. -/// @return A boolean representing whether the hash operation succeeded. -bool plasma_compute_object_hash( - PlasmaClient* conn, ObjectID object_id, unsigned char* digest); +} // namespace plasma #endif // PLASMA_CLIENT_H diff --git a/cpp/src/plasma/common.cc b/cpp/src/plasma/common.cc index a09a963fa47..d7a79650785 100644 --- a/cpp/src/plasma/common.cc +++ b/cpp/src/plasma/common.cc @@ -19,7 +19,9 @@ #include -#include "format/plasma_generated.h" +#include "plasma/plasma_generated.h" + +namespace plasma { using arrow::Status; @@ -39,13 +41,9 @@ UniqueID UniqueID::from_binary(const std::string& binary) { return id; } -const uint8_t* UniqueID::data() const { - return id_; -} +const uint8_t* UniqueID::data() const { return id_; } -uint8_t* UniqueID::mutable_data() { - return id_; -} +uint8_t* UniqueID::mutable_data() { return id_; } std::string UniqueID::binary() const { return std::string(reinterpret_cast(id_), kUniqueIDSize); @@ -81,3 +79,8 @@ Status plasma_error_status(int plasma_error) { } return Status::OK(); } + +ARROW_EXPORT int ObjectStatusLocal = ObjectStatus_Local; +ARROW_EXPORT int ObjectStatusRemote = ObjectStatus_Remote; + +} // namespace plasma diff --git a/cpp/src/plasma/common.h b/cpp/src/plasma/common.h index 85dc74bf86e..2b71da67015 100644 --- a/cpp/src/plasma/common.h +++ b/cpp/src/plasma/common.h @@ -29,9 +29,11 @@ #include "arrow/status.h" #include "arrow/util/logging.h" +namespace plasma { + constexpr int64_t kUniqueIDSize = 20; -class UniqueID { +class ARROW_EXPORT UniqueID { public: static UniqueID from_random(); static UniqueID from_binary(const std::string& binary); @@ -60,4 +62,39 @@ typedef UniqueID ObjectID; arrow::Status plasma_error_status(int plasma_error); +/// Size of object hash digests. +constexpr int64_t kDigestSize = sizeof(uint64_t); + +/// Object request data structure. Used for Wait. +struct ObjectRequest { + /// The ID of the requested object. If ID_NIL request any object. + ObjectID object_id; + /// Request associated to the object. It can take one of the following values: + /// - PLASMA_QUERY_LOCAL: return if or when the object is available in the + /// local Plasma Store. + /// - PLASMA_QUERY_ANYWHERE: return if or when the object is available in + /// the system (i.e., either in the local or a remote Plasma Store). + int type; + /// Object status. Same as the status returned by plasma_status() function + /// call. This is filled in by plasma_wait_for_objects1(): + /// - ObjectStatus_Local: object is ready at the local Plasma Store. + /// - ObjectStatus_Remote: object is ready at a remote Plasma Store. + /// - ObjectStatus_Nonexistent: object does not exist in the system. + /// - PLASMA_CLIENT_IN_TRANSFER, if the object is currently being scheduled + /// for being transferred or it is transferring. + int status; +}; + +enum ObjectRequestType { + /// Query for object in the local plasma store. + PLASMA_QUERY_LOCAL = 1, + /// Query for object in the local plasma store or in a remote plasma store. + PLASMA_QUERY_ANYWHERE +}; + +extern int ObjectStatusLocal; +extern int ObjectStatusRemote; + +} // namespace plasma + #endif // PLASMA_COMMON_H diff --git a/cpp/src/plasma/events.cc b/cpp/src/plasma/events.cc index a9f7356e1f6..4e4ecfaaaca 100644 --- a/cpp/src/plasma/events.cc +++ b/cpp/src/plasma/events.cc @@ -19,34 +19,37 @@ #include -void EventLoop::file_event_callback( - aeEventLoop* loop, int fd, void* context, int events) { +namespace plasma { + +void EventLoop::FileEventCallback(aeEventLoop* loop, int fd, void* context, int events) { FileCallback* callback = reinterpret_cast(context); (*callback)(events); } -int EventLoop::timer_event_callback(aeEventLoop* loop, TimerID timer_id, void* context) { +int EventLoop::TimerEventCallback(aeEventLoop* loop, TimerID timer_id, void* context) { TimerCallback* callback = reinterpret_cast(context); return (*callback)(timer_id); } constexpr int kInitialEventLoopSize = 1024; -EventLoop::EventLoop() { - loop_ = aeCreateEventLoop(kInitialEventLoopSize); -} +EventLoop::EventLoop() { loop_ = aeCreateEventLoop(kInitialEventLoopSize); } -bool EventLoop::add_file_event(int fd, int events, const FileCallback& callback) { - if (file_callbacks_.find(fd) != file_callbacks_.end()) { return false; } +bool EventLoop::AddFileEvent(int fd, int events, const FileCallback& callback) { + if (file_callbacks_.find(fd) != file_callbacks_.end()) { + return false; + } auto data = std::unique_ptr(new FileCallback(callback)); void* context = reinterpret_cast(data.get()); // Try to add the file descriptor. - int err = aeCreateFileEvent(loop_, fd, events, EventLoop::file_event_callback, context); + int err = aeCreateFileEvent(loop_, fd, events, EventLoop::FileEventCallback, context); // If it cannot be added, increase the size of the event loop. if (err == AE_ERR && errno == ERANGE) { err = aeResizeSetSize(loop_, 3 * aeGetSetSize(loop_) / 2); - if (err != AE_OK) { return false; } - err = aeCreateFileEvent(loop_, fd, events, EventLoop::file_event_callback, context); + if (err != AE_OK) { + return false; + } + err = aeCreateFileEvent(loop_, fd, events, EventLoop::FileEventCallback, context); } // In any case, test if there were errors. if (err == AE_OK) { @@ -56,26 +59,31 @@ bool EventLoop::add_file_event(int fd, int events, const FileCallback& callback) return false; } -void EventLoop::remove_file_event(int fd) { +void EventLoop::RemoveFileEvent(int fd) { aeDeleteFileEvent(loop_, fd, AE_READABLE | AE_WRITABLE); file_callbacks_.erase(fd); } -void EventLoop::run() { - aeMain(loop_); +void EventLoop::Start() { aeMain(loop_); } + +void EventLoop::Stop() { + aeStop(loop_); + aeDeleteEventLoop(loop_); } -int64_t EventLoop::add_timer(int64_t timeout, const TimerCallback& callback) { +int64_t EventLoop::AddTimer(int64_t timeout, const TimerCallback& callback) { auto data = std::unique_ptr(new TimerCallback(callback)); void* context = reinterpret_cast(data.get()); int64_t timer_id = - aeCreateTimeEvent(loop_, timeout, EventLoop::timer_event_callback, context, NULL); + aeCreateTimeEvent(loop_, timeout, EventLoop::TimerEventCallback, context, NULL); timer_callbacks_.emplace(timer_id, std::move(data)); return timer_id; } -int EventLoop::remove_timer(int64_t timer_id) { +int EventLoop::RemoveTimer(int64_t timer_id) { int err = aeDeleteTimeEvent(loop_, timer_id); timer_callbacks_.erase(timer_id); return err; } + +} // namespace plasma diff --git a/cpp/src/plasma/events.h b/cpp/src/plasma/events.h index bd93d6bb2a6..42914848645 100644 --- a/cpp/src/plasma/events.h +++ b/cpp/src/plasma/events.h @@ -26,6 +26,8 @@ extern "C" { #include "ae/ae.h" } +namespace plasma { + /// Constant specifying that the timer is done and it will be removed. constexpr int kEventLoopTimerDone = AE_NOMORE; @@ -59,13 +61,13 @@ class EventLoop { /// @param events The flags for events we are listening to (read or write). /// @param callback The callback that will be called when the event happens. /// @return Returns true if the event handler was added successfully. - bool add_file_event(int fd, int events, const FileCallback& callback); + bool AddFileEvent(int fd, int events, const FileCallback& callback); /// Remove a file event handler from the event loop. /// /// @param fd The file descriptor of the event handler. /// @return Void. - void remove_file_event(int fd); + void RemoveFileEvent(int fd); /// Register a handler that will be called after a time slice of /// "timeout" milliseconds. @@ -73,27 +75,32 @@ class EventLoop { /// @param timeout The timeout in milliseconds. /// @param callback The callback for the timeout. /// @return The ID of the newly created timer. - int64_t add_timer(int64_t timeout, const TimerCallback& callback); + int64_t AddTimer(int64_t timeout, const TimerCallback& callback); /// Remove a timer handler from the event loop. /// /// @param timer_id The ID of the timer that is to be removed. /// @return The ae.c error code. TODO(pcm): needs to be standardized - int remove_timer(int64_t timer_id); + int RemoveTimer(int64_t timer_id); - /// Run the event loop. + /// \brief Run the event loop. /// /// @return Void. - void run(); + void Start(); + + /// \brief Stop the event loop + void Stop(); private: - static void file_event_callback(aeEventLoop* loop, int fd, void* context, int events); + static void FileEventCallback(aeEventLoop* loop, int fd, void* context, int events); - static int timer_event_callback(aeEventLoop* loop, TimerID timer_id, void* context); + static int TimerEventCallback(aeEventLoop* loop, TimerID timer_id, void* context); aeEventLoop* loop_; std::unordered_map> file_callbacks_; std::unordered_map> timer_callbacks_; }; +} // namespace plasma + #endif // PLASMA_EVENTS diff --git a/cpp/src/plasma/eviction_policy.cc b/cpp/src/plasma/eviction_policy.cc index 4ae6384d425..6c2309f1709 100644 --- a/cpp/src/plasma/eviction_policy.cc +++ b/cpp/src/plasma/eviction_policy.cc @@ -19,6 +19,8 @@ #include +namespace plasma { + void LRUCache::add(const ObjectID& key, int64_t size) { auto it = item_map_.find(key); ARROW_CHECK(it == item_map_.end()); @@ -34,8 +36,8 @@ void LRUCache::remove(const ObjectID& key) { item_map_.erase(it); } -int64_t LRUCache::choose_objects_to_evict( - int64_t num_bytes_required, std::vector* objects_to_evict) { +int64_t LRUCache::choose_objects_to_evict(int64_t num_bytes_required, + std::vector* objects_to_evict) { int64_t bytes_evicted = 0; auto it = item_list_.end(); while (bytes_evicted < num_bytes_required && it != item_list_.begin()) { @@ -49,8 +51,8 @@ int64_t LRUCache::choose_objects_to_evict( EvictionPolicy::EvictionPolicy(PlasmaStoreInfo* store_info) : memory_used_(0), store_info_(store_info) {} -int64_t EvictionPolicy::choose_objects_to_evict( - int64_t num_bytes_required, std::vector* objects_to_evict) { +int64_t EvictionPolicy::choose_objects_to_evict(int64_t num_bytes_required, + std::vector* objects_to_evict) { int64_t bytes_evicted = cache_.choose_objects_to_evict(num_bytes_required, objects_to_evict); /* Update the LRU cache. */ @@ -67,8 +69,8 @@ void EvictionPolicy::object_created(const ObjectID& object_id) { cache_.add(object_id, entry->info.data_size + entry->info.metadata_size); } -bool EvictionPolicy::require_space( - int64_t size, std::vector* objects_to_evict) { +bool EvictionPolicy::require_space(int64_t size, + std::vector* objects_to_evict) { /* Check if there is enough space to create the object. */ int64_t required_space = memory_used_ + size - store_info_->memory_capacity; int64_t num_bytes_evicted; @@ -93,15 +95,17 @@ bool EvictionPolicy::require_space( return num_bytes_evicted >= required_space; } -void EvictionPolicy::begin_object_access( - const ObjectID& object_id, std::vector* objects_to_evict) { +void EvictionPolicy::begin_object_access(const ObjectID& object_id, + std::vector* objects_to_evict) { /* If the object is in the LRU cache, remove it. */ cache_.remove(object_id); } -void EvictionPolicy::end_object_access( - const ObjectID& object_id, std::vector* objects_to_evict) { +void EvictionPolicy::end_object_access(const ObjectID& object_id, + std::vector* objects_to_evict) { auto entry = store_info_->objects[object_id].get(); /* Add the object to the LRU cache.*/ cache_.add(object_id, entry->info.data_size + entry->info.metadata_size); } + +} // namespace plasma diff --git a/cpp/src/plasma/eviction_policy.h b/cpp/src/plasma/eviction_policy.h index 3815fc6652f..dd1c873466e 100644 --- a/cpp/src/plasma/eviction_policy.h +++ b/cpp/src/plasma/eviction_policy.h @@ -26,6 +26,8 @@ #include "plasma/common.h" #include "plasma/plasma.h" +namespace plasma { + // ==== The eviction policy ==== // // This file contains declaration for all functions and data structures that @@ -40,8 +42,8 @@ class LRUCache { void remove(const ObjectID& key); - int64_t choose_objects_to_evict( - int64_t num_bytes_required, std::vector* objects_to_evict); + int64_t choose_objects_to_evict(int64_t num_bytes_required, + std::vector* objects_to_evict); private: /// A doubly-linked list containing the items in the cache and @@ -93,8 +95,8 @@ class EvictionPolicy { /// @param objects_to_evict The object IDs that were chosen for eviction will /// be stored into this vector. /// @return Void. - void begin_object_access( - const ObjectID& object_id, std::vector* objects_to_evict); + void begin_object_access(const ObjectID& object_id, + std::vector* objects_to_evict); /// This method will be called whenever an object in the Plasma store that was /// being used is no longer being used. When this method is called, the @@ -105,8 +107,8 @@ class EvictionPolicy { /// @param objects_to_evict The object IDs that were chosen for eviction will /// be stored into this vector. /// @return Void. - void end_object_access( - const ObjectID& object_id, std::vector* objects_to_evict); + void end_object_access(const ObjectID& object_id, + std::vector* objects_to_evict); /// Choose some objects to evict from the Plasma store. When this method is /// called, the eviction policy will assume that the objects chosen to be @@ -119,8 +121,8 @@ class EvictionPolicy { /// @param objects_to_evict The object IDs that were chosen for eviction will /// be stored into this vector. /// @return The total number of bytes of space chosen to be evicted. - int64_t choose_objects_to_evict( - int64_t num_bytes_required, std::vector* objects_to_evict); + int64_t choose_objects_to_evict(int64_t num_bytes_required, + std::vector* objects_to_evict); private: /// The amount of memory (in bytes) currently being used. @@ -131,4 +133,6 @@ class EvictionPolicy { LRUCache cache_; }; +} // namespace plasma + #endif // PLASMA_EVICTION_POLICY_H diff --git a/cpp/src/plasma/extension.cc b/cpp/src/plasma/extension.cc deleted file mode 100644 index 5d61e337c10..00000000000 --- a/cpp/src/plasma/extension.cc +++ /dev/null @@ -1,456 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#include "plasma/extension.h" - -#include -#include - -#include "plasma/client.h" -#include "plasma/common.h" -#include "plasma/io.h" -#include "plasma/protocol.h" - -PyObject* PlasmaOutOfMemoryError; -PyObject* PlasmaObjectExistsError; - -PyObject* PyPlasma_connect(PyObject* self, PyObject* args) { - const char* store_socket_name; - const char* manager_socket_name; - int release_delay; - if (!PyArg_ParseTuple( - args, "ssi", &store_socket_name, &manager_socket_name, &release_delay)) { - return NULL; - } - PlasmaClient* client = new PlasmaClient(); - ARROW_CHECK_OK(client->Connect(store_socket_name, manager_socket_name, release_delay)); - - return PyCapsule_New(client, "plasma", NULL); -} - -PyObject* PyPlasma_disconnect(PyObject* self, PyObject* args) { - PyObject* client_capsule; - if (!PyArg_ParseTuple(args, "O", &client_capsule)) { return NULL; } - PlasmaClient* client; - ARROW_CHECK(PyObjectToPlasmaClient(client_capsule, &client)); - ARROW_CHECK_OK(client->Disconnect()); - /* We use the context of the connection capsule to indicate if the connection - * is still active (if the context is NULL) or if it is closed (if the context - * is (void*) 0x1). This is neccessary because the primary pointer of the - * capsule cannot be NULL. */ - PyCapsule_SetContext(client_capsule, reinterpret_cast(0x1)); - Py_RETURN_NONE; -} - -PyObject* PyPlasma_create(PyObject* self, PyObject* args) { - PlasmaClient* client; - ObjectID object_id; - Py_ssize_t size; - PyObject* metadata; - if (!PyArg_ParseTuple(args, "O&O&nO", PyObjectToPlasmaClient, &client, - PyStringToUniqueID, &object_id, &size, &metadata)) { - return NULL; - } - if (!PyByteArray_Check(metadata)) { - PyErr_SetString(PyExc_TypeError, "metadata must be a bytearray"); - return NULL; - } - uint8_t* data; - Status s = client->Create(object_id, size, - reinterpret_cast(PyByteArray_AsString(metadata)), - PyByteArray_Size(metadata), &data); - if (s.IsPlasmaObjectExists()) { - PyErr_SetString(PlasmaObjectExistsError, - "An object with this ID already exists in the plasma " - "store."); - return NULL; - } - if (s.IsPlasmaStoreFull()) { - PyErr_SetString(PlasmaOutOfMemoryError, - "The plasma store ran out of memory and could not create " - "this object."); - return NULL; - } - ARROW_CHECK(s.ok()); - -#if PY_MAJOR_VERSION >= 3 - return PyMemoryView_FromMemory(reinterpret_cast(data), size, PyBUF_WRITE); -#else - return PyBuffer_FromReadWriteMemory(reinterpret_cast(data), size); -#endif -} - -PyObject* PyPlasma_hash(PyObject* self, PyObject* args) { - PlasmaClient* client; - ObjectID object_id; - if (!PyArg_ParseTuple(args, "O&O&", PyObjectToPlasmaClient, &client, PyStringToUniqueID, - &object_id)) { - return NULL; - } - unsigned char digest[kDigestSize]; - bool success = plasma_compute_object_hash(client, object_id, digest); - if (success) { - PyObject* digest_string = - PyBytes_FromStringAndSize(reinterpret_cast(digest), kDigestSize); - return digest_string; - } else { - Py_RETURN_NONE; - } -} - -PyObject* PyPlasma_seal(PyObject* self, PyObject* args) { - PlasmaClient* client; - ObjectID object_id; - if (!PyArg_ParseTuple(args, "O&O&", PyObjectToPlasmaClient, &client, PyStringToUniqueID, - &object_id)) { - return NULL; - } - ARROW_CHECK_OK(client->Seal(object_id)); - Py_RETURN_NONE; -} - -PyObject* PyPlasma_release(PyObject* self, PyObject* args) { - PlasmaClient* client; - ObjectID object_id; - if (!PyArg_ParseTuple(args, "O&O&", PyObjectToPlasmaClient, &client, PyStringToUniqueID, - &object_id)) { - return NULL; - } - ARROW_CHECK_OK(client->Release(object_id)); - Py_RETURN_NONE; -} - -PyObject* PyPlasma_get(PyObject* self, PyObject* args) { - PlasmaClient* client; - PyObject* object_id_list; - Py_ssize_t timeout_ms; - if (!PyArg_ParseTuple( - args, "O&On", PyObjectToPlasmaClient, &client, &object_id_list, &timeout_ms)) { - return NULL; - } - - Py_ssize_t num_object_ids = PyList_Size(object_id_list); - std::vector object_ids(num_object_ids); - std::vector object_buffers(num_object_ids); - - for (int i = 0; i < num_object_ids; ++i) { - PyStringToUniqueID(PyList_GetItem(object_id_list, i), &object_ids[i]); - } - - Py_BEGIN_ALLOW_THREADS; - ARROW_CHECK_OK( - client->Get(object_ids.data(), num_object_ids, timeout_ms, object_buffers.data())); - Py_END_ALLOW_THREADS; - - PyObject* returns = PyList_New(num_object_ids); - for (int i = 0; i < num_object_ids; ++i) { - if (object_buffers[i].data_size != -1) { - /* The object was retrieved, so return the object. */ - PyObject* t = PyTuple_New(2); - Py_ssize_t data_size = static_cast(object_buffers[i].data_size); - Py_ssize_t metadata_size = static_cast(object_buffers[i].metadata_size); -#if PY_MAJOR_VERSION >= 3 - char* data = reinterpret_cast(object_buffers[i].data); - char* metadata = reinterpret_cast(object_buffers[i].metadata); - PyTuple_SET_ITEM(t, 0, PyMemoryView_FromMemory(data, data_size, PyBUF_READ)); - PyTuple_SET_ITEM( - t, 1, PyMemoryView_FromMemory(metadata, metadata_size, PyBUF_READ)); -#else - void* data = reinterpret_cast(object_buffers[i].data); - void* metadata = reinterpret_cast(object_buffers[i].metadata); - PyTuple_SET_ITEM(t, 0, PyBuffer_FromMemory(data, data_size)); - PyTuple_SET_ITEM(t, 1, PyBuffer_FromMemory(metadata, metadata_size)); -#endif - ARROW_CHECK(PyList_SetItem(returns, i, t) == 0); - } else { - /* The object was not retrieved, so just add None to the list of return - * values. */ - Py_INCREF(Py_None); - ARROW_CHECK(PyList_SetItem(returns, i, Py_None) == 0); - } - } - return returns; -} - -PyObject* PyPlasma_contains(PyObject* self, PyObject* args) { - PlasmaClient* client; - ObjectID object_id; - if (!PyArg_ParseTuple(args, "O&O&", PyObjectToPlasmaClient, &client, PyStringToUniqueID, - &object_id)) { - return NULL; - } - bool has_object; - ARROW_CHECK_OK(client->Contains(object_id, &has_object)); - - if (has_object) { - Py_RETURN_TRUE; - } else { - Py_RETURN_FALSE; - } -} - -PyObject* PyPlasma_fetch(PyObject* self, PyObject* args) { - PlasmaClient* client; - PyObject* object_id_list; - if (!PyArg_ParseTuple(args, "O&O", PyObjectToPlasmaClient, &client, &object_id_list)) { - return NULL; - } - if (client->get_manager_fd() == -1) { - PyErr_SetString(PyExc_RuntimeError, "Not connected to the plasma manager"); - return NULL; - } - Py_ssize_t n = PyList_Size(object_id_list); - ObjectID* object_ids = new ObjectID[n]; - for (int i = 0; i < n; ++i) { - PyStringToUniqueID(PyList_GetItem(object_id_list, i), &object_ids[i]); - } - ARROW_CHECK_OK(client->Fetch(static_cast(n), object_ids)); - delete[] object_ids; - Py_RETURN_NONE; -} - -PyObject* PyPlasma_wait(PyObject* self, PyObject* args) { - PlasmaClient* client; - PyObject* object_id_list; - Py_ssize_t timeout; - int num_returns; - if (!PyArg_ParseTuple(args, "O&Oni", PyObjectToPlasmaClient, &client, &object_id_list, - &timeout, &num_returns)) { - return NULL; - } - Py_ssize_t n = PyList_Size(object_id_list); - - if (client->get_manager_fd() == -1) { - PyErr_SetString(PyExc_RuntimeError, "Not connected to the plasma manager"); - return NULL; - } - if (num_returns < 0) { - PyErr_SetString( - PyExc_RuntimeError, "The argument num_returns cannot be less than zero."); - return NULL; - } - if (num_returns > n) { - PyErr_SetString(PyExc_RuntimeError, - "The argument num_returns cannot be greater than len(object_ids)"); - return NULL; - } - int64_t threshold = 1 << 30; - if (timeout > threshold) { - PyErr_SetString( - PyExc_RuntimeError, "The argument timeout cannot be greater than 2 ** 30."); - return NULL; - } - - std::vector object_requests(n); - for (int i = 0; i < n; ++i) { - ARROW_CHECK(PyStringToUniqueID(PyList_GetItem(object_id_list, i), - &object_requests[i].object_id) == 1); - object_requests[i].type = PLASMA_QUERY_ANYWHERE; - } - /* Drop the global interpreter lock while we are waiting, so other threads can - * run. */ - int num_return_objects; - Py_BEGIN_ALLOW_THREADS; - ARROW_CHECK_OK( - client->Wait(n, object_requests.data(), num_returns, timeout, &num_return_objects)); - Py_END_ALLOW_THREADS; - - int num_to_return = std::min(num_return_objects, num_returns); - PyObject* ready_ids = PyList_New(num_to_return); - PyObject* waiting_ids = PySet_New(object_id_list); - int num_returned = 0; - for (int i = 0; i < n; ++i) { - if (num_returned == num_to_return) { break; } - if (object_requests[i].status == ObjectStatus_Local || - object_requests[i].status == ObjectStatus_Remote) { - PyObject* ready = PyBytes_FromStringAndSize( - reinterpret_cast(&object_requests[i].object_id), - sizeof(object_requests[i].object_id)); - PyList_SetItem(ready_ids, num_returned, ready); - PySet_Discard(waiting_ids, ready); - num_returned += 1; - } else { - ARROW_CHECK(object_requests[i].status == ObjectStatus_Nonexistent); - } - } - ARROW_CHECK(num_returned == num_to_return); - /* Return both the ready IDs and the remaining IDs. */ - PyObject* t = PyTuple_New(2); - PyTuple_SetItem(t, 0, ready_ids); - PyTuple_SetItem(t, 1, waiting_ids); - return t; -} - -PyObject* PyPlasma_evict(PyObject* self, PyObject* args) { - PlasmaClient* client; - Py_ssize_t num_bytes; - if (!PyArg_ParseTuple(args, "O&n", PyObjectToPlasmaClient, &client, &num_bytes)) { - return NULL; - } - int64_t evicted_bytes; - ARROW_CHECK_OK(client->Evict(static_cast(num_bytes), evicted_bytes)); - return PyLong_FromSsize_t(static_cast(evicted_bytes)); -} - -PyObject* PyPlasma_delete(PyObject* self, PyObject* args) { - PlasmaClient* client; - ObjectID object_id; - if (!PyArg_ParseTuple(args, "O&O&", PyObjectToPlasmaClient, &client, PyStringToUniqueID, - &object_id)) { - return NULL; - } - ARROW_CHECK_OK(client->Delete(object_id)); - Py_RETURN_NONE; -} - -PyObject* PyPlasma_transfer(PyObject* self, PyObject* args) { - PlasmaClient* client; - ObjectID object_id; - const char* addr; - int port; - if (!PyArg_ParseTuple(args, "O&O&si", PyObjectToPlasmaClient, &client, - PyStringToUniqueID, &object_id, &addr, &port)) { - return NULL; - } - - if (client->get_manager_fd() == -1) { - PyErr_SetString(PyExc_RuntimeError, "Not connected to the plasma manager"); - return NULL; - } - - ARROW_CHECK_OK(client->Transfer(addr, port, object_id)); - Py_RETURN_NONE; -} - -PyObject* PyPlasma_subscribe(PyObject* self, PyObject* args) { - PlasmaClient* client; - if (!PyArg_ParseTuple(args, "O&", PyObjectToPlasmaClient, &client)) { return NULL; } - - int sock; - ARROW_CHECK_OK(client->Subscribe(&sock)); - return PyLong_FromLong(sock); -} - -PyObject* PyPlasma_receive_notification(PyObject* self, PyObject* args) { - int plasma_sock; - - if (!PyArg_ParseTuple(args, "i", &plasma_sock)) { return NULL; } - /* Receive object notification from the plasma connection socket. If the - * object was added, return a tuple of its fields: ObjectID, data_size, - * metadata_size. If the object was deleted, data_size and metadata_size will - * be set to -1. */ - uint8_t* notification = read_message_async(plasma_sock); - if (notification == NULL) { - PyErr_SetString( - PyExc_RuntimeError, "Failed to read object notification from Plasma socket"); - return NULL; - } - auto object_info = flatbuffers::GetRoot(notification); - /* Construct a tuple from object_info and return. */ - PyObject* t = PyTuple_New(3); - PyTuple_SetItem(t, 0, PyBytes_FromStringAndSize(object_info->object_id()->data(), - object_info->object_id()->size())); - if (object_info->is_deletion()) { - PyTuple_SetItem(t, 1, PyLong_FromLong(-1)); - PyTuple_SetItem(t, 2, PyLong_FromLong(-1)); - } else { - PyTuple_SetItem(t, 1, PyLong_FromLong(object_info->data_size())); - PyTuple_SetItem(t, 2, PyLong_FromLong(object_info->metadata_size())); - } - - delete[] notification; - return t; -} - -static PyMethodDef plasma_methods[] = { - {"connect", PyPlasma_connect, METH_VARARGS, "Connect to plasma."}, - {"disconnect", PyPlasma_disconnect, METH_VARARGS, "Disconnect from plasma."}, - {"create", PyPlasma_create, METH_VARARGS, "Create a new plasma object."}, - {"hash", PyPlasma_hash, METH_VARARGS, "Compute the hash of a plasma object."}, - {"seal", PyPlasma_seal, METH_VARARGS, "Seal a plasma object."}, - {"get", PyPlasma_get, METH_VARARGS, "Get a plasma object."}, - {"contains", PyPlasma_contains, METH_VARARGS, - "Does the plasma store contain this plasma object?"}, - {"fetch", PyPlasma_fetch, METH_VARARGS, - "Fetch the object from another plasma manager instance."}, - {"wait", PyPlasma_wait, METH_VARARGS, - "Wait until num_returns objects in object_ids are ready."}, - {"evict", PyPlasma_evict, METH_VARARGS, - "Evict some objects until we recover some number of bytes."}, - {"release", PyPlasma_release, METH_VARARGS, "Release the plasma object."}, - {"delete", PyPlasma_delete, METH_VARARGS, "Delete a plasma object."}, - {"transfer", PyPlasma_transfer, METH_VARARGS, - "Transfer object to another plasma manager."}, - {"subscribe", PyPlasma_subscribe, METH_VARARGS, - "Subscribe to the plasma notification socket."}, - {"receive_notification", PyPlasma_receive_notification, METH_VARARGS, - "Receive next notification from plasma notification socket."}, - {NULL} /* Sentinel */ -}; - -#if PY_MAJOR_VERSION >= 3 -static struct PyModuleDef moduledef = { - PyModuleDef_HEAD_INIT, "libplasma", /* m_name */ - "A Python client library for plasma.", /* m_doc */ - 0, /* m_size */ - plasma_methods, /* m_methods */ - NULL, /* m_reload */ - NULL, /* m_traverse */ - NULL, /* m_clear */ - NULL, /* m_free */ -}; -#endif - -#if PY_MAJOR_VERSION >= 3 -#define INITERROR return NULL -#else -#define INITERROR return -#endif - -#ifndef PyMODINIT_FUNC /* declarations for DLL import/export */ -#define PyMODINIT_FUNC void -#endif - -#if PY_MAJOR_VERSION >= 3 -#define MOD_INIT(name) PyMODINIT_FUNC PyInit_##name(void) -#else -#define MOD_INIT(name) PyMODINIT_FUNC init##name(void) -#endif - -MOD_INIT(libplasma) { -#if PY_MAJOR_VERSION >= 3 - PyObject* m = PyModule_Create(&moduledef); -#else - PyObject* m = - Py_InitModule3("libplasma", plasma_methods, "A Python client library for plasma."); -#endif - - /* Create a custom exception for when an object ID is reused. */ - char plasma_object_exists_error[] = "plasma_object_exists.error"; - PlasmaObjectExistsError = PyErr_NewException(plasma_object_exists_error, NULL, NULL); - Py_INCREF(PlasmaObjectExistsError); - PyModule_AddObject(m, "plasma_object_exists_error", PlasmaObjectExistsError); - /* Create a custom exception for when the plasma store is out of memory. */ - char plasma_out_of_memory_error[] = "plasma_out_of_memory.error"; - PlasmaOutOfMemoryError = PyErr_NewException(plasma_out_of_memory_error, NULL, NULL); - Py_INCREF(PlasmaOutOfMemoryError); - PyModule_AddObject(m, "plasma_out_of_memory_error", PlasmaOutOfMemoryError); - -#if PY_MAJOR_VERSION >= 3 - return m; -#endif -} diff --git a/cpp/src/plasma/extension.h b/cpp/src/plasma/extension.h deleted file mode 100644 index cee30abb359..00000000000 --- a/cpp/src/plasma/extension.h +++ /dev/null @@ -1,50 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#ifndef PLASMA_EXTENSION_H -#define PLASMA_EXTENSION_H - -#undef _XOPEN_SOURCE -#undef _POSIX_C_SOURCE -#include - -#include "bytesobject.h" // NOLINT - -#include "plasma/client.h" -#include "plasma/common.h" - -static int PyObjectToPlasmaClient(PyObject* object, PlasmaClient** client) { - if (PyCapsule_IsValid(object, "plasma")) { - *client = reinterpret_cast(PyCapsule_GetPointer(object, "plasma")); - return 1; - } else { - PyErr_SetString(PyExc_TypeError, "must be a 'plasma' capsule"); - return 0; - } -} - -int PyStringToUniqueID(PyObject* object, ObjectID* object_id) { - if (PyBytes_Check(object)) { - memcpy(object_id, PyBytes_AsString(object), sizeof(ObjectID)); - return 1; - } else { - PyErr_SetString(PyExc_TypeError, "must be a 20 character string"); - return 0; - } -} - -#endif // PLASMA_EXTENSION_H diff --git a/cpp/src/plasma/io.cc b/cpp/src/plasma/io.cc index 5875ebb7ae6..9bb43399082 100644 --- a/cpp/src/plasma/io.cc +++ b/cpp/src/plasma/io.cc @@ -17,6 +17,11 @@ #include "plasma/io.h" +#include +#include + +#include "arrow/status.h" + #include "plasma/common.h" using arrow::Status; @@ -29,6 +34,8 @@ using arrow::Status; #define NUM_CONNECT_ATTEMPTS 50 #define CONNECT_TIMEOUT_MS 100 +namespace plasma { + Status WriteBytes(int fd, uint8_t* cursor, size_t length) { ssize_t nbytes = 0; size_t bytesleft = length; @@ -38,7 +45,9 @@ Status WriteBytes(int fd, uint8_t* cursor, size_t length) { * advance the cursor, and decrease the amount left to write. */ nbytes = write(fd, cursor + offset, bytesleft); if (nbytes < 0) { - if (errno == EAGAIN || errno == EWOULDBLOCK || errno == EINTR) { continue; } + if (errno == EAGAIN || errno == EWOULDBLOCK || errno == EINTR) { + continue; + } return Status::IOError(std::string(strerror(errno))); } else if (nbytes == 0) { return Status::IOError("Encountered unexpected EOF"); @@ -67,7 +76,9 @@ Status ReadBytes(int fd, uint8_t* cursor, size_t length) { while (bytesleft > 0) { nbytes = read(fd, cursor + offset, bytesleft); if (nbytes < 0) { - if (errno == EAGAIN || errno == EWOULDBLOCK || errno == EINTR) { continue; } + if (errno == EAGAIN || errno == EWOULDBLOCK || errno == EINTR) { + continue; + } return Status::IOError(std::string(strerror(errno))); } else if (0 == nbytes) { return Status::IOError("Encountered unexpected EOF"); @@ -83,14 +94,16 @@ Status ReadBytes(int fd, uint8_t* cursor, size_t length) { Status ReadMessage(int fd, int64_t* type, std::vector* buffer) { int64_t version; RETURN_NOT_OK_ELSE(ReadBytes(fd, reinterpret_cast(&version), sizeof(version)), - *type = DISCONNECT_CLIENT); + *type = DISCONNECT_CLIENT); ARROW_CHECK(version == PLASMA_PROTOCOL_VERSION) << "version = " << version; size_t length; RETURN_NOT_OK_ELSE(ReadBytes(fd, reinterpret_cast(type), sizeof(*type)), - *type = DISCONNECT_CLIENT); + *type = DISCONNECT_CLIENT); RETURN_NOT_OK_ELSE(ReadBytes(fd, reinterpret_cast(&length), sizeof(length)), - *type = DISCONNECT_CLIENT); - if (length > buffer->size()) { buffer->resize(length); } + *type = DISCONNECT_CLIENT); + if (length > buffer->size()) { + buffer->resize(length); + } RETURN_NOT_OK_ELSE(ReadBytes(fd, buffer->data(), length), *type = DISCONNECT_CLIENT); return Status::OK(); } @@ -105,7 +118,7 @@ int bind_ipc_sock(const std::string& pathname, bool shall_listen) { /* Tell the system to allow the port to be reused. */ int on = 1; if (setsockopt(socket_fd, SOL_SOCKET, SO_REUSEADDR, reinterpret_cast(&on), - sizeof(on)) < 0) { + sizeof(on)) < 0) { ARROW_LOG(ERROR) << "setsockopt failed for pathname " << pathname; close(socket_fd); return -1; @@ -134,25 +147,36 @@ int bind_ipc_sock(const std::string& pathname, bool shall_listen) { return socket_fd; } -int connect_ipc_sock_retry( - const std::string& pathname, int num_retries, int64_t timeout) { +Status ConnectIpcSocketRetry(const std::string& pathname, int num_retries, + int64_t timeout, int* fd) { /* Pick the default values if the user did not specify. */ - if (num_retries < 0) { num_retries = NUM_CONNECT_ATTEMPTS; } - if (timeout < 0) { timeout = CONNECT_TIMEOUT_MS; } + if (num_retries < 0) { + num_retries = NUM_CONNECT_ATTEMPTS; + } + if (timeout < 0) { + timeout = CONNECT_TIMEOUT_MS; + } - int fd = -1; + *fd = -1; for (int num_attempts = 0; num_attempts < num_retries; ++num_attempts) { - fd = connect_ipc_sock(pathname); - if (fd >= 0) { break; } + *fd = connect_ipc_sock(pathname); + if (*fd >= 0) { + break; + } if (num_attempts == 0) { - ARROW_LOG(ERROR) << "Connection to socket failed for pathname " << pathname; + ARROW_LOG(ERROR) << "Connection to IPC socket failed for pathname " << pathname + << ", retrying " << num_retries << " times"; } /* Sleep for timeout milliseconds. */ usleep(static_cast(timeout * 1000)); } /* If we could not connect to the socket, exit. */ - if (fd == -1) { ARROW_LOG(FATAL) << "Could not connect to socket " << pathname; } - return fd; + if (*fd == -1) { + std::stringstream ss; + ss << "Could not connect to socket " << pathname; + return Status::IOError(ss.str()); + } + return Status::OK(); } int connect_ipc_sock(const std::string& pathname) { @@ -210,3 +234,5 @@ uint8_t* read_message_async(int sock) { } return message; } + +} // namespace plasma diff --git a/cpp/src/plasma/io.h b/cpp/src/plasma/io.h index 43c3fb53549..ef96c06ccea 100644 --- a/cpp/src/plasma/io.h +++ b/cpp/src/plasma/io.h @@ -34,22 +34,29 @@ #define PLASMA_PROTOCOL_VERSION 0x0000000000000000 #define DISCONNECT_CLIENT 0 -arrow::Status WriteBytes(int fd, uint8_t* cursor, size_t length); +namespace plasma { -arrow::Status WriteMessage(int fd, int64_t type, int64_t length, uint8_t* bytes); +using arrow::Status; -arrow::Status ReadBytes(int fd, uint8_t* cursor, size_t length); +Status WriteBytes(int fd, uint8_t* cursor, size_t length); -arrow::Status ReadMessage(int fd, int64_t* type, std::vector* buffer); +Status WriteMessage(int fd, int64_t type, int64_t length, uint8_t* bytes); + +Status ReadBytes(int fd, uint8_t* cursor, size_t length); + +Status ReadMessage(int fd, int64_t* type, std::vector* buffer); int bind_ipc_sock(const std::string& pathname, bool shall_listen); int connect_ipc_sock(const std::string& pathname); -int connect_ipc_sock_retry(const std::string& pathname, int num_retries, int64_t timeout); +Status ConnectIpcSocketRetry(const std::string& pathname, int num_retries, + int64_t timeout, int* fd); int AcceptClient(int socket_fd); uint8_t* read_message_async(int sock); +} // namespace plasma + #endif // PLASMA_IO_H diff --git a/cpp/src/plasma/malloc.cc b/cpp/src/plasma/malloc.cc index 97c9a16c0c0..77a8afea754 100644 --- a/cpp/src/plasma/malloc.cc +++ b/cpp/src/plasma/malloc.cc @@ -69,13 +69,9 @@ std::unordered_map mmap_records; constexpr int GRANULARITY_MULTIPLIER = 2; -static void* pointer_advance(void* p, ptrdiff_t n) { - return (unsigned char*)p + n; -} +static void* pointer_advance(void* p, ptrdiff_t n) { return (unsigned char*)p + n; } -static void* pointer_retreat(void* p, ptrdiff_t n) { - return (unsigned char*)p - n; -} +static void* pointer_retreat(void* p, ptrdiff_t n) { return (unsigned char*)p - n; } static ptrdiff_t pointer_distance(void const* pfrom, void const* pto) { return (unsigned char const*)pto - (unsigned char const*)pfrom; @@ -87,8 +83,8 @@ int create_buffer(int64_t size) { int fd; #ifdef _WIN32 if (!CreateFileMapping(INVALID_HANDLE_VALUE, NULL, PAGE_READWRITE, - (DWORD)((uint64_t)size >> (CHAR_BIT * sizeof(DWORD))), (DWORD)(uint64_t)size, - NULL)) { + (DWORD)((uint64_t)size >> (CHAR_BIT * sizeof(DWORD))), + (DWORD)(uint64_t)size, NULL)) { fd = -1; } #else @@ -127,7 +123,9 @@ void* fake_mmap(size_t size) { int fd = create_buffer(size); ARROW_CHECK(fd >= 0) << "Failed to create buffer during mmap"; void* pointer = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); - if (pointer == MAP_FAILED) { return pointer; } + if (pointer == MAP_FAILED) { + return pointer; + } /* Increase dlmalloc's allocation granularity directly. */ mparams.granularity *= GRANULARITY_MULTIPLIER; @@ -156,7 +154,9 @@ int fake_munmap(void* addr, int64_t size) { } int r = munmap(addr, size); - if (r == 0) { close(entry->second.fd); } + if (r == 0) { + close(entry->second.fd); + } mmap_records.erase(entry); return r; diff --git a/cpp/src/plasma/plasma.cc b/cpp/src/plasma/plasma.cc index 559d8e7f2a6..87082817f12 100644 --- a/cpp/src/plasma/plasma.cc +++ b/cpp/src/plasma/plasma.cc @@ -24,8 +24,12 @@ #include "plasma/common.h" #include "plasma/protocol.h" +namespace plasma { + int warn_if_sigpipe(int status, int client_sock) { - if (status >= 0) { return 0; } + if (status >= 0) { + return 0; + } if (errno == EPIPE || errno == EBADF || errno == ECONNRESET) { ARROW_LOG(WARNING) << "Received SIGPIPE, BAD FILE DESCRIPTOR, or ECONNRESET when " "sending a message to client on fd " @@ -56,9 +60,13 @@ uint8_t* create_object_info_buffer(ObjectInfoT* object_info) { return notification; } -ObjectTableEntry* get_object_table_entry( - PlasmaStoreInfo* store_info, const ObjectID& object_id) { +ObjectTableEntry* get_object_table_entry(PlasmaStoreInfo* store_info, + const ObjectID& object_id) { auto it = store_info->objects.find(object_id); - if (it == store_info->objects.end()) { return NULL; } + if (it == store_info->objects.end()) { + return NULL; + } return it->second.get(); } + +} // namespace plasma diff --git a/cpp/src/plasma/plasma.h b/cpp/src/plasma/plasma.h index 275d0c7a416..d60e5a83630 100644 --- a/cpp/src/plasma/plasma.h +++ b/cpp/src/plasma/plasma.h @@ -32,8 +32,10 @@ #include "arrow/status.h" #include "arrow/util/logging.h" -#include "format/common_generated.h" #include "plasma/common.h" +#include "plasma/common_generated.h" + +namespace plasma { #define HANDLE_SIGPIPE(s, fd_) \ do { \ @@ -54,47 +56,23 @@ /// Allocation granularity used in plasma for object allocation. #define BLOCK_SIZE 64 -/// Size of object hash digests. -constexpr int64_t kDigestSize = sizeof(uint64_t); - struct Client; -/// Object request data structure. Used in the plasma_wait_for_objects() -/// argument. -typedef struct { - /// The ID of the requested object. If ID_NIL request any object. - ObjectID object_id; - /// Request associated to the object. It can take one of the following values: - /// - PLASMA_QUERY_LOCAL: return if or when the object is available in the - /// local Plasma Store. - /// - PLASMA_QUERY_ANYWHERE: return if or when the object is available in - /// the system (i.e., either in the local or a remote Plasma Store). - int type; - /// Object status. Same as the status returned by plasma_status() function - /// call. This is filled in by plasma_wait_for_objects1(): - /// - ObjectStatus_Local: object is ready at the local Plasma Store. - /// - ObjectStatus_Remote: object is ready at a remote Plasma Store. - /// - ObjectStatus_Nonexistent: object does not exist in the system. - /// - PLASMA_CLIENT_IN_TRANSFER, if the object is currently being scheduled - /// for being transferred or it is transferring. - int status; -} ObjectRequest; - /// Mapping from object IDs to type and status of the request. typedef std::unordered_map ObjectRequestMap; /// Handle to access memory mapped file and map it into client address space. -typedef struct { +struct object_handle { /// The file descriptor of the memory mapped file in the store. It is used as /// a unique identifier of the file in the client to look up the corresponding /// file descriptor on the client's side. int store_fd; /// The size in bytes of the memory mapped file. int64_t mmap_size; -} object_handle; +}; // TODO(pcm): Replace this by the flatbuffers message PlasmaObjectSpec. -typedef struct { +struct PlasmaObject { /// Handle for memory mapped file the object is stored in. object_handle handle; /// The offset in bytes in the memory mapped file of the data. @@ -105,28 +83,21 @@ typedef struct { int64_t data_size; /// The size in bytes of the metadata. int64_t metadata_size; -} PlasmaObject; +}; -typedef enum { +enum object_state { /// Object was created but not sealed in the local Plasma Store. PLASMA_CREATED = 1, /// Object is sealed and stored in the local Plasma Store. PLASMA_SEALED -} object_state; +}; -typedef enum { +enum object_status { /// The object was not found. OBJECT_NOT_FOUND = 0, /// The object was found. OBJECT_FOUND = 1 -} object_status; - -typedef enum { - /// Query for object in the local plasma store. - PLASMA_QUERY_LOCAL = 1, - /// Query for object in the local plasma store or in a remote plasma store. - PLASMA_QUERY_ANYWHERE -} object_request_type; +}; /// This type is used by the Plasma store. It is here because it is exposed to /// the eviction policy. @@ -167,8 +138,8 @@ struct PlasmaStoreInfo { /// @param object_id The object_id of the entry we are looking for. /// @return The entry associated with the object_id or NULL if the object_id /// is not present. -ObjectTableEntry* get_object_table_entry( - PlasmaStoreInfo* store_info, const ObjectID& object_id); +ObjectTableEntry* get_object_table_entry(PlasmaStoreInfo* store_info, + const ObjectID& object_id); /// Print a warning if the status is less than zero. This should be used to check /// the success of messages sent to plasma clients. We print a warning instead of @@ -188,4 +159,6 @@ int warn_if_sigpipe(int status, int client_sock); uint8_t* create_object_info_buffer(ObjectInfoT* object_info); +} // namespace plasma + #endif // PLASMA_PLASMA_H diff --git a/cpp/src/plasma/plasma.pc.in b/cpp/src/plasma/plasma.pc.in new file mode 100644 index 00000000000..d86868939f3 --- /dev/null +++ b/cpp/src/plasma/plasma.pc.in @@ -0,0 +1,30 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +prefix=@CMAKE_INSTALL_PREFIX@ +libdir=${prefix}/@CMAKE_INSTALL_LIBDIR@ +includedir=${prefix}/include + +so_version=@PLASMA_SO_VERSION@ +abi_version=@PLASMA_ABI_VERSION@ +executable=${prefix}/@CMAKE_INSTALL_BINDIR@/plasma_store + +Name: Plasma +Description: Plasma is an in-memory object store and cache for big data. +Version: @PLASMA_VERSION@ +Libs: -L${libdir} -lplasma +Cflags: -I${includedir} diff --git a/cpp/src/plasma/protocol.cc b/cpp/src/plasma/protocol.cc index 246aa297360..77bc8b7aae3 100644 --- a/cpp/src/plasma/protocol.cc +++ b/cpp/src/plasma/protocol.cc @@ -18,16 +18,18 @@ #include "plasma/protocol.h" #include "flatbuffers/flatbuffers.h" -#include "format/plasma_generated.h" +#include "plasma/plasma_generated.h" #include "plasma/common.h" #include "plasma/io.h" +namespace plasma { + using flatbuffers::uoffset_t; flatbuffers::Offset>> to_flatbuffer(flatbuffers::FlatBufferBuilder* fbb, const ObjectID* object_ids, - int64_t num_objects) { + int64_t num_objects) { std::vector> results; for (int64_t i = 0; i < num_objects; i++) { results.push_back(fbb->CreateString(object_ids[i].binary())); @@ -45,45 +47,48 @@ Status PlasmaReceive(int sock, int64_t message_type, std::vector* buffe template Status PlasmaSend(int sock, int64_t message_type, flatbuffers::FlatBufferBuilder* fbb, - const Message& message) { + const Message& message) { fbb->Finish(message); return WriteMessage(sock, message_type, fbb->GetSize(), fbb->GetBufferPointer()); } // Create messages. -Status SendCreateRequest( - int sock, ObjectID object_id, int64_t data_size, int64_t metadata_size) { +Status SendCreateRequest(int sock, ObjectID object_id, int64_t data_size, + int64_t metadata_size) { flatbuffers::FlatBufferBuilder fbb; - auto message = CreatePlasmaCreateRequest( - fbb, fbb.CreateString(object_id.binary()), data_size, metadata_size); + auto message = CreatePlasmaCreateRequest(fbb, fbb.CreateString(object_id.binary()), + data_size, metadata_size); return PlasmaSend(sock, MessageType_PlasmaCreateRequest, &fbb, message); } -Status ReadCreateRequest( - uint8_t* data, ObjectID* object_id, int64_t* data_size, int64_t* metadata_size) { +Status ReadCreateRequest(uint8_t* data, size_t size, ObjectID* object_id, + int64_t* data_size, int64_t* metadata_size) { DCHECK(data); auto message = flatbuffers::GetRoot(data); + DCHECK(verify_flatbuffer(message, data, size)); *data_size = message->data_size(); *metadata_size = message->metadata_size(); *object_id = ObjectID::from_binary(message->object_id()->str()); return Status::OK(); } -Status SendCreateReply( - int sock, ObjectID object_id, PlasmaObject* object, int error_code) { +Status SendCreateReply(int sock, ObjectID object_id, PlasmaObject* object, + int error_code) { flatbuffers::FlatBufferBuilder fbb; PlasmaObjectSpec plasma_object(object->handle.store_fd, object->handle.mmap_size, - object->data_offset, object->data_size, object->metadata_offset, - object->metadata_size); - auto message = CreatePlasmaCreateReply( - fbb, fbb.CreateString(object_id.binary()), &plasma_object, (PlasmaError)error_code); + object->data_offset, object->data_size, + object->metadata_offset, object->metadata_size); + auto message = CreatePlasmaCreateReply(fbb, fbb.CreateString(object_id.binary()), + &plasma_object, (PlasmaError)error_code); return PlasmaSend(sock, MessageType_PlasmaCreateReply, &fbb, message); } -Status ReadCreateReply(uint8_t* data, ObjectID* object_id, PlasmaObject* object) { +Status ReadCreateReply(uint8_t* data, size_t size, ObjectID* object_id, + PlasmaObject* object) { DCHECK(data); auto message = flatbuffers::GetRoot(data); + DCHECK(verify_flatbuffer(message, data, size)); *object_id = ObjectID::from_binary(message->object_id()->str()); object->handle.store_fd = message->plasma_object()->segment_index(); object->handle.mmap_size = message->plasma_object()->mmap_size(); @@ -104,9 +109,11 @@ Status SendSealRequest(int sock, ObjectID object_id, unsigned char* digest) { return PlasmaSend(sock, MessageType_PlasmaSealRequest, &fbb, message); } -Status ReadSealRequest(uint8_t* data, ObjectID* object_id, unsigned char* digest) { +Status ReadSealRequest(uint8_t* data, size_t size, ObjectID* object_id, + unsigned char* digest) { DCHECK(data); auto message = flatbuffers::GetRoot(data); + DCHECK(verify_flatbuffer(message, data, size)); *object_id = ObjectID::from_binary(message->object_id()->str()); ARROW_CHECK(message->digest()->size() == kDigestSize); memcpy(digest, message->digest()->data(), kDigestSize); @@ -115,14 +122,15 @@ Status ReadSealRequest(uint8_t* data, ObjectID* object_id, unsigned char* digest Status SendSealReply(int sock, ObjectID object_id, int error) { flatbuffers::FlatBufferBuilder fbb; - auto message = CreatePlasmaSealReply( - fbb, fbb.CreateString(object_id.binary()), (PlasmaError)error); + auto message = CreatePlasmaSealReply(fbb, fbb.CreateString(object_id.binary()), + (PlasmaError)error); return PlasmaSend(sock, MessageType_PlasmaSealReply, &fbb, message); } -Status ReadSealReply(uint8_t* data, ObjectID* object_id) { +Status ReadSealReply(uint8_t* data, size_t size, ObjectID* object_id) { DCHECK(data); auto message = flatbuffers::GetRoot(data); + DCHECK(verify_flatbuffer(message, data, size)); *object_id = ObjectID::from_binary(message->object_id()->str()); return plasma_error_status(message->error()); } @@ -131,27 +139,29 @@ Status ReadSealReply(uint8_t* data, ObjectID* object_id) { Status SendReleaseRequest(int sock, ObjectID object_id) { flatbuffers::FlatBufferBuilder fbb; - auto message = CreatePlasmaSealRequest(fbb, fbb.CreateString(object_id.binary())); + auto message = CreatePlasmaReleaseRequest(fbb, fbb.CreateString(object_id.binary())); return PlasmaSend(sock, MessageType_PlasmaReleaseRequest, &fbb, message); } -Status ReadReleaseRequest(uint8_t* data, ObjectID* object_id) { +Status ReadReleaseRequest(uint8_t* data, size_t size, ObjectID* object_id) { DCHECK(data); auto message = flatbuffers::GetRoot(data); + DCHECK(verify_flatbuffer(message, data, size)); *object_id = ObjectID::from_binary(message->object_id()->str()); return Status::OK(); } Status SendReleaseReply(int sock, ObjectID object_id, int error) { flatbuffers::FlatBufferBuilder fbb; - auto message = CreatePlasmaReleaseReply( - fbb, fbb.CreateString(object_id.binary()), (PlasmaError)error); + auto message = CreatePlasmaReleaseReply(fbb, fbb.CreateString(object_id.binary()), + (PlasmaError)error); return PlasmaSend(sock, MessageType_PlasmaReleaseReply, &fbb, message); } -Status ReadReleaseReply(uint8_t* data, ObjectID* object_id) { +Status ReadReleaseReply(uint8_t* data, size_t size, ObjectID* object_id) { DCHECK(data); auto message = flatbuffers::GetRoot(data); + DCHECK(verify_flatbuffer(message, data, size)); *object_id = ObjectID::from_binary(message->object_id()->str()); return plasma_error_status(message->error()); } @@ -164,23 +174,25 @@ Status SendDeleteRequest(int sock, ObjectID object_id) { return PlasmaSend(sock, MessageType_PlasmaDeleteRequest, &fbb, message); } -Status ReadDeleteRequest(uint8_t* data, ObjectID* object_id) { +Status ReadDeleteRequest(uint8_t* data, size_t size, ObjectID* object_id) { DCHECK(data); auto message = flatbuffers::GetRoot(data); + DCHECK(verify_flatbuffer(message, data, size)); *object_id = ObjectID::from_binary(message->object_id()->str()); return Status::OK(); } Status SendDeleteReply(int sock, ObjectID object_id, int error) { flatbuffers::FlatBufferBuilder fbb; - auto message = CreatePlasmaDeleteReply( - fbb, fbb.CreateString(object_id.binary()), (PlasmaError)error); + auto message = CreatePlasmaDeleteReply(fbb, fbb.CreateString(object_id.binary()), + (PlasmaError)error); return PlasmaSend(sock, MessageType_PlasmaDeleteReply, &fbb, message); } -Status ReadDeleteReply(uint8_t* data, ObjectID* object_id) { +Status ReadDeleteReply(uint8_t* data, size_t size, ObjectID* object_id) { DCHECK(data); auto message = flatbuffers::GetRoot(data); + DCHECK(verify_flatbuffer(message, data, size)); *object_id = ObjectID::from_binary(message->object_id()->str()); return plasma_error_status(message->error()); } @@ -194,34 +206,38 @@ Status SendStatusRequest(int sock, const ObjectID* object_ids, int64_t num_objec return PlasmaSend(sock, MessageType_PlasmaStatusRequest, &fbb, message); } -Status ReadStatusRequest(uint8_t* data, ObjectID object_ids[], int64_t num_objects) { +Status ReadStatusRequest(uint8_t* data, size_t size, ObjectID object_ids[], + int64_t num_objects) { DCHECK(data); auto message = flatbuffers::GetRoot(data); + DCHECK(verify_flatbuffer(message, data, size)); for (uoffset_t i = 0; i < num_objects; ++i) { object_ids[i] = ObjectID::from_binary(message->object_ids()->Get(i)->str()); } return Status::OK(); } -Status SendStatusReply( - int sock, ObjectID object_ids[], int object_status[], int64_t num_objects) { +Status SendStatusReply(int sock, ObjectID object_ids[], int object_status[], + int64_t num_objects) { flatbuffers::FlatBufferBuilder fbb; auto message = CreatePlasmaStatusReply(fbb, to_flatbuffer(&fbb, object_ids, num_objects), - fbb.CreateVector(object_status, num_objects)); + fbb.CreateVector(object_status, num_objects)); return PlasmaSend(sock, MessageType_PlasmaStatusReply, &fbb, message); } -int64_t ReadStatusReply_num_objects(uint8_t* data) { +int64_t ReadStatusReply_num_objects(uint8_t* data, size_t size) { DCHECK(data); auto message = flatbuffers::GetRoot(data); + DCHECK(verify_flatbuffer(message, data, size)); return message->object_ids()->size(); } -Status ReadStatusReply( - uint8_t* data, ObjectID object_ids[], int object_status[], int64_t num_objects) { +Status ReadStatusReply(uint8_t* data, size_t size, ObjectID object_ids[], + int object_status[], int64_t num_objects) { DCHECK(data); auto message = flatbuffers::GetRoot(data); + DCHECK(verify_flatbuffer(message, data, size)); for (uoffset_t i = 0; i < num_objects; ++i) { object_ids[i] = ObjectID::from_binary(message->object_ids()->Get(i)->str()); } @@ -239,9 +255,10 @@ Status SendContainsRequest(int sock, ObjectID object_id) { return PlasmaSend(sock, MessageType_PlasmaContainsRequest, &fbb, message); } -Status ReadContainsRequest(uint8_t* data, ObjectID* object_id) { +Status ReadContainsRequest(uint8_t* data, size_t size, ObjectID* object_id) { DCHECK(data); auto message = flatbuffers::GetRoot(data); + DCHECK(verify_flatbuffer(message, data, size)); *object_id = ObjectID::from_binary(message->object_id()->str()); return Status::OK(); } @@ -253,9 +270,11 @@ Status SendContainsReply(int sock, ObjectID object_id, bool has_object) { return PlasmaSend(sock, MessageType_PlasmaContainsReply, &fbb, message); } -Status ReadContainsReply(uint8_t* data, ObjectID* object_id, bool* has_object) { +Status ReadContainsReply(uint8_t* data, size_t size, ObjectID* object_id, + bool* has_object) { DCHECK(data); auto message = flatbuffers::GetRoot(data); + DCHECK(verify_flatbuffer(message, data, size)); *object_id = ObjectID::from_binary(message->object_id()->str()); *has_object = message->has_object(); return Status::OK(); @@ -269,9 +288,7 @@ Status SendConnectRequest(int sock) { return PlasmaSend(sock, MessageType_PlasmaConnectRequest, &fbb, message); } -Status ReadConnectRequest(uint8_t* data) { - return Status::OK(); -} +Status ReadConnectRequest(uint8_t* data) { return Status::OK(); } Status SendConnectReply(int sock, int64_t memory_capacity) { flatbuffers::FlatBufferBuilder fbb; @@ -279,9 +296,10 @@ Status SendConnectReply(int sock, int64_t memory_capacity) { return PlasmaSend(sock, MessageType_PlasmaConnectReply, &fbb, message); } -Status ReadConnectReply(uint8_t* data, int64_t* memory_capacity) { +Status ReadConnectReply(uint8_t* data, size_t size, int64_t* memory_capacity) { DCHECK(data); auto message = flatbuffers::GetRoot(data); + DCHECK(verify_flatbuffer(message, data, size)); *memory_capacity = message->memory_capacity(); return Status::OK(); } @@ -294,9 +312,10 @@ Status SendEvictRequest(int sock, int64_t num_bytes) { return PlasmaSend(sock, MessageType_PlasmaEvictRequest, &fbb, message); } -Status ReadEvictRequest(uint8_t* data, int64_t* num_bytes) { +Status ReadEvictRequest(uint8_t* data, size_t size, int64_t* num_bytes) { DCHECK(data); auto message = flatbuffers::GetRoot(data); + DCHECK(verify_flatbuffer(message, data, size)); *num_bytes = message->num_bytes(); return Status::OK(); } @@ -307,27 +326,29 @@ Status SendEvictReply(int sock, int64_t num_bytes) { return PlasmaSend(sock, MessageType_PlasmaEvictReply, &fbb, message); } -Status ReadEvictReply(uint8_t* data, int64_t& num_bytes) { +Status ReadEvictReply(uint8_t* data, size_t size, int64_t& num_bytes) { DCHECK(data); auto message = flatbuffers::GetRoot(data); + DCHECK(verify_flatbuffer(message, data, size)); num_bytes = message->num_bytes(); return Status::OK(); } // Get messages. -Status SendGetRequest( - int sock, const ObjectID* object_ids, int64_t num_objects, int64_t timeout_ms) { +Status SendGetRequest(int sock, const ObjectID* object_ids, int64_t num_objects, + int64_t timeout_ms) { flatbuffers::FlatBufferBuilder fbb; - auto message = CreatePlasmaGetRequest( - fbb, to_flatbuffer(&fbb, object_ids, num_objects), timeout_ms); + auto message = CreatePlasmaGetRequest(fbb, to_flatbuffer(&fbb, object_ids, num_objects), + timeout_ms); return PlasmaSend(sock, MessageType_PlasmaGetRequest, &fbb, message); } -Status ReadGetRequest( - uint8_t* data, std::vector& object_ids, int64_t* timeout_ms) { +Status ReadGetRequest(uint8_t* data, size_t size, std::vector& object_ids, + int64_t* timeout_ms) { DCHECK(data); auto message = flatbuffers::GetRoot(data); + DCHECK(verify_flatbuffer(message, data, size)); for (uoffset_t i = 0; i < message->object_ids()->size(); ++i) { auto object_id = message->object_ids()->Get(i)->str(); object_ids.push_back(ObjectID::from_binary(object_id)); @@ -336,7 +357,8 @@ Status ReadGetRequest( return Status::OK(); } -Status SendGetReply(int sock, ObjectID object_ids[], +Status SendGetReply( + int sock, ObjectID object_ids[], std::unordered_map& plasma_objects, int64_t num_objects) { flatbuffers::FlatBufferBuilder fbb; @@ -345,18 +367,20 @@ Status SendGetReply(int sock, ObjectID object_ids[], for (int i = 0; i < num_objects; ++i) { const PlasmaObject& object = plasma_objects[object_ids[i]]; objects.push_back(PlasmaObjectSpec(object.handle.store_fd, object.handle.mmap_size, - object.data_offset, object.data_size, object.metadata_offset, - object.metadata_size)); + object.data_offset, object.data_size, + object.metadata_offset, object.metadata_size)); } - auto message = CreatePlasmaGetReply(fbb, to_flatbuffer(&fbb, object_ids, num_objects), - fbb.CreateVectorOfStructs(objects.data(), num_objects)); + auto message = + CreatePlasmaGetReply(fbb, to_flatbuffer(&fbb, object_ids, num_objects), + fbb.CreateVectorOfStructs(objects.data(), num_objects)); return PlasmaSend(sock, MessageType_PlasmaGetReply, &fbb, message); } -Status ReadGetReply(uint8_t* data, ObjectID object_ids[], PlasmaObject plasma_objects[], - int64_t num_objects) { +Status ReadGetReply(uint8_t* data, size_t size, ObjectID object_ids[], + PlasmaObject plasma_objects[], int64_t num_objects) { DCHECK(data); auto message = flatbuffers::GetRoot(data); + DCHECK(verify_flatbuffer(message, data, size)); for (uoffset_t i = 0; i < num_objects; ++i) { object_ids[i] = ObjectID::from_binary(message->object_ids()->Get(i)->str()); } @@ -381,9 +405,10 @@ Status SendFetchRequest(int sock, const ObjectID* object_ids, int64_t num_object return PlasmaSend(sock, MessageType_PlasmaFetchRequest, &fbb, message); } -Status ReadFetchRequest(uint8_t* data, std::vector& object_ids) { +Status ReadFetchRequest(uint8_t* data, size_t size, std::vector& object_ids) { DCHECK(data); auto message = flatbuffers::GetRoot(data); + DCHECK(verify_flatbuffer(message, data, size)); for (uoffset_t i = 0; i < message->object_ids()->size(); ++i) { object_ids.push_back(ObjectID::from_binary(message->object_ids()->Get(i)->str())); } @@ -393,25 +418,26 @@ Status ReadFetchRequest(uint8_t* data, std::vector& object_ids) { // Wait messages. Status SendWaitRequest(int sock, ObjectRequest object_requests[], int64_t num_requests, - int num_ready_objects, int64_t timeout_ms) { + int num_ready_objects, int64_t timeout_ms) { flatbuffers::FlatBufferBuilder fbb; std::vector> object_request_specs; for (int i = 0; i < num_requests; i++) { - object_request_specs.push_back(CreateObjectRequestSpec(fbb, - fbb.CreateString(object_requests[i].object_id.binary()), + object_request_specs.push_back(CreateObjectRequestSpec( + fbb, fbb.CreateString(object_requests[i].object_id.binary()), object_requests[i].type)); } - auto message = CreatePlasmaWaitRequest( - fbb, fbb.CreateVector(object_request_specs), num_ready_objects, timeout_ms); + auto message = CreatePlasmaWaitRequest(fbb, fbb.CreateVector(object_request_specs), + num_ready_objects, timeout_ms); return PlasmaSend(sock, MessageType_PlasmaWaitRequest, &fbb, message); } -Status ReadWaitRequest(uint8_t* data, ObjectRequestMap& object_requests, - int64_t* timeout_ms, int* num_ready_objects) { +Status ReadWaitRequest(uint8_t* data, size_t size, ObjectRequestMap& object_requests, + int64_t* timeout_ms, int* num_ready_objects) { DCHECK(data); auto message = flatbuffers::GetRoot(data); + DCHECK(verify_flatbuffer(message, data, size)); *num_ready_objects = message->num_ready_objects(); *timeout_ms = message->timeout(); @@ -419,14 +445,14 @@ Status ReadWaitRequest(uint8_t* data, ObjectRequestMap& object_requests, ObjectID object_id = ObjectID::from_binary(message->object_requests()->Get(i)->object_id()->str()); ObjectRequest object_request({object_id, message->object_requests()->Get(i)->type(), - ObjectStatus_Nonexistent}); + ObjectStatus_Nonexistent}); object_requests[object_id] = object_request; } return Status::OK(); } -Status SendWaitReply( - int sock, const ObjectRequestMap& object_requests, int num_ready_objects) { +Status SendWaitReply(int sock, const ObjectRequestMap& object_requests, + int num_ready_objects) { flatbuffers::FlatBufferBuilder fbb; std::vector> object_replies; @@ -441,11 +467,12 @@ Status SendWaitReply( return PlasmaSend(sock, MessageType_PlasmaWaitReply, &fbb, message); } -Status ReadWaitReply( - uint8_t* data, ObjectRequest object_requests[], int* num_ready_objects) { +Status ReadWaitReply(uint8_t* data, size_t size, ObjectRequest object_requests[], + int* num_ready_objects) { DCHECK(data); auto message = flatbuffers::GetRoot(data); + DCHECK(verify_flatbuffer(message, data, size)); *num_ready_objects = message->num_ready_objects(); for (int i = 0; i < *num_ready_objects; i++) { object_requests[i].object_id = @@ -473,9 +500,11 @@ Status SendDataRequest(int sock, ObjectID object_id, const char* address, int po return PlasmaSend(sock, MessageType_PlasmaDataRequest, &fbb, message); } -Status ReadDataRequest(uint8_t* data, ObjectID* object_id, char** address, int* port) { +Status ReadDataRequest(uint8_t* data, size_t size, ObjectID* object_id, char** address, + int* port) { DCHECK(data); auto message = flatbuffers::GetRoot(data); + DCHECK(verify_flatbuffer(message, data, size)); DCHECK(message->object_id()->size() == sizeof(ObjectID)); *object_id = ObjectID::from_binary(message->object_id()->str()); *address = strdup(message->address()->c_str()); @@ -483,20 +512,23 @@ Status ReadDataRequest(uint8_t* data, ObjectID* object_id, char** address, int* return Status::OK(); } -Status SendDataReply( - int sock, ObjectID object_id, int64_t object_size, int64_t metadata_size) { +Status SendDataReply(int sock, ObjectID object_id, int64_t object_size, + int64_t metadata_size) { flatbuffers::FlatBufferBuilder fbb; - auto message = CreatePlasmaDataReply( - fbb, fbb.CreateString(object_id.binary()), object_size, metadata_size); + auto message = CreatePlasmaDataReply(fbb, fbb.CreateString(object_id.binary()), + object_size, metadata_size); return PlasmaSend(sock, MessageType_PlasmaDataReply, &fbb, message); } -Status ReadDataReply( - uint8_t* data, ObjectID* object_id, int64_t* object_size, int64_t* metadata_size) { +Status ReadDataReply(uint8_t* data, size_t size, ObjectID* object_id, + int64_t* object_size, int64_t* metadata_size) { DCHECK(data); auto message = flatbuffers::GetRoot(data); + DCHECK(verify_flatbuffer(message, data, size)); *object_id = ObjectID::from_binary(message->object_id()->str()); *object_size = (int64_t)message->object_size(); *metadata_size = (int64_t)message->metadata_size(); return Status::OK(); } + +} // namespace plasma diff --git a/cpp/src/plasma/protocol.h b/cpp/src/plasma/protocol.h index 5d9d1367514..af4b13978c6 100644 --- a/cpp/src/plasma/protocol.h +++ b/cpp/src/plasma/protocol.h @@ -21,135 +21,148 @@ #include #include "arrow/status.h" -#include "format/plasma_generated.h" #include "plasma/plasma.h" +#include "plasma/plasma_generated.h" + +namespace plasma { using arrow::Status; +template +bool verify_flatbuffer(T* object, uint8_t* data, size_t size) { + flatbuffers::Verifier verifier(data, size); + return object->Verify(verifier); +} + /* Plasma receive message. */ Status PlasmaReceive(int sock, int64_t message_type, std::vector* buffer); /* Plasma Create message functions. */ -Status SendCreateRequest( - int sock, ObjectID object_id, int64_t data_size, int64_t metadata_size); +Status SendCreateRequest(int sock, ObjectID object_id, int64_t data_size, + int64_t metadata_size); -Status ReadCreateRequest( - uint8_t* data, ObjectID* object_id, int64_t* data_size, int64_t* metadata_size); +Status ReadCreateRequest(uint8_t* data, size_t size, ObjectID* object_id, + int64_t* data_size, int64_t* metadata_size); Status SendCreateReply(int sock, ObjectID object_id, PlasmaObject* object, int error); -Status ReadCreateReply(uint8_t* data, ObjectID* object_id, PlasmaObject* object); +Status ReadCreateReply(uint8_t* data, size_t size, ObjectID* object_id, + PlasmaObject* object); /* Plasma Seal message functions. */ Status SendSealRequest(int sock, ObjectID object_id, unsigned char* digest); -Status ReadSealRequest(uint8_t* data, ObjectID* object_id, unsigned char* digest); +Status ReadSealRequest(uint8_t* data, size_t size, ObjectID* object_id, + unsigned char* digest); Status SendSealReply(int sock, ObjectID object_id, int error); -Status ReadSealReply(uint8_t* data, ObjectID* object_id); +Status ReadSealReply(uint8_t* data, size_t size, ObjectID* object_id); /* Plasma Get message functions. */ -Status SendGetRequest( - int sock, const ObjectID* object_ids, int64_t num_objects, int64_t timeout_ms); +Status SendGetRequest(int sock, const ObjectID* object_ids, int64_t num_objects, + int64_t timeout_ms); -Status ReadGetRequest( - uint8_t* data, std::vector& object_ids, int64_t* timeout_ms); +Status ReadGetRequest(uint8_t* data, size_t size, std::vector& object_ids, + int64_t* timeout_ms); -Status SendGetReply(int sock, ObjectID object_ids[], +Status SendGetReply( + int sock, ObjectID object_ids[], std::unordered_map& plasma_objects, int64_t num_objects); -Status ReadGetReply(uint8_t* data, ObjectID object_ids[], PlasmaObject plasma_objects[], - int64_t num_objects); +Status ReadGetReply(uint8_t* data, size_t size, ObjectID object_ids[], + PlasmaObject plasma_objects[], int64_t num_objects); /* Plasma Release message functions. */ Status SendReleaseRequest(int sock, ObjectID object_id); -Status ReadReleaseRequest(uint8_t* data, ObjectID* object_id); +Status ReadReleaseRequest(uint8_t* data, size_t size, ObjectID* object_id); Status SendReleaseReply(int sock, ObjectID object_id, int error); -Status ReadReleaseReply(uint8_t* data, ObjectID* object_id); +Status ReadReleaseReply(uint8_t* data, size_t size, ObjectID* object_id); /* Plasma Delete message functions. */ Status SendDeleteRequest(int sock, ObjectID object_id); -Status ReadDeleteRequest(uint8_t* data, ObjectID* object_id); +Status ReadDeleteRequest(uint8_t* data, size_t size, ObjectID* object_id); Status SendDeleteReply(int sock, ObjectID object_id, int error); -Status ReadDeleteReply(uint8_t* data, ObjectID* object_id); +Status ReadDeleteReply(uint8_t* data, size_t size, ObjectID* object_id); /* Satus messages. */ Status SendStatusRequest(int sock, const ObjectID* object_ids, int64_t num_objects); -Status ReadStatusRequest(uint8_t* data, ObjectID object_ids[], int64_t num_objects); +Status ReadStatusRequest(uint8_t* data, size_t size, ObjectID object_ids[], + int64_t num_objects); -Status SendStatusReply( - int sock, ObjectID object_ids[], int object_status[], int64_t num_objects); +Status SendStatusReply(int sock, ObjectID object_ids[], int object_status[], + int64_t num_objects); -int64_t ReadStatusReply_num_objects(uint8_t* data); +int64_t ReadStatusReply_num_objects(uint8_t* data, size_t size); -Status ReadStatusReply( - uint8_t* data, ObjectID object_ids[], int object_status[], int64_t num_objects); +Status ReadStatusReply(uint8_t* data, size_t size, ObjectID object_ids[], + int object_status[], int64_t num_objects); /* Plasma Constains message functions. */ Status SendContainsRequest(int sock, ObjectID object_id); -Status ReadContainsRequest(uint8_t* data, ObjectID* object_id); +Status ReadContainsRequest(uint8_t* data, size_t size, ObjectID* object_id); Status SendContainsReply(int sock, ObjectID object_id, bool has_object); -Status ReadContainsReply(uint8_t* data, ObjectID* object_id, bool* has_object); +Status ReadContainsReply(uint8_t* data, size_t size, ObjectID* object_id, + bool* has_object); /* Plasma Connect message functions. */ Status SendConnectRequest(int sock); -Status ReadConnectRequest(uint8_t* data); +Status ReadConnectRequest(uint8_t* data, size_t size); Status SendConnectReply(int sock, int64_t memory_capacity); -Status ReadConnectReply(uint8_t* data, int64_t* memory_capacity); +Status ReadConnectReply(uint8_t* data, size_t size, int64_t* memory_capacity); /* Plasma Evict message functions (no reply so far). */ Status SendEvictRequest(int sock, int64_t num_bytes); -Status ReadEvictRequest(uint8_t* data, int64_t* num_bytes); +Status ReadEvictRequest(uint8_t* data, size_t size, int64_t* num_bytes); Status SendEvictReply(int sock, int64_t num_bytes); -Status ReadEvictReply(uint8_t* data, int64_t& num_bytes); +Status ReadEvictReply(uint8_t* data, size_t size, int64_t& num_bytes); /* Plasma Fetch Remote message functions. */ Status SendFetchRequest(int sock, const ObjectID* object_ids, int64_t num_objects); -Status ReadFetchRequest(uint8_t* data, std::vector& object_ids); +Status ReadFetchRequest(uint8_t* data, size_t size, std::vector& object_ids); /* Plasma Wait message functions. */ Status SendWaitRequest(int sock, ObjectRequest object_requests[], int64_t num_requests, - int num_ready_objects, int64_t timeout_ms); + int num_ready_objects, int64_t timeout_ms); -Status ReadWaitRequest(uint8_t* data, ObjectRequestMap& object_requests, - int64_t* timeout_ms, int* num_ready_objects); +Status ReadWaitRequest(uint8_t* data, size_t size, ObjectRequestMap& object_requests, + int64_t* timeout_ms, int* num_ready_objects); -Status SendWaitReply( - int sock, const ObjectRequestMap& object_requests, int num_ready_objects); +Status SendWaitReply(int sock, const ObjectRequestMap& object_requests, + int num_ready_objects); -Status ReadWaitReply( - uint8_t* data, ObjectRequest object_requests[], int* num_ready_objects); +Status ReadWaitReply(uint8_t* data, size_t size, ObjectRequest object_requests[], + int* num_ready_objects); /* Plasma Subscribe message functions. */ @@ -159,12 +172,15 @@ Status SendSubscribeRequest(int sock); Status SendDataRequest(int sock, ObjectID object_id, const char* address, int port); -Status ReadDataRequest(uint8_t* data, ObjectID* object_id, char** address, int* port); +Status ReadDataRequest(uint8_t* data, size_t size, ObjectID* object_id, char** address, + int* port); + +Status SendDataReply(int sock, ObjectID object_id, int64_t object_size, + int64_t metadata_size); -Status SendDataReply( - int sock, ObjectID object_id, int64_t object_size, int64_t metadata_size); +Status ReadDataReply(uint8_t* data, size_t size, ObjectID* object_id, + int64_t* object_size, int64_t* metadata_size); -Status ReadDataReply( - uint8_t* data, ObjectID* object_id, int64_t* object_size, int64_t* metadata_size); +} // namespace plasma #endif /* PLASMA_PROTOCOL */ diff --git a/cpp/src/plasma/store.cc b/cpp/src/plasma/store.cc index 9394e3de310..9f4b98c0ee7 100644 --- a/cpp/src/plasma/store.cc +++ b/cpp/src/plasma/store.cc @@ -49,12 +49,14 @@ #include #include -#include "format/common_generated.h" #include "plasma/common.h" +#include "plasma/common_generated.h" #include "plasma/fling.h" #include "plasma/io.h" #include "plasma/malloc.h" +namespace plasma { + extern "C" { void* dlmalloc(size_t bytes); void* dlmemalign(size_t alignment, size_t bytes); @@ -87,8 +89,8 @@ GetRequest::GetRequest(Client* client, const std::vector& object_ids) object_ids(object_ids.begin(), object_ids.end()), objects(object_ids.size()), num_satisfied(0) { - std::unordered_set unique_ids( - object_ids.begin(), object_ids.end()); + std::unordered_set unique_ids(object_ids.begin(), + object_ids.end()); num_objects_to_wait_for = unique_ids.size(); } @@ -116,7 +118,9 @@ PlasmaStore::~PlasmaStore() { // object's list of clients, otherwise do nothing. void PlasmaStore::add_client_to_object_clients(ObjectTableEntry* entry, Client* client) { // Check if this client is already using the object. - if (entry->clients.find(client) != entry->clients.end()) { return; } + if (entry->clients.find(client) != entry->clients.end()) { + return; + } // If there are no other clients using this object, notify the eviction policy // that the object is being used. if (entry->clients.size() == 0) { @@ -131,7 +135,8 @@ void PlasmaStore::add_client_to_object_clients(ObjectTableEntry* entry, Client* // Create a new object buffer in the hash table. int PlasmaStore::create_object(const ObjectID& object_id, int64_t data_size, - int64_t metadata_size, Client* client, PlasmaObject* result) { + int64_t metadata_size, Client* client, + PlasmaObject* result) { ARROW_LOG(DEBUG) << "creating object " << object_id.hex(); if (store_info_.objects.count(object_id) != 0) { // There is already an object with the same ID in the Plasma Store, so @@ -158,7 +163,9 @@ int PlasmaStore::create_object(const ObjectID& object_id, int64_t data_size, delete_objects(objects_to_evict); // Return an error to the client if not enough space could be freed to // create the object. - if (!success) { return PlasmaError_OutOfMemory; } + if (!success) { + return PlasmaError_OutOfMemory; + } } } while (pointer == NULL); int fd; @@ -210,7 +217,7 @@ void PlasmaObject_init(PlasmaObject* object, ObjectTableEntry* entry) { void PlasmaStore::return_from_get(GetRequest* get_req) { // Send the get reply to the client. Status s = SendGetReply(get_req->client->fd, &get_req->object_ids[0], get_req->objects, - get_req->object_ids.size()); + get_req->object_ids.size()); warn_if_sigpipe(s.ok() ? 0 : -1, get_req->client->fd); // If we successfully sent the get reply message to the client, then also send // the file descriptors. @@ -247,10 +254,14 @@ void PlasmaStore::return_from_get(GetRequest* get_req) { auto& get_requests = object_get_requests_[object_id]; // Erase get_req from the vector. auto it = std::find(get_requests.begin(), get_requests.end(), get_req); - if (it != get_requests.end()) { get_requests.erase(it); } + if (it != get_requests.end()) { + get_requests.erase(it); + } } // Remove the get request. - if (get_req->timer != -1) { ARROW_CHECK(loop_->remove_timer(get_req->timer) == AE_OK); } + if (get_req->timer != -1) { + ARROW_CHECK(loop_->RemoveTimer(get_req->timer) == AE_OK); + } delete get_req; } @@ -285,8 +296,9 @@ void PlasmaStore::update_object_get_requests(const ObjectID& object_id) { object_get_requests_.erase(object_id); } -void PlasmaStore::process_get_request( - Client* client, const std::vector& object_ids, int64_t timeout_ms) { +void PlasmaStore::process_get_request(Client* client, + const std::vector& object_ids, + int64_t timeout_ms) { // Create a get request for this object. GetRequest* get_req = new GetRequest(client, object_ids); @@ -318,15 +330,15 @@ void PlasmaStore::process_get_request( } else if (timeout_ms != -1) { // Set a timer that will cause the get request to return to the client. Note // that a timeout of -1 is used to indicate that no timer should be set. - get_req->timer = loop_->add_timer(timeout_ms, [this, get_req](int64_t timer_id) { + get_req->timer = loop_->AddTimer(timeout_ms, [this, get_req](int64_t timer_id) { return_from_get(get_req); return kEventLoopTimerDone; }); } } -int PlasmaStore::remove_client_from_object_clients( - ObjectTableEntry* entry, Client* client) { +int PlasmaStore::remove_client_from_object_clients(ObjectTableEntry* entry, + Client* client) { auto it = entry->clients.find(client); if (it != entry->clients.end()) { entry->clients.erase(it); @@ -400,34 +412,40 @@ void PlasmaStore::delete_objects(const std::vector& object_ids) { void PlasmaStore::connect_client(int listener_sock) { int client_fd = AcceptClient(listener_sock); - // This is freed in disconnect_client. + Client* client = new Client(client_fd); + connected_clients_[client_fd] = std::unique_ptr(client); + // Add a callback to handle events on this socket. // TODO(pcm): Check return value. - loop_->add_file_event(client_fd, kEventLoopRead, [this, client](int events) { + loop_->AddFileEvent(client_fd, kEventLoopRead, [this, client](int events) { Status s = process_message(client); - if (!s.ok()) { ARROW_LOG(FATAL) << "Failed to process file event: " << s; } + if (!s.ok()) { + ARROW_LOG(FATAL) << "Failed to process file event: " << s; + } }); ARROW_LOG(DEBUG) << "New connection with fd " << client_fd; } -void PlasmaStore::disconnect_client(Client* client) { - ARROW_CHECK(client != NULL); - ARROW_CHECK(client->fd > 0); - loop_->remove_file_event(client->fd); +void PlasmaStore::disconnect_client(int client_fd) { + ARROW_CHECK(client_fd > 0); + auto it = connected_clients_.find(client_fd); + ARROW_CHECK(it != connected_clients_.end()); + loop_->RemoveFileEvent(client_fd); // Close the socket. - close(client->fd); - ARROW_LOG(INFO) << "Disconnecting client on fd " << client->fd; + close(client_fd); + ARROW_LOG(INFO) << "Disconnecting client on fd " << client_fd; // If this client was using any objects, remove it from the appropriate // lists. for (const auto& entry : store_info_.objects) { - remove_client_from_object_clients(entry.second.get(), client); + remove_client_from_object_clients(entry.second.get(), it->second.get()); } + // Note, the store may still attempt to send a message to the disconnected // client (for example, when an object ID that the client was waiting for // is ready). In these cases, the attempt to send the message will fail, but // the store should just ignore the failure. - delete client; + connected_clients_.erase(it); } /// Send notifications about sealed objects to the subscribers. This is called @@ -464,8 +482,9 @@ void PlasmaStore::send_notifications(int client_fd) { // at the end of the method. // TODO(pcm): Introduce status codes and check in case the file descriptor // is added twice. - loop_->add_file_event(client_fd, kEventLoopWrite, - [this, client_fd](int events) { send_notifications(client_fd); }); + loop_->AddFileEvent(client_fd, kEventLoopWrite, [this, client_fd](int events) { + send_notifications(client_fd); + }); break; } else { ARROW_LOG(WARNING) << "Failed to send notification to client on fd " << client_fd; @@ -480,7 +499,8 @@ void PlasmaStore::send_notifications(int client_fd) { delete[] notification; } // Remove the sent notifications from the array. - it->second.object_notifications.erase(it->second.object_notifications.begin(), + it->second.object_notifications.erase( + it->second.object_notifications.begin(), it->second.object_notifications.begin() + num_processed); // Stop sending notifications if the pipe was broken. @@ -490,7 +510,9 @@ void PlasmaStore::send_notifications(int client_fd) { } // If we have sent all notifications, remove the fd from the event loop. - if (it->second.object_notifications.empty()) { loop_->remove_file_event(client_fd); } + if (it->second.object_notifications.empty()) { + loop_->RemoveFileEvent(client_fd); + } } void PlasmaStore::push_notification(ObjectInfoT* object_info) { @@ -535,6 +557,7 @@ Status PlasmaStore::process_message(Client* client) { ARROW_CHECK(s.ok() || s.IsIOError()); uint8_t* input = input_buffer_.data(); + size_t input_size = input_buffer_.size(); ObjectID object_id; PlasmaObject object; // TODO(pcm): Get rid of the following. @@ -545,11 +568,12 @@ Status PlasmaStore::process_message(Client* client) { case MessageType_PlasmaCreateRequest: { int64_t data_size; int64_t metadata_size; - RETURN_NOT_OK(ReadCreateRequest(input, &object_id, &data_size, &metadata_size)); + RETURN_NOT_OK( + ReadCreateRequest(input, input_size, &object_id, &data_size, &metadata_size)); int error_code = create_object(object_id, data_size, metadata_size, client, &object); - HANDLE_SIGPIPE( - SendCreateReply(client->fd, object_id, &object, error_code), client->fd); + HANDLE_SIGPIPE(SendCreateReply(client->fd, object_id, &object, error_code), + client->fd); if (error_code == PlasmaError_OK) { warn_if_sigpipe(send_fd(client->fd, object.handle.store_fd), client->fd); } @@ -557,15 +581,15 @@ Status PlasmaStore::process_message(Client* client) { case MessageType_PlasmaGetRequest: { std::vector object_ids_to_get; int64_t timeout_ms; - RETURN_NOT_OK(ReadGetRequest(input, object_ids_to_get, &timeout_ms)); + RETURN_NOT_OK(ReadGetRequest(input, input_size, object_ids_to_get, &timeout_ms)); process_get_request(client, object_ids_to_get, timeout_ms); } break; case MessageType_PlasmaReleaseRequest: - RETURN_NOT_OK(ReadReleaseRequest(input, &object_id)); + RETURN_NOT_OK(ReadReleaseRequest(input, input_size, &object_id)); release_object(object_id, client); break; case MessageType_PlasmaContainsRequest: - RETURN_NOT_OK(ReadContainsRequest(input, &object_id)); + RETURN_NOT_OK(ReadContainsRequest(input, input_size, &object_id)); if (contains_object(object_id) == OBJECT_FOUND) { HANDLE_SIGPIPE(SendContainsReply(client->fd, object_id, 1), client->fd); } else { @@ -574,13 +598,13 @@ Status PlasmaStore::process_message(Client* client) { break; case MessageType_PlasmaSealRequest: { unsigned char digest[kDigestSize]; - RETURN_NOT_OK(ReadSealRequest(input, &object_id, &digest[0])); + RETURN_NOT_OK(ReadSealRequest(input, input_size, &object_id, &digest[0])); seal_object(object_id, &digest[0]); } break; case MessageType_PlasmaEvictRequest: { // This code path should only be used for testing. int64_t num_bytes; - RETURN_NOT_OK(ReadEvictRequest(input, &num_bytes)); + RETURN_NOT_OK(ReadEvictRequest(input, input_size, &num_bytes)); std::vector objects_to_evict; int64_t num_bytes_evicted = eviction_policy_.choose_objects_to_evict(num_bytes, &objects_to_evict); @@ -591,12 +615,12 @@ Status PlasmaStore::process_message(Client* client) { subscribe_to_updates(client); break; case MessageType_PlasmaConnectRequest: { - HANDLE_SIGPIPE( - SendConnectReply(client->fd, store_info_.memory_capacity), client->fd); + HANDLE_SIGPIPE(SendConnectReply(client->fd, store_info_.memory_capacity), + client->fd); } break; case DISCONNECT_CLIENT: ARROW_LOG(DEBUG) << "Disconnecting client on fd " << client->fd; - disconnect_client(client); + disconnect_client(client->fd); break; default: // This code should be unreachable. @@ -605,28 +629,61 @@ Status PlasmaStore::process_message(Client* client) { return Status::OK(); } -// Report "success" to valgrind. -void signal_handler(int signal) { - if (signal == SIGTERM) { exit(0); } +class PlasmaStoreRunner { + public: + PlasmaStoreRunner() {} + + void Start(char* socket_name, int64_t system_memory) { + // Create the event loop. + loop_.reset(new EventLoop); + store_.reset(new PlasmaStore(loop_.get(), system_memory)); + int socket = bind_ipc_sock(socket_name, true); + // TODO(pcm): Check return value. + ARROW_CHECK(socket >= 0); + + loop_->AddFileEvent(socket, kEventLoopRead, [this, socket](int events) { + this->store_->connect_client(socket); + }); + loop_->Start(); + } + + void Shutdown() { + loop_->Stop(); + loop_ = nullptr; + store_ = nullptr; + } + + private: + std::unique_ptr loop_; + std::unique_ptr store_; +}; + +static PlasmaStoreRunner* g_runner = nullptr; + +void HandleSignal(int signal) { + if (signal == SIGTERM) { + if (g_runner != nullptr) { + g_runner->Shutdown(); + } + // Report "success" to valgrind. + exit(0); + } } void start_server(char* socket_name, int64_t system_memory) { // Ignore SIGPIPE signals. If we don't do this, then when we attempt to write // to a client that has already died, the store could die. signal(SIGPIPE, SIG_IGN); - // Create the event loop. - EventLoop loop; - PlasmaStore store(&loop, system_memory); - int socket = bind_ipc_sock(socket_name, true); - ARROW_CHECK(socket >= 0); - // TODO(pcm): Check return value. - loop.add_file_event(socket, kEventLoopRead, - [&store, socket](int events) { store.connect_client(socket); }); - loop.run(); + + PlasmaStoreRunner runner; + g_runner = &runner; + signal(SIGTERM, HandleSignal); + runner.Start(socket_name, system_memory); } +} // namespace plasma + int main(int argc, char* argv[]) { - signal(SIGTERM, signal_handler); char* socket_name = NULL; int64_t system_memory = -1; int c; @@ -677,7 +734,7 @@ int main(int argc, char* argv[]) { #endif // Make it so dlmalloc fails if we try to request more memory than is // available. - dlmalloc_set_footprint_limit((size_t)system_memory); + plasma::dlmalloc_set_footprint_limit((size_t)system_memory); ARROW_LOG(DEBUG) << "starting server listening on " << socket_name; - start_server(socket_name, system_memory); + plasma::start_server(socket_name, system_memory); } diff --git a/cpp/src/plasma/store.h b/cpp/src/plasma/store.h index 8bd94265410..fb732a1375d 100644 --- a/cpp/src/plasma/store.h +++ b/cpp/src/plasma/store.h @@ -27,6 +27,8 @@ #include "plasma/plasma.h" #include "plasma/protocol.h" +namespace plasma { + struct GetRequest; struct NotificationQueue { @@ -64,7 +66,7 @@ class PlasmaStore { /// cannot create the object. In this case, the client should not call /// plasma_release. int create_object(const ObjectID& object_id, int64_t data_size, int64_t metadata_size, - Client* client, PlasmaObject* result); + Client* client, PlasmaObject* result); /// Delete objects that have been created in the hash table. This should only /// be called on objects that are returned by the eviction policy to evict. @@ -85,8 +87,8 @@ class PlasmaStore { /// @param object_ids Object IDs of the objects to be gotten. /// @param timeout_ms The timeout for the get request in milliseconds. /// @return Void. - void process_get_request( - Client* client, const std::vector& object_ids, int64_t timeout_ms); + void process_get_request(Client* client, const std::vector& object_ids, + int64_t timeout_ms); /// Seal an object. The object is now immutable and can be accessed with get. /// @@ -125,9 +127,9 @@ class PlasmaStore { /// Disconnect a client from the PlasmaStore. /// - /// @param client The client that is disconnected. + /// @param client_fd The client file descriptor that is disconnected. /// @return Void. - void disconnect_client(Client* client); + void disconnect_client(int client_fd); void send_notifications(int client_fd); @@ -164,6 +166,10 @@ class PlasmaStore { /// TODO(pcm): Consider putting this into the Client data structure and /// reorganize the code slightly. std::unordered_map pending_notifications_; + + std::unordered_map> connected_clients_; }; +} // namespace plasma + #endif // PLASMA_STORE_H diff --git a/cpp/src/plasma/test/client_tests.cc b/cpp/src/plasma/test/client_tests.cc index 29b5b135144..02b38321451 100644 --- a/cpp/src/plasma/test/client_tests.cc +++ b/cpp/src/plasma/test/client_tests.cc @@ -29,7 +29,9 @@ #include "plasma/plasma.h" #include "plasma/protocol.h" -std::string g_test_executable; // NOLINT +namespace plasma { + +std::string test_executable; // NOLINT class TestPlasmaStore : public ::testing::Test { public: @@ -37,7 +39,7 @@ class TestPlasmaStore : public ::testing::Test { // stdout of the object store. Consider changing that. void SetUp() { std::string plasma_directory = - g_test_executable.substr(0, g_test_executable.find_last_of("/")); + test_executable.substr(0, test_executable.find_last_of("/")); std::string plasma_command = plasma_directory + "/plasma_store -m 1000000000 -s /tmp/store 1> /dev/null 2> /dev/null &"; @@ -125,8 +127,10 @@ TEST_F(TestPlasmaStore, MultipleGetTest) { ASSERT_EQ(object_buffer[1].data[0], 2); } +} // namespace plasma + int main(int argc, char** argv) { ::testing::InitGoogleTest(&argc, argv); - g_test_executable = std::string(argv[0]); + plasma::test_executable = std::string(argv[0]); return RUN_ALL_TESTS(); } diff --git a/cpp/src/plasma/test/serialization_tests.cc b/cpp/src/plasma/test/serialization_tests.cc index 325cead06e7..c76f5ce1092 100644 --- a/cpp/src/plasma/test/serialization_tests.cc +++ b/cpp/src/plasma/test/serialization_tests.cc @@ -25,6 +25,8 @@ #include "plasma/plasma.h" #include "plasma/protocol.h" +namespace plasma { + /** * Create a temporary file. Needs to be closed by the caller. * @@ -80,8 +82,8 @@ TEST(PlasmaSerialization, CreateRequest) { ObjectID object_id2; int64_t data_size2; int64_t metadata_size2; - ARROW_CHECK_OK( - ReadCreateRequest(data.data(), &object_id2, &data_size2, &metadata_size2)); + ARROW_CHECK_OK(ReadCreateRequest(data.data(), data.size(), &object_id2, &data_size2, + &metadata_size2)); ASSERT_EQ(data_size1, data_size2); ASSERT_EQ(metadata_size1, metadata_size2); ASSERT_EQ(object_id1, object_id2); @@ -97,7 +99,7 @@ TEST(PlasmaSerialization, CreateReply) { ObjectID object_id2; PlasmaObject object2; memset(&object2, 0, sizeof(object2)); - ARROW_CHECK_OK(ReadCreateReply(data.data(), &object_id2, &object2)); + ARROW_CHECK_OK(ReadCreateReply(data.data(), data.size(), &object_id2, &object2)); ASSERT_EQ(object_id1, object_id2); ASSERT_EQ(memcmp(&object1, &object2, sizeof(object1)), 0); close(fd); @@ -112,7 +114,7 @@ TEST(PlasmaSerialization, SealRequest) { std::vector data = read_message_from_file(fd, MessageType_PlasmaSealRequest); ObjectID object_id2; unsigned char digest2[kDigestSize]; - ARROW_CHECK_OK(ReadSealRequest(data.data(), &object_id2, &digest2[0])); + ARROW_CHECK_OK(ReadSealRequest(data.data(), data.size(), &object_id2, &digest2[0])); ASSERT_EQ(object_id1, object_id2); ASSERT_EQ(memcmp(&digest1[0], &digest2[0], kDigestSize), 0); close(fd); @@ -124,7 +126,7 @@ TEST(PlasmaSerialization, SealReply) { ARROW_CHECK_OK(SendSealReply(fd, object_id1, PlasmaError_ObjectExists)); std::vector data = read_message_from_file(fd, MessageType_PlasmaSealReply); ObjectID object_id2; - Status s = ReadSealReply(data.data(), &object_id2); + Status s = ReadSealReply(data.data(), data.size(), &object_id2); ASSERT_EQ(object_id1, object_id2); ASSERT_TRUE(s.IsPlasmaObjectExists()); close(fd); @@ -140,7 +142,8 @@ TEST(PlasmaSerialization, GetRequest) { std::vector data = read_message_from_file(fd, MessageType_PlasmaGetRequest); std::vector object_ids_return; int64_t timeout_ms_return; - ARROW_CHECK_OK(ReadGetRequest(data.data(), object_ids_return, &timeout_ms_return)); + ARROW_CHECK_OK( + ReadGetRequest(data.data(), data.size(), object_ids_return, &timeout_ms_return)); ASSERT_EQ(object_ids[0], object_ids_return[0]); ASSERT_EQ(object_ids[1], object_ids_return[1]); ASSERT_EQ(timeout_ms, timeout_ms_return); @@ -160,16 +163,16 @@ TEST(PlasmaSerialization, GetReply) { ObjectID object_ids_return[2]; PlasmaObject plasma_objects_return[2]; memset(&plasma_objects_return, 0, sizeof(plasma_objects_return)); - ARROW_CHECK_OK( - ReadGetReply(data.data(), object_ids_return, &plasma_objects_return[0], 2)); + ARROW_CHECK_OK(ReadGetReply(data.data(), data.size(), object_ids_return, + &plasma_objects_return[0], 2)); ASSERT_EQ(object_ids[0], object_ids_return[0]); ASSERT_EQ(object_ids[1], object_ids_return[1]); ASSERT_EQ(memcmp(&plasma_objects[object_ids[0]], &plasma_objects_return[0], - sizeof(PlasmaObject)), - 0); + sizeof(PlasmaObject)), + 0); ASSERT_EQ(memcmp(&plasma_objects[object_ids[1]], &plasma_objects_return[1], - sizeof(PlasmaObject)), - 0); + sizeof(PlasmaObject)), + 0); close(fd); } @@ -180,7 +183,7 @@ TEST(PlasmaSerialization, ReleaseRequest) { std::vector data = read_message_from_file(fd, MessageType_PlasmaReleaseRequest); ObjectID object_id2; - ARROW_CHECK_OK(ReadReleaseRequest(data.data(), &object_id2)); + ARROW_CHECK_OK(ReadReleaseRequest(data.data(), data.size(), &object_id2)); ASSERT_EQ(object_id1, object_id2); close(fd); } @@ -191,7 +194,7 @@ TEST(PlasmaSerialization, ReleaseReply) { ARROW_CHECK_OK(SendReleaseReply(fd, object_id1, PlasmaError_ObjectExists)); std::vector data = read_message_from_file(fd, MessageType_PlasmaReleaseReply); ObjectID object_id2; - Status s = ReadReleaseReply(data.data(), &object_id2); + Status s = ReadReleaseReply(data.data(), data.size(), &object_id2); ASSERT_EQ(object_id1, object_id2); ASSERT_TRUE(s.IsPlasmaObjectExists()); close(fd); @@ -203,7 +206,7 @@ TEST(PlasmaSerialization, DeleteRequest) { ARROW_CHECK_OK(SendDeleteRequest(fd, object_id1)); std::vector data = read_message_from_file(fd, MessageType_PlasmaDeleteRequest); ObjectID object_id2; - ARROW_CHECK_OK(ReadDeleteRequest(data.data(), &object_id2)); + ARROW_CHECK_OK(ReadDeleteRequest(data.data(), data.size(), &object_id2)); ASSERT_EQ(object_id1, object_id2); close(fd); } @@ -215,7 +218,7 @@ TEST(PlasmaSerialization, DeleteReply) { ARROW_CHECK_OK(SendDeleteReply(fd, object_id1, error1)); std::vector data = read_message_from_file(fd, MessageType_PlasmaDeleteReply); ObjectID object_id2; - Status s = ReadDeleteReply(data.data(), &object_id2); + Status s = ReadDeleteReply(data.data(), data.size(), &object_id2); ASSERT_EQ(object_id1, object_id2); ASSERT_TRUE(s.IsPlasmaObjectExists()); close(fd); @@ -230,7 +233,8 @@ TEST(PlasmaSerialization, StatusRequest) { ARROW_CHECK_OK(SendStatusRequest(fd, object_ids, num_objects)); std::vector data = read_message_from_file(fd, MessageType_PlasmaStatusRequest); ObjectID object_ids_read[num_objects]; - ARROW_CHECK_OK(ReadStatusRequest(data.data(), object_ids_read, num_objects)); + ARROW_CHECK_OK( + ReadStatusRequest(data.data(), data.size(), object_ids_read, num_objects)); ASSERT_EQ(object_ids[0], object_ids_read[0]); ASSERT_EQ(object_ids[1], object_ids_read[1]); close(fd); @@ -244,11 +248,11 @@ TEST(PlasmaSerialization, StatusReply) { int object_statuses[2] = {42, 43}; ARROW_CHECK_OK(SendStatusReply(fd, object_ids, object_statuses, 2)); std::vector data = read_message_from_file(fd, MessageType_PlasmaStatusReply); - int64_t num_objects = ReadStatusReply_num_objects(data.data()); + int64_t num_objects = ReadStatusReply_num_objects(data.data(), data.size()); ObjectID object_ids_read[num_objects]; int object_statuses_read[num_objects]; - ARROW_CHECK_OK( - ReadStatusReply(data.data(), object_ids_read, object_statuses_read, num_objects)); + ARROW_CHECK_OK(ReadStatusReply(data.data(), data.size(), object_ids_read, + object_statuses_read, num_objects)); ASSERT_EQ(object_ids[0], object_ids_read[0]); ASSERT_EQ(object_ids[1], object_ids_read[1]); ASSERT_EQ(object_statuses[0], object_statuses_read[0]); @@ -262,7 +266,7 @@ TEST(PlasmaSerialization, EvictRequest) { ARROW_CHECK_OK(SendEvictRequest(fd, num_bytes)); std::vector data = read_message_from_file(fd, MessageType_PlasmaEvictRequest); int64_t num_bytes_received; - ARROW_CHECK_OK(ReadEvictRequest(data.data(), &num_bytes_received)); + ARROW_CHECK_OK(ReadEvictRequest(data.data(), data.size(), &num_bytes_received)); ASSERT_EQ(num_bytes, num_bytes_received); close(fd); } @@ -273,7 +277,7 @@ TEST(PlasmaSerialization, EvictReply) { ARROW_CHECK_OK(SendEvictReply(fd, num_bytes)); std::vector data = read_message_from_file(fd, MessageType_PlasmaEvictReply); int64_t num_bytes_received; - ARROW_CHECK_OK(ReadEvictReply(data.data(), num_bytes_received)); + ARROW_CHECK_OK(ReadEvictReply(data.data(), data.size(), num_bytes_received)); ASSERT_EQ(num_bytes, num_bytes_received); close(fd); } @@ -286,7 +290,7 @@ TEST(PlasmaSerialization, FetchRequest) { ARROW_CHECK_OK(SendFetchRequest(fd, object_ids, 2)); std::vector data = read_message_from_file(fd, MessageType_PlasmaFetchRequest); std::vector object_ids_read; - ARROW_CHECK_OK(ReadFetchRequest(data.data(), object_ids_read)); + ARROW_CHECK_OK(ReadFetchRequest(data.data(), data.size(), object_ids_read)); ASSERT_EQ(object_ids[0], object_ids_read[0]); ASSERT_EQ(object_ids[1], object_ids_read[1]); close(fd); @@ -301,15 +305,15 @@ TEST(PlasmaSerialization, WaitRequest) { const int num_ready_objects_in = 1; int64_t timeout_ms = 1000; - ARROW_CHECK_OK(SendWaitRequest( - fd, &object_requests_in[0], num_objects_in, num_ready_objects_in, timeout_ms)); + ARROW_CHECK_OK(SendWaitRequest(fd, &object_requests_in[0], num_objects_in, + num_ready_objects_in, timeout_ms)); /* Read message back. */ std::vector data = read_message_from_file(fd, MessageType_PlasmaWaitRequest); int num_ready_objects_out; int64_t timeout_ms_read; ObjectRequestMap object_requests_out; - ARROW_CHECK_OK(ReadWaitRequest( - data.data(), object_requests_out, &timeout_ms_read, &num_ready_objects_out)); + ARROW_CHECK_OK(ReadWaitRequest(data.data(), data.size(), object_requests_out, + &timeout_ms_read, &num_ready_objects_out)); ASSERT_EQ(num_objects_in, object_requests_out.size()); ASSERT_EQ(num_ready_objects_out, num_ready_objects_in); for (int i = 0; i < num_objects_in; i++) { @@ -338,7 +342,8 @@ TEST(PlasmaSerialization, WaitReply) { std::vector data = read_message_from_file(fd, MessageType_PlasmaWaitReply); ObjectRequest objects_out[2]; int num_objects_out; - ARROW_CHECK_OK(ReadWaitReply(data.data(), &objects_out[0], &num_objects_out)); + ARROW_CHECK_OK( + ReadWaitReply(data.data(), data.size(), &objects_out[0], &num_objects_out)); ASSERT_EQ(num_objects_in, num_objects_out); for (int i = 0; i < num_objects_out; i++) { /* Each object request must appear exactly once. */ @@ -362,7 +367,8 @@ TEST(PlasmaSerialization, DataRequest) { ObjectID object_id2; char* address2; int port2; - ARROW_CHECK_OK(ReadDataRequest(data.data(), &object_id2, &address2, &port2)); + ARROW_CHECK_OK( + ReadDataRequest(data.data(), data.size(), &object_id2, &address2, &port2)); ASSERT_EQ(object_id1, object_id2); ASSERT_EQ(strcmp(address1, address2), 0); ASSERT_EQ(port1, port2); @@ -381,8 +387,11 @@ TEST(PlasmaSerialization, DataReply) { ObjectID object_id2; int64_t object_size2; int64_t metadata_size2; - ARROW_CHECK_OK(ReadDataReply(data.data(), &object_id2, &object_size2, &metadata_size2)); + ARROW_CHECK_OK(ReadDataReply(data.data(), data.size(), &object_id2, &object_size2, + &metadata_size2)); ASSERT_EQ(object_id1, object_id2); ASSERT_EQ(object_size1, object_size2); ASSERT_EQ(metadata_size1, metadata_size2); } + +} // namespace plasma diff --git a/cpp/src/plasma/thirdparty/dlmalloc.c b/cpp/src/plasma/thirdparty/dlmalloc.c index 84ccbd28fc4..7f3fd639649 100644 --- a/cpp/src/plasma/thirdparty/dlmalloc.c +++ b/cpp/src/plasma/thirdparty/dlmalloc.c @@ -521,6 +521,7 @@ MAX_RELEASE_CHECK_RATE default: 4095 unless not HAVE_MMAP improvement at the expense of carrying around more memory. */ + /* Version identifier to allow people to support multiple versions */ #ifndef DLMALLOC_VERSION #define DLMALLOC_VERSION 20806 @@ -584,9 +585,21 @@ MAX_RELEASE_CHECK_RATE default: 4095 unless not HAVE_MMAP /* The maximum possible size_t value has all bits set */ #define MAX_SIZE_T (~(size_t)0) +#if (defined(USE_RECURSIVE_LOCKS) && USE_RECURSIVE_LOCKS != 0) +#define RECURSIVE_LOCKS_ENABLED 1 +#else +#define RECURSIVE_LOCKS_ENABLED 0 +#endif + +#if (defined(USE_RECURSIVE_LOCKS) && USE_RECURSIVE_LOCKS != 0) +#define SPIN_LOCKS_ENABLED 1 +#else +#define SPIN_LOCKS_ENABLED 0 +#endif + #ifndef USE_LOCKS /* ensure true if spin or recursive locks set */ -#define USE_LOCKS ((defined(USE_SPIN_LOCKS) && USE_SPIN_LOCKS != 0) || \ - (defined(USE_RECURSIVE_LOCKS) && USE_RECURSIVE_LOCKS != 0)) +#define USE_LOCKS ((SPIN_LOCKS_ENABLED != 0) || \ + (RECURSIVE_LOCKS_ENABLED != 0)) #endif /* USE_LOCKS */ #if USE_LOCKS /* Spin locks for gcc >= 4.1, older gcc on x86, MSC >= 1310 */ @@ -645,7 +658,9 @@ MAX_RELEASE_CHECK_RATE default: 4095 unless not HAVE_MMAP #ifndef HAVE_MREMAP #ifdef linux #define HAVE_MREMAP 1 +#ifndef _GNU_SOURCE #define _GNU_SOURCE /* Turns on mremap() definition */ +#endif /* _GNU_SOURCE */ #else /* linux */ #define HAVE_MREMAP 0 #endif /* linux */ diff --git a/dev/make_changelog.py b/dev/make_changelog.py index 47127903b7b..b4b0070df8e 100644 --- a/dev/make_changelog.py +++ b/dev/make_changelog.py @@ -74,6 +74,7 @@ def format_changelog_website(issues, out): CATEGORIES = { 'New Feature': NEW_FEATURE, 'Improvement': NEW_FEATURE, + 'Wish': NEW_FEATURE, 'Task': NEW_FEATURE, 'Test': NEW_FEATURE, 'Bug': BUGFIX diff --git a/format/Guidelines.md b/format/Guidelines.md index c75da9f98be..ff3a63d9a2f 100644 --- a/format/Guidelines.md +++ b/format/Guidelines.md @@ -1,15 +1,20 @@ # Implementation guidelines diff --git a/format/IPC.md b/format/IPC.md index 7d689216d55..3fd234e4aa1 100644 --- a/format/IPC.md +++ b/format/IPC.md @@ -1,15 +1,20 @@ # Interprocess messaging / communication (IPC) diff --git a/format/Layout.md b/format/Layout.md index 1e817ff1375..b62b1565a75 100644 --- a/format/Layout.md +++ b/format/Layout.md @@ -1,15 +1,20 @@ # Arrow: Physical memory layout diff --git a/format/Metadata.md b/format/Metadata.md index 18fac527470..80ca08ae13f 100644 --- a/format/Metadata.md +++ b/format/Metadata.md @@ -1,15 +1,20 @@ # Metadata: Logical types, schemas, data headers diff --git a/format/README.md b/format/README.md index 3aa8fdd6d4d..c87ac2a00d6 100644 --- a/format/README.md +++ b/format/README.md @@ -1,15 +1,20 @@ ## Arrow specification documents diff --git a/format/Schema.fbs b/format/Schema.fbs index a7e802b9dcb..186f8e362bd 100644 --- a/format/Schema.fbs +++ b/format/Schema.fbs @@ -44,6 +44,35 @@ table FixedSizeList { listSize: int; } +/// A Map is a logical nested type that is represented as +/// +/// List> +/// +/// In this layout, the keys and values are each respectively contiguous. We do +/// not constrain the key and value types, so the application is responsible +/// for ensuring that the keys are hashable and unique. Whether the keys are sorted +/// may be set in the metadata for this field +/// +/// In a Field with Map type, the Field has a child Struct field, which then +/// has two children: key type and the second the value type. The names of the +/// child fields may be respectively "entry", "key", and "value", but this is +/// not enforced +/// +/// Map +/// - child[0] entry: Struct +/// - child[0] key: K +/// - child[1] value: V +/// +/// Neither the "entry" field nor the "key" field may be nullable. +/// +/// The metadata is structured so that Arrow systems without special handling +/// for Map can make Map an alias for List. The "layout" attribute for the Map +/// field must have the same contents as a List. +table Map { + /// Set to true if the keys within each value are sorted + keysSorted: bool; +} + enum UnionMode:short { Sparse, Dense } /// A union is a complex type with children in Field @@ -170,7 +199,8 @@ union Type { Struct_, Union, FixedSizeBinary, - FixedSizeList + FixedSizeList, + Map } /// ---------------------------------------------------------------------- diff --git a/integration/README.md b/integration/README.md index 6005b62c41c..5b6ea45ff73 100644 --- a/integration/README.md +++ b/integration/README.md @@ -1,15 +1,20 @@ # Arrow integration testing diff --git a/integration/integration_test.py b/integration/integration_test.py index 215ba58232a..b7f1609935e 100644 --- a/integration/integration_test.py +++ b/integration/integration_test.py @@ -945,7 +945,7 @@ def get_static_json_files(): def run_all_tests(debug=False): - testers = [CPPTester(debug=debug)] # , JavaTester(debug=debug)] + testers = [CPPTester(debug=debug), JavaTester(debug=debug)] static_json_files = get_static_json_files() generated_json_files = get_generated_json_files() json_files = static_json_files + generated_json_files diff --git a/java/README.md b/java/README.md index a57e35afbbd..dd4f9245156 100644 --- a/java/README.md +++ b/java/README.md @@ -1,15 +1,20 @@ # Arrow Java @@ -25,4 +30,4 @@ install: ``` cd java mvn install -``` +``` diff --git a/java/memory/src/main/java/io/netty/buffer/ArrowBuf.java b/java/memory/src/main/java/io/netty/buffer/ArrowBuf.java index 0328a167190..09886a6ffe3 100644 --- a/java/memory/src/main/java/io/netty/buffer/ArrowBuf.java +++ b/java/memory/src/main/java/io/netty/buffer/ArrowBuf.java @@ -208,7 +208,7 @@ public ArrowBuf retain(BufferAllocator target) { * that carries an association with the underlying memory of this ArrowBuf. If this ArrowBuf is * connected to the * owning BufferLedger of this memory, that memory ownership/accounting will be transferred to - * the taret allocator. If + * the target allocator. If * this ArrowBuf does not currently own the memory underlying it (and is only associated with * it), this does not * transfer any ownership to the newly created ArrowBuf. diff --git a/java/memory/src/main/java/org/apache/arrow/memory/BaseAllocator.java b/java/memory/src/main/java/org/apache/arrow/memory/BaseAllocator.java index ddc78f03f0f..be0ba77f5b2 100644 --- a/java/memory/src/main/java/org/apache/arrow/memory/BaseAllocator.java +++ b/java/memory/src/main/java/org/apache/arrow/memory/BaseAllocator.java @@ -640,7 +640,7 @@ private void dumpBuffers(final StringBuilder sb, final Set ledgerS continue; } final UnsafeDirectLittleEndian udle = ledger.getUnderlying(); - sb.append("UnsafeDirectLittleEndian[dentityHashCode == "); + sb.append("UnsafeDirectLittleEndian[identityHashCode == "); sb.append(Integer.toString(System.identityHashCode(udle))); sb.append("] size "); sb.append(Integer.toString(udle.capacity())); diff --git a/java/pom.xml b/java/pom.xml index 2613a441045..de2113e397e 100644 --- a/java/pom.xml +++ b/java/pom.xml @@ -30,7 +30,7 @@ ${project.basedir}/target/generated-sources 4.11 - 1.7.6 + 1.7.25 18.0 2 2.7.1 @@ -93,7 +93,7 @@ - false + false **/*.log **/*.css @@ -231,7 +231,7 @@ pl.project13.maven git-commit-id-plugin - 2.1.9 + 2.2.2 for-jars @@ -520,7 +520,7 @@ ch.qos.logback logback-classic - 1.0.13 + 1.2.3 test diff --git a/java/vector/src/main/codegen/templates/NullableValueVectors.java b/java/vector/src/main/codegen/templates/NullableValueVectors.java index 092097bb2bd..624ba9d2cec 100644 --- a/java/vector/src/main/codegen/templates/NullableValueVectors.java +++ b/java/vector/src/main/codegen/templates/NullableValueVectors.java @@ -44,9 +44,10 @@ * NB: this class is automatically generated from ${.template_name} and ValueVectorTypes.tdd using FreeMarker. */ @SuppressWarnings("unused") -public final class ${className} extends BaseDataValueVector implements <#if type.major == "VarLen">VariableWidth<#else>FixedWidthVector, NullableVector, FieldVector { +public final class ${className} extends BaseValueVector implements <#if type.major == "VarLen">VariableWidth<#else>FixedWidthVector, NullableVector, FieldVector { private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(${className}.class); +protected final static byte[] emptyByteArray = new byte[]{}; private final FieldReader reader = new ${minor.class}ReaderImpl(${className}.this); private final String bitsField = "$bits$"; @@ -217,7 +218,6 @@ public int getBufferSizeFor(final int valueCount) { + bits.getBufferSizeFor(valueCount); } - @Override public ArrowBuf getBuffer() { return values.getBuffer(); } @@ -286,7 +286,6 @@ public void reset() { bits.zeroVector(); mutator.reset(); accessor.reset(); - super.reset(); } @Override @@ -314,12 +313,10 @@ public void allocateNew(int valueCount) { accessor.reset(); } - @Override public void reset() { bits.zeroVector(); mutator.reset(); accessor.reset(); - super.reset(); } /** @@ -332,6 +329,11 @@ public void zeroVector() { } + @Override + public TransferPair getTransferPair(String ref, BufferAllocator allocator, CallBack callBack) { + return getTransferPair(ref, allocator); + } + @Override public TransferPair getTransferPair(BufferAllocator allocator){ return new TransferImpl(name, allocator); @@ -540,7 +542,7 @@ public void set(int index, <#if type.major == "VarLen">byte[]<#elseif (type.widt <#if type.major == "VarLen"> - private void fillEmpties(int index){ + public void fillEmpties(int index){ final ${valuesName}.Mutator valuesMutator = values.getMutator(); for (int i = lastSet + 1; i < index; i++) { valuesMutator.setSafe(i, emptyByteArray); @@ -699,6 +701,22 @@ public void reset(){ setCount = 0; <#if type.major = "VarLen">lastSet = -1; } + + public void setLastSet(int value) { + <#if type.major = "VarLen"> + lastSet = value; + <#else> + throw new UnsupportedOperationException(); + + } + + public int getLastSet() { + <#if type.major != "VarLen"> + throw new UnsupportedOperationException(); + <#else> + return lastSet; + + } } } diff --git a/java/vector/src/main/codegen/templates/UnionVector.java b/java/vector/src/main/codegen/templates/UnionVector.java index aa9d34d6e26..eabe42a7c4c 100644 --- a/java/vector/src/main/codegen/templates/UnionVector.java +++ b/java/vector/src/main/codegen/templates/UnionVector.java @@ -321,10 +321,8 @@ public void transfer() { @Override public void splitAndTransfer(int startIndex, int length) { - to.allocateNew(); - for (int i = 0; i < length; i++) { - to.copyFromSafe(startIndex + i, i, org.apache.arrow.vector.complex.UnionVector.this); - } + internalMapVectorTransferPair.splitAndTransfer(startIndex, length); + typeVectorTransferPair.splitAndTransfer(startIndex, length); to.getMutator().setValueCount(length); } diff --git a/java/vector/src/main/java/org/apache/arrow/vector/BaseDataValueVector.java b/java/vector/src/main/java/org/apache/arrow/vector/BaseDataValueVector.java index 6d7d3f04a6d..0fea719da88 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/BaseDataValueVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/BaseDataValueVector.java @@ -30,8 +30,6 @@ public abstract class BaseDataValueVector extends BaseValueVector implements BufferBacked { - protected final static byte[] emptyByteArray = new byte[]{}; // Nullable vectors use this - public static void load(ArrowFieldNode fieldNode, List vectors, List buffers) { int expectedSize = vectors.size(); if (buffers.size() != expectedSize) { diff --git a/java/vector/src/main/java/org/apache/arrow/vector/BitVector.java b/java/vector/src/main/java/org/apache/arrow/vector/BitVector.java index 82cbd47d758..f34ef2c2a22 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/BitVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/BitVector.java @@ -261,32 +261,34 @@ public void splitAndTransferTo(int startIndex, int length, BitVector target) { int firstByte = getByteIndex(startIndex); int byteSize = getSizeFromCount(length); int offset = startIndex % 8; - if (offset == 0) { - target.clear(); - // slice - if (target.data != null) { - target.data.release(); - } - target.data = data.slice(firstByte, byteSize); - target.data.retain(1); - } else { - // Copy data - // When the first bit starts from the middle of a byte (offset != 0), copy data from src BitVector. - // Each byte in the target is composed by a part in i-th byte, another part in (i+1)-th byte. - // The last byte copied to target is a bit tricky : - // 1) if length requires partly byte (length % 8 !=0), copy the remaining bits only. - // 2) otherwise, copy the last byte in the same way as to the prior bytes. - target.clear(); - target.allocateNew(length); - // TODO maybe do this one word at a time, rather than byte? - for(int i = 0; i < byteSize - 1; i++) { - target.data.setByte(i, (((this.data.getByte(firstByte + i) & 0xFF) >>> offset) + (this.data.getByte(firstByte + i + 1) << (8 - offset)))); - } - if (length % 8 != 0) { - target.data.setByte(byteSize - 1, ((this.data.getByte(firstByte + byteSize - 1) & 0xFF) >>> offset)); + if (length > 0) { + if (offset == 0) { + target.clear(); + // slice + if (target.data != null) { + target.data.release(); + } + target.data = data.slice(firstByte, byteSize); + target.data.retain(1); } else { - target.data.setByte(byteSize - 1, - (((this.data.getByte(firstByte + byteSize - 1) & 0xFF) >>> offset) + (this.data.getByte(firstByte + byteSize) << (8 - offset)))); + // Copy data + // When the first bit starts from the middle of a byte (offset != 0), copy data from src BitVector. + // Each byte in the target is composed by a part in i-th byte, another part in (i+1)-th byte. + // The last byte copied to target is a bit tricky : + // 1) if length requires partly byte (length % 8 !=0), copy the remaining bits only. + // 2) otherwise, copy the last byte in the same way as to the prior bytes. + target.clear(); + target.allocateNew(length); + // TODO maybe do this one word at a time, rather than byte? + for (int i = 0; i < byteSize - 1; i++) { + target.data.setByte(i, (((this.data.getByte(firstByte + i) & 0xFF) >>> offset) + (this.data.getByte(firstByte + i + 1) << (8 - offset)))); + } + if (length % 8 != 0) { + target.data.setByte(byteSize - 1, ((this.data.getByte(firstByte + byteSize - 1) & 0xFF) >>> offset)); + } else { + target.data.setByte(byteSize - 1, + (((this.data.getByte(firstByte + byteSize - 1) & 0xFF) >>> offset) + (this.data.getByte(firstByte + byteSize) << (8 - offset)))); + } } } target.getMutator().setValueCount(length); diff --git a/java/vector/src/main/java/org/apache/arrow/vector/ValueVector.java b/java/vector/src/main/java/org/apache/arrow/vector/ValueVector.java index 2e83836b646..3812c0b2fc3 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/ValueVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/ValueVector.java @@ -47,7 +47,7 @@ *
  • you should never write to a vector once it has been read.
  • * * - * Please note that the current implementation doesn't enfore those rules, hence we may find few places that + * Please note that the current implementation doesn't enforce those rules, hence we may find few places that * deviate from these rules (e.g. offset vectors in Variable Length and Repeated vector) * * This interface "should" strive to guarantee this order of operation: diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/AbstractContainerVector.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/AbstractContainerVector.java index 7f8e6796285..2aeeca25f0e 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/AbstractContainerVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/AbstractContainerVector.java @@ -79,7 +79,7 @@ protected T typeify(ValueVector v, Class clazz) { if (clazz.isAssignableFrom(v.getClass())) { return clazz.cast(v); } - throw new IllegalStateException(String.format("Vector requested [%s] was different than type stored [%s]. Arrow doesn't yet support hetergenous types.", clazz.getSimpleName(), v.getClass().getSimpleName())); + throw new IllegalStateException(String.format("Vector requested [%s] was different than type stored [%s]. Arrow doesn't yet support heterogenous types.", clazz.getSimpleName(), v.getClass().getSimpleName())); } protected boolean supportsDirectRead() { diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java index 4ab624f3694..fdeac397165 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java @@ -38,6 +38,7 @@ import org.apache.arrow.vector.FieldVector; import org.apache.arrow.vector.UInt4Vector; import org.apache.arrow.vector.ValueVector; +import org.apache.arrow.vector.VarCharVector; import org.apache.arrow.vector.ZeroVector; import org.apache.arrow.vector.complex.impl.ComplexCopier; import org.apache.arrow.vector.complex.impl.UnionListReader; @@ -179,7 +180,11 @@ public TransferPair makeTransferPair(ValueVector target) { private class TransferImpl implements TransferPair { ListVector to; - TransferPair pairs[] = new TransferPair[3]; + TransferPair bitsTransferPair; + TransferPair offsetsTransferPair; + TransferPair dataTransferPair; + + TransferPair[] pairs; public TransferImpl(String name, BufferAllocator allocator, CallBack callBack) { this(new ListVector(name, allocator, fieldType, callBack)); @@ -188,12 +193,13 @@ public TransferImpl(String name, BufferAllocator allocator, CallBack callBack) { public TransferImpl(ListVector to) { this.to = to; to.addOrGetVector(vector.getField().getFieldType()); - pairs[0] = offsets.makeTransferPair(to.offsets); - pairs[1] = bits.makeTransferPair(to.bits); + offsetsTransferPair = offsets.makeTransferPair(to.offsets); + bitsTransferPair = bits.makeTransferPair(to.bits); if (to.getDataVector() instanceof ZeroVector) { to.addOrGetVector(vector.getField().getFieldType()); } - pairs[2] = getDataVector().makeTransferPair(to.getDataVector()); + dataTransferPair = getDataVector().makeTransferPair(to.getDataVector()); + pairs = new TransferPair[] { bitsTransferPair, offsetsTransferPair, dataTransferPair }; } @Override @@ -206,10 +212,20 @@ public void transfer() { @Override public void splitAndTransfer(int startIndex, int length) { - to.allocateNew(); - for (int i = 0; i < length; i++) { - copyValueSafe(startIndex + i, i); + UInt4Vector.Accessor offsetVectorAccessor = ListVector.this.offsets.getAccessor(); + final int startPoint = offsetVectorAccessor.get(startIndex); + final int sliceLength = offsetVectorAccessor.get(startIndex + length) - startPoint; + to.clear(); + to.offsets.allocateNew(length + 1); + offsetVectorAccessor = ListVector.this.offsets.getAccessor(); + final UInt4Vector.Mutator targetOffsetVectorMutator = to.offsets.getMutator(); + for (int i = 0; i < length + 1; i++) { + targetOffsetVectorMutator.set(i, offsetVectorAccessor.get(startIndex + i) - startPoint); } + bitsTransferPair.splitAndTransfer(startIndex, length); + dataTransferPair.splitAndTransfer(startPoint, sliceLength); + to.lastSet = length; + to.mutator.setValueCount(length); } @Override @@ -393,6 +409,12 @@ public void setValueCount(int valueCount) { vector.getMutator().setValueCount(childValueCount); bits.getMutator().setValueCount(valueCount); } + + public void setLastSet(int value) { + lastSet = value; + } + + public int getLastSet() { return lastSet; } } } diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/StateTool.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/StateTool.java index 852c72c5497..05a79d24295 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/StateTool.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/StateTool.java @@ -28,7 +28,7 @@ public static > void check(T currentState, T... expectedStates return; } } - throw new IllegalArgumentException(String.format("Expected to be in one of these states %s but was actuall in state %s", Arrays.toString(expectedStates), currentState)); + throw new IllegalArgumentException(String.format("Expected to be in one of these states %s but was actually in state %s", Arrays.toString(expectedStates), currentState)); } } diff --git a/java/vector/src/test/java/org/apache/arrow/vector/TestBitVector.java b/java/vector/src/test/java/org/apache/arrow/vector/TestBitVector.java index f2343c88e70..194b78585fa 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/TestBitVector.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/TestBitVector.java @@ -20,6 +20,8 @@ import static org.junit.Assert.assertEquals; import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.memory.RootAllocator; +import org.apache.arrow.vector.util.TransferPair; import org.junit.After; import org.junit.Before; import org.junit.Test; @@ -31,7 +33,7 @@ public class TestBitVector { @Before public void init() { - allocator = new DirtyRootAllocator(Long.MAX_VALUE, (byte) 100); + allocator = new RootAllocator(Long.MAX_VALUE); } @After @@ -63,4 +65,80 @@ public void testBitVectorCopyFromSafe() { } } + @Test + public void testSplitAndTransfer() throws Exception { + + try (final BitVector sourceVector = new BitVector("bitvector", allocator)) { + final BitVector.Mutator sourceMutator = sourceVector.getMutator(); + final BitVector.Accessor sourceAccessor = sourceVector.getAccessor(); + + sourceVector.allocateNew(40); + + /* populate the bitvector -- 010101010101010101010101..... */ + for(int i = 0; i < 40; i++) { + if((i & 1) == 1) { + sourceMutator.set(i, 1); + } + else { + sourceMutator.set(i, 0); + } + } + + sourceMutator.setValueCount(40); + + /* check the vector output */ + for(int i = 0; i < 40; i++) { + int result = sourceAccessor.get(i); + if((i & 1) == 1) { + assertEquals(Integer.toString(1), Integer.toString(result)); + } + else { + assertEquals(Integer.toString(0), Integer.toString(result)); + } + } + + final TransferPair transferPair = sourceVector.getTransferPair(allocator); + final BitVector toVector = (BitVector)transferPair.getTo(); + final BitVector.Accessor toAccessor = toVector.getAccessor(); + final BitVector.Mutator toMutator = toVector.getMutator(); + + /* + * form test cases such that we cover: + * + * (1) the start index is exactly where a particular byte starts in the source bit vector + * (2) the start index is randomly positioned within a byte in the source bit vector + * (2.1) the length is a multiple of 8 + * (2.2) the length is not a multiple of 8 + */ + final int[][] transferLengths = { {0, 8}, /* (1) */ + {8, 10}, /* (1) */ + {18, 0}, /* zero length scenario */ + {18, 8}, /* (2.1) */ + {26, 0}, /* zero length scenario */ + {26, 14} /* (2.2) */ + }; + + for (final int[] transferLength : transferLengths) { + final int start = transferLength[0]; + final int length = transferLength[1]; + + transferPair.splitAndTransfer(start, length); + + /* check the toVector output after doing splitAndTransfer */ + for (int i = 0; i < length; i++) { + int result = toAccessor.get(i); + if((i & 1) == 1) { + assertEquals(Integer.toString(1), Integer.toString(result)); + } + else { + assertEquals(Integer.toString(0), Integer.toString(result)); + } + } + + toVector.clear(); + } + + sourceVector.close(); + } + } } diff --git a/java/vector/src/test/java/org/apache/arrow/vector/TestListVector.java b/java/vector/src/test/java/org/apache/arrow/vector/TestListVector.java index 11be3298f75..a1762c466ce 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/TestListVector.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/TestListVector.java @@ -17,15 +17,29 @@ */ package org.apache.arrow.vector; +import static org.junit.Assert.assertEquals; +import static org.junit.Assert.assertNull; +import static org.junit.Assert.assertTrue; +import static org.junit.Assert.assertFalse; + import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.vector.complex.ListVector; import org.apache.arrow.vector.complex.impl.UnionListWriter; +import org.apache.arrow.vector.complex.impl.UnionListReader; import org.apache.arrow.vector.complex.reader.FieldReader; +import org.apache.arrow.vector.complex.writer.FieldWriter; +import org.apache.arrow.vector.holders.NullableBigIntHolder; +import org.apache.arrow.vector.types.Types; +import org.apache.arrow.vector.types.Types.*; +import org.apache.arrow.vector.types.pojo.FieldType; +import org.apache.arrow.vector.util.TransferPair; import org.junit.After; import org.junit.Assert; import org.junit.Before; import org.junit.Test; +import java.util.List; + public class TestListVector { private BufferAllocator allocator; @@ -80,4 +94,343 @@ public void testCopyFrom() throws Exception { Assert.assertTrue("shouldn't be null", reader.isSet()); } } + + @Test + public void testSetLastSetUsage() throws Exception { + try (ListVector listVector = ListVector.empty("input", allocator)) { + + /* Explicitly add the dataVector */ + MinorType type = MinorType.BIGINT; + listVector.addOrGetVector(FieldType.nullable(type.getType())); + + /* allocate memory */ + listVector.allocateNew(); + + /* get inner vectors; bitVector and offsetVector */ + List innerVectors = listVector.getFieldInnerVectors(); + BitVector bitVector = (BitVector)innerVectors.get(0); + UInt4Vector offsetVector = (UInt4Vector)innerVectors.get(1); + + /* get the underlying data vector -- NullableBigIntVector */ + NullableBigIntVector dataVector = (NullableBigIntVector)listVector.getDataVector(); + + /* check current lastSet */ + assertEquals(Integer.toString(0), Integer.toString(listVector.getMutator().getLastSet())); + + int index = 0; + int offset = 0; + + /* write [10, 11, 12] to the list vector at index */ + bitVector.getMutator().setSafe(index, 1); + dataVector.getMutator().setSafe(0, 1, 10); + dataVector.getMutator().setSafe(1, 1, 11); + dataVector.getMutator().setSafe(2, 1, 12); + offsetVector.getMutator().setSafe(index + 1, 3); + + index += 1; + + /* write [13, 14] to the list vector at index 1 */ + bitVector.getMutator().setSafe(index, 1); + dataVector.getMutator().setSafe(3, 1, 13); + dataVector.getMutator().setSafe(4, 1, 14); + offsetVector.getMutator().setSafe(index + 1, 5); + + index += 1; + + /* write [15, 16, 17] to the list vector at index 2 */ + bitVector.getMutator().setSafe(index, 1); + dataVector.getMutator().setSafe(5, 1, 15); + dataVector.getMutator().setSafe(6, 1, 16); + dataVector.getMutator().setSafe(7, 1, 17); + offsetVector.getMutator().setSafe(index + 1, 8); + + /* check current lastSet */ + assertEquals(Integer.toString(0), Integer.toString(listVector.getMutator().getLastSet())); + + /* set lastset and arbitrary valuecount for list vector. + * + * NOTE: if we don't execute setLastSet() before setLastValueCount(), then + * the latter will corrupt the offsetVector and thus the accessor will not + * retrieve the correct values from underlying dataVector. Run the test + * by commenting out next line and we should see failures from 5th assert + * onwards. This is why doing setLastSet() is important before setValueCount() + * once the vector has been loaded. + * + * Another important thing to remember is the value of lastSet itself. + * Even though the listVector has elements till index 2 only, the lastSet should + * be set as 3. This is because the offsetVector has valid offsets filled till index 3. + * If we do setLastSet(2), the offsetVector at index 3 will contain incorrect value + * after execution of setValueCount(). + * + * correct state of the listVector + * bitvector {1, 1, 1, 0, 0.... } + * offsetvector {0, 3, 5, 8, 8, 8.....} + * datavector { [10, 11, 12], + * [13, 14], + * [15, 16, 17] + * } + * + * if we don't do setLastSet() before setValueCount --> incorrect state + * bitvector {1, 1, 1, 0, 0.... } + * offsetvector {0, 0, 0, 0, 0, 0.....} + * datavector { [10, 11, 12], + * [13, 14], + * [15, 16, 17] + * } + * + * if we do setLastSet(2) before setValueCount --> incorrect state + * bitvector {1, 1, 1, 0, 0.... } + * offsetvector {0, 3, 5, 5, 5, 5.....} + * datavector { [10, 11, 12], + * [13, 14], + * [15, 16, 17] + * } + */ + listVector.getMutator().setLastSet(3); + listVector.getMutator().setValueCount(10); + + /* check the vector output */ + final UInt4Vector.Accessor offsetAccessor = offsetVector.getAccessor(); + final ValueVector.Accessor valueAccessor = dataVector.getAccessor(); + + index = 0; + offset = offsetAccessor.get(index); + assertEquals(Integer.toString(0), Integer.toString(offset)); + + Object actual = valueAccessor.getObject(offset); + assertEquals(new Long(10), (Long)actual); + offset++; + actual = valueAccessor.getObject(offset); + assertEquals(new Long(11), (Long)actual); + offset++; + actual = valueAccessor.getObject(offset); + assertEquals(new Long(12), (Long)actual); + + index++; + offset = offsetAccessor.get(index); + assertEquals(Integer.toString(3), Integer.toString(offset)); + + actual = valueAccessor.getObject(offset); + assertEquals(new Long(13), (Long)actual); + offset++; + actual = valueAccessor.getObject(offset); + assertEquals(new Long(14), (Long)actual); + + index++; + offset = offsetAccessor.get(index); + assertEquals(Integer.toString(5), Integer.toString(offset)); + + actual = valueAccessor.getObject(offsetAccessor.get(index)); + assertEquals(new Long(15), (Long)actual); + offset++; + actual = valueAccessor.getObject(offset); + assertEquals(new Long(16), (Long)actual); + offset++; + actual = valueAccessor.getObject(offset); + assertEquals(new Long(17), (Long)actual); + + index++; + offset = offsetAccessor.get(index); + assertEquals(Integer.toString(8), Integer.toString(offset)); + + actual = valueAccessor.getObject(offsetAccessor.get(index)); + assertNull(actual); + } + } + + @Test + public void testSplitAndTransfer() throws Exception { + try (ListVector listVector = ListVector.empty("sourceVector", allocator)) { + + /* Explicitly add the dataVector */ + MinorType type = MinorType.BIGINT; + listVector.addOrGetVector(FieldType.nullable(type.getType())); + + UnionListWriter listWriter = listVector.getWriter(); + + /* allocate memory */ + listWriter.allocate(); + + /* populate data */ + listWriter.setPosition(0); + listWriter.startList(); + listWriter.bigInt().writeBigInt(10); + listWriter.bigInt().writeBigInt(11); + listWriter.bigInt().writeBigInt(12); + listWriter.endList(); + + listWriter.setPosition(1); + listWriter.startList(); + listWriter.bigInt().writeBigInt(13); + listWriter.bigInt().writeBigInt(14); + listWriter.endList(); + + listWriter.setPosition(2); + listWriter.startList(); + listWriter.bigInt().writeBigInt(15); + listWriter.bigInt().writeBigInt(16); + listWriter.bigInt().writeBigInt(17); + listWriter.bigInt().writeBigInt(18); + listWriter.endList(); + + listWriter.setPosition(3); + listWriter.startList(); + listWriter.bigInt().writeBigInt(19); + listWriter.endList(); + + listWriter.setPosition(4); + listWriter.startList(); + listWriter.bigInt().writeBigInt(20); + listWriter.bigInt().writeBigInt(21); + listWriter.bigInt().writeBigInt(22); + listWriter.bigInt().writeBigInt(23); + listWriter.endList(); + + listVector.getMutator().setValueCount(5); + + assertEquals(5, listVector.getMutator().getLastSet()); + + /* get offsetVector */ + UInt4Vector offsetVector = (UInt4Vector)listVector.getOffsetVector(); + + /* get dataVector */ + NullableBigIntVector dataVector = (NullableBigIntVector)listVector.getDataVector(); + + /* check the vector output */ + final UInt4Vector.Accessor offsetAccessor = offsetVector.getAccessor(); + final ValueVector.Accessor valueAccessor = dataVector.getAccessor(); + + int index = 0; + int offset = 0; + Object actual = null; + + /* index 0 */ + assertFalse(listVector.getAccessor().isNull(index)); + offset = offsetAccessor.get(index); + assertEquals(Integer.toString(0), Integer.toString(offset)); + + actual = valueAccessor.getObject(offset); + assertEquals(new Long(10), (Long)actual); + offset++; + actual = valueAccessor.getObject(offset); + assertEquals(new Long(11), (Long)actual); + offset++; + actual = valueAccessor.getObject(offset); + assertEquals(new Long(12), (Long)actual); + + /* index 1 */ + index++; + assertFalse(listVector.getAccessor().isNull(index)); + offset = offsetAccessor.get(index); + assertEquals(Integer.toString(3), Integer.toString(offset)); + + actual = valueAccessor.getObject(offset); + assertEquals(new Long(13), (Long)actual); + offset++; + actual = valueAccessor.getObject(offset); + assertEquals(new Long(14), (Long)actual); + + /* index 2 */ + index++; + assertFalse(listVector.getAccessor().isNull(index)); + offset = offsetAccessor.get(index); + assertEquals(Integer.toString(5), Integer.toString(offset)); + + actual = valueAccessor.getObject(offset); + assertEquals(new Long(15), (Long)actual); + offset++; + actual = valueAccessor.getObject(offset); + assertEquals(new Long(16), (Long)actual); + offset++; + actual = valueAccessor.getObject(offset); + assertEquals(new Long(17), (Long)actual); + offset++; + actual = valueAccessor.getObject(offset); + assertEquals(new Long(18), (Long)actual); + + /* index 3 */ + index++; + assertFalse(listVector.getAccessor().isNull(index)); + offset = offsetAccessor.get(index); + assertEquals(Integer.toString(9), Integer.toString(offset)); + + actual = valueAccessor.getObject(offset); + assertEquals(new Long(19), (Long)actual); + + /* index 4 */ + index++; + assertFalse(listVector.getAccessor().isNull(index)); + offset = offsetAccessor.get(index); + assertEquals(Integer.toString(10), Integer.toString(offset)); + + actual = valueAccessor.getObject(offset); + assertEquals(new Long(20), (Long)actual); + offset++; + actual = valueAccessor.getObject(offset); + assertEquals(new Long(21), (Long)actual); + offset++; + actual = valueAccessor.getObject(offset); + assertEquals(new Long(22), (Long)actual); + offset++; + actual = valueAccessor.getObject(offset); + assertEquals(new Long(23), (Long)actual); + + /* index 5 */ + index++; + assertTrue(listVector.getAccessor().isNull(index)); + offset = offsetAccessor.get(index); + assertEquals(Integer.toString(14), Integer.toString(offset)); + + /* do split and transfer */ + try (ListVector toVector = ListVector.empty("toVector", allocator)) { + + TransferPair transferPair = listVector.makeTransferPair(toVector); + + int[][] transferLengths = { {0, 2}, + {3, 1}, + {4, 1} + }; + + for (final int[] transferLength : transferLengths) { + int start = transferLength[0]; + int splitLength = transferLength[1]; + + int dataLength1 = 0; + int dataLength2 = 0; + + int offset1 = 0; + int offset2 = 0; + + transferPair.splitAndTransfer(start, splitLength); + + /* get offsetVector of toVector */ + UInt4Vector offsetVector1 = (UInt4Vector)toVector.getOffsetVector(); + UInt4Vector.Accessor offsetAccessor1 = offsetVector1.getAccessor(); + + /* get dataVector of toVector */ + NullableBigIntVector dataVector1 = (NullableBigIntVector)toVector.getDataVector(); + NullableBigIntVector.Accessor valueAccessor1 = dataVector1.getAccessor(); + + for(int i = 0; i < splitLength; i++) { + dataLength1 = offsetAccessor.get(start + i + 1) - offsetAccessor.get(start + i); + dataLength2 = offsetAccessor1.get(i + 1) - offsetAccessor1.get(i); + + assertEquals("Different data lengths at index: " + i + " and start: " + start, + dataLength1, dataLength2); + + offset1 = offsetAccessor.get(start + i); + offset2 = offsetAccessor1.get(i); + + for(int j = 0; j < dataLength1; j++) { + assertEquals("Different data at indexes: " + offset1 + " and " + offset2, + valueAccessor.getObject(offset1), valueAccessor1.getObject(offset2)); + + offset1++; + offset2++; + } + } + } + } + } + } } diff --git a/java/vector/src/test/java/org/apache/arrow/vector/TestUnionVector.java b/java/vector/src/test/java/org/apache/arrow/vector/TestUnionVector.java index a5b90ee90b8..a5159242d76 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/TestUnionVector.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/TestUnionVector.java @@ -24,6 +24,7 @@ import org.apache.arrow.vector.holders.NullableBitHolder; import org.apache.arrow.vector.holders.NullableIntHolder; import org.apache.arrow.vector.holders.NullableUInt4Holder; +import org.apache.arrow.vector.holders.NullableFloat4Holder; import org.apache.arrow.vector.types.Types; import org.apache.arrow.vector.types.Types.MinorType; import org.apache.arrow.vector.util.TransferPair; @@ -117,6 +118,179 @@ public void testTransfer() throws Exception { } } + @Test + public void testSplitAndTransfer() throws Exception { + try (UnionVector sourceVector = new UnionVector(EMPTY_SCHEMA_PATH, allocator, null)) { + final UnionVector.Mutator sourceMutator = sourceVector.getMutator(); + final UnionVector.Accessor sourceAccessor = sourceVector.getAccessor(); + + sourceVector.allocateNew(); + + /* populate the UnionVector */ + sourceMutator.setType(0, MinorType.INT); + sourceMutator.setSafe(0, newIntHolder(5)); + sourceMutator.setType(1, MinorType.INT); + sourceMutator.setSafe(1, newIntHolder(10)); + sourceMutator.setType(2, MinorType.INT); + sourceMutator.setSafe(2, newIntHolder(15)); + sourceMutator.setType(3, MinorType.INT); + sourceMutator.setSafe(3, newIntHolder(20)); + sourceMutator.setType(4, MinorType.INT); + sourceMutator.setSafe(4, newIntHolder(25)); + sourceMutator.setType(5, MinorType.INT); + sourceMutator.setSafe(5, newIntHolder(30)); + sourceMutator.setType(6, MinorType.INT); + sourceMutator.setSafe(6, newIntHolder(35)); + sourceMutator.setType(7, MinorType.INT); + sourceMutator.setSafe(7, newIntHolder(40)); + sourceMutator.setType(8, MinorType.INT); + sourceMutator.setSafe(8, newIntHolder(45)); + sourceMutator.setType(9, MinorType.INT); + sourceMutator.setSafe(9, newIntHolder(50)); + sourceMutator.setValueCount(10); + + /* check the vector output */ + assertEquals(10, sourceAccessor.getValueCount()); + assertEquals(false, sourceAccessor.isNull(0)); + assertEquals(5, sourceAccessor.getObject(0)); + assertEquals(false, sourceAccessor.isNull(1)); + assertEquals(10, sourceAccessor.getObject(1)); + assertEquals(false, sourceAccessor.isNull(2)); + assertEquals(15, sourceAccessor.getObject(2)); + assertEquals(false, sourceAccessor.isNull(3)); + assertEquals(20, sourceAccessor.getObject(3)); + assertEquals(false, sourceAccessor.isNull(4)); + assertEquals(25, sourceAccessor.getObject(4)); + assertEquals(false, sourceAccessor.isNull(5)); + assertEquals(30, sourceAccessor.getObject(5)); + assertEquals(false, sourceAccessor.isNull(6)); + assertEquals(35, sourceAccessor.getObject(6)); + assertEquals(false, sourceAccessor.isNull(7)); + assertEquals(40, sourceAccessor.getObject(7)); + assertEquals(false, sourceAccessor.isNull(8)); + assertEquals(45, sourceAccessor.getObject(8)); + assertEquals(false, sourceAccessor.isNull(9)); + assertEquals(50, sourceAccessor.getObject(9)); + + try(UnionVector toVector = new UnionVector(EMPTY_SCHEMA_PATH, allocator, null)) { + + final TransferPair transferPair = sourceVector.makeTransferPair(toVector); + final UnionVector.Accessor toAccessor = toVector.getAccessor(); + + final int[][] transferLengths = { {0, 3}, + {3, 1}, + {4, 2}, + {6, 1}, + {7, 1}, + {8, 2} + }; + + for (final int[] transferLength : transferLengths) { + final int start = transferLength[0]; + final int length = transferLength[1]; + + transferPair.splitAndTransfer(start, length); + + /* check the toVector output after doing the splitAndTransfer */ + for (int i = 0; i < length; i++) { + assertEquals("Different data at indexes: " + (start + i) + "and " + i, sourceAccessor.getObject(start + i), + toAccessor.getObject(i)); + } + } + } + } + } + + @Test + public void testSplitAndTransferWithMixedVectors() throws Exception { + try (UnionVector sourceVector = new UnionVector(EMPTY_SCHEMA_PATH, allocator, null)) { + final UnionVector.Mutator sourceMutator = sourceVector.getMutator(); + final UnionVector.Accessor sourceAccessor = sourceVector.getAccessor(); + + sourceVector.allocateNew(); + + /* populate the UnionVector */ + sourceMutator.setType(0, MinorType.INT); + sourceMutator.setSafe(0, newIntHolder(5)); + + sourceMutator.setType(1, MinorType.FLOAT4); + sourceMutator.setSafe(1, newFloat4Holder(5.5f)); + + sourceMutator.setType(2, MinorType.INT); + sourceMutator.setSafe(2, newIntHolder(10)); + + sourceMutator.setType(3, MinorType.FLOAT4); + sourceMutator.setSafe(3, newFloat4Holder(10.5f)); + + sourceMutator.setType(4, MinorType.INT); + sourceMutator.setSafe(4, newIntHolder(15)); + + sourceMutator.setType(5, MinorType.FLOAT4); + sourceMutator.setSafe(5, newFloat4Holder(15.5f)); + + sourceMutator.setType(6, MinorType.INT); + sourceMutator.setSafe(6, newIntHolder(20)); + + sourceMutator.setType(7, MinorType.FLOAT4); + sourceMutator.setSafe(7, newFloat4Holder(20.5f)); + + sourceMutator.setType(8, MinorType.INT); + sourceMutator.setSafe(8, newIntHolder(30)); + + sourceMutator.setType(9, MinorType.FLOAT4); + sourceMutator.setSafe(9, newFloat4Holder(30.5f)); + sourceMutator.setValueCount(10); + + /* check the vector output */ + assertEquals(10, sourceAccessor.getValueCount()); + assertEquals(false, sourceAccessor.isNull(0)); + assertEquals(5, sourceAccessor.getObject(0)); + assertEquals(false, sourceAccessor.isNull(1)); + assertEquals(5.5f, sourceAccessor.getObject(1)); + assertEquals(false, sourceAccessor.isNull(2)); + assertEquals(10, sourceAccessor.getObject(2)); + assertEquals(false, sourceAccessor.isNull(3)); + assertEquals(10.5f, sourceAccessor.getObject(3)); + assertEquals(false, sourceAccessor.isNull(4)); + assertEquals(15, sourceAccessor.getObject(4)); + assertEquals(false, sourceAccessor.isNull(5)); + assertEquals(15.5f, sourceAccessor.getObject(5)); + assertEquals(false, sourceAccessor.isNull(6)); + assertEquals(20, sourceAccessor.getObject(6)); + assertEquals(false, sourceAccessor.isNull(7)); + assertEquals(20.5f, sourceAccessor.getObject(7)); + assertEquals(false, sourceAccessor.isNull(8)); + assertEquals(30, sourceAccessor.getObject(8)); + assertEquals(false, sourceAccessor.isNull(9)); + assertEquals(30.5f, sourceAccessor.getObject(9)); + + try(UnionVector toVector = new UnionVector(EMPTY_SCHEMA_PATH, allocator, null)) { + + final TransferPair transferPair = sourceVector.makeTransferPair(toVector); + final UnionVector.Accessor toAccessor = toVector.getAccessor(); + + final int[][] transferLengths = { {0, 2}, + {2, 1}, + {3, 2}, + {5, 3}, + {8, 2} + }; + + for (final int[] transferLength : transferLengths) { + final int start = transferLength[0]; + final int length = transferLength[1]; + + transferPair.splitAndTransfer(start, length); + + /* check the toVector output after doing the splitAndTransfer */ + for (int i = 0; i < length; i++) { + assertEquals("Different values at index: " + i, sourceAccessor.getObject(start + i), toAccessor.getObject(i)); + } + } + } + } + } + private static NullableIntHolder newIntHolder(int value) { final NullableIntHolder holder = new NullableIntHolder(); holder.isSet = 1; @@ -130,4 +304,11 @@ private static NullableBitHolder newBitHolder(boolean value) { holder.value = value ? 1 : 0; return holder; } + + private static NullableFloat4Holder newFloat4Holder(float value) { + final NullableFloat4Holder holder = new NullableFloat4Holder(); + holder.isSet = 1; + holder.value = value; + return holder; + } } diff --git a/java/vector/src/test/java/org/apache/arrow/vector/TestValueVector.java b/java/vector/src/test/java/org/apache/arrow/vector/TestValueVector.java index 63543b09329..0f41c2dd790 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/TestValueVector.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/TestValueVector.java @@ -26,11 +26,15 @@ import java.nio.charset.Charset; import java.util.List; +import java.util.ArrayList; import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.memory.RootAllocator; + +import org.apache.arrow.vector.schema.ArrowRecordBatch; import org.apache.arrow.vector.schema.TypeLayout; import org.apache.arrow.vector.types.Types.MinorType; +import org.apache.arrow.vector.types.pojo.Schema; import org.apache.arrow.vector.types.pojo.ArrowType; import org.apache.arrow.vector.types.pojo.Field; import org.junit.After; @@ -56,6 +60,9 @@ public void init() { private final static byte[] STR1 = "AAAAA1".getBytes(utf8Charset); private final static byte[] STR2 = "BBBBBBBBB2".getBytes(utf8Charset); private final static byte[] STR3 = "CCCC3".getBytes(utf8Charset); + private final static byte[] STR4 = "DDDDDDDD4".getBytes(utf8Charset); + private final static byte[] STR5 = "EEE5".getBytes(utf8Charset); + private final static byte[] STR6 = "FFFFF6".getBytes(utf8Charset); @After public void terminate() throws Exception { @@ -509,11 +516,231 @@ public void testCopyFromWithNulls() { } else { assertEquals(Integer.toString(i), vector2.getAccessor().getObject(i).toString()); } + } + } + } + + @Test + public void testSetLastSetUsage() { + try (final NullableVarCharVector vector = new NullableVarCharVector("myvector", allocator)) { + + final NullableVarCharVector.Mutator mutator = vector.getMutator(); + vector.allocateNew(1024 * 10, 1024); + + setBytes(0, STR1, vector); + setBytes(1, STR2, vector); + setBytes(2, STR3, vector); + setBytes(3, STR4, vector); + setBytes(4, STR5, vector); + setBytes(5, STR6, vector); + + /* Check current lastSet */ + assertEquals(Integer.toString(-1), Integer.toString(mutator.getLastSet())); + + /* Check the vector output */ + final NullableVarCharVector.Accessor accessor = vector.getAccessor(); + assertArrayEquals(STR1, accessor.get(0)); + assertArrayEquals(STR2, accessor.get(1)); + assertArrayEquals(STR3, accessor.get(2)); + assertArrayEquals(STR4, accessor.get(3)); + assertArrayEquals(STR5, accessor.get(4)); + assertArrayEquals(STR6, accessor.get(5)); + + /* + * If we don't do setLastSe(5) before setValueCount(), then the latter will corrupt + * the value vector by filling in all positions [0,valuecount-1] will empty byte arrays. + * Run the test by commenting out next line and we should see incorrect vector output. + */ + mutator.setLastSet(5); + mutator.setValueCount(20); + + /* Check the vector output again */ + assertArrayEquals(STR1, accessor.get(0)); + assertArrayEquals(STR2, accessor.get(1)); + assertArrayEquals(STR3, accessor.get(2)); + assertArrayEquals(STR4, accessor.get(3)); + assertArrayEquals(STR5, accessor.get(4)); + assertArrayEquals(STR6, accessor.get(5)); + } + } + + @Test + public void testVectorLoadUnload() { + + try (final NullableVarCharVector vector1 = new NullableVarCharVector("myvector", allocator)) { + + final NullableVarCharVector.Mutator mutator1 = vector1.getMutator(); + + vector1.allocateNew(1024 * 10, 1024); + + mutator1.set(0, STR1); + mutator1.set(1, STR2); + mutator1.set(2, STR3); + mutator1.set(3, STR4); + mutator1.set(4, STR5); + mutator1.set(5, STR6); + assertEquals(Integer.toString(5), Integer.toString(mutator1.getLastSet())); + mutator1.setValueCount(15); + assertEquals(Integer.toString(14), Integer.toString(mutator1.getLastSet())); + + /* Check the vector output */ + final NullableVarCharVector.Accessor accessor1 = vector1.getAccessor(); + assertArrayEquals(STR1, accessor1.get(0)); + assertArrayEquals(STR2, accessor1.get(1)); + assertArrayEquals(STR3, accessor1.get(2)); + assertArrayEquals(STR4, accessor1.get(3)); + assertArrayEquals(STR5, accessor1.get(4)); + assertArrayEquals(STR6, accessor1.get(5)); + + Field field = vector1.getField(); + String fieldName = field.getName(); + + List fields = new ArrayList(); + List fieldVectors = new ArrayList(); + + fields.add(field); + fieldVectors.add(vector1); + + Schema schema = new Schema(fields); + + VectorSchemaRoot schemaRoot1 = new VectorSchemaRoot(schema, fieldVectors, accessor1.getValueCount()); + VectorUnloader vectorUnloader = new VectorUnloader(schemaRoot1); + + try ( + ArrowRecordBatch recordBatch = vectorUnloader.getRecordBatch(); + BufferAllocator finalVectorsAllocator = allocator.newChildAllocator("new vector", 0, Long.MAX_VALUE); + VectorSchemaRoot schemaRoot2 = VectorSchemaRoot.create(schema, finalVectorsAllocator); + ) { + + VectorLoader vectorLoader = new VectorLoader(schemaRoot2); + vectorLoader.load(recordBatch); + + NullableVarCharVector vector2 = (NullableVarCharVector)schemaRoot2.getVector(fieldName); + NullableVarCharVector.Mutator mutator2 = vector2.getMutator(); + + /* + * lastSet would have internally been set by VectorLoader.load() when it invokes + * loadFieldBuffers. + */ + assertEquals(Integer.toString(14), Integer.toString(mutator2.getLastSet())); + mutator2.setValueCount(25); + assertEquals(Integer.toString(24), Integer.toString(mutator2.getLastSet())); + + /* Check the vector output */ + final NullableVarCharVector.Accessor accessor2 = vector2.getAccessor(); + assertArrayEquals(STR1, accessor2.get(0)); + assertArrayEquals(STR2, accessor2.get(1)); + assertArrayEquals(STR3, accessor2.get(2)); + assertArrayEquals(STR4, accessor2.get(3)); + assertArrayEquals(STR5, accessor2.get(4)); + assertArrayEquals(STR6, accessor2.get(5)); } + } + } + @Test + public void testFillEmptiesUsage() { + try (final NullableVarCharVector vector = new NullableVarCharVector("myvector", allocator)) { + final NullableVarCharVector.Mutator mutator = vector.getMutator(); + + vector.allocateNew(1024 * 10, 1024); + + setBytes(0, STR1, vector); + setBytes(1, STR2, vector); + setBytes(2, STR3, vector); + setBytes(3, STR4, vector); + setBytes(4, STR5, vector); + setBytes(5, STR6, vector); + + /* Check current lastSet */ + assertEquals(Integer.toString(-1), Integer.toString(mutator.getLastSet())); + + /* Check the vector output */ + final NullableVarCharVector.Accessor accessor = vector.getAccessor(); + assertArrayEquals(STR1, accessor.get(0)); + assertArrayEquals(STR2, accessor.get(1)); + assertArrayEquals(STR3, accessor.get(2)); + assertArrayEquals(STR4, accessor.get(3)); + assertArrayEquals(STR5, accessor.get(4)); + assertArrayEquals(STR6, accessor.get(5)); + + mutator.setLastSet(5); + /* fill empty byte arrays from index [6, 9] */ + mutator.fillEmpties(10); + + /* Check current lastSet */ + assertEquals(Integer.toString(9), Integer.toString(mutator.getLastSet())); + + /* Check the vector output */ + assertArrayEquals(STR1, accessor.get(0)); + assertArrayEquals(STR2, accessor.get(1)); + assertArrayEquals(STR3, accessor.get(2)); + assertArrayEquals(STR4, accessor.get(3)); + assertArrayEquals(STR5, accessor.get(4)); + assertArrayEquals(STR6, accessor.get(5)); + assertEquals(Integer.toString(0), Integer.toString(accessor.getValueLength(6))); + assertEquals(Integer.toString(0), Integer.toString(accessor.getValueLength(7))); + assertEquals(Integer.toString(0), Integer.toString(accessor.getValueLength(8))); + assertEquals(Integer.toString(0), Integer.toString(accessor.getValueLength(9))); + + setBytes(10, STR1, vector); + setBytes(11, STR2, vector); + + mutator.setLastSet(11); + /* fill empty byte arrays from index [12, 14] */ + mutator.setValueCount(15); + + /* Check current lastSet */ + assertEquals(Integer.toString(14), Integer.toString(mutator.getLastSet())); + + /* Check the vector output */ + assertArrayEquals(STR1, accessor.get(0)); + assertArrayEquals(STR2, accessor.get(1)); + assertArrayEquals(STR3, accessor.get(2)); + assertArrayEquals(STR4, accessor.get(3)); + assertArrayEquals(STR5, accessor.get(4)); + assertArrayEquals(STR6, accessor.get(5)); + assertEquals(Integer.toString(0), Integer.toString(accessor.getValueLength(6))); + assertEquals(Integer.toString(0), Integer.toString(accessor.getValueLength(7))); + assertEquals(Integer.toString(0), Integer.toString(accessor.getValueLength(8))); + assertEquals(Integer.toString(0), Integer.toString(accessor.getValueLength(9))); + assertArrayEquals(STR1, accessor.get(10)); + assertArrayEquals(STR2, accessor.get(11)); + assertEquals(Integer.toString(0), Integer.toString(accessor.getValueLength(12))); + assertEquals(Integer.toString(0), Integer.toString(accessor.getValueLength(13))); + assertEquals(Integer.toString(0), Integer.toString(accessor.getValueLength(14))); + + /* Check offsets */ + final UInt4Vector.Accessor offsetAccessor = vector.values.offsetVector.getAccessor(); + assertEquals(Integer.toString(0), Integer.toString(offsetAccessor.get(0))); + assertEquals(Integer.toString(6), Integer.toString(offsetAccessor.get(1))); + assertEquals(Integer.toString(16), Integer.toString(offsetAccessor.get(2))); + assertEquals(Integer.toString(21), Integer.toString(offsetAccessor.get(3))); + assertEquals(Integer.toString(30), Integer.toString(offsetAccessor.get(4))); + assertEquals(Integer.toString(34), Integer.toString(offsetAccessor.get(5))); + + assertEquals(Integer.toString(40), Integer.toString(offsetAccessor.get(6))); + assertEquals(Integer.toString(40), Integer.toString(offsetAccessor.get(7))); + assertEquals(Integer.toString(40), Integer.toString(offsetAccessor.get(8))); + assertEquals(Integer.toString(40), Integer.toString(offsetAccessor.get(9))); + assertEquals(Integer.toString(40), Integer.toString(offsetAccessor.get(10))); + + assertEquals(Integer.toString(46), Integer.toString(offsetAccessor.get(11))); + assertEquals(Integer.toString(56), Integer.toString(offsetAccessor.get(12))); + + assertEquals(Integer.toString(56), Integer.toString(offsetAccessor.get(13))); + assertEquals(Integer.toString(56), Integer.toString(offsetAccessor.get(14))); + assertEquals(Integer.toString(56), Integer.toString(offsetAccessor.get(15))); } } + public static void setBytes(int index, byte[] bytes, NullableVarCharVector vector) { + final int currentOffset = vector.values.offsetVector.getAccessor().get(index); + + vector.bits.getMutator().setToOne(index); + vector.values.offsetVector.getMutator().set(index + 1, currentOffset + bytes.length); + vector.values.data.setBytes(currentOffset, bytes, 0, bytes.length); + } } diff --git a/js/README.md b/js/README.md index de9070c59aa..38e8fafcea9 100644 --- a/js/README.md +++ b/js/README.md @@ -1,15 +1,20 @@ ### Installation diff --git a/js/flatbuffers.sh b/js/flatbuffers.sh index 55967f84a9f..0f8e3f9fe99 100755 --- a/js/flatbuffers.sh +++ b/js/flatbuffers.sh @@ -1,16 +1,21 @@ #!/bin/bash -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at # -# http://www.apache.org/licenses/LICENSE-2.0 +# http://www.apache.org/licenses/LICENSE-2.0 # -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. See accompanying LICENSE file. +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. echo "Compiling flatbuffer schemas..." mkdir -p lib lib-esm diff --git a/js/webpack.config.js b/js/webpack.config.js index b9c3e83a890..aa123bd39f9 100644 --- a/js/webpack.config.js +++ b/js/webpack.config.js @@ -1,14 +1,19 @@ -// Licensed under the Apache License, Version 2.0 (the "License"); -// you may not use this file except in compliance with the License. -// You may obtain a copy of the License at +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at // -// http://www.apache.org/licenses/LICENSE-2.0 +// http://www.apache.org/licenses/LICENSE-2.0 // -// Unless required by applicable law or agreed to in writing, software -// distributed under the License is distributed on an "AS IS" BASIS, -// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -// See the License for the specific language governing permissions and -// limitations under the License. See accompanying LICENSE file. +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. var path = require('path'); var UglifyJSPlugin = require('uglifyjs-webpack-plugin'); diff --git a/python/.gitignore b/python/.gitignore index 6c0d5a93cd3..1bf20c4ca52 100644 --- a/python/.gitignore +++ b/python/.gitignore @@ -34,3 +34,9 @@ coverage.xml # benchmark working dir .asv pyarrow/_table_api.h + +# manylinux1 temporary files +manylinux1/arrow + +# plasma store +pyarrow/plasma_store \ No newline at end of file diff --git a/python/CMakeLists.txt b/python/CMakeLists.txt index 224147d8b5c..bfae157ed6b 100644 --- a/python/CMakeLists.txt +++ b/python/CMakeLists.txt @@ -51,9 +51,14 @@ if("${CMAKE_SOURCE_DIR}" STREQUAL "${CMAKE_CURRENT_SOURCE_DIR}") option(PYARROW_BUILD_PARQUET "Build the PyArrow Parquet integration" OFF) + option(PYARROW_BUILD_PLASMA + "Build the PyArrow Plasma integration" + OFF) option(PYARROW_BUNDLE_ARROW_CPP "Bundle the Arrow C++ libraries" OFF) + set(PYARROW_CXXFLAGS "" CACHE STRING + "Compiler flags to append when compiling Arrow") endif() find_program(CCACHE_FOUND ccache) @@ -72,6 +77,7 @@ include(CompilerInfo) # Add common flags set(CMAKE_CXX_FLAGS "${CXX_COMMON_FLAGS} ${CMAKE_CXX_FLAGS}") +set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${PYARROW_CXXFLAGS}") if (NOT MSVC) # Enable perf and other tools to work properly @@ -79,6 +85,13 @@ if (NOT MSVC) # Suppress Cython warnings set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wno-unused-variable") +else() + # MSVC version of -Wno-return-type-c-linkage + set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} /wd4190") + + # Cython generates some bitshift expressions that MSVC does not like in + # __Pyx_PyFloat_DivideObjC + set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} /wd4293") endif() if ("${COMPILER_FAMILY}" STREQUAL "clang") @@ -90,80 +103,18 @@ if ("${COMPILER_FAMILY}" STREQUAL "clang") set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Qunused-arguments") # Cython warnings in clang - set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wno-parentheses-equality -Wno-constant-logical-operand") -endif() + set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wno-parentheses-equality") + set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wno-constant-logical-operand") + set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wno-missing-declarations") -set(PYARROW_LINK "a") + # We have public Cython APIs which return C++ types, which are in an extern + # "C" blog (no symbol mangling) and clang doesn't like this + set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wno-return-type-c-linkage") +endif() # For any C code, use the same flags. set(CMAKE_C_FLAGS "${CMAKE_CXX_FLAGS}") -# Code coverage -if ("${PYARROW_GENERATE_COVERAGE}") - if("${CMAKE_CXX_COMPILER}" MATCHES ".*clang.*") - # There appears to be some bugs in clang 3.3 which cause code coverage - # to have link errors, not locating the llvm_gcda_* symbols. - # This should be fixed in llvm 3.4 with http://llvm.org/viewvc/llvm-project?view=revision&revision=184666 - message(SEND_ERROR "Cannot currently generate coverage with clang") - endif() - set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} --coverage -DCOVERAGE_BUILD") - - # For coverage to work properly, we need to use static linkage. Otherwise, - # __gcov_flush() doesn't properly flush coverage from every module. - # See http://stackoverflow.com/questions/28164543/using-gcov-flush-within-a-library-doesnt-force-the-other-modules-to-yield-gc - if("${PYARROW_LINK}" STREQUAL "a") - message("Using static linking for coverage build") - set(PYARROW_LINK "s") - elseif("${PYARROW_LINK}" STREQUAL "d") - message(SEND_ERROR "Cannot use coverage with static linking") - endif() -endif() - -# If we still don't know what kind of linking to perform, choose based on -# build type (developers like fast builds). -if ("${PYARROW_LINK}" STREQUAL "a") - if ("${CMAKE_BUILD_TYPE}" STREQUAL "DEBUG" OR - "${CMAKE_BUILD_TYPE}" STREQUAL "FASTDEBUG") - message("Using dynamic linking for ${CMAKE_BUILD_TYPE} builds") - set(PYARROW_LINK "d") - else() - message("Using static linking for ${CMAKE_BUILD_TYPE} builds") - set(PYARROW_LINK "s") - endif() -endif() - -# Are we using the gold linker? It doesn't work with dynamic linking as -# weak symbols aren't properly overridden, causing tcmalloc to be omitted. -# Let's flag this as an error in RELEASE builds (we shouldn't release a -# product like this). -# -# See https://sourceware.org/bugzilla/show_bug.cgi?id=16979 for details. -# -# The gold linker is only for ELF binaries, which OSX doesn't use. We can -# just skip. -if (NOT APPLE AND NOT MSVC) - execute_process(COMMAND ${CMAKE_CXX_COMPILER} -Wl,--version OUTPUT_VARIABLE LINKER_OUTPUT) -endif () - -if (LINKER_OUTPUT MATCHES "gold") - if ("${PYARROW_LINK}" STREQUAL "d" AND - "${CMAKE_BUILD_TYPE}" STREQUAL "RELEASE") - message(SEND_ERROR "Cannot use gold with dynamic linking in a RELEASE build " - "as it would cause tcmalloc symbols to get dropped") - else() - message("Using gold linker") - endif() - set(PYARROW_USING_GOLD 1) -else() - message("Using ld linker") -endif() - -# Having set PYARROW_LINK due to build type and/or sanitizer, it's now safe to -# act on its value. -if ("${PYARROW_LINK}" STREQUAL "d") - set(BUILD_SHARED_LIBS ON) -endif() - # set compile output directory string (TOLOWER ${CMAKE_BUILD_TYPE} BUILD_SUBDIR_NAME) @@ -333,6 +284,29 @@ if (PYARROW_BUILD_PARQUET) _parquet) endif() +## Plasma +if (PYARROW_BUILD_PLASMA) + find_package(Plasma) + + if(NOT PLASMA_FOUND) + message(FATAL_ERROR "Unable to locate Plasma libraries") + endif() + + include_directories(SYSTEM ${PLASMA_INCLUDE_DIR}) + ADD_THIRDPARTY_LIB(libplasma + SHARED_LIB ${PLASMA_SHARED_LIB}) + + if (PYARROW_BUNDLE_ARROW_CPP) + bundle_arrow_lib(PLASMA_SHARED_LIB) + endif() + set(LINK_LIBS + ${LINK_LIBS} + libplasma_shared) + + set(CYTHON_EXTENSIONS ${CYTHON_EXTENSIONS} plasma) + file(COPY ${PLASMA_EXECUTABLE} DESTINATION ${BUILD_OUTPUT_ROOT_DIRECTORY}) +endif() + ############################################################ # Setup and build Cython modules ############################################################ diff --git a/python/README.md b/python/README.md index 816fbf0c85d..29d213babd9 100644 --- a/python/README.md +++ b/python/README.md @@ -1,15 +1,20 @@ ## Python library for Apache Arrow diff --git a/python/asv.conf.json b/python/asv.conf.json index 0c059fd79c1..2a1dd42aba1 100644 --- a/python/asv.conf.json +++ b/python/asv.conf.json @@ -1,14 +1,19 @@ -// Licensed under the Apache License, Version 2.0 (the "License"); -// you may not use this file except in compliance with the License. -// You may obtain a copy of the License at +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at // // http://www.apache.org/licenses/LICENSE-2.0 // -// Unless required by applicable law or agreed to in writing, software -// distributed under the License is distributed on an "AS IS" BASIS, -// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -// See the License for the specific language governing permissions and -// limitations under the License. See accompanying LICENSE file. +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. { // The version of the config file format. Do not change, unless diff --git a/python/cmake_modules/FindParquet.cmake b/python/cmake_modules/FindParquet.cmake index 88dca2ed646..0339ec56ae2 100644 --- a/python/cmake_modules/FindParquet.cmake +++ b/python/cmake_modules/FindParquet.cmake @@ -60,6 +60,8 @@ if(PARQUET_HOME) PATHS ${PARQUET_HOME} NO_DEFAULT_PATH PATH_SUFFIXES "lib") get_filename_component(PARQUET_LIBS ${PARQUET_LIBRARIES} PATH ) + set(PARQUET_ABI_VERSION "1.0.0") + set(PARQUET_SO_VERSION "1") else() pkg_check_modules(PARQUET parquet) if (PARQUET_FOUND) diff --git a/python/cmake_modules/FindPlasma.cmake b/python/cmake_modules/FindPlasma.cmake new file mode 100644 index 00000000000..3acaa348bff --- /dev/null +++ b/python/cmake_modules/FindPlasma.cmake @@ -0,0 +1,99 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +# - Find PLASMA (plasma/client.h, libplasma.a, libplasma.so) +# This module defines +# PLASMA_INCLUDE_DIR, directory containing headers +# PLASMA_LIBS, directory containing plasma libraries +# PLASMA_STATIC_LIB, path to libplasma.a +# PLASMA_SHARED_LIB, path to libplasma's shared library +# PLASMA_SHARED_IMP_LIB, path to libplasma's import library (MSVC only) +# PLASMA_FOUND, whether plasma has been found + +include(FindPkgConfig) + +if ("$ENV{ARROW_HOME}" STREQUAL "") + pkg_check_modules(PLASMA plasma) + if (PLASMA_FOUND) + pkg_get_variable(PLASMA_EXECUTABLE plasma executable) + pkg_get_variable(PLASMA_ABI_VERSION plasma abi_version) + message(STATUS "Plasma ABI version: ${PLASMA_ABI_VERSION}") + pkg_get_variable(PLASMA_SO_VERSION plasma so_version) + message(STATUS "Plasma SO version: ${PLASMA_SO_VERSION}") + set(PLASMA_INCLUDE_DIR ${PLASMA_INCLUDE_DIRS}) + set(PLASMA_LIBS ${PLASMA_LIBRARY_DIRS}) + set(PLASMA_SEARCH_LIB_PATH ${PLASMA_LIBRARY_DIRS}) + endif() +else() + set(PLASMA_HOME "$ENV{ARROW_HOME}") + + set(PLASMA_EXECUTABLE ${PLASMA_HOME}/bin/plasma_store) + + set(PLASMA_SEARCH_HEADER_PATHS + ${PLASMA_HOME}/include + ) + + set(PLASMA_SEARCH_LIB_PATH + ${PLASMA_HOME}/lib + ) + + find_path(PLASMA_INCLUDE_DIR plasma/client.h PATHS + ${PLASMA_SEARCH_HEADER_PATHS} + # make sure we don't accidentally pick up a different version + NO_DEFAULT_PATH + ) +endif() + +find_library(PLASMA_LIB_PATH NAMES plasma + PATHS + ${PLASMA_SEARCH_LIB_PATH} + NO_DEFAULT_PATH) +get_filename_component(PLASMA_LIBS ${PLASMA_LIB_PATH} DIRECTORY) + +if (PLASMA_INCLUDE_DIR AND PLASMA_LIBS) + set(PLASMA_FOUND TRUE) + set(PLASMA_LIB_NAME plasma) + + set(PLASMA_STATIC_LIB ${PLASMA_LIBS}/lib${PLASMA_LIB_NAME}.a) + + set(PLASMA_SHARED_LIB ${PLASMA_LIBS}/lib${PLASMA_LIB_NAME}${CMAKE_SHARED_LIBRARY_SUFFIX}) +endif() + +if (PLASMA_FOUND) + if (NOT Plasma_FIND_QUIETLY) + message(STATUS "Found the Plasma core library: ${PLASMA_LIB_PATH}") + message(STATUS "Found Plasma executable: ${PLASMA_EXECUTABLE}") + endif () +else () + if (NOT Plasma_FIND_QUIETLY) + set(PLASMA_ERR_MSG "Could not find the Plasma library. Looked for headers") + set(PLASMA_ERR_MSG "${PLASMA_ERR_MSG} in ${PLASMA_SEARCH_HEADER_PATHS}, and for libs") + set(PLASMA_ERR_MSG "${PLASMA_ERR_MSG} in ${PLASMA_SEARCH_LIB_PATH}") + if (Plasma_FIND_REQUIRED) + message(FATAL_ERROR "${PLASMA_ERR_MSG}") + else (Plasma_FIND_REQUIRED) + message(STATUS "${PLASMA_ERR_MSG}") + endif (Plasma_FIND_REQUIRED) + endif () + set(PLASMA_FOUND FALSE) +endif () + +mark_as_advanced( + PLASMA_INCLUDE_DIR + PLASMA_STATIC_LIB + PLASMA_SHARED_LIB +) diff --git a/python/doc/Benchmarks.md b/python/doc/Benchmarks.md index 1c368018582..c84bf0dc1eb 100644 --- a/python/doc/Benchmarks.md +++ b/python/doc/Benchmarks.md @@ -1,15 +1,20 @@ ## Benchmark Requirements diff --git a/python/doc/Makefile b/python/doc/Makefile index 65d6a4df3b2..1b9f707021a 100644 --- a/python/doc/Makefile +++ b/python/doc/Makefile @@ -1,15 +1,21 @@ - -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. See accompanying LICENSE file. +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +# # Makefile for Sphinx documentation # diff --git a/python/doc/source/api.rst b/python/doc/source/api.rst index c52d400cef1..fd1cb728d98 100644 --- a/python/doc/source/api.rst +++ b/python/doc/source/api.rst @@ -164,6 +164,18 @@ Input / Output and Shared Memory create_memory_map PythonFile +File Systems +------------ + +.. autosummary:: + :toctree: generated/ + + hdfs.connect + LocalFileSystem + +.. class:: HadoopFileSystem + :noindex: + .. _api.ipc: Interprocess Communication and Messaging @@ -212,6 +224,20 @@ Type Classes Field Schema +.. currentmodule:: pyarrow.plasma + +.. _api.plasma: + +In-Memory Object Store +---------------------- + +.. autosummary:: + :toctree: generated/ + + ObjectID + PlasmaClient + PlasmaBuffer + .. currentmodule:: pyarrow.parquet .. _api.parquet: @@ -225,5 +251,8 @@ Apache Parquet ParquetDataset ParquetFile read_table + read_metadata + read_pandas + read_schema write_metadata write_table diff --git a/python/doc/source/conf.py b/python/doc/source/conf.py index c7f098fc0d5..25e6d5e2d44 100644 --- a/python/doc/source/conf.py +++ b/python/doc/source/conf.py @@ -1,16 +1,21 @@ # -*- coding: utf-8 -*- # -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. See accompanying LICENSE file. +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. # # This file is execfile()d with the current directory set to its # containing dir. @@ -57,7 +62,8 @@ ] # Show members for classes in .. autosummary -autodoc_default_flags = ['members', 'undoc-members', 'show-inheritance', 'inherited-members'] +autodoc_default_flags = ['members', 'undoc-members', 'show-inheritance', + 'inherited-members'] # numpydoc configuration napoleon_use_rtype = False diff --git a/python/doc/source/development.rst b/python/doc/source/development.rst index b5aba6c53ef..d0a1c544dd0 100644 --- a/python/doc/source/development.rst +++ b/python/doc/source/development.rst @@ -84,7 +84,7 @@ from conda-forge: conda create -y -q -n pyarrow-dev \ python=3.6 numpy six setuptools cython pandas pytest \ cmake flatbuffers rapidjson boost-cpp thrift-cpp snappy zlib \ - brotli jemalloc -c conda-forge + brotli jemalloc lz4-c zstd -c conda-forge source activate pyarrow-dev @@ -267,7 +267,6 @@ Now, we build and install Arrow C++ libraries -DCMAKE_INSTALL_PREFIX=%ARROW_HOME% ^ -DCMAKE_BUILD_TYPE=Release ^ -DARROW_BUILD_TESTS=off ^ - -DARROW_ZLIB_VENDORED=off ^ -DARROW_PYTHON=on .. cmake --build . --target INSTALL --config Release cd ..\.. diff --git a/python/doc/source/filesystems.rst b/python/doc/source/filesystems.rst index 61c03c57dfa..c0530f93c2c 100644 --- a/python/doc/source/filesystems.rst +++ b/python/doc/source/filesystems.rst @@ -15,8 +15,8 @@ .. specific language governing permissions and limitations .. under the License. -Filesystem Interfaces -===================== +File System Interfaces +====================== In this section, we discuss filesystem-like interfaces in PyArrow. @@ -31,12 +31,14 @@ System. You connect like so: .. code-block:: python import pyarrow as pa - hdfs = pa.HdfsClient(host, port, user=user, kerb_ticket=ticket_cache_path) + fs = pa.hdfs.connect(host, port, user=user, kerb_ticket=ticket_cache_path) + with fs.open(path, 'rb') as f: + # Do something with f -By default, ``pyarrow.HdfsClient`` uses libhdfs, a JNI-based interface to the -Java Hadoop client. This library is loaded **at runtime** (rather than at link -/ library load time, since the library may not be in your LD_LIBRARY_PATH), and -relies on some environment variables. +By default, ``pyarrow.hdfs.HadoopFileSystem`` uses libhdfs, a JNI-based +interface to the Java Hadoop client. This library is loaded **at runtime** +(rather than at link / library load time, since the library may not be in your +LD_LIBRARY_PATH), and relies on some environment variables. * ``HADOOP_HOME``: the root of your installed Hadoop distribution. Often has `lib/native/libhdfs.so`. @@ -56,5 +58,33 @@ You can also use libhdfs3, a thirdparty C++ library for HDFS from Pivotal Labs: .. code-block:: python - hdfs3 = pa.HdfsClient(host, port, user=user, kerb_ticket=ticket_cache_path, - driver='libhdfs3') + fs = pa.hdfs.connect(host, port, user=user, kerb_ticket=ticket_cache_path, + driver='libhdfs3') + +HDFS API +~~~~~~~~ + +.. currentmodule:: pyarrow + +.. autosummary:: + :toctree: generated/ + + hdfs.connect + HadoopFileSystem.cat + HadoopFileSystem.chmod + HadoopFileSystem.chown + HadoopFileSystem.delete + HadoopFileSystem.df + HadoopFileSystem.disk_usage + HadoopFileSystem.download + HadoopFileSystem.exists + HadoopFileSystem.get_capacity + HadoopFileSystem.get_space_used + HadoopFileSystem.info + HadoopFileSystem.ls + HadoopFileSystem.mkdir + HadoopFileSystem.open + HadoopFileSystem.rename + HadoopFileSystem.rm + HadoopFileSystem.upload + HdfsFile diff --git a/python/doc/source/memory.rst b/python/doc/source/memory.rst index ccc6298b661..f18919999e0 100644 --- a/python/doc/source/memory.rst +++ b/python/doc/source/memory.rst @@ -226,10 +226,3 @@ file interfaces that can read and write to Arrow Buffers. reader.read(7) These have similar semantics to Python's built-in ``io.BytesIO``. - -Hadoop Filesystem ------------------ - -:class:`~pyarrow.HdfsFile` is an implementation of :class:`~pyarrow.NativeFile` -that can read and write to the Hadoop filesytem. Read more in the -:ref:`Filesystems Section `. diff --git a/python/manylinux1/Dockerfile-x86_64 b/python/manylinux1/Dockerfile-x86_64 index 13919a24087..6d72ec7538c 100644 --- a/python/manylinux1/Dockerfile-x86_64 +++ b/python/manylinux1/Dockerfile-x86_64 @@ -1,14 +1,19 @@ -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. See accompanying LICENSE file. +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. FROM quay.io/xhochy/arrow_manylinux1_x86_64_base:latest ADD arrow /arrow diff --git a/python/manylinux1/Dockerfile-x86_64_base b/python/manylinux1/Dockerfile-x86_64_base index cdd13e2e933..0160aa4eea5 100644 --- a/python/manylinux1/Dockerfile-x86_64_base +++ b/python/manylinux1/Dockerfile-x86_64_base @@ -1,14 +1,19 @@ -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. See accompanying LICENSE file. +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. FROM quay.io/pypa/manylinux1_x86_64:latest # Install dependencies diff --git a/python/manylinux1/README.md b/python/manylinux1/README.md index 2e7f56bd620..a74f7a27b93 100644 --- a/python/manylinux1/README.md +++ b/python/manylinux1/README.md @@ -1,15 +1,20 @@ ## Manylinux1 wheels for Apache Arrow diff --git a/python/manylinux1/build_arrow.sh b/python/manylinux1/build_arrow.sh index 8c6bda9550e..5a21e36e4d7 100755 --- a/python/manylinux1/build_arrow.sh +++ b/python/manylinux1/build_arrow.sh @@ -1,16 +1,21 @@ #!/bin/bash # -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. See accompanying LICENSE file. +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. # # Usage: # docker run --rm -v $PWD:/io arrow-base-x86_64 /io/build_arrow.sh @@ -35,6 +40,7 @@ cd /arrow/python # PyArrow build configuration export PYARROW_BUILD_TYPE='release' export PYARROW_WITH_PARQUET=1 +export PYARROW_WITH_PLASMA=1 export PYARROW_BUNDLE_ARROW_CPP=1 # Need as otherwise arrow_io is sometimes not linked export LDFLAGS="-Wl,--no-as-needed" @@ -52,7 +58,7 @@ for PYTHON in ${PYTHON_VERSIONS}; do ARROW_BUILD_DIR=/arrow/cpp/build-PY${PYTHON} mkdir -p "${ARROW_BUILD_DIR}" pushd "${ARROW_BUILD_DIR}" - PATH="$(cpython_path $PYTHON)/bin:$PATH" cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/arrow-dist -DARROW_BUILD_TESTS=OFF -DARROW_BUILD_SHARED=ON -DARROW_BOOST_USE_SHARED=OFF -DARROW_JEMALLOC=ON -DARROW_RPATH_ORIGIN=ON -DARROW_JEMALLOC_USE_SHARED=OFF -DARROW_PYTHON=ON -DPythonInterp_FIND_VERSION=${PYTHON} .. + PATH="$(cpython_path $PYTHON)/bin:$PATH" cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/arrow-dist -DARROW_BUILD_TESTS=OFF -DARROW_BUILD_SHARED=ON -DARROW_BOOST_USE_SHARED=OFF -DARROW_JEMALLOC=ON -DARROW_RPATH_ORIGIN=ON -DARROW_JEMALLOC_USE_SHARED=OFF -DARROW_PYTHON=ON -DPythonInterp_FIND_VERSION=${PYTHON} -DARROW_PLASMA=ON .. make -j5 install popd @@ -65,6 +71,7 @@ for PYTHON in ${PYTHON_VERSIONS}; do echo "=== (${PYTHON}) Test the existence of optional modules ===" $PIPI_IO -r requirements.txt PATH="$PATH:$(cpython_path $PYTHON)/bin" $PYTHON_INTERPRETER -c "import pyarrow.parquet" + PATH="$PATH:$(cpython_path $PYTHON)/bin" $PYTHON_INTERPRETER -c "import pyarrow.plasma" echo "=== (${PYTHON}) Tag the wheel with manylinux1 ===" mkdir -p repaired_wheels/ @@ -73,9 +80,11 @@ for PYTHON in ${PYTHON_VERSIONS}; do echo "=== (${PYTHON}) Testing manylinux1 wheel ===" source /venv-test-${PYTHON}/bin/activate pip install repaired_wheels/*.whl - py.test --parquet /venv-test-${PYTHON}/lib/*/site-packages/pyarrow + + # ARROW-1264; for some reason the test case added causes a segfault inside + # the Docker container when writing and error message to stderr + py.test --parquet /venv-test-${PYTHON}/lib/*/site-packages/pyarrow -v -s --disable-plasma deactivate mv repaired_wheels/*.whl /io/dist done - diff --git a/python/manylinux1/scripts/build_boost.sh b/python/manylinux1/scripts/build_boost.sh index 6a313366494..3c11f3aeb6f 100755 --- a/python/manylinux1/scripts/build_boost.sh +++ b/python/manylinux1/scripts/build_boost.sh @@ -1,16 +1,20 @@ #!/bin/bash -ex -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. See accompanying LICENSE file. - +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. wget --no-check-certificate http://downloads.sourceforge.net/project/boost/boost/1.60.0/boost_1_60_0.tar.gz -O /boost_1_60_0.tar.gz tar xf boost_1_60_0.tar.gz diff --git a/python/manylinux1/scripts/build_brotli.sh b/python/manylinux1/scripts/build_brotli.sh index 4b4cbf17ca9..9a1eca7b780 100755 --- a/python/manylinux1/scripts/build_brotli.sh +++ b/python/manylinux1/scripts/build_brotli.sh @@ -1,15 +1,20 @@ #!/bin/bash -ex -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. See accompanying LICENSE file. +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. export BROTLI_VERSION="0.6.0" wget "https://github.com/google/brotli/archive/v${BROTLI_VERSION}.tar.gz" -O brotli-${BROTLI_VERSION}.tar.gz diff --git a/python/manylinux1/scripts/build_ccache.sh b/python/manylinux1/scripts/build_ccache.sh index 6ad5d29f832..681adecd9a7 100755 --- a/python/manylinux1/scripts/build_ccache.sh +++ b/python/manylinux1/scripts/build_ccache.sh @@ -1,15 +1,20 @@ #!/bin/bash -ex -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. See accompanying LICENSE file. +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. wget https://www.samba.org/ftp/ccache/ccache-3.3.4.tar.bz2 -O ccache-3.3.4.tar.bz2 tar xf ccache-3.3.4.tar.bz2 diff --git a/python/manylinux1/scripts/build_flatbuffers.sh b/python/manylinux1/scripts/build_flatbuffers.sh index 7703855b6ef..683a89ce5c4 100755 --- a/python/manylinux1/scripts/build_flatbuffers.sh +++ b/python/manylinux1/scripts/build_flatbuffers.sh @@ -1,15 +1,20 @@ #!/bin/bash -ex -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. See accompanying LICENSE file. +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. wget https://github.com/google/flatbuffers/archive/v1.6.0.tar.gz -O flatbuffers-1.6.0.tar.gz tar xf flatbuffers-1.6.0.tar.gz diff --git a/python/manylinux1/scripts/build_gtest.sh b/python/manylinux1/scripts/build_gtest.sh index 3427bed091e..4ce20c1fb44 100755 --- a/python/manylinux1/scripts/build_gtest.sh +++ b/python/manylinux1/scripts/build_gtest.sh @@ -1,15 +1,20 @@ #!/bin/bash -ex -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. See accompanying LICENSE file. +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. wget https://github.com/google/googletest/archive/release-1.7.0.tar.gz -O googletest-release-1.7.0.tar.gz tar xf googletest-release-1.7.0.tar.gz diff --git a/python/manylinux1/scripts/build_jemalloc.sh b/python/manylinux1/scripts/build_jemalloc.sh index 8153baa097e..1bf1a06b27e 100755 --- a/python/manylinux1/scripts/build_jemalloc.sh +++ b/python/manylinux1/scripts/build_jemalloc.sh @@ -1,15 +1,20 @@ #!/bin/bash -ex -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. See accompanying LICENSE file. +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. wget https://github.com/jemalloc/jemalloc/releases/download/4.4.0/jemalloc-4.4.0.tar.bz2 -O jemalloc-4.4.0.tar.bz2 tar xf jemalloc-4.4.0.tar.bz2 diff --git a/python/manylinux1/scripts/build_lz4.sh b/python/manylinux1/scripts/build_lz4.sh index 975a3015412..8242a5fe27e 100755 --- a/python/manylinux1/scripts/build_lz4.sh +++ b/python/manylinux1/scripts/build_lz4.sh @@ -1,15 +1,20 @@ #!/bin/bash -ex -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. See accompanying LICENSE file. +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. export LZ4_VERSION="1.7.5" export PREFIX="/usr" diff --git a/python/manylinux1/scripts/build_openssl.sh b/python/manylinux1/scripts/build_openssl.sh index 3bcb2b9a053..1a54d72f046 100755 --- a/python/manylinux1/scripts/build_openssl.sh +++ b/python/manylinux1/scripts/build_openssl.sh @@ -1,15 +1,20 @@ #!/bin/bash -ex -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. See accompanying LICENSE file. +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. wget --no-check-certificate https://www.openssl.org/source/openssl-1.0.2k.tar.gz -O openssl-1.0.2k.tar.gz tar xf openssl-1.0.2k.tar.gz diff --git a/python/manylinux1/scripts/build_snappy.sh b/python/manylinux1/scripts/build_snappy.sh index 973b4ff7d80..5392e14a33a 100755 --- a/python/manylinux1/scripts/build_snappy.sh +++ b/python/manylinux1/scripts/build_snappy.sh @@ -1,15 +1,20 @@ #!/bin/bash -ex -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. See accompanying LICENSE file. +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. export SNAPPY_VERSION="1.1.3" wget "https://github.com/google/snappy/releases/download/${SNAPPY_VERSION}/snappy-${SNAPPY_VERSION}.tar.gz" -O snappy-${SNAPPY_VERSION}.tar.gz diff --git a/python/manylinux1/scripts/build_thrift.sh b/python/manylinux1/scripts/build_thrift.sh index 1db74585548..28aa75b7413 100755 --- a/python/manylinux1/scripts/build_thrift.sh +++ b/python/manylinux1/scripts/build_thrift.sh @@ -1,15 +1,20 @@ #!/bin/bash -ex -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. See accompanying LICENSE file. +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. export THRIFT_VERSION=0.10.0 wget http://archive.apache.org/dist/thrift/${THRIFT_VERSION}/thrift-${THRIFT_VERSION}.tar.gz diff --git a/python/manylinux1/scripts/build_virtualenvs.sh b/python/manylinux1/scripts/build_virtualenvs.sh index 60d6580de0c..ddedcf61fde 100755 --- a/python/manylinux1/scripts/build_virtualenvs.sh +++ b/python/manylinux1/scripts/build_virtualenvs.sh @@ -1,15 +1,20 @@ #!/bin/bash -e -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. See accompanying LICENSE file. +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. # Build upon the scripts in https://github.com/matthew-brett/manylinux-builds # * Copyright (c) 2013-2016, Matt Terry and Matthew Brett (BSD 2-clause) diff --git a/python/manylinux1/scripts/build_zstd.sh b/python/manylinux1/scripts/build_zstd.sh index 268e5c8894c..ef0e267757e 100755 --- a/python/manylinux1/scripts/build_zstd.sh +++ b/python/manylinux1/scripts/build_zstd.sh @@ -1,15 +1,20 @@ #!/bin/bash -ex -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. See accompanying LICENSE file. +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. export ZSTD_VERSION="1.2.0" export CFLAGS="${CFLAGS} -O3 -fPIC" diff --git a/python/pyarrow/__init__.py b/python/pyarrow/__init__.py index e3d783aee58..8d4a214ba26 100644 --- a/python/pyarrow/__init__.py +++ b/python/pyarrow/__init__.py @@ -68,6 +68,7 @@ Date32Value, Date64Value, TimestampValue) from pyarrow.lib import (HdfsFile, NativeFile, PythonFile, + FixedSizeBufferOutputStream, Buffer, BufferReader, BufferOutputStream, OSFile, MemoryMappedFile, memory_map, frombuffer, @@ -87,7 +88,10 @@ ArrowTypeError) -from pyarrow.filesystem import Filesystem, HdfsClient, LocalFilesystem +from pyarrow.filesystem import FileSystem, LocalFileSystem + +from pyarrow.hdfs import HadoopFileSystem +import pyarrow.hdfs as hdfs from pyarrow.ipc import (Message, MessageReader, RecordBatchFileReader, RecordBatchFileWriter, @@ -99,23 +103,13 @@ open_file, serialize_pandas, deserialize_pandas) - -localfs = LocalFilesystem.get_instance() +localfs = LocalFileSystem.get_instance() # ---------------------------------------------------------------------- # 0.4.0 deprecations -import warnings - -def _deprecate_class(old_name, new_name, klass, next_version='0.5.0'): - msg = ('pyarrow.{0} has been renamed to ' - '{1}, will be removed in {2}' - .format(old_name, new_name, next_version)) - def deprecated_factory(*args, **kwargs): - warnings.warn(msg, FutureWarning) - return klass(*args) - return deprecated_factory +from pyarrow.util import _deprecate_class FileReader = _deprecate_class('FileReader', 'RecordBatchFileReader', @@ -136,3 +130,7 @@ def deprecated_factory(*args, **kwargs): InMemoryOutputStream = _deprecate_class('InMemoryOutputStream', 'BufferOutputStream', BufferOutputStream, '0.5.0') + +# Backwards compatibility with pyarrow < 0.6.0 +HdfsClient = _deprecate_class('HdfsClient', 'pyarrow.hdfs.connect', + hdfs.connect, '0.6.0') diff --git a/python/pyarrow/_config.pyx b/python/pyarrow/_config.pyx index e5fdbef8af5..a2d2d719e68 100644 --- a/python/pyarrow/_config.pyx +++ b/python/pyarrow/_config.pyx @@ -1,14 +1,19 @@ -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. See accompanying LICENSE file. +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. # cython: profile=False # distutils: language = c++ diff --git a/python/pyarrow/_parquet.pyx b/python/pyarrow/_parquet.pyx index bbe52033526..c940122da5d 100644 --- a/python/pyarrow/_parquet.pyx +++ b/python/pyarrow/_parquet.pyx @@ -542,6 +542,7 @@ cdef ParquetCompression compression_from_name(str name): cdef class ParquetWriter: cdef: unique_ptr[FileWriter] writer + shared_ptr[OutputStream] sink cdef readonly: object use_dictionary @@ -555,14 +556,13 @@ cdef class ParquetWriter: MemoryPool memory_pool=None, use_deprecated_int96_timestamps=False): cdef: shared_ptr[FileOutputStream] filestream - shared_ptr[OutputStream] sink shared_ptr[WriterProperties] properties if isinstance(where, six.string_types): check_status(FileOutputStream.Open(tobytes(where), &filestream)) - sink = filestream + self.sink = filestream else: - get_writer(where, &sink) + get_writer(where, &self.sink) self.use_dictionary = use_dictionary self.compression = compression @@ -582,7 +582,7 @@ cdef class ParquetWriter: check_status( FileWriter.Open(deref(schema.schema), maybe_unbox_memory_pool(memory_pool), - sink, properties, arrow_properties, + self.sink, properties, arrow_properties, &self.writer)) cdef void _set_int96_support(self, ArrowWriterProperties.Builder* props): @@ -629,11 +629,14 @@ cdef class ParquetWriter: cdef CTable* ctable = table.table if row_group_size is None or row_group_size == -1: - row_group_size = ctable.num_rows() + if ctable.num_rows() > 0: + row_group_size = ctable.num_rows() + else: + row_group_size = 1 elif row_group_size == 0: raise ValueError('Row group size cannot be 0') - cdef int c_row_group_size = row_group_size + cdef int64_t c_row_group_size = row_group_size with nogil: check_status(self.writer.get() diff --git a/python/pyarrow/array.pxi b/python/pyarrow/array.pxi index efbe36f80b3..67418aa5eac 100644 --- a/python/pyarrow/array.pxi +++ b/python/pyarrow/array.pxi @@ -89,6 +89,23 @@ def array(object sequence, DataType type=None, MemoryPool memory_pool=None, return pyarrow_wrap_array(sp_array) +def _normalize_slice(object arrow_obj, slice key): + cdef Py_ssize_t n = len(arrow_obj) + + start = key.start or 0 + while start < 0: + start += n + + stop = key.stop if key.stop is not None else n + while stop < 0: + stop += n + + step = key.step or 1 + if step != 1: + raise IndexError('only slices with step 1 supported') + else: + return arrow_obj.slice(start, stop - start) + cdef class Array: @@ -230,23 +247,10 @@ cdef class Array: raise NotImplemented def __getitem__(self, key): - cdef: - Py_ssize_t n = len(self) + cdef Py_ssize_t n = len(self) if PySlice_Check(key): - start = key.start or 0 - while start < 0: - start += n - - stop = key.stop if key.stop is not None else n - while stop < 0: - stop += n - - step = key.step or 1 - if step != 1: - raise IndexError('only slices with step 1 supported') - else: - return self.slice(start, stop - start) + return _normalize_slice(self, key) while key < 0: key += len(self) diff --git a/python/pyarrow/error.pxi b/python/pyarrow/error.pxi index 259aeb074e3..8a3f57d209a 100644 --- a/python/pyarrow/error.pxi +++ b/python/pyarrow/error.pxi @@ -48,6 +48,18 @@ class ArrowNotImplementedError(NotImplementedError, ArrowException): pass +class PlasmaObjectExists(ArrowException): + pass + + +class PlasmaObjectNonexistent(ArrowException): + pass + + +class PlasmaStoreFull(ArrowException): + pass + + cdef int check_status(const CStatus& status) nogil except -1: if status.ok(): return 0 @@ -66,5 +78,11 @@ cdef int check_status(const CStatus& status) nogil except -1: raise ArrowNotImplementedError(message) elif status.IsTypeError(): raise ArrowTypeError(message) + elif status.IsPlasmaObjectExists(): + raise PlasmaObjectExists(message) + elif status.IsPlasmaObjectNonexistent(): + raise PlasmaObjectNonexistent(message) + elif status.IsPlasmaStoreFull(): + raise PlasmaStoreFull(message) else: raise ArrowException(message) diff --git a/python/pyarrow/filesystem.py b/python/pyarrow/filesystem.py index 9fa4f762777..8d2d8fcd342 100644 --- a/python/pyarrow/filesystem.py +++ b/python/pyarrow/filesystem.py @@ -17,15 +17,26 @@ from os.path import join as pjoin import os +import posixpath from pyarrow.util import implements -import pyarrow.lib as lib -class Filesystem(object): +class FileSystem(object): """ Abstract filesystem interface """ + def cat(self, path): + """ + Return contents of file as a bytes object + + Returns + ------- + contents : bytes + """ + with self.open(path, 'rb') as f: + return f.read() + def ls(self, path): """ Return list of file paths @@ -44,6 +55,68 @@ def delete(self, path, recursive=False): """ raise NotImplementedError + def disk_usage(self, path): + """ + Compute bytes used by all contents under indicated path in file tree + + Parameters + ---------- + path : string + Can be a file path or directory + + Returns + ------- + usage : int + """ + path_info = self.stat(path) + if path_info['kind'] == 'file': + return path_info['size'] + + total = 0 + for root, directories, files in self.walk(path): + for child_path in files: + abspath = self._path_join(root, child_path) + total += self.stat(abspath)['size'] + + return total + + def _path_join(self, *args): + return self.pathsep.join(args) + + def stat(self, path): + """ + + Returns + ------- + stat : dict + """ + raise NotImplementedError('FileSystem.stat') + + def rm(self, path, recursive=False): + """ + Alias for FileSystem.delete + """ + return self.delete(path, recursive=recursive) + + def mv(self, path, new_path): + """ + Alias for FileSystem.rename + """ + return self.rename(path, new_path) + + def rename(self, path, new_path): + """ + Rename file, like UNIX mv command + + Parameters + ---------- + path : string + Path to alter + new_path : string + Path to move to + """ + raise NotImplementedError('FileSystem.rename') + def mkdir(self, path, create_parents=True): raise NotImplementedError @@ -96,44 +169,51 @@ def read_parquet(self, path, columns=None, metadata=None, schema=None, return dataset.read(columns=columns, nthreads=nthreads, use_pandas_metadata=use_pandas_metadata) + def open(self, path, mode='rb'): + """ + Open file for reading or writing + """ + raise NotImplementedError + @property def pathsep(self): return '/' -class LocalFilesystem(Filesystem): +class LocalFileSystem(FileSystem): _instance = None @classmethod def get_instance(cls): if cls._instance is None: - cls._instance = LocalFilesystem() + cls._instance = LocalFileSystem() return cls._instance - @implements(Filesystem.ls) + @implements(FileSystem.ls) def ls(self, path): return sorted(pjoin(path, x) for x in os.listdir(path)) - @implements(Filesystem.mkdir) + @implements(FileSystem.mkdir) def mkdir(self, path, create_parents=True): if create_parents: os.makedirs(path) else: os.mkdir(path) - @implements(Filesystem.isdir) + @implements(FileSystem.isdir) def isdir(self, path): return os.path.isdir(path) - @implements(Filesystem.isfile) + @implements(FileSystem.isfile) def isfile(self, path): return os.path.isfile(path) - @implements(Filesystem.exists) + @implements(FileSystem.exists) def exists(self, path): return os.path.exists(path) + @implements(FileSystem.open) def open(self, path, mode='rb'): """ Open file for reading or writing @@ -144,68 +224,103 @@ def open(self, path, mode='rb'): def pathsep(self): return os.path.sep + def walk(self, top_dir): + """ + Directory tree generator, see os.walk + """ + return os.walk(top_dir) + -class HdfsClient(lib._HdfsClient, Filesystem): +class DaskFileSystem(FileSystem): """ - Connect to an HDFS cluster. All parameters are optional and should - only be set if the defaults need to be overridden. - - Authentication should be automatic if the HDFS cluster uses Kerberos. - However, if a username is specified, then the ticket cache will likely - be required. - - Parameters - ---------- - host : NameNode. Set to "default" for fs.defaultFS from core-site.xml. - port : NameNode's port. Set to 0 for default or logical (HA) nodes. - user : Username when connecting to HDFS; None implies login user. - kerb_ticket : Path to Kerberos ticket cache. - driver : {'libhdfs', 'libhdfs3'}, default 'libhdfs' - Connect using libhdfs (JNI-based) or libhdfs3 (3rd-party C++ - library from Pivotal Labs) - - Notes - ----- - The first time you call this method, it will take longer than usual due - to JNI spin-up time. - - Returns - ------- - client : HDFSClient + Wraps s3fs Dask filesystem implementation like s3fs, gcsfs, etc. """ - def __init__(self, host="default", port=0, user=None, kerb_ticket=None, - driver='libhdfs'): - self._connect(host, port, user, kerb_ticket, driver) + def __init__(self, fs): + self.fs = fs - @implements(Filesystem.isdir) + @implements(FileSystem.isdir) def isdir(self, path): - return lib._HdfsClient.isdir(self, path) + raise NotImplementedError("Unsupported file system API") - @implements(Filesystem.isfile) + @implements(FileSystem.isfile) def isfile(self, path): - return lib._HdfsClient.isfile(self, path) + raise NotImplementedError("Unsupported file system API") - @implements(Filesystem.delete) + @implements(FileSystem.delete) def delete(self, path, recursive=False): - return lib._HdfsClient.delete(self, path, recursive) + return self.fs.rm(path, recursive=recursive) - @implements(Filesystem.mkdir) - def mkdir(self, path, create_parents=True): - return lib._HdfsClient.mkdir(self, path) + @implements(FileSystem.mkdir) + def mkdir(self, path): + return self.fs.mkdir(path) + + @implements(FileSystem.open) + def open(self, path, mode='rb'): + """ + Open file for reading or writing + """ + return self.fs.open(path, mode=mode) - def ls(self, path, full_info=False): + def ls(self, path, detail=False): + return self.fs.ls(path, detail=detail) + + def walk(self, top_path): + """ + Directory tree generator, like os.walk """ - Retrieve directory contents and metadata, if requested. + return self.fs.walk(top_path) - Parameters - ---------- - path : HDFS path - full_info : boolean, default False - If False, only return list of paths - Returns - ------- - result : list of dicts (full_info=True) or strings (full_info=False) +class S3FSWrapper(DaskFileSystem): + + @implements(FileSystem.isdir) + def isdir(self, path): + try: + contents = self.fs.ls(path) + if len(contents) == 1 and contents[0] == path: + return False + else: + return True + except OSError: + return False + + @implements(FileSystem.isfile) + def isfile(self, path): + try: + contents = self.fs.ls(path) + return len(contents) == 1 and contents[0] == path + except OSError: + return False + + def walk(self, path, refresh=False): + """ + Directory tree generator, like os.walk + + Generator version of what is in s3fs, which yields a flattened list of + files """ - return lib._HdfsClient.ls(self, path, full_info) + path = path.replace('s3://', '') + directories = set() + files = set() + + for key in list(self.fs._ls(path, refresh=refresh)): + path = key['Key'] + if key['StorageClass'] == 'DIRECTORY': + directories.add(path) + elif key['StorageClass'] == 'BUCKET': + pass + else: + files.add(path) + + # s3fs creates duplicate 'DIRECTORY' entries + files = sorted([posixpath.split(f)[1] for f in files + if f not in directories]) + directories = sorted([posixpath.split(x)[1] + for x in directories]) + + yield path, directories, files + + for directory in directories: + for tup in self.walk(directory, refresh=refresh): + yield tup diff --git a/python/pyarrow/hdfs.py b/python/pyarrow/hdfs.py new file mode 100644 index 00000000000..855cc1e76bd --- /dev/null +++ b/python/pyarrow/hdfs.py @@ -0,0 +1,137 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +import posixpath + +from pyarrow.util import implements +from pyarrow.filesystem import FileSystem +import pyarrow.lib as lib + + +class HadoopFileSystem(lib.HadoopFileSystem, FileSystem): + """ + FileSystem interface for HDFS cluster. See pyarrow.hdfs.connect for full + connection details + """ + + def __init__(self, host="default", port=0, user=None, kerb_ticket=None, + driver='libhdfs'): + self._connect(host, port, user, kerb_ticket, driver) + + @implements(FileSystem.isdir) + def isdir(self, path): + return super(HadoopFileSystem, self).isdir(path) + + @implements(FileSystem.isfile) + def isfile(self, path): + return super(HadoopFileSystem, self).isfile(path) + + @implements(FileSystem.delete) + def delete(self, path, recursive=False): + return super(HadoopFileSystem, self).delete(path, recursive) + + @implements(FileSystem.mkdir) + def mkdir(self, path, create_parents=True): + return super(HadoopFileSystem, self).mkdir(path) + + @implements(FileSystem.rename) + def rename(self, path, new_path): + return super(HadoopFileSystem, self).rename(path, new_path) + + def ls(self, path, detail=False): + """ + Retrieve directory contents and metadata, if requested. + + Parameters + ---------- + path : HDFS path + detail : boolean, default False + If False, only return list of paths + + Returns + ------- + result : list of dicts (detail=True) or strings (detail=False) + """ + return super(HadoopFileSystem, self).ls(path, detail) + + def walk(self, top_path): + """ + Directory tree generator for HDFS, like os.walk + + Parameters + ---------- + top_path : string + Root directory for tree traversal + + Returns + ------- + Generator yielding 3-tuple (dirpath, dirnames, filename) + """ + contents = self.ls(top_path, detail=True) + + directories, files = _libhdfs_walk_files_dirs(top_path, contents) + yield top_path, directories, files + for dirname in directories: + for tup in self.walk(self._path_join(top_path, dirname)): + yield tup + + +def _libhdfs_walk_files_dirs(top_path, contents): + files = [] + directories = [] + for c in contents: + scrubbed_name = posixpath.split(c['name'])[1] + if c['kind'] == 'file': + files.append(scrubbed_name) + else: + directories.append(scrubbed_name) + + return directories, files + + +def connect(host="default", port=0, user=None, kerb_ticket=None, + driver='libhdfs'): + """ + Connect to an HDFS cluster. All parameters are optional and should + only be set if the defaults need to be overridden. + + Authentication should be automatic if the HDFS cluster uses Kerberos. + However, if a username is specified, then the ticket cache will likely + be required. + + Parameters + ---------- + host : NameNode. Set to "default" for fs.defaultFS from core-site.xml. + port : NameNode's port. Set to 0 for default or logical (HA) nodes. + user : Username when connecting to HDFS; None implies login user. + kerb_ticket : Path to Kerberos ticket cache. + driver : {'libhdfs', 'libhdfs3'}, default 'libhdfs' + Connect using libhdfs (JNI-based) or libhdfs3 (3rd-party C++ + library from Apache HAWQ (incubating) ) + + Notes + ----- + The first time you call this method, it will take longer than usual due + to JNI spin-up time. + + Returns + ------- + filesystem : HadoopFileSystem + """ + fs = HadoopFileSystem(host=host, port=port, user=user, + kerb_ticket=kerb_ticket, driver=driver) + return fs diff --git a/python/pyarrow/includes/common.pxd b/python/pyarrow/includes/common.pxd index 3487d48ce9b..637a133afb0 100644 --- a/python/pyarrow/includes/common.pxd +++ b/python/pyarrow/includes/common.pxd @@ -50,6 +50,9 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: c_bool IsKeyError() c_bool IsNotImplemented() c_bool IsTypeError() + c_bool IsPlasmaObjectExists() + c_bool IsPlasmaObjectNonexistent() + c_bool IsPlasmaStoreFull() cdef inline object PyObject_to_object(PyObject* o): diff --git a/python/pyarrow/includes/libarrow.pxd b/python/pyarrow/includes/libarrow.pxd index edf50ad54e7..8d7e27915ee 100644 --- a/python/pyarrow/includes/libarrow.pxd +++ b/python/pyarrow/includes/libarrow.pxd @@ -148,9 +148,15 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: CLoggingMemoryPool(CMemoryPool*) cdef cppclass CBuffer" arrow::Buffer": + CBuffer(const uint8_t* data, int64_t size) uint8_t* data() int64_t size() shared_ptr[CBuffer] parent() + c_bool is_mutable() const + + cdef cppclass CMutableBuffer" arrow::MutableBuffer"(CBuffer): + CMutableBuffer(const uint8_t* data, int64_t size) + uint8_t* mutable_data() cdef cppclass ResizableBuffer(CBuffer): CStatus Resize(int64_t nbytes) @@ -363,7 +369,7 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: shared_ptr[CTable]* table) int num_columns() - int num_rows() + int64_t num_rows() c_bool Equals(const CTable& other) @@ -407,6 +413,10 @@ cdef extern from "arrow/io/interfaces.h" namespace "arrow::io" nogil: ObjectType_FILE" arrow::io::ObjectType::FILE" ObjectType_DIRECTORY" arrow::io::ObjectType::DIRECTORY" + cdef cppclass FileStatistics: + int64_t size + ObjectType kind + cdef cppclass FileInterface: CStatus Close() CStatus Tell(int64_t* position) @@ -444,6 +454,9 @@ cdef extern from "arrow/io/interfaces.h" namespace "arrow::io" nogil: WriteableFile): pass + cdef cppclass FileSystem: + CStatus Stat(const c_string& path, FileStatistics* stat) + cdef extern from "arrow/io/file.h" namespace "arrow::io" nogil: @@ -511,10 +524,10 @@ cdef extern from "arrow/io/hdfs.h" namespace "arrow::io" nogil: cdef cppclass HdfsOutputStream(OutputStream): pass - cdef cppclass CHdfsClient" arrow::io::HdfsClient": + cdef cppclass CHadoopFileSystem" arrow::io::HadoopFileSystem"(FileSystem): @staticmethod CStatus Connect(const HdfsConnectionConfig* config, - shared_ptr[CHdfsClient]* client) + shared_ptr[CHadoopFileSystem]* client) CStatus MakeDirectory(const c_string& path) @@ -524,6 +537,10 @@ cdef extern from "arrow/io/hdfs.h" namespace "arrow::io" nogil: c_bool Exists(const c_string& path) + CStatus Chmod(const c_string& path, int mode) + CStatus Chown(const c_string& path, const char* owner, + const char* group) + CStatus GetCapacity(int64_t* nbytes) CStatus GetUsed(int64_t* nbytes) @@ -558,6 +575,9 @@ cdef extern from "arrow/io/memory.h" namespace "arrow::io" nogil: CMockOutputStream() int64_t GetExtentBytesWritten() + cdef cppclass CFixedSizeBufferWriter" arrow::io::FixedSizeBufferWriter"(WriteableFile): + CFixedSizeBufferWriter(const shared_ptr[CBuffer]& buffer) + cdef extern from "arrow/ipc/api.h" namespace "arrow::ipc" nogil: enum MessageType" arrow::ipc::Message::Type": diff --git a/python/pyarrow/io-hdfs.pxi b/python/pyarrow/io-hdfs.pxi new file mode 100644 index 00000000000..8ac4e8c2319 --- /dev/null +++ b/python/pyarrow/io-hdfs.pxi @@ -0,0 +1,468 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +# ---------------------------------------------------------------------- +# HDFS IO implementation + +_HDFS_PATH_RE = re.compile('hdfs://(.*):(\d+)(.*)') + +try: + # Python 3 + from queue import Queue, Empty as QueueEmpty, Full as QueueFull +except ImportError: + from Queue import Queue, Empty as QueueEmpty, Full as QueueFull + + +def have_libhdfs(): + try: + check_status(HaveLibHdfs()) + return True + except: + return False + + +def have_libhdfs3(): + try: + check_status(HaveLibHdfs3()) + return True + except: + return False + + +def strip_hdfs_abspath(path): + m = _HDFS_PATH_RE.match(path) + if m: + return m.group(3) + else: + return path + + +cdef class HadoopFileSystem: + cdef: + shared_ptr[CHadoopFileSystem] client + + cdef readonly: + bint is_open + + def __cinit__(self): + pass + + def _connect(self, host, port, user, kerb_ticket, driver): + cdef HdfsConnectionConfig conf + + if host is not None: + conf.host = tobytes(host) + conf.port = port + if user is not None: + conf.user = tobytes(user) + if kerb_ticket is not None: + conf.kerb_ticket = tobytes(kerb_ticket) + + if driver == 'libhdfs': + check_status(HaveLibHdfs()) + conf.driver = HdfsDriver_LIBHDFS + else: + check_status(HaveLibHdfs3()) + conf.driver = HdfsDriver_LIBHDFS3 + + with nogil: + check_status(CHadoopFileSystem.Connect(&conf, &self.client)) + self.is_open = True + + @classmethod + def connect(cls, *args, **kwargs): + return cls(*args, **kwargs) + + def __dealloc__(self): + if self.is_open: + self.close() + + def close(self): + """ + Disconnect from the HDFS cluster + """ + self._ensure_client() + with nogil: + check_status(self.client.get().Disconnect()) + self.is_open = False + + cdef _ensure_client(self): + if self.client.get() == NULL: + raise IOError('HDFS client improperly initialized') + elif not self.is_open: + raise IOError('HDFS client is closed') + + def exists(self, path): + """ + Returns True if the path is known to the cluster, False if it does not + (or there is an RPC error) + """ + self._ensure_client() + + cdef c_string c_path = tobytes(path) + cdef c_bool result + with nogil: + result = self.client.get().Exists(c_path) + return result + + def isdir(self, path): + cdef HdfsPathInfo info + self._path_info(path, &info) + return info.kind == ObjectType_DIRECTORY + + def isfile(self, path): + cdef HdfsPathInfo info + self._path_info(path, &info) + return info.kind == ObjectType_FILE + + def get_capacity(self): + """ + Get reported total capacity of file system + + Returns + ------- + capacity : int + """ + cdef int64_t capacity = 0 + with nogil: + check_status(self.client.get().GetCapacity(&capacity)) + return capacity + + def get_space_used(self): + """ + Get space used on file system + + Returns + ------- + space_used : int + """ + cdef int64_t space_used = 0 + with nogil: + check_status(self.client.get().GetUsed(&space_used)) + return space_used + + def df(self): + """ + Return free space on disk, like the UNIX df command + + Returns + ------- + space : int + """ + return self.get_capacity() - self.get_space_used() + + def rename(self, path, new_path): + cdef c_string c_path = tobytes(path) + cdef c_string c_new_path = tobytes(new_path) + with nogil: + check_status(self.client.get().Rename(c_path, c_new_path)) + + def info(self, path): + """ + Return detailed HDFS information for path + + Parameters + ---------- + path : string + Path to file or directory + + Returns + ------- + path_info : dict + """ + cdef HdfsPathInfo info + self._path_info(path, &info) + return { + 'path': frombytes(info.name), + 'owner': frombytes(info.owner), + 'group': frombytes(info.group), + 'size': info.size, + 'block_size': info.block_size, + 'last_modified': info.last_modified_time, + 'last_accessed': info.last_access_time, + 'replication': info.replication, + 'permissions': info.permissions, + 'kind': ('directory' if info.kind == ObjectType_DIRECTORY + else 'file') + } + + def stat(self, path): + """ + Return basic file system statistics about path + + Parameters + ---------- + path : string + Path to file or directory + + Returns + ------- + stat : dict + """ + cdef FileStatistics info + cdef c_string c_path = tobytes(path) + with nogil: + check_status(self.client.get() + .Stat(c_path, &info)) + return { + 'size': info.size, + 'kind': ('directory' if info.kind == ObjectType_DIRECTORY + else 'file') + } + + cdef _path_info(self, path, HdfsPathInfo* info): + cdef c_string c_path = tobytes(path) + + with nogil: + check_status(self.client.get() + .GetPathInfo(c_path, info)) + + + def ls(self, path, bint full_info): + cdef: + c_string c_path = tobytes(path) + vector[HdfsPathInfo] listing + list results = [] + int i + + self._ensure_client() + + with nogil: + check_status(self.client.get() + .ListDirectory(c_path, &listing)) + + cdef const HdfsPathInfo* info + for i in range( listing.size()): + info = &listing[i] + + # Try to trim off the hdfs://HOST:PORT piece + name = strip_hdfs_abspath(frombytes(info.name)) + + if full_info: + kind = ('file' if info.kind == ObjectType_FILE + else 'directory') + + results.append({ + 'kind': kind, + 'name': name, + 'owner': frombytes(info.owner), + 'group': frombytes(info.group), + 'list_modified_time': info.last_modified_time, + 'list_access_time': info.last_access_time, + 'size': info.size, + 'replication': info.replication, + 'block_size': info.block_size, + 'permissions': info.permissions + }) + else: + results.append(name) + + return results + + def chmod(self, path, mode): + """ + Change file permissions + + Parameters + ---------- + path : string + absolute path to file or directory + mode : int + POSIX-like bitmask + """ + self._ensure_client() + cdef c_string c_path = tobytes(path) + cdef int c_mode = mode + with nogil: + check_status(self.client.get() + .Chmod(c_path, c_mode)) + + def chown(self, path, owner=None, group=None): + """ + Change file permissions + + Parameters + ---------- + path : string + absolute path to file or directory + owner : string, default None + New owner, None for no change + group : string, default None + New group, None for no change + """ + cdef: + c_string c_path + c_string c_owner + c_string c_group + const char* c_owner_ptr = NULL + const char* c_group_ptr = NULL + + self._ensure_client() + + c_path = tobytes(path) + if owner is not None: + c_owner = tobytes(owner) + c_owner_ptr = c_owner.c_str() + + if group is not None: + c_group = tobytes(group) + c_group_ptr = c_group.c_str() + + with nogil: + check_status(self.client.get() + .Chown(c_path, c_owner_ptr, c_group_ptr)) + + def mkdir(self, path): + """ + Create indicated directory and any necessary parent directories + """ + self._ensure_client() + cdef c_string c_path = tobytes(path) + with nogil: + check_status(self.client.get() + .MakeDirectory(c_path)) + + def delete(self, path, bint recursive=False): + """ + Delete the indicated file or directory + + Parameters + ---------- + path : string + recursive : boolean, default False + If True, also delete child paths for directories + """ + self._ensure_client() + + cdef c_string c_path = tobytes(path) + with nogil: + check_status(self.client.get() + .Delete(c_path, recursive == 1)) + + def open(self, path, mode='rb', buffer_size=None, replication=None, + default_block_size=None): + """ + Open HDFS file for reading or writing + + Parameters + ---------- + mode : string + Must be one of 'rb', 'wb', 'ab' + + Returns + ------- + handle : HdfsFile + """ + self._ensure_client() + + cdef HdfsFile out = HdfsFile() + + if mode not in ('rb', 'wb', 'ab'): + raise Exception("Mode must be 'rb' (read), " + "'wb' (write, new file), or 'ab' (append)") + + cdef c_string c_path = tobytes(path) + cdef c_bool append = False + + # 0 in libhdfs means "use the default" + cdef int32_t c_buffer_size = buffer_size or 0 + cdef int16_t c_replication = replication or 0 + cdef int64_t c_default_block_size = default_block_size or 0 + + cdef shared_ptr[HdfsOutputStream] wr_handle + cdef shared_ptr[HdfsReadableFile] rd_handle + + if mode in ('wb', 'ab'): + if mode == 'ab': + append = True + + with nogil: + check_status( + self.client.get() + .OpenWriteable(c_path, append, c_buffer_size, + c_replication, c_default_block_size, + &wr_handle)) + + out.wr_file = wr_handle + + out.is_readable = False + out.is_writeable = 1 + else: + with nogil: + check_status(self.client.get() + .OpenReadable(c_path, &rd_handle)) + + out.rd_file = rd_handle + out.is_readable = True + out.is_writeable = 0 + + if c_buffer_size == 0: + c_buffer_size = 2 ** 16 + + out.mode = mode + out.buffer_size = c_buffer_size + out.parent = _HdfsFileNanny(self, out) + out.is_open = True + out.own_file = True + + return out + + def download(self, path, stream, buffer_size=None): + with self.open(path, 'rb') as f: + f.download(stream, buffer_size=buffer_size) + + def upload(self, path, stream, buffer_size=None): + """ + Upload file-like object to HDFS path + """ + with self.open(path, 'wb') as f: + f.upload(stream, buffer_size=buffer_size) + + +# ARROW-404: Helper class to ensure that files are closed before the +# client. During deallocation of the extension class, the attributes are +# decref'd which can cause the client to get closed first if the file has the +# last remaining reference +cdef class _HdfsFileNanny: + cdef: + object client + object file_handle_ref + + def __cinit__(self, client, file_handle): + import weakref + self.client = client + self.file_handle_ref = weakref.ref(file_handle) + + def __dealloc__(self): + fh = self.file_handle_ref() + if fh: + fh.close() + # avoid cyclic GC + self.file_handle_ref = None + self.client = None + + +cdef class HdfsFile(NativeFile): + cdef readonly: + int32_t buffer_size + object mode + object parent + + cdef object __weakref__ + + def __dealloc__(self): + self.parent = None diff --git a/python/pyarrow/io.pxi b/python/pyarrow/io.pxi index 8b213a33053..211c2a3e6e9 100644 --- a/python/pyarrow/io.pxi +++ b/python/pyarrow/io.pxi @@ -106,6 +106,9 @@ cdef class NativeFile: raise IOError("file not open") def size(self): + """ + Return file size + """ cdef int64_t size self._assert_readable() with nogil: @@ -113,6 +116,9 @@ cdef class NativeFile: return size def tell(self): + """ + Return current stream position + """ cdef int64_t position with nogil: if self.is_readable: @@ -121,10 +127,46 @@ cdef class NativeFile: check_status(self.wr_file.get().Tell(&position)) return position - def seek(self, int64_t position): + def seek(self, int64_t position, int whence=0): + """ + Change current file stream position + + Parameters + ---------- + position : int + Byte offset, interpreted relative to value of whence argument + whence : int, default 0 + Point of reference for seek offset + + Notes + ----- + Values of whence: + * 0 -- start of stream (the default); offset should be zero or positive + * 1 -- current stream position; offset may be negative + * 2 -- end of stream; offset is usually negative + + Returns + ------- + new_position : the new absolute stream position + """ + cdef int64_t offset self._assert_readable() with nogil: - check_status(self.rd_file.get().Seek(position)) + if whence == 0: + offset = position + elif whence == 1: + check_status(self.rd_file.get().Tell(&offset)) + offset = offset + position + elif whence == 2: + check_status(self.rd_file.get().GetSize(&offset)) + offset = offset + position + else: + with gil: + raise ValueError("Invalid value of whence: {0}" + .format(whence)) + check_status(self.rd_file.get().Seek(offset)) + + return self.tell() def write(self, data): """ @@ -144,6 +186,18 @@ cdef class NativeFile: check_status(self.wr_file.get().Write(buf, bufsize)) def read(self, nbytes=None): + """ + Read indicated number of bytes from file, or read all remaining bytes + if no argument passed + + Parameters + ---------- + nbytes : int, default None + + Returns + ------- + data : bytes + """ cdef: int64_t c_nbytes int64_t bytes_read = 0 @@ -473,6 +527,15 @@ cdef class OSFile(NativeFile): self.wr_file = handle +cdef class FixedSizeBufferOutputStream(NativeFile): + + def __cinit__(self, Buffer buffer): + self.wr_file.reset(new CFixedSizeBufferWriter(buffer.buffer)) + self.is_readable = 0 + self.is_writeable = 1 + self.is_open = True + + # ---------------------------------------------------------------------- # Arrow buffers @@ -523,7 +586,10 @@ cdef class Buffer: buffer.len = self.size buffer.ndim = 1 buffer.obj = self - buffer.readonly = 1 + if self.buffer.get().is_mutable(): + buffer.readonly = 0 + else: + buffer.readonly = 1 buffer.shape = self.shape buffer.strides = self.strides buffer.suboffsets = NULL @@ -540,6 +606,15 @@ cdef class Buffer: p[0] = self.buffer.get().data() return self.size + def __getwritebuffer__(self, Py_ssize_t idx, void **p): + if not self.buffer.get().is_mutable(): + raise SystemError("trying to write an immutable buffer") + if idx != 0: + raise SystemError("accessing non-existent buffer segment") + if p != NULL: + p[0] = self.buffer.get().data() + return self.size + cdef shared_ptr[PoolBuffer] allocate_buffer(CMemoryPool* pool): cdef shared_ptr[PoolBuffer] result @@ -658,301 +733,3 @@ cdef get_writer(object source, shared_ptr[OutputStream]* writer): else: raise TypeError('Unable to read from object of type: {0}' .format(type(source))) - -# ---------------------------------------------------------------------- -# HDFS IO implementation - -_HDFS_PATH_RE = re.compile('hdfs://(.*):(\d+)(.*)') - -try: - # Python 3 - from queue import Queue, Empty as QueueEmpty, Full as QueueFull -except ImportError: - from Queue import Queue, Empty as QueueEmpty, Full as QueueFull - - -def have_libhdfs(): - try: - check_status(HaveLibHdfs()) - return True - except: - return False - - -def have_libhdfs3(): - try: - check_status(HaveLibHdfs3()) - return True - except: - return False - - -def strip_hdfs_abspath(path): - m = _HDFS_PATH_RE.match(path) - if m: - return m.group(3) - else: - return path - - -cdef class _HdfsClient: - cdef: - shared_ptr[CHdfsClient] client - - cdef readonly: - bint is_open - - def __cinit__(self): - pass - - def _connect(self, host, port, user, kerb_ticket, driver): - cdef HdfsConnectionConfig conf - - if host is not None: - conf.host = tobytes(host) - conf.port = port - if user is not None: - conf.user = tobytes(user) - if kerb_ticket is not None: - conf.kerb_ticket = tobytes(kerb_ticket) - - if driver == 'libhdfs': - check_status(HaveLibHdfs()) - conf.driver = HdfsDriver_LIBHDFS - else: - check_status(HaveLibHdfs3()) - conf.driver = HdfsDriver_LIBHDFS3 - - with nogil: - check_status(CHdfsClient.Connect(&conf, &self.client)) - self.is_open = True - - @classmethod - def connect(cls, *args, **kwargs): - return cls(*args, **kwargs) - - def __dealloc__(self): - if self.is_open: - self.close() - - def close(self): - """ - Disconnect from the HDFS cluster - """ - self._ensure_client() - with nogil: - check_status(self.client.get().Disconnect()) - self.is_open = False - - cdef _ensure_client(self): - if self.client.get() == NULL: - raise IOError('HDFS client improperly initialized') - elif not self.is_open: - raise IOError('HDFS client is closed') - - def exists(self, path): - """ - Returns True if the path is known to the cluster, False if it does not - (or there is an RPC error) - """ - self._ensure_client() - - cdef c_string c_path = tobytes(path) - cdef c_bool result - with nogil: - result = self.client.get().Exists(c_path) - return result - - def isdir(self, path): - cdef HdfsPathInfo info - self._path_info(path, &info) - return info.kind == ObjectType_DIRECTORY - - def isfile(self, path): - cdef HdfsPathInfo info - self._path_info(path, &info) - return info.kind == ObjectType_FILE - - cdef _path_info(self, path, HdfsPathInfo* info): - cdef c_string c_path = tobytes(path) - - with nogil: - check_status(self.client.get() - .GetPathInfo(c_path, info)) - - - def ls(self, path, bint full_info): - cdef: - c_string c_path = tobytes(path) - vector[HdfsPathInfo] listing - list results = [] - int i - - self._ensure_client() - - with nogil: - check_status(self.client.get() - .ListDirectory(c_path, &listing)) - - cdef const HdfsPathInfo* info - for i in range( listing.size()): - info = &listing[i] - - # Try to trim off the hdfs://HOST:PORT piece - name = strip_hdfs_abspath(frombytes(info.name)) - - if full_info: - kind = ('file' if info.kind == ObjectType_FILE - else 'directory') - - results.append({ - 'kind': kind, - 'name': name, - 'owner': frombytes(info.owner), - 'group': frombytes(info.group), - 'list_modified_time': info.last_modified_time, - 'list_access_time': info.last_access_time, - 'size': info.size, - 'replication': info.replication, - 'block_size': info.block_size, - 'permissions': info.permissions - }) - else: - results.append(name) - - return results - - def mkdir(self, path): - """ - Create indicated directory and any necessary parent directories - """ - self._ensure_client() - - cdef c_string c_path = tobytes(path) - with nogil: - check_status(self.client.get() - .MakeDirectory(c_path)) - - def delete(self, path, bint recursive=False): - """ - Delete the indicated file or directory - - Parameters - ---------- - path : string - recursive : boolean, default False - If True, also delete child paths for directories - """ - self._ensure_client() - - cdef c_string c_path = tobytes(path) - with nogil: - check_status(self.client.get() - .Delete(c_path, recursive)) - - def open(self, path, mode='rb', buffer_size=None, replication=None, - default_block_size=None): - """ - Parameters - ---------- - mode : string, 'rb', 'wb', 'ab' - """ - self._ensure_client() - - cdef HdfsFile out = HdfsFile() - - if mode not in ('rb', 'wb', 'ab'): - raise Exception("Mode must be 'rb' (read), " - "'wb' (write, new file), or 'ab' (append)") - - cdef c_string c_path = tobytes(path) - cdef c_bool append = False - - # 0 in libhdfs means "use the default" - cdef int32_t c_buffer_size = buffer_size or 0 - cdef int16_t c_replication = replication or 0 - cdef int64_t c_default_block_size = default_block_size or 0 - - cdef shared_ptr[HdfsOutputStream] wr_handle - cdef shared_ptr[HdfsReadableFile] rd_handle - - if mode in ('wb', 'ab'): - if mode == 'ab': - append = True - - with nogil: - check_status( - self.client.get() - .OpenWriteable(c_path, append, c_buffer_size, - c_replication, c_default_block_size, - &wr_handle)) - - out.wr_file = wr_handle - - out.is_readable = False - out.is_writeable = 1 - else: - with nogil: - check_status(self.client.get() - .OpenReadable(c_path, &rd_handle)) - - out.rd_file = rd_handle - out.is_readable = True - out.is_writeable = 0 - - if c_buffer_size == 0: - c_buffer_size = 2 ** 16 - - out.mode = mode - out.buffer_size = c_buffer_size - out.parent = _HdfsFileNanny(self, out) - out.is_open = True - out.own_file = True - - return out - - def download(self, path, stream, buffer_size=None): - with self.open(path, 'rb') as f: - f.download(stream, buffer_size=buffer_size) - - def upload(self, path, stream, buffer_size=None): - """ - Upload file-like object to HDFS path - """ - with self.open(path, 'wb') as f: - f.upload(stream, buffer_size=buffer_size) - - -# ARROW-404: Helper class to ensure that files are closed before the -# client. During deallocation of the extension class, the attributes are -# decref'd which can cause the client to get closed first if the file has the -# last remaining reference -cdef class _HdfsFileNanny: - cdef: - object client - object file_handle_ref - - def __cinit__(self, client, file_handle): - import weakref - self.client = client - self.file_handle_ref = weakref.ref(file_handle) - - def __dealloc__(self): - fh = self.file_handle_ref() - if fh: - fh.close() - # avoid cyclic GC - self.file_handle_ref = None - self.client = None - - -cdef class HdfsFile(NativeFile): - cdef readonly: - int32_t buffer_size - object mode - object parent - - cdef object __weakref__ - - def __dealloc__(self): - self.parent = None diff --git a/python/pyarrow/lib.pyx b/python/pyarrow/lib.pyx index 59903088817..4df2fcd64f6 100644 --- a/python/pyarrow/lib.pyx +++ b/python/pyarrow/lib.pyx @@ -114,6 +114,7 @@ include "table.pxi" # File IO include "io.pxi" +include "io-hdfs.pxi" # IPC / Messaging include "ipc.pxi" diff --git a/python/pyarrow/pandas_compat.py b/python/pyarrow/pandas_compat.py index 9b2a5c4c60d..cd7ad477826 100644 --- a/python/pyarrow/pandas_compat.py +++ b/python/pyarrow/pandas_compat.py @@ -155,7 +155,7 @@ def index_level_name(index, i): return '__index_level_{:d}__'.format(i) -def construct_metadata(df, index_levels, preserve_index, types): +def construct_metadata(df, column_names, index_levels, preserve_index, types): """Returns a dictionary containing enough metadata to reconstruct a pandas DataFrame as an Arrow Table, including index columns. @@ -170,41 +170,77 @@ def construct_metadata(df, index_levels, preserve_index, types): ------- dict """ - ncolumns = len(df.columns) + ncolumns = len(column_names) df_types = types[:ncolumns] index_types = types[ncolumns:ncolumns + len(index_levels)] + + column_metadata = [ + get_column_metadata(df[col_name], name=sanitized_name, + arrow_type=arrow_type) + for col_name, sanitized_name, arrow_type in + zip(df.columns, column_names, df_types) + ] + + if preserve_index: + index_column_names = [index_level_name(level, i) + for i, level in enumerate(index_levels)] + index_column_metadata = [ + get_column_metadata(level, name=index_level_name(level, i), + arrow_type=arrow_type) + for i, (level, arrow_type) in enumerate(zip(index_levels, + index_types)) + ] + else: + index_column_names = index_column_metadata = [] + return { - b'pandas': json.dumps( - { - 'index_columns': [ - index_level_name(level, i) - for i, level in enumerate(index_levels) - ] if preserve_index else [], - 'columns': [ - get_column_metadata( - df[name], - name=name, - arrow_type=arrow_type - ) - for name, arrow_type in zip(df.columns, df_types) - ] + ( - [ - get_column_metadata( - level, - name=index_level_name(level, i), - arrow_type=arrow_type - ) - for i, (level, arrow_type) in enumerate( - zip(index_levels, index_types) - ) - ] if preserve_index else [] - ), - 'pandas_version': pd.__version__, - } - ).encode('utf8') + b'pandas': json.dumps({ + 'index_columns': index_column_names, + 'columns': column_metadata + index_column_metadata, + 'pandas_version': pd.__version__ + }).encode('utf8') } +def dataframe_to_arrays(df, timestamps_to_ms, schema, preserve_index): + names = [] + arrays = [] + index_columns = [] + types = [] + type = None + + if preserve_index: + n = len(getattr(df.index, 'levels', [df.index])) + index_columns.extend(df.index.get_level_values(i) for i in range(n)) + + for name in df.columns: + col = df[name] + if not isinstance(name, six.string_types): + name = str(name) + + if schema is not None: + field = schema.field_by_name(name) + type = getattr(field, "type", None) + + array = pa.Array.from_pandas( + col, type=type, timestamps_to_ms=timestamps_to_ms + ) + arrays.append(array) + names.append(name) + types.append(array.type) + + for i, column in enumerate(index_columns): + array = pa.Array.from_pandas(column, timestamps_to_ms=timestamps_to_ms) + arrays.append(array) + names.append(index_level_name(column, i)) + types.append(array.type) + + metadata = construct_metadata( + df, names, index_columns, preserve_index, types + ) + return names, arrays, metadata + + def table_to_blockmanager(table, nthreads=1): import pandas.core.internals as _int from pyarrow.compat import DatetimeTZDtype diff --git a/python/pyarrow/parquet.py b/python/pyarrow/parquet.py index fea73978e3e..6d39a2354f6 100644 --- a/python/pyarrow/parquet.py +++ b/python/pyarrow/parquet.py @@ -22,7 +22,7 @@ import numpy as np -from pyarrow.filesystem import LocalFilesystem +from pyarrow.filesystem import FileSystem, LocalFileSystem from pyarrow._parquet import (ParquetReader, FileMetaData, # noqa RowGroupMetaData, ParquetSchema, ParquetWriter) @@ -403,7 +403,7 @@ class ParquetManifest(object): """ def __init__(self, dirpath, filesystem=None, pathsep='/', partition_scheme='hive'): - self.filesystem = filesystem or LocalFilesystem.get_instance() + self.filesystem = filesystem or LocalFileSystem.get_instance() self.pathsep = pathsep self.dirpath = dirpath self.partition_scheme = partition_scheme @@ -416,40 +416,41 @@ def __init__(self, dirpath, filesystem=None, pathsep='/', self._visit_level(0, self.dirpath, []) def _visit_level(self, level, base_path, part_keys): - directories = [] - files = [] fs = self.filesystem - if not fs.isdir(base_path): - raise ValueError('"{0}" is not a directory'.format(base_path)) - - for path in sorted(fs.ls(base_path)): - if fs.isfile(path): - if _is_parquet_file(path): - files.append(path) - elif path.endswith('_common_metadata'): - self.common_metadata_path = path - elif path.endswith('_metadata'): - self.metadata_path = path - elif not self._should_silently_exclude(path): - print('Ignoring path: {0}'.format(path)) - elif fs.isdir(path): - directories.append(path) + _, directories, files = next(fs.walk(base_path)) + + filtered_files = [] + for path in files: + full_path = self.pathsep.join((base_path, path)) + if _is_parquet_file(path): + filtered_files.append(full_path) + elif path.endswith('_common_metadata'): + self.common_metadata_path = full_path + elif path.endswith('_metadata'): + self.metadata_path = full_path + elif not self._should_silently_exclude(path): + print('Ignoring path: {0}'.format(full_path)) # ARROW-1079: Filter out "private" directories starting with underscore - directories = [x for x in directories if not _is_private_directory(x)] + filtered_directories = [self.pathsep.join((base_path, x)) + for x in directories + if not _is_private_directory(x)] + + filtered_files.sort() + filtered_directories.sort() - if len(files) > 0 and len(directories) > 0: + if len(files) > 0 and len(filtered_directories) > 0: raise ValueError('Found files in an intermediate ' 'directory: {0}'.format(base_path)) - elif len(directories) > 0: - self._visit_directories(level, directories, part_keys) + elif len(filtered_directories) > 0: + self._visit_directories(level, filtered_directories, part_keys) else: - self._push_pieces(files, part_keys) + self._push_pieces(filtered_files, part_keys) - def _should_silently_exclude(self, path): - _, tail = path.rsplit(self.pathsep, 1) - return tail.endswith('.crc') or tail in EXCLUDED_PARQUET_PATHS + def _should_silently_exclude(self, file_name): + return (file_name.endswith('.crc') or + file_name in EXCLUDED_PARQUET_PATHS) def _visit_directories(self, level, directories, part_keys): for path in directories: @@ -505,7 +506,7 @@ class ParquetDataset(object): ---------- path_or_paths : str or List[str] A directory name, single file name, or list of file names - filesystem : Filesystem, default None + filesystem : FileSystem, default None If nothing passed, paths assumed to be found in the local on-disk filesystem metadata : pyarrow.parquet.FileMetaData @@ -521,9 +522,9 @@ class ParquetDataset(object): def __init__(self, path_or_paths, filesystem=None, schema=None, metadata=None, split_row_groups=False, validate_schema=True): if filesystem is None: - self.fs = LocalFilesystem.get_instance() + self.fs = LocalFileSystem.get_instance() else: - self.fs = filesystem + self.fs = _ensure_filesystem(filesystem) self.paths = path_or_paths @@ -630,7 +631,7 @@ def _get_common_pandas_metadata(self): return keyvalues.get(b'pandas', None) def _get_open_file_func(self): - if self.fs is None or isinstance(self.fs, LocalFilesystem): + if self.fs is None or isinstance(self.fs, LocalFileSystem): def open_file(path, meta=None): return ParquetFile(path, metadata=meta, common_metadata=self.common_metadata) @@ -642,6 +643,18 @@ def open_file(path, meta=None): return open_file +def _ensure_filesystem(fs): + if not isinstance(fs, FileSystem): + if type(fs).__name__ == 'S3FileSystem': + from pyarrow.filesystem import S3FSWrapper + return S3FSWrapper(fs) + else: + raise IOError('Unrecognized filesystem: {0}' + .format(type(fs))) + else: + return fs + + def _make_manifest(path_or_paths, fs, pathsep='/'): partitions = None metadata_path = None @@ -703,7 +716,7 @@ def read_table(source, columns=None, nthreads=1, metadata=None, Content of the file as a table (of columns) """ if is_string(source): - fs = LocalFilesystem.get_instance() + fs = LocalFileSystem.get_instance() if fs.isdir(source): return fs.read_parquet(source, columns=columns, metadata=metadata) @@ -769,9 +782,22 @@ def write_table(table, where, row_group_size=None, version='1.0', compression=compression, version=version, use_deprecated_int96_timestamps=use_deprecated_int96_timestamps) - writer = ParquetWriter(where, table.schema, **options) - writer.write_table(table, row_group_size=row_group_size) - writer.close() + + writer = None + try: + writer = ParquetWriter(where, table.schema, **options) + writer.write_table(table, row_group_size=row_group_size) + except: + if writer is not None: + writer.close() + if isinstance(where, six.string_types): + try: + os.remove(where) + except os.error: + pass + raise + else: + writer.close() def write_metadata(schema, where, version='1.0', @@ -792,3 +818,33 @@ def write_metadata(schema, where, version='1.0', ) writer = ParquetWriter(where, schema, **options) writer.close() + + +def read_metadata(where): + """ + Read FileMetadata from footer of a single Parquet file + + Parameters + ---------- + where : string (filepath) or file-like object + + Returns + ------- + metadata : FileMetadata + """ + return ParquetFile(where).metadata + + +def read_schema(where): + """ + Read effective Arrow schema from Parquet file metadata + + Parameters + ---------- + where : string (filepath) or file-like object + + Returns + ------- + schema : pyarrow.Schema + """ + return ParquetFile(where).schema.to_arrow_schema() diff --git a/python/pyarrow/plasma.pyx b/python/pyarrow/plasma.pyx new file mode 100644 index 00000000000..dd62d473b00 --- /dev/null +++ b/python/pyarrow/plasma.pyx @@ -0,0 +1,588 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +# cython: profile=False +# distutils: language = c++ +# cython: embedsignature = True + +from libcpp cimport bool as c_bool, nullptr +from libcpp.memory cimport shared_ptr, unique_ptr, make_shared +from libcpp.string cimport string as c_string +from libcpp.vector cimport vector as c_vector +from libc.stdint cimport int64_t, uint8_t, uintptr_t +from cpython.pycapsule cimport * + +from pyarrow.lib cimport Buffer, NativeFile, check_status +from pyarrow.includes.libarrow cimport (CMutableBuffer, CBuffer, + CFixedSizeBufferWriter, CStatus) + + +PLASMA_WAIT_TIMEOUT = 2 ** 30 + + +cdef extern from "plasma/common.h" nogil: + + cdef cppclass CUniqueID" plasma::UniqueID": + + @staticmethod + CUniqueID from_binary(const c_string& binary) + + c_bool operator==(const CUniqueID& rhs) const + + c_string hex() const + + c_string binary() const + + cdef struct CObjectRequest" plasma::ObjectRequest": + CUniqueID object_id + int type + int status + + +cdef extern from "plasma/common.h": + cdef int64_t kDigestSize" plasma::kDigestSize" + + cdef enum ObjectRequestType: + PLASMA_QUERY_LOCAL"plasma::PLASMA_QUERY_LOCAL", + PLASMA_QUERY_ANYWHERE"plasma::PLASMA_QUERY_ANYWHERE" + + cdef int ObjectStatusLocal"plasma::ObjectStatusLocal"; + cdef int ObjectStatusRemote"plasma::ObjectStatusRemote"; + +cdef extern from "plasma/client.h" nogil: + + cdef cppclass CPlasmaClient" plasma::PlasmaClient": + + CPlasmaClient() + + CStatus Connect(const c_string& store_socket_name, + const c_string& manager_socket_name, + int release_delay, int num_retries) + + CStatus Create(const CUniqueID& object_id, int64_t data_size, + const uint8_t* metadata, int64_t metadata_size, + uint8_t** data) + + CStatus Get(const CUniqueID* object_ids, int64_t num_objects, + int64_t timeout_ms, CObjectBuffer* object_buffers) + + CStatus Seal(const CUniqueID& object_id) + + CStatus Evict(int64_t num_bytes, int64_t& num_bytes_evicted) + + CStatus Hash(const CUniqueID& object_id, uint8_t* digest) + + CStatus Release(const CUniqueID& object_id) + + CStatus Contains(const CUniqueID& object_id, c_bool* has_object) + + CStatus Subscribe(int* fd) + + CStatus GetNotification(int fd, CUniqueID* object_id, + int64_t* data_size, int64_t* metadata_size) + + CStatus Disconnect() + + CStatus Fetch(int num_object_ids, const CUniqueID* object_ids) + + CStatus Wait(int64_t num_object_requests, + CObjectRequest* object_requests, + int num_ready_objects, int64_t timeout_ms, + int* num_objects_ready); + + CStatus Transfer(const char* addr, int port, + const CUniqueID& object_id) + + +cdef extern from "plasma/client.h" nogil: + + cdef struct CObjectBuffer" plasma::ObjectBuffer": + int64_t data_size + uint8_t* data + int64_t metadata_size + uint8_t* metadata + + +def make_object_id(object_id): + return ObjectID(object_id) + + +cdef class ObjectID: + """ + An ObjectID represents a string of bytes used to identify Plasma objects. + """ + + cdef: + CUniqueID data + + def __cinit__(self, object_id): + self.data = CUniqueID.from_binary(object_id) + + def __richcmp__(ObjectID self, ObjectID object_id, operation): + if operation != 2: + raise ValueError("operation != 2 (only equality is supported)") + return self.data == object_id.data + + def __hash__(self): + return hash(self.data.binary()) + + def __repr__(self): + return "ObjectID(" + self.data.hex().decode() + ")" + + def __reduce__(self): + return (make_object_id, (self.data.binary(),)) + + def binary(self): + """ + Return the binary representation of this ObjectID. + + Returns + ------- + bytes + Binary representation of the ObjectID. + """ + return self.data.binary() + + +cdef class PlasmaBuffer(Buffer): + """ + This is the type returned by calls to get with a PlasmaClient. + + We define our own class instead of directly returning a buffer object so + that we can add a custom destructor which notifies Plasma that the object + is no longer being used, so the memory in the Plasma store backing the + object can potentially be freed. + + Attributes + ---------- + object_id : ObjectID + The ID of the object in the buffer. + client : PlasmaClient + The PlasmaClient that we use to communicate with the store and manager. + """ + + cdef: + ObjectID object_id + PlasmaClient client + + def __cinit__(self, ObjectID object_id, PlasmaClient client): + """ + Initialize a PlasmaBuffer. + """ + self.object_id = object_id + self.client = client + + def __dealloc__(self): + """ + Notify Plasma that the object is no longer needed. + + If the plasma client has been shut down, then don't do anything. + """ + self.client.release(self.object_id) + + +cdef class PlasmaClient: + """ + The PlasmaClient is used to interface with a plasma store and manager. + + The PlasmaClient can ask the PlasmaStore to allocate a new buffer, seal a + buffer, and get a buffer. Buffers are referred to by object IDs, which are + strings. + """ + + cdef: + shared_ptr[CPlasmaClient] client + int notification_fd + c_string store_socket_name + c_string manager_socket_name + + def __cinit__(self): + self.client.reset(new CPlasmaClient()) + self.notification_fd = -1 + self.store_socket_name = "" + self.manager_socket_name = "" + + cdef _get_object_buffers(self, object_ids, int64_t timeout_ms, + c_vector[CObjectBuffer]* result): + cdef c_vector[CUniqueID] ids + cdef ObjectID object_id + for object_id in object_ids: + ids.push_back(object_id.data) + result[0].resize(ids.size()) + with nogil: + check_status(self.client.get().Get(ids.data(), ids.size(), + timeout_ms, result[0].data())) + + cdef _make_plasma_buffer(self, ObjectID object_id, uint8_t* data, + int64_t size): + cdef shared_ptr[CBuffer] buffer + buffer.reset(new CBuffer(data, size)) + result = PlasmaBuffer(object_id, self) + result.init(buffer) + return result + + cdef _make_mutable_plasma_buffer(self, ObjectID object_id, uint8_t* data, + int64_t size): + cdef shared_ptr[CBuffer] buffer + buffer.reset(new CMutableBuffer(data, size)) + result = PlasmaBuffer(object_id, self) + result.init(buffer) + return result + + @property + def store_socket_name(self): + return self.store_socket_name.decode() + + @property + def manager_socket_name(self): + return self.manager_socket_name.decode() + + def create(self, ObjectID object_id, int64_t data_size, + c_string metadata=b""): + """ + Create a new buffer in the PlasmaStore for a particular object ID. + + The returned buffer is mutable until seal is called. + + Parameters + ---------- + object_id : ObjectID + The object ID used to identify an object. + size : int + The size in bytes of the created buffer. + metadata : bytes + An optional string of bytes encoding whatever metadata the user + wishes to encode. + + Raises + ------ + PlasmaObjectExists + This exception is raised if the object could not be created because + there already is an object with the same ID in the plasma store. + + PlasmaStoreFull: This exception is raised if the object could + not be created because the plasma store is unable to evict + enough objects to create room for it. + """ + cdef uint8_t* data + with nogil: + check_status(self.client.get().Create(object_id.data, data_size, + (metadata.data()), + metadata.size(), &data)) + return self._make_mutable_plasma_buffer(object_id, data, data_size) + + def get(self, object_ids, timeout_ms=-1): + """ + Returns data buffer from the PlasmaStore based on object ID. + + If the object has not been sealed yet, this call will block. The + retrieved buffer is immutable. + + Parameters + ---------- + object_ids : list + A list of ObjectIDs used to identify some objects. + timeout_ms :int + The number of milliseconds that the get call should block before + timing out and returning. Pass -1 if the call should block and 0 + if the call should return immediately. + + Returns + ------- + list + List of PlasmaBuffers for the data associated with the object_ids + and None if the object was not available. + """ + cdef c_vector[CObjectBuffer] object_buffers + self._get_object_buffers(object_ids, timeout_ms, &object_buffers) + result = [] + for i in range(object_buffers.size()): + if object_buffers[i].data_size != -1: + result.append(self._make_plasma_buffer( + object_ids[i], object_buffers[i].data, + object_buffers[i].data_size)) + else: + result.append(None) + return result + + def get_metadata(self, object_ids, timeout_ms=-1): + """ + Returns metadata buffer from the PlasmaStore based on object ID. + + If the object has not been sealed yet, this call will block. The + retrieved buffer is immutable. + + Parameters + ---------- + object_ids : list + A list of ObjectIDs used to identify some objects. + timeout_ms : int + The number of milliseconds that the get call should block before + timing out and returning. Pass -1 if the call should block and 0 + if the call should return immediately. + + Returns + ------- + list + List of PlasmaBuffers for the metadata associated with the + object_ids and None if the object was not available. + """ + cdef c_vector[CObjectBuffer] object_buffers + self._get_object_buffers(object_ids, timeout_ms, &object_buffers) + result = [] + for i in range(object_buffers.size()): + result.append(self._make_plasma_buffer( + object_ids[i], object_buffers[i].metadata, + object_buffers[i].metadata_size)) + return result + + def seal(self, ObjectID object_id): + """ + Seal the buffer in the PlasmaStore for a particular object ID. + + Once a buffer has been sealed, the buffer is immutable and can only be + accessed through get. + + Parameters + ---------- + object_id : ObjectID + A string used to identify an object. + """ + with nogil: + check_status(self.client.get().Seal(object_id.data)) + + def release(self, ObjectID object_id): + """ + Notify Plasma that the object is no longer needed. + + Parameters + ---------- + object_id : ObjectID + A string used to identify an object. + """ + with nogil: + check_status(self.client.get().Release(object_id.data)) + + def contains(self, ObjectID object_id): + """ + Check if the object is present and sealed in the PlasmaStore. + + Parameters + ---------- + object_id : ObjectID + A string used to identify an object. + """ + cdef c_bool is_contained + with nogil: + check_status(self.client.get().Contains(object_id.data, + &is_contained)) + return is_contained + + def hash(self, ObjectID object_id): + """ + Compute the checksum of an object in the object store. + + Parameters + ---------- + object_id : ObjectID + A string used to identify an object. + + Returns + ------- + bytes + A digest string object's hash. If the object isn't in the object + store, the string will have length zero. + """ + cdef c_vector[uint8_t] digest = c_vector[uint8_t](kDigestSize) + with nogil: + check_status(self.client.get().Hash(object_id.data, + digest.data())) + return bytes(digest[:]) + + def evict(self, int64_t num_bytes): + """ + Evict some objects until to recover some bytes. + + Recover at least num_bytes bytes if possible. + + Parameters + ---------- + num_bytes : int + The number of bytes to attempt to recover. + """ + cdef int64_t num_bytes_evicted = -1 + with nogil: + check_status(self.client.get().Evict(num_bytes, num_bytes_evicted)) + return num_bytes_evicted + + def transfer(self, address, int port, ObjectID object_id): + """ + Transfer local object with id object_id to another plasma instance + + Parameters + ---------- + addr : str + IPv4 address of the plasma instance the object is sent to. + port : int + Port number of the plasma instance the object is sent to. + object_id : str + A string used to identify an object. + """ + cdef c_string addr = address.encode() + with nogil: + check_status(self.client.get() + .Transfer(addr.c_str(), port, object_id.data)) + + def fetch(self, object_ids): + """ + Fetch the objects with the given IDs from other plasma managers. + + Parameters + ---------- + object_ids : list + A list of strings used to identify the objects. + """ + cdef c_vector[CUniqueID] ids + cdef ObjectID object_id + for object_id in object_ids: + ids.push_back(object_id.data) + with nogil: + check_status(self.client.get().Fetch(ids.size(), ids.data())) + + def wait(self, object_ids, int64_t timeout=PLASMA_WAIT_TIMEOUT, + int num_returns=1): + """ + Wait until num_returns objects in object_ids are ready. + Currently, the object ID arguments to wait must be unique. + + Parameters + ---------- + object_ids : list + List of object IDs to wait for. + timeout :int + Return to the caller after timeout milliseconds. + num_returns : int + We are waiting for this number of objects to be ready. + + Returns + ------- + list + List of object IDs that are ready. + list + List of object IDs we might still wait on. + """ + # Check that the object ID arguments are unique. The plasma manager + # currently crashes if given duplicate object IDs. + if len(object_ids) != len(set(object_ids)): + raise Exception("Wait requires a list of unique object IDs.") + cdef int64_t num_object_requests = len(object_ids) + cdef c_vector[CObjectRequest] object_requests = ( + c_vector[CObjectRequest](num_object_requests)) + cdef int num_objects_ready = 0 + cdef ObjectID object_id + for i, object_id in enumerate(object_ids): + object_requests[i].object_id = object_id.data + object_requests[i].type = PLASMA_QUERY_ANYWHERE + with nogil: + check_status(self.client.get().Wait(num_object_requests, + object_requests.data(), + num_returns, timeout, + &num_objects_ready)) + cdef int num_to_return = min(num_objects_ready, num_returns); + ready_ids = [] + waiting_ids = set(object_ids) + cdef int num_returned = 0 + for i in range(len(object_ids)): + if num_returned == num_to_return: + break + if (object_requests[i].status == ObjectStatusLocal or + object_requests[i].status == ObjectStatusRemote): + ready_ids.append( + ObjectID(object_requests[i].object_id.binary())) + waiting_ids.discard( + ObjectID(object_requests[i].object_id.binary())) + num_returned += 1 + return ready_ids, list(waiting_ids) + + def subscribe(self): + """Subscribe to notifications about sealed objects.""" + with nogil: + check_status(self.client.get().Subscribe(&self.notification_fd)) + + def get_next_notification(self): + """ + Get the next notification from the notification socket. + + Returns + ------- + ObjectID + The object ID of the object that was stored. + int + The data size of the object that was stored. + int + The metadata size of the object that was stored. + """ + cdef ObjectID object_id = ObjectID(20 * b"\0") + cdef int64_t data_size + cdef int64_t metadata_size + with nogil: + check_status(self.client.get() + .GetNotification(self.notification_fd, + &object_id.data, + &data_size, + &metadata_size)) + return object_id, data_size, metadata_size + + def to_capsule(self): + return PyCapsule_New(self.client.get(), "plasma", NULL) + + def disconnect(self): + """ + Disconnect this client from the Plasma store. + """ + with nogil: + check_status(self.client.get().Disconnect()) + + +def connect(store_socket_name, manager_socket_name, int release_delay, + int num_retries=-1): + """ + Return a new PlasmaClient that is connected a plasma store and + optionally a manager. + + Parameters + ---------- + store_socket_name : str + Name of the socket the plasma store is listening at. + manager_socket_name : str + Name of the socket the plasma manager is listening at. + release_delay : int + The maximum number of objects that the client will keep and + delay releasing (for caching reasons). + num_retries : int, default -1 + Number of times tor ty to connect to plasma store. Default value of -1 + uses the default (50) + """ + cdef PlasmaClient result = PlasmaClient() + result.store_socket_name = store_socket_name.encode() + result.manager_socket_name = manager_socket_name.encode() + with nogil: + check_status(result.client.get() + .Connect(result.store_socket_name, + result.manager_socket_name, + release_delay, num_retries)) + return result diff --git a/python/pyarrow/scalar.pxi b/python/pyarrow/scalar.pxi index dec5341ca4a..1f72070cb7e 100644 --- a/python/pyarrow/scalar.pxi +++ b/python/pyarrow/scalar.pxi @@ -169,7 +169,6 @@ cdef class Time64Value(ArrayValue): CTime64Type* dtype = ap.type().get() cdef int64_t val = ap.Value(self.index) - print(val) if dtype.unit() == TimeUnit_MICRO: return (datetime.datetime(1970, 1, 1) + datetime.timedelta(microseconds=val)).time() diff --git a/python/pyarrow/table.pxi b/python/pyarrow/table.pxi index 6188e90616b..6277761b7d6 100644 --- a/python/pyarrow/table.pxi +++ b/python/pyarrow/table.pxi @@ -286,7 +286,7 @@ cdef int _schema_from_arrays( c_string c_name vector[shared_ptr[CField]] fields shared_ptr[CDataType] type_ - int K = len(arrays) + Py_ssize_t K = len(arrays) fields.resize(K) @@ -317,51 +317,6 @@ cdef int _schema_from_arrays( return 0 -cdef tuple _dataframe_to_arrays( - df, - bint timestamps_to_ms, - Schema schema, - bint preserve_index -): - cdef: - list names = [] - list arrays = [] - list index_columns = [] - list types = [] - DataType type = None - dict metadata - Py_ssize_t i - Py_ssize_t n - - if preserve_index: - n = len(getattr(df.index, 'levels', [df.index])) - index_columns.extend(df.index.get_level_values(i) for i in range(n)) - - for name in df.columns: - col = df[name] - if schema is not None: - field = schema.field_by_name(name) - type = getattr(field, "type", None) - - array = Array.from_pandas( - col, type=type, timestamps_to_ms=timestamps_to_ms - ) - arrays.append(array) - names.append(name) - types.append(array.type) - - for i, column in enumerate(index_columns): - array = Array.from_pandas(column, timestamps_to_ms=timestamps_to_ms) - arrays.append(array) - names.append(pdcompat.index_level_name(column, i)) - types.append(array.type) - - metadata = pdcompat.construct_metadata( - df, index_columns, preserve_index, types - ) - return names, arrays, metadata - - cdef class RecordBatch: """ Batch of rows of columns of equal length @@ -475,8 +430,13 @@ cdef class RecordBatch: ) return pyarrow_wrap_array(self.batch.column(i)) - def __getitem__(self, i): - return self.column(i) + def __getitem__(self, key): + cdef: + Py_ssize_t start, stop + if isinstance(key, slice): + return _normalize_slice(self, key) + else: + return self.column(key) def slice(self, offset=0, length=None): """ @@ -565,7 +525,7 @@ cdef class RecordBatch: ------- pyarrow.RecordBatch """ - names, arrays, metadata = _dataframe_to_arrays( + names, arrays, metadata = pdcompat.dataframe_to_arrays( df, False, schema, preserve_index ) return cls.from_arrays(arrays, names, metadata) @@ -743,7 +703,7 @@ cdef class Table: >>> pa.Table.from_pandas(df) """ - names, arrays, metadata = _dataframe_to_arrays( + names, arrays, metadata = pdcompat.dataframe_to_arrays( df, timestamps_to_ms=timestamps_to_ms, schema=schema, @@ -773,7 +733,7 @@ cdef class Table: vector[shared_ptr[CColumn]] columns shared_ptr[CSchema] schema shared_ptr[CTable] table - size_t K = len(arrays) + int i, K = len(arrays) _schema_from_arrays(arrays, names, metadata, &schema) @@ -881,7 +841,7 @@ cdef class Table: self._check_nullptr() return pyarrow_wrap_schema(self.table.schema()) - def column(self, int64_t i): + def column(self, int i): """ Select a column by its numeric index. @@ -895,8 +855,8 @@ cdef class Table: """ cdef: Column column = Column() - int64_t num_columns = self.num_columns - int64_t index + int num_columns = self.num_columns + int index self._check_nullptr() if not -num_columns <= i < num_columns: diff --git a/python/pyarrow/tests/conftest.py b/python/pyarrow/tests/conftest.py index 2aeeab7294c..c6bd6c9b3a2 100644 --- a/python/pyarrow/tests/conftest.py +++ b/python/pyarrow/tests/conftest.py @@ -18,12 +18,22 @@ from pytest import skip -groups = ['hdfs', 'parquet', 'large_memory'] +groups = [ + 'hdfs', + 'parquet', + 'plasma', + 'large_memory', + 's3', +] + defaults = { 'hdfs': False, + 'large_memory': False, 'parquet': False, - 'large_memory': False + 'plasma': False, + 'large_memory': False, + 's3': False, } try: @@ -33,6 +43,13 @@ pass +try: + import pyarrow.plasma as plasma # noqa + defaults['plasma'] = True +except ImportError: + pass + + def pytest_configure(config): pass @@ -43,6 +60,11 @@ def pytest_addoption(parser): default=defaults[group], help=('Enable the {0} test group'.format(group))) + for group in groups: + parser.addoption('--disable-{0}'.format(group), action='store_true', + default=False, + help=('Disable the {0} test group'.format(group))) + for group in groups: parser.addoption('--only-{0}'.format(group), action='store_true', default=False, @@ -54,12 +76,14 @@ def pytest_runtest_setup(item): for group in groups: only_flag = '--only-{0}'.format(group) + disable_flag = '--disable-{0}'.format(group) flag = '--{0}'.format(group) if item.config.getoption(only_flag): only_set = True elif getattr(item.obj, group, None): - if not item.config.getoption(flag): + if (item.config.getoption(disable_flag) or + not item.config.getoption(flag)): skip('{0} NOT enabled'.format(flag)) if only_set: diff --git a/python/pyarrow/tests/test_convert_pandas.py b/python/pyarrow/tests/test_convert_pandas.py index 43e0bad5e3d..d4886585633 100644 --- a/python/pyarrow/tests/test_convert_pandas.py +++ b/python/pyarrow/tests/test_convert_pandas.py @@ -109,6 +109,11 @@ def test_all_none_category(self): df['a'] = df['a'].astype('category') self._check_pandas_roundtrip(df) + def test_non_string_columns(self): + df = pd.DataFrame({0: [1, 2, 3]}) + table = pa.Table.from_pandas(df) + assert table.column(0).name == '0' + def test_float_no_nulls(self): data = {} fields = [] diff --git a/python/pyarrow/tests/test_feather.py b/python/pyarrow/tests/test_feather.py index 93d67365c8a..a7013ba5998 100644 --- a/python/pyarrow/tests/test_feather.py +++ b/python/pyarrow/tests/test_feather.py @@ -1,16 +1,19 @@ -# Copyright 2016 Feather Developers +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at # -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at +# http://www.apache.org/licenses/LICENSE-2.0 # -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. import os import sys diff --git a/python/pyarrow/tests/test_hdfs.py b/python/pyarrow/tests/test_hdfs.py index cea02fbecc7..79638f2c64d 100644 --- a/python/pyarrow/tests/test_hdfs.py +++ b/python/pyarrow/tests/test_hdfs.py @@ -43,7 +43,7 @@ def hdfs_test_client(driver='libhdfs'): raise ValueError('Env variable ARROW_HDFS_TEST_PORT was not ' 'an integer') - return pa.HdfsClient(host, port, user, driver=driver) + return pa.hdfs.connect(host, port, user, driver=driver) @pytest.mark.hdfs @@ -72,7 +72,26 @@ def tearDownClass(cls): cls.hdfs.delete(cls.tmp_path, recursive=True) cls.hdfs.close() - def test_hdfs_close(self): + def test_cat(self): + path = pjoin(self.tmp_path, 'cat-test') + + data = b'foobarbaz' + with self.hdfs.open(path, 'wb') as f: + f.write(data) + + contents = self.hdfs.cat(path) + assert contents == data + + def test_capacity_space(self): + capacity = self.hdfs.get_capacity() + space_used = self.hdfs.get_space_used() + disk_free = self.hdfs.df() + + assert capacity > 0 + assert capacity > space_used + assert disk_free == (capacity - space_used) + + def test_close(self): client = hdfs_test_client() assert client.is_open client.close() @@ -81,7 +100,7 @@ def test_hdfs_close(self): with pytest.raises(Exception): client.ls('/') - def test_hdfs_mkdir(self): + def test_mkdir(self): path = pjoin(self.tmp_path, 'test-dir/test-dir') parent_path = pjoin(self.tmp_path, 'test-dir') @@ -91,7 +110,64 @@ def test_hdfs_mkdir(self): self.hdfs.delete(parent_path, recursive=True) assert not self.hdfs.exists(path) - def test_hdfs_ls(self): + def test_mv_rename(self): + path = pjoin(self.tmp_path, 'mv-test') + new_path = pjoin(self.tmp_path, 'mv-new-test') + + data = b'foobarbaz' + with self.hdfs.open(path, 'wb') as f: + f.write(data) + + assert self.hdfs.exists(path) + self.hdfs.mv(path, new_path) + assert not self.hdfs.exists(path) + assert self.hdfs.exists(new_path) + + assert self.hdfs.cat(new_path) == data + + self.hdfs.rename(new_path, path) + assert self.hdfs.cat(path) == data + + def test_info(self): + path = pjoin(self.tmp_path, 'info-base') + file_path = pjoin(path, 'ex') + self.hdfs.mkdir(path) + + data = b'foobarbaz' + with self.hdfs.open(file_path, 'wb') as f: + f.write(data) + + path_info = self.hdfs.info(path) + file_path_info = self.hdfs.info(file_path) + + assert path_info['kind'] == 'directory' + + assert file_path_info['kind'] == 'file' + assert file_path_info['size'] == len(data) + + def test_disk_usage(self): + path = pjoin(self.tmp_path, 'disk-usage-base') + p1 = pjoin(path, 'p1') + p2 = pjoin(path, 'p2') + + subdir = pjoin(path, 'subdir') + p3 = pjoin(subdir, 'p3') + + if self.hdfs.exists(path): + self.hdfs.delete(path, True) + + self.hdfs.mkdir(path) + self.hdfs.mkdir(subdir) + + data = b'foobarbaz' + + for file_path in [p1, p2, p3]: + with self.hdfs.open(file_path, 'wb') as f: + f.write(data) + + assert self.hdfs.disk_usage(path) == len(data) * 3 + + def test_ls(self): base_path = pjoin(self.tmp_path, 'ls-test') self.hdfs.mkdir(base_path) @@ -106,7 +182,12 @@ def test_hdfs_ls(self): contents = sorted(self.hdfs.ls(base_path, False)) assert contents == [dir_path, f1_path] - def test_hdfs_download_upload(self): + def test_chmod_chown(self): + path = pjoin(self.tmp_path, 'chmod-test') + with self.hdfs.open(path, 'wb') as f: + f.write(b'a' * 10) + + def test_download_upload(self): base_path = pjoin(self.tmp_path, 'upload-test') data = b'foobarbaz' @@ -120,7 +201,7 @@ def test_hdfs_download_upload(self): out_buf.seek(0) assert out_buf.getvalue() == data - def test_hdfs_file_context_manager(self): + def test_file_context_manager(self): path = pjoin(self.tmp_path, 'ctx-manager') data = b'foo' @@ -132,7 +213,7 @@ def test_hdfs_file_context_manager(self): result = f.read(10) assert result == data - def test_hdfs_read_whole_file(self): + def test_read_whole_file(self): path = pjoin(self.tmp_path, 'read-whole-file') data = b'foo' * 1000 @@ -145,7 +226,7 @@ def test_hdfs_read_whole_file(self): assert result == data @test_parquet.parquet - def test_hdfs_read_multiple_parquet_files(self): + def test_read_multiple_parquet_files(self): import pyarrow.parquet as pq nfiles = 10 @@ -191,7 +272,7 @@ def check_driver(cls): if not pa.have_libhdfs(): pytest.fail('No libhdfs available on system') - def test_hdfs_orphaned_file(self): + def test_orphaned_file(self): hdfs = hdfs_test_client() file_path = self._make_test_file(hdfs, 'orphaned_file_test', 'fname', 'foobarbaz') diff --git a/python/pyarrow/tests/test_io.py b/python/pyarrow/tests/test_io.py index 835f50874f7..c81a0485ce1 100644 --- a/python/pyarrow/tests/test_io.py +++ b/python/pyarrow/tests/test_io.py @@ -323,6 +323,15 @@ def _check_native_file_reader(FACTORY, sample_data): assert f.tell() == len(data) + 1 assert f.read(5) == b'' + # Test whence argument of seek, ARROW-1287 + assert f.seek(3) == 3 + assert f.seek(3, os.SEEK_CUR) == 6 + assert f.tell() == 6 + + ex_length = len(data) - 2 + assert f.seek(-2, os.SEEK_END) == ex_length + assert f.tell() == ex_length + def test_memory_map_reader(sample_disk_data): _check_native_file_reader(pa.memory_map, sample_disk_data) diff --git a/python/pyarrow/tests/test_ipc.py b/python/pyarrow/tests/test_ipc.py index bcaca6df777..3ad369c31f4 100644 --- a/python/pyarrow/tests/test_ipc.py +++ b/python/pyarrow/tests/test_ipc.py @@ -360,7 +360,7 @@ def test_pandas_serialize_round_trip_multi_index(): @pytest.mark.xfail( - raises=TypeError, + raises=AssertionError, reason='Non string columns are not supported', ) def test_pandas_serialize_round_trip_not_string_columns(): diff --git a/python/pyarrow/tests/test_parquet.py b/python/pyarrow/tests/test_parquet.py index 40e44b352ac..ab3b26cd4e0 100644 --- a/python/pyarrow/tests/test_parquet.py +++ b/python/pyarrow/tests/test_parquet.py @@ -23,7 +23,7 @@ import pytest from pyarrow.compat import guid, u -from pyarrow.filesystem import LocalFilesystem +from pyarrow.filesystem import LocalFileSystem import pyarrow as pa from .pandas_examples import dataframe_with_arrays, dataframe_with_lists @@ -124,9 +124,8 @@ def test_pandas_parquet_custom_metadata(tmpdir): assert b'pandas' in arrow_table.schema.metadata _write_table(arrow_table, filename.strpath, version="2.0") - pf = pq.ParquetFile(filename.strpath) - md = pf.metadata.metadata + md = pq.read_metadata(filename.strpath).metadata assert b'pandas' in md js = json.loads(md[b'pandas'].decode('utf8')) @@ -265,6 +264,18 @@ def test_read_pandas_column_subset(tmpdir): tm.assert_frame_equal(df[['strings', 'uint8']], df_read) +@parquet +def test_pandas_parquet_empty_roundtrip(tmpdir): + df = _test_dataframe(0) + arrow_table = pa.Table.from_pandas(df) + imos = pa.BufferOutputStream() + _write_table(arrow_table, imos, version="2.0") + buf = imos.get_result() + reader = pa.BufferReader(buf) + df_read = _read_table(reader).to_pandas() + tm.assert_frame_equal(df, df_read) + + @parquet def test_pandas_parquet_pyfile_roundtrip(tmpdir): filename = tmpdir.join('pandas_pyfile_roundtrip.parquet').strpath @@ -580,7 +591,7 @@ def test_pass_separate_metadata(): _write_table(a_table, buf, compression='snappy', version='2.0') buf.seek(0) - metadata = pq.ParquetFile(buf).metadata + metadata = pq.read_metadata(buf) buf.seek(0) @@ -689,6 +700,45 @@ def test_partition_set_dictionary_type(): @parquet def test_read_partitioned_directory(tmpdir): + fs = LocalFileSystem.get_instance() + base_path = str(tmpdir) + + _partition_test_for_filesystem(fs, base_path) + + +@pytest.yield_fixture +def s3_example(): + access_key = os.environ['PYARROW_TEST_S3_ACCESS_KEY'] + secret_key = os.environ['PYARROW_TEST_S3_SECRET_KEY'] + bucket_name = os.environ['PYARROW_TEST_S3_BUCKET'] + + import s3fs + fs = s3fs.S3FileSystem(key=access_key, secret=secret_key) + + test_dir = guid() + + bucket_uri = 's3://{0}/{1}'.format(bucket_name, test_dir) + fs.mkdir(bucket_uri) + yield fs, bucket_uri + fs.rm(bucket_uri, recursive=True) + + +@pytest.mark.s3 +@parquet +def test_read_partitioned_directory_s3fs(s3_example): + from pyarrow.filesystem import S3FSWrapper + import pyarrow.parquet as pq + + fs, bucket_uri = s3_example + wrapper = S3FSWrapper(fs) + _partition_test_for_filesystem(wrapper, bucket_uri) + + # Check that we can auto-wrap + dataset = pq.ParquetDataset(bucket_uri, filesystem=fs) + dataset.read() + + +def _partition_test_for_filesystem(fs, base_path): import pyarrow.parquet as pq foo_keys = [0, 1] @@ -706,10 +756,9 @@ def test_read_partitioned_directory(tmpdir): 'values': np.random.randn(N) }, columns=['index', 'foo', 'bar', 'values']) - base_path = str(tmpdir) - _generate_partition_directories(base_path, partition_spec, df) + _generate_partition_directories(fs, base_path, partition_spec, df) - dataset = pq.ParquetDataset(base_path) + dataset = pq.ParquetDataset(base_path, filesystem=fs) table = dataset.read() result_df = (table.to_pandas() .sort_values(by='index') @@ -726,12 +775,11 @@ def test_read_partitioned_directory(tmpdir): tm.assert_frame_equal(result_df, expected_df) -def _generate_partition_directories(base_dir, partition_spec, df): +def _generate_partition_directories(fs, base_dir, partition_spec, df): # partition_spec : list of lists, e.g. [['foo', [0, 1, 2], # ['bar', ['a', 'b', 'c']] # part_table : a pyarrow.Table to write to each partition DEPTH = len(partition_spec) - fs = LocalFilesystem.get_instance() def _visit_level(base_dir, level, part_keys): name, values = partition_spec[level] @@ -747,7 +795,8 @@ def _visit_level(base_dir, level, part_keys): filtered_df = _filter_partition(df, this_part_keys) part_table = pa.Table.from_pandas(filtered_df) - _write_table(part_table, file_path) + with fs.open(file_path, 'wb') as f: + _write_table(part_table, f) else: _visit_level(level_dir, level + 1, this_part_keys) @@ -776,14 +825,32 @@ def test_read_common_metadata_files(tmpdir): dataset = pq.ParquetDataset(base_path) assert dataset.metadata_path == metadata_path - pf = pq.ParquetFile(data_path) - assert dataset.schema.equals(pf.schema) + common_schema = pq.read_metadata(data_path).schema + assert dataset.schema.equals(common_schema) # handle list of one directory dataset2 = pq.ParquetDataset([base_path]) assert dataset2.schema.equals(dataset.schema) +@parquet +def test_read_schema(tmpdir): + import pyarrow.parquet as pq + + N = 100 + df = pd.DataFrame({ + 'index': np.arange(N), + 'values': np.random.randn(N) + }, columns=['index', 'values']) + + data_path = pjoin(str(tmpdir), 'test.parquet') + + table = pa.Table.from_pandas(df) + _write_table(table, data_path) + + assert table.schema.equals(pq.read_schema(data_path)) + + def _filter_partition(df, part_keys): predicate = np.ones(len(df), dtype=bool) @@ -835,7 +902,7 @@ def read_multiple_files(paths, columns=None, nthreads=None, **kwargs): assert result.equals(expected) # Read with provided metadata - metadata = pq.ParquetFile(paths[0]).metadata + metadata = pq.read_metadata(paths[0]) result2 = read_multiple_files(paths, metadata=metadata) assert result2.equals(expected) @@ -861,7 +928,7 @@ def read_multiple_files(paths, columns=None, nthreads=None, **kwargs): t = pa.Table.from_pandas(bad_apple) _write_table(t, bad_apple_path) - bad_meta = pq.ParquetFile(bad_apple_path).metadata + bad_meta = pq.read_metadata(bad_apple_path) with pytest.raises(ValueError): read_multiple_files(paths + [bad_apple_path]) @@ -1006,3 +1073,28 @@ def test_multiindex_duplicate_values(tmpdir): result_df = result_table.to_pandas() tm.assert_frame_equal(result_df, df) + + +@parquet +def test_write_error_deletes_incomplete_file(tmpdir): + # ARROW-1285 + df = pd.DataFrame({'a': list('abc'), + 'b': list(range(1, 4)), + 'c': np.arange(3, 6).astype('u1'), + 'd': np.arange(4.0, 7.0, dtype='float64'), + 'e': [True, False, True], + 'f': pd.Categorical(list('abc')), + 'g': pd.date_range('20130101', periods=3), + 'h': pd.date_range('20130101', periods=3, + tz='US/Eastern'), + 'i': pd.date_range('20130101', periods=3, freq='ns')}) + + pdf = pa.Table.from_pandas(df) + + filename = tmpdir.join('tmp_file').strpath + try: + _write_table(pdf, filename) + except pa.ArrowException: + pass + + assert not os.path.exists(filename) diff --git a/python/pyarrow/tests/test_plasma.py b/python/pyarrow/tests/test_plasma.py new file mode 100644 index 00000000000..04162bbbbad --- /dev/null +++ b/python/pyarrow/tests/test_plasma.py @@ -0,0 +1,686 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +from __future__ import absolute_import +from __future__ import division +from __future__ import print_function + +import numpy as np +import os +import pytest +import random +import signal +import subprocess +import time + +import pyarrow as pa +import pandas as pd + +DEFAULT_PLASMA_STORE_MEMORY = 10 ** 9 + + +def random_name(): + return str(random.randint(0, 99999999)) + + +def random_object_id(): + import pyarrow.plasma as plasma + return plasma.ObjectID(np.random.bytes(20)) + + +def generate_metadata(length): + metadata = bytearray(length) + if length > 0: + metadata[0] = random.randint(0, 255) + metadata[-1] = random.randint(0, 255) + for _ in range(100): + metadata[random.randint(0, length - 1)] = random.randint(0, 255) + return metadata + + +def write_to_data_buffer(buff, length): + array = np.frombuffer(buff, dtype="uint8") + if length > 0: + array[0] = random.randint(0, 255) + array[-1] = random.randint(0, 255) + for _ in range(100): + array[random.randint(0, length - 1)] = random.randint(0, 255) + + +def create_object_with_id(client, object_id, data_size, metadata_size, + seal=True): + metadata = generate_metadata(metadata_size) + memory_buffer = client.create(object_id, data_size, metadata) + write_to_data_buffer(memory_buffer, data_size) + if seal: + client.seal(object_id) + return memory_buffer, metadata + + +def create_object(client, data_size, metadata_size, seal=True): + object_id = random_object_id() + memory_buffer, metadata = create_object_with_id(client, object_id, + data_size, metadata_size, + seal=seal) + return object_id, memory_buffer, metadata + + +def assert_get_object_equal(unit_test, client1, client2, object_id, + memory_buffer=None, metadata=None): + import pyarrow.plasma as plasma + client1_buff = client1.get([object_id])[0] + client2_buff = client2.get([object_id])[0] + client1_metadata = client1.get_metadata([object_id])[0] + client2_metadata = client2.get_metadata([object_id])[0] + assert len(client1_buff) == len(client2_buff) + assert len(client1_metadata) == len(client2_metadata) + # Check that the buffers from the two clients are the same. + assert plasma.buffers_equal(client1_buff, client2_buff) + # Check that the metadata buffers from the two clients are the same. + assert plasma.buffers_equal(client1_metadata, client2_metadata) + # If a reference buffer was provided, check that it is the same as well. + if memory_buffer is not None: + assert plasma.buffers_equal(memory_buffer, client1_buff) + # If reference metadata was provided, check that it is the same as well. + if metadata is not None: + assert plasma.buffers_equal(metadata, client1_metadata) + + +def start_plasma_store(plasma_store_memory=DEFAULT_PLASMA_STORE_MEMORY, + use_valgrind=False, use_profiler=False, + stdout_file=None, stderr_file=None): + """Start a plasma store process. + Args: + use_valgrind (bool): True if the plasma store should be started inside + of valgrind. If this is True, use_profiler must be False. + use_profiler (bool): True if the plasma store should be started inside + a profiler. If this is True, use_valgrind must be False. + stdout_file: A file handle opened for writing to redirect stdout to. If + no redirection should happen, then this should be None. + stderr_file: A file handle opened for writing to redirect stderr to. If + no redirection should happen, then this should be None. + Return: + A tuple of the name of the plasma store socket and the process ID of + the plasma store process. + """ + if use_valgrind and use_profiler: + raise Exception("Cannot use valgrind and profiler at the same time.") + plasma_store_executable = os.path.join(pa.__path__[0], "plasma_store") + plasma_store_name = "/tmp/plasma_store{}".format(random_name()) + command = [plasma_store_executable, + "-s", plasma_store_name, + "-m", str(plasma_store_memory)] + if use_valgrind: + pid = subprocess.Popen(["valgrind", + "--track-origins=yes", + "--leak-check=full", + "--show-leak-kinds=all", + "--leak-check-heuristics=stdstring", + "--error-exitcode=1"] + command, + stdout=stdout_file, stderr=stderr_file) + time.sleep(1.0) + elif use_profiler: + pid = subprocess.Popen(["valgrind", "--tool=callgrind"] + command, + stdout=stdout_file, stderr=stderr_file) + time.sleep(1.0) + else: + pid = subprocess.Popen(command, stdout=stdout_file, stderr=stderr_file) + time.sleep(0.1) + return plasma_store_name, pid + + +@pytest.mark.plasma +class TestPlasmaClient(object): + + def setup_method(self, test_method): + import pyarrow.plasma as plasma + # Start Plasma store. + plasma_store_name, self.p = start_plasma_store( + use_valgrind=os.getenv("PLASMA_VALGRIND") == "1") + # Connect to Plasma. + self.plasma_client = plasma.connect(plasma_store_name, "", 64) + # For the eviction test + self.plasma_client2 = plasma.connect(plasma_store_name, "", 0) + + def teardown_method(self, test_method): + # Check that the Plasma store is still alive. + assert self.p.poll() is None + # Kill the plasma store process. + if os.getenv("PLASMA_VALGRIND") == "1": + self.p.send_signal(signal.SIGTERM) + self.p.wait() + if self.p.returncode != 0: + assert False + else: + self.p.kill() + + def test_connection_failure_raises_exception(self): + import pyarrow.plasma as plasma + # ARROW-1264 + with pytest.raises(IOError): + plasma.connect('unknown-store-name', '', 0, 1) + + def test_create(self): + # Create an object id string. + object_id = random_object_id() + # Create a new buffer and write to it. + length = 50 + memory_buffer = np.frombuffer(self.plasma_client.create(object_id, + length), + dtype="uint8") + for i in range(length): + memory_buffer[i] = i % 256 + # Seal the object. + self.plasma_client.seal(object_id) + # Get the object. + memory_buffer = np.frombuffer(self.plasma_client.get([object_id])[0], + dtype="uint8") + for i in range(length): + assert memory_buffer[i] == i % 256 + + def test_create_with_metadata(self): + for length in range(1000): + # Create an object id string. + object_id = random_object_id() + # Create a random metadata string. + metadata = generate_metadata(length) + # Create a new buffer and write to it. + memory_buffer = np.frombuffer(self.plasma_client.create(object_id, + length, + metadata), + dtype="uint8") + for i in range(length): + memory_buffer[i] = i % 256 + # Seal the object. + self.plasma_client.seal(object_id) + # Get the object. + memory_buffer = np.frombuffer( + self.plasma_client.get([object_id])[0], dtype="uint8") + for i in range(length): + assert memory_buffer[i] == i % 256 + # Get the metadata. + metadata_buffer = np.frombuffer( + self.plasma_client.get_metadata([object_id])[0], dtype="uint8") + assert len(metadata) == len(metadata_buffer) + for i in range(len(metadata)): + assert metadata[i] == metadata_buffer[i] + + def test_create_existing(self): + # This test is partially used to test the code path in which we create + # an object with an ID that already exists + length = 100 + for _ in range(1000): + object_id = random_object_id() + self.plasma_client.create(object_id, length, + generate_metadata(length)) + try: + self.plasma_client.create(object_id, length, + generate_metadata(length)) + # TODO(pcm): Introduce a more specific error type here. + except pa.lib.ArrowException: + pass + else: + assert False + + def test_get(self): + num_object_ids = 100 + # Test timing out of get with various timeouts. + for timeout in [0, 10, 100, 1000]: + object_ids = [random_object_id() for _ in range(num_object_ids)] + results = self.plasma_client.get(object_ids, timeout_ms=timeout) + assert results == num_object_ids * [None] + + data_buffers = [] + metadata_buffers = [] + for i in range(num_object_ids): + if i % 2 == 0: + data_buffer, metadata_buffer = create_object_with_id( + self.plasma_client, object_ids[i], 2000, 2000) + data_buffers.append(data_buffer) + metadata_buffers.append(metadata_buffer) + + # Test timing out from some but not all get calls with various + # timeouts. + for timeout in [0, 10, 100, 1000]: + data_results = self.plasma_client.get(object_ids, + timeout_ms=timeout) + # metadata_results = self.plasma_client.get_metadata( + # object_ids, timeout_ms=timeout) + for i in range(num_object_ids): + if i % 2 == 0: + array1 = np.frombuffer(data_buffers[i // 2], dtype="uint8") + array2 = np.frombuffer(data_results[i], dtype="uint8") + np.testing.assert_equal(array1, array2) + # TODO(rkn): We should compare the metadata as well. But + # currently the types are different (e.g., memoryview + # versus bytearray). + # assert plasma.buffers_equal( + # metadata_buffers[i // 2], metadata_results[i]) + else: + assert results[i] is None + + def test_store_arrow_objects(self): + data = np.random.randn(10, 4) + # Write an arrow object. + object_id = random_object_id() + tensor = pa.Tensor.from_numpy(data) + data_size = pa.get_tensor_size(tensor) + buf = self.plasma_client.create(object_id, data_size) + stream = pa.FixedSizeBufferOutputStream(buf) + pa.write_tensor(tensor, stream) + self.plasma_client.seal(object_id) + # Read the arrow object. + [tensor] = self.plasma_client.get([object_id]) + reader = pa.BufferReader(tensor) + array = pa.read_tensor(reader).to_numpy() + # Assert that they are equal. + np.testing.assert_equal(data, array) + + def test_store_pandas_dataframe(self): + import pyarrow.plasma as plasma + d = {'one': pd.Series([1., 2., 3.], index=['a', 'b', 'c']), + 'two': pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])} + df = pd.DataFrame(d) + + # Write the DataFrame. + record_batch = pa.RecordBatch.from_pandas(df) + # Determine the size. + s = pa.MockOutputStream() + stream_writer = pa.RecordBatchStreamWriter(s, record_batch.schema) + stream_writer.write_batch(record_batch) + data_size = s.size() + object_id = plasma.ObjectID(np.random.bytes(20)) + + buf = self.plasma_client.create(object_id, data_size) + stream = pa.FixedSizeBufferOutputStream(buf) + stream_writer = pa.RecordBatchStreamWriter(stream, record_batch.schema) + stream_writer.write_batch(record_batch) + + self.plasma_client.seal(object_id) + + # Read the DataFrame. + [data] = self.plasma_client.get([object_id]) + reader = pa.RecordBatchStreamReader(pa.BufferReader(data)) + result = reader.get_next_batch().to_pandas() + + pd.util.testing.assert_frame_equal(df, result) + + def test_pickle_object_ids(self): + # This can be used for sharing object IDs between processes. + import pickle + object_id = random_object_id() + data = pickle.dumps(object_id) + object_id2 = pickle.loads(data) + assert object_id == object_id2 + + def test_store_full(self): + # The store is started with 1GB, so make sure that create throws an + # exception when it is full. + def assert_create_raises_plasma_full(unit_test, size): + partial_size = np.random.randint(size) + try: + _, memory_buffer, _ = create_object(unit_test.plasma_client, + partial_size, + size - partial_size) + # TODO(pcm): More specific error here. + except pa.lib.ArrowException: + pass + else: + # For some reason the above didn't throw an exception, so fail. + assert False + + # Create a list to keep some of the buffers in scope. + memory_buffers = [] + _, memory_buffer, _ = create_object(self.plasma_client, 5 * 10 ** 8, 0) + memory_buffers.append(memory_buffer) + # Remaining space is 5 * 10 ** 8. Make sure that we can't create an + # object of size 5 * 10 ** 8 + 1, but we can create one of size + # 2 * 10 ** 8. + assert_create_raises_plasma_full(self, 5 * 10 ** 8 + 1) + _, memory_buffer, _ = create_object(self.plasma_client, 2 * 10 ** 8, 0) + del memory_buffer + _, memory_buffer, _ = create_object(self.plasma_client, 2 * 10 ** 8, 0) + del memory_buffer + assert_create_raises_plasma_full(self, 5 * 10 ** 8 + 1) + + _, memory_buffer, _ = create_object(self.plasma_client, 2 * 10 ** 8, 0) + memory_buffers.append(memory_buffer) + # Remaining space is 3 * 10 ** 8. + assert_create_raises_plasma_full(self, 3 * 10 ** 8 + 1) + + _, memory_buffer, _ = create_object(self.plasma_client, 10 ** 8, 0) + memory_buffers.append(memory_buffer) + # Remaining space is 2 * 10 ** 8. + assert_create_raises_plasma_full(self, 2 * 10 ** 8 + 1) + + def test_contains(self): + fake_object_ids = [random_object_id() for _ in range(100)] + real_object_ids = [random_object_id() for _ in range(100)] + for object_id in real_object_ids: + assert self.plasma_client.contains(object_id) is False + self.plasma_client.create(object_id, 100) + self.plasma_client.seal(object_id) + assert self.plasma_client.contains(object_id) + for object_id in fake_object_ids: + assert not self.plasma_client.contains(object_id) + for object_id in real_object_ids: + assert self.plasma_client.contains(object_id) + + def test_hash(self): + # Check the hash of an object that doesn't exist. + object_id1 = random_object_id() + try: + self.plasma_client.hash(object_id1) + # TODO(pcm): Introduce a more specific error type here + except pa.lib.ArrowException: + pass + else: + assert False + + length = 1000 + # Create a random object, and check that the hash function always + # returns the same value. + metadata = generate_metadata(length) + memory_buffer = np.frombuffer(self.plasma_client.create(object_id1, + length, + metadata), + dtype="uint8") + for i in range(length): + memory_buffer[i] = i % 256 + self.plasma_client.seal(object_id1) + assert (self.plasma_client.hash(object_id1) == + self.plasma_client.hash(object_id1)) + + # Create a second object with the same value as the first, and check + # that their hashes are equal. + object_id2 = random_object_id() + memory_buffer = np.frombuffer(self.plasma_client.create(object_id2, + length, + metadata), + dtype="uint8") + for i in range(length): + memory_buffer[i] = i % 256 + self.plasma_client.seal(object_id2) + assert (self.plasma_client.hash(object_id1) == + self.plasma_client.hash(object_id2)) + + # Create a third object with a different value from the first two, and + # check that its hash is different. + object_id3 = random_object_id() + metadata = generate_metadata(length) + memory_buffer = np.frombuffer(self.plasma_client.create(object_id3, + length, + metadata), + dtype="uint8") + for i in range(length): + memory_buffer[i] = (i + 1) % 256 + self.plasma_client.seal(object_id3) + assert (self.plasma_client.hash(object_id1) != + self.plasma_client.hash(object_id3)) + + # Create a fourth object with the same value as the third, but + # different metadata. Check that its hash is different from any of the + # previous three. + object_id4 = random_object_id() + metadata4 = generate_metadata(length) + memory_buffer = np.frombuffer(self.plasma_client.create(object_id4, + length, + metadata4), + dtype="uint8") + for i in range(length): + memory_buffer[i] = (i + 1) % 256 + self.plasma_client.seal(object_id4) + assert (self.plasma_client.hash(object_id1) != + self.plasma_client.hash(object_id4)) + assert (self.plasma_client.hash(object_id3) != + self.plasma_client.hash(object_id4)) + + def test_many_hashes(self): + hashes = [] + length = 2 ** 10 + + for i in range(256): + object_id = random_object_id() + memory_buffer = np.frombuffer(self.plasma_client.create(object_id, + length), + dtype="uint8") + for j in range(length): + memory_buffer[j] = i + self.plasma_client.seal(object_id) + hashes.append(self.plasma_client.hash(object_id)) + + # Create objects of varying length. Each pair has two bits different. + for i in range(length): + object_id = random_object_id() + memory_buffer = np.frombuffer(self.plasma_client.create(object_id, + length), + dtype="uint8") + for j in range(length): + memory_buffer[j] = 0 + memory_buffer[i] = 1 + self.plasma_client.seal(object_id) + hashes.append(self.plasma_client.hash(object_id)) + + # Create objects of varying length, all with value 0. + for i in range(length): + object_id = random_object_id() + memory_buffer = np.frombuffer(self.plasma_client.create(object_id, + i), + dtype="uint8") + for j in range(i): + memory_buffer[j] = 0 + self.plasma_client.seal(object_id) + hashes.append(self.plasma_client.hash(object_id)) + + # Check that all hashes were unique. + assert len(set(hashes)) == 256 + length + length + + # def test_individual_delete(self): + # length = 100 + # # Create an object id string. + # object_id = random_object_id() + # # Create a random metadata string. + # metadata = generate_metadata(100) + # # Create a new buffer and write to it. + # memory_buffer = self.plasma_client.create(object_id, length, + # metadata) + # for i in range(length): + # memory_buffer[i] = chr(i % 256) + # # Seal the object. + # self.plasma_client.seal(object_id) + # # Check that the object is present. + # assert self.plasma_client.contains(object_id) + # # Delete the object. + # self.plasma_client.delete(object_id) + # # Make sure the object is no longer present. + # self.assertFalse(self.plasma_client.contains(object_id)) + # + # def test_delete(self): + # # Create some objects. + # object_ids = [random_object_id() for _ in range(100)] + # for object_id in object_ids: + # length = 100 + # # Create a random metadata string. + # metadata = generate_metadata(100) + # # Create a new buffer and write to it. + # memory_buffer = self.plasma_client.create(object_id, length, + # metadata) + # for i in range(length): + # memory_buffer[i] = chr(i % 256) + # # Seal the object. + # self.plasma_client.seal(object_id) + # # Check that the object is present. + # assert self.plasma_client.contains(object_id) + # + # # Delete the objects and make sure they are no longer present. + # for object_id in object_ids: + # # Delete the object. + # self.plasma_client.delete(object_id) + # # Make sure the object is no longer present. + # self.assertFalse(self.plasma_client.contains(object_id)) + + def test_illegal_functionality(self): + # Create an object id string. + object_id = random_object_id() + # Create a new buffer and write to it. + length = 1000 + memory_buffer = self.plasma_client.create(object_id, length) + # Make sure we cannot access memory out of bounds. + with pytest.raises(Exception): + memory_buffer[length] + # Seal the object. + self.plasma_client.seal(object_id) + # This test is commented out because it currently fails. + # # Make sure the object is ready only now. + # def illegal_assignment(): + # memory_buffer[0] = chr(0) + # with pytest.raises(Exception): + # illegal_assignment() + # Get the object. + memory_buffer = self.plasma_client.get([object_id])[0] + + # Make sure the object is read only. + def illegal_assignment(): + memory_buffer[0] = chr(0) + with pytest.raises(Exception): + illegal_assignment() + + def test_evict(self): + client = self.plasma_client2 + object_id1 = random_object_id() + b1 = client.create(object_id1, 1000) + client.seal(object_id1) + del b1 + assert client.evict(1) == 1000 + + object_id2 = random_object_id() + object_id3 = random_object_id() + b2 = client.create(object_id2, 999) + b3 = client.create(object_id3, 998) + client.seal(object_id3) + del b3 + assert client.evict(1000) == 998 + + object_id4 = random_object_id() + b4 = client.create(object_id4, 997) + client.seal(object_id4) + del b4 + client.seal(object_id2) + del b2 + assert client.evict(1) == 997 + assert client.evict(1) == 999 + + object_id5 = random_object_id() + object_id6 = random_object_id() + object_id7 = random_object_id() + b5 = client.create(object_id5, 996) + b6 = client.create(object_id6, 995) + b7 = client.create(object_id7, 994) + client.seal(object_id5) + client.seal(object_id6) + client.seal(object_id7) + del b5 + del b6 + del b7 + assert client.evict(2000) == 996 + 995 + 994 + + def test_subscribe(self): + # Subscribe to notifications from the Plasma Store. + self.plasma_client.subscribe() + for i in [1, 10, 100, 1000, 10000]: + object_ids = [random_object_id() for _ in range(i)] + metadata_sizes = [np.random.randint(1000) for _ in range(i)] + data_sizes = [np.random.randint(1000) for _ in range(i)] + for j in range(i): + self.plasma_client.create( + object_ids[j], data_sizes[j], + metadata=bytearray(np.random.bytes(metadata_sizes[j]))) + self.plasma_client.seal(object_ids[j]) + # Check that we received notifications for all of the objects. + for j in range(i): + notification_info = self.plasma_client.get_next_notification() + recv_objid, recv_dsize, recv_msize = notification_info + assert object_ids[j] == recv_objid + assert data_sizes[j] == recv_dsize + assert metadata_sizes[j] == recv_msize + + def test_subscribe_deletions(self): + # Subscribe to notifications from the Plasma Store. We use + # plasma_client2 to make sure that all used objects will get evicted + # properly. + self.plasma_client2.subscribe() + for i in [1, 10, 100, 1000, 10000]: + object_ids = [random_object_id() for _ in range(i)] + # Add 1 to the sizes to make sure we have nonzero object sizes. + metadata_sizes = [np.random.randint(1000) + 1 for _ in range(i)] + data_sizes = [np.random.randint(1000) + 1 for _ in range(i)] + for j in range(i): + x = self.plasma_client2.create( + object_ids[j], data_sizes[j], + metadata=bytearray(np.random.bytes(metadata_sizes[j]))) + self.plasma_client2.seal(object_ids[j]) + del x + # Check that we received notifications for creating all of the + # objects. + for j in range(i): + notification_info = self.plasma_client2.get_next_notification() + recv_objid, recv_dsize, recv_msize = notification_info + assert object_ids[j] == recv_objid + assert data_sizes[j] == recv_dsize + assert metadata_sizes[j] == recv_msize + + # Check that we receive notifications for deleting all objects, as + # we evict them. + for j in range(i): + assert (self.plasma_client2.evict(1) == + data_sizes[j] + metadata_sizes[j]) + notification_info = self.plasma_client2.get_next_notification() + recv_objid, recv_dsize, recv_msize = notification_info + assert object_ids[j] == recv_objid + assert -1 == recv_dsize + assert -1 == recv_msize + + # Test multiple deletion notifications. The first 9 object IDs have + # size 0, and the last has a nonzero size. When Plasma evicts 1 byte, + # it will evict all objects, so we should receive deletion + # notifications for each. + num_object_ids = 10 + object_ids = [random_object_id() for _ in range(num_object_ids)] + metadata_sizes = [0] * (num_object_ids - 1) + data_sizes = [0] * (num_object_ids - 1) + metadata_sizes.append(np.random.randint(1000)) + data_sizes.append(np.random.randint(1000)) + for i in range(num_object_ids): + x = self.plasma_client2.create( + object_ids[i], data_sizes[i], + metadata=bytearray(np.random.bytes(metadata_sizes[i]))) + self.plasma_client2.seal(object_ids[i]) + del x + for i in range(num_object_ids): + notification_info = self.plasma_client2.get_next_notification() + recv_objid, recv_dsize, recv_msize = notification_info + assert object_ids[i] == recv_objid + assert data_sizes[i] == recv_dsize + assert metadata_sizes[i] == recv_msize + assert (self.plasma_client2.evict(1) == + data_sizes[-1] + metadata_sizes[-1]) + for i in range(num_object_ids): + notification_info = self.plasma_client2.get_next_notification() + recv_objid, recv_dsize, recv_msize = notification_info + assert object_ids[i] == recv_objid + assert -1 == recv_dsize + assert -1 == recv_msize diff --git a/python/pyarrow/tests/test_table.py b/python/pyarrow/tests/test_table.py index c2aeda9b2df..28b98f0952a 100644 --- a/python/pyarrow/tests/test_table.py +++ b/python/pyarrow/tests/test_table.py @@ -80,7 +80,7 @@ def test_recordbatch_basics(): batch[2] -def test_recordbatch_slice(): +def test_recordbatch_slice_getitem(): data = [ pa.array(range(5)), pa.array([-10, -5, 0, 5, 10]) @@ -90,7 +90,6 @@ def test_recordbatch_slice(): batch = pa.RecordBatch.from_arrays(data, names) sliced = batch.slice(2) - assert sliced.num_rows == 3 expected = pa.RecordBatch.from_arrays( @@ -111,6 +110,14 @@ def test_recordbatch_slice(): with pytest.raises(IndexError): batch.slice(-1) + # Check __getitem__-based slicing + assert batch.slice(0, 0).equals(batch[:0]) + assert batch.slice(0, 2).equals(batch[:2]) + assert batch.slice(2, 2).equals(batch[2:4]) + assert batch.slice(2, len(batch) - 2).equals(batch[2:]) + assert batch.slice(len(batch) - 2, 2).equals(batch[-2:]) + assert batch.slice(len(batch) - 4, 2).equals(batch[-4:-2]) + def test_recordbatch_from_to_pandas(): data = pd.DataFrame({ diff --git a/python/pyarrow/types.pxi b/python/pyarrow/types.pxi index a8d7aa0ee81..fefde55bc2f 100644 --- a/python/pyarrow/types.pxi +++ b/python/pyarrow/types.pxi @@ -281,12 +281,12 @@ cdef class Schema: def __len__(self): return self.schema.num_fields() - def __getitem__(self, int64_t i): + def __getitem__(self, int i): cdef: Field result = Field() - int64_t num_fields = self.schema.num_fields() - int64_t index + int num_fields = self.schema.num_fields() + int index if not -num_fields <= i < num_fields: raise IndexError( @@ -456,7 +456,7 @@ def field(name, DataType type, bint nullable=True, dict metadata=None): convert_metadata(metadata, &c_meta) result.sp_field.reset(new CField(tobytes(name), type.sp_type, - nullable, c_meta)) + nullable == 1, c_meta)) result.field = result.sp_field.get() result.type = type return result diff --git a/python/pyarrow/util.py b/python/pyarrow/util.py index 4b6a8356330..d984e19215b 100644 --- a/python/pyarrow/util.py +++ b/python/pyarrow/util.py @@ -15,6 +15,8 @@ # specific language governing permissions and limitations # under the License. +import warnings + # Miscellaneous utility code @@ -23,3 +25,13 @@ def decorator(g): g.__doc__ = f.__doc__ return g return decorator + + +def _deprecate_class(old_name, new_name, klass, next_version='0.5.0'): + msg = ('pyarrow.{0} is deprecated as of {1}, please use {2} instead' + .format(old_name, next_version, new_name)) + + def deprecated_factory(*args, **kwargs): + warnings.warn(msg, FutureWarning) + return klass(*args) + return deprecated_factory diff --git a/python/setup.py b/python/setup.py index 1ea57ae2d85..ebf28cc64e9 100644 --- a/python/setup.py +++ b/python/setup.py @@ -82,6 +82,7 @@ def run(self): user_options = ([('extra-cmake-args=', None, 'extra arguments for CMake'), ('build-type=', None, 'build type (debug or release)'), ('with-parquet', None, 'build the Parquet extension'), + ('with-plasma', None, 'build the Plasma extension'), ('bundle-arrow-cpp', None, 'bundle the Arrow C++ libraries')] + _build_ext.user_options) @@ -91,6 +92,8 @@ def initialize_options(self): self.extra_cmake_args = os.environ.get('PYARROW_CMAKE_OPTIONS', '') self.build_type = os.environ.get('PYARROW_BUILD_TYPE', 'debug').lower() + self.cmake_cxxflags = os.environ.get('PYARROW_CXXFLAGS', '') + if sys.platform == 'win32': # Cannot do debug builds in Windows unless Python itself is a debug # build @@ -99,12 +102,15 @@ def initialize_options(self): self.with_parquet = strtobool( os.environ.get('PYARROW_WITH_PARQUET', '0')) + self.with_plasma = strtobool( + os.environ.get('PYARROW_WITH_PLASMA', '0')) self.bundle_arrow_cpp = strtobool( os.environ.get('PYARROW_BUNDLE_ARROW_CPP', '0')) CYTHON_MODULE_NAMES = [ 'lib', - '_parquet'] + '_parquet', + 'plasma'] def _run_cmake(self): # The directory containing this setup.py @@ -139,14 +145,22 @@ def _run_cmake(self): if self.with_parquet: cmake_options.append('-DPYARROW_BUILD_PARQUET=on') + if self.with_plasma: + cmake_options.append('-DPYARROW_BUILD_PLASMA=on') + + if len(self.cmake_cxxflags) > 0: + cmake_options.append('-DPYARROW_CXXFLAGS="{0}"' + .format(self.cmake_cxxflags)) + if self.bundle_arrow_cpp: cmake_options.append('-DPYARROW_BUNDLE_ARROW_CPP=ON') # ARROW-1090: work around CMake rough edges if 'ARROW_HOME' in os.environ and sys.platform != 'win32': - os.environ['PKG_CONFIG_PATH'] = pjoin(os.environ['ARROW_HOME'], 'lib', 'pkgconfig') + pkg_config = pjoin(os.environ['ARROW_HOME'], 'lib', + 'pkgconfig') + os.environ['PKG_CONFIG_PATH'] = pkg_config del os.environ['ARROW_HOME'] - cmake_options.append('-DCMAKE_BUILD_TYPE={0}' .format(self.build_type.lower())) @@ -239,9 +253,12 @@ def move_lib(lib_name): print(pjoin(build_prefix, 'include'), pjoin(build_lib, 'pyarrow')) if os.path.exists(pjoin(build_lib, 'pyarrow', 'include')): shutil.rmtree(pjoin(build_lib, 'pyarrow', 'include')) - shutil.move(pjoin(build_prefix, 'include'), pjoin(build_lib, 'pyarrow')) + shutil.move(pjoin(build_prefix, 'include'), + pjoin(build_lib, 'pyarrow')) move_lib("arrow") move_lib("arrow_python") + if self.with_plasma: + move_lib("plasma") if self.with_parquet: move_lib("parquet") @@ -270,11 +287,22 @@ def move_lib(lib_name): shutil.move(self.get_ext_built_api_header(name), pjoin(os.path.dirname(ext_path), name + '_api.h')) + # Move the plasma store + if self.with_plasma: + build_py = self.get_finalized_command('build_py') + source = os.path.join(self.build_type, "plasma_store") + target = os.path.join(build_lib, + build_py.get_package_dir('pyarrow'), + "plasma_store") + shutil.move(source, target) + os.chdir(saved_cwd) def _failure_permitted(self, name): if name == '_parquet' and not self.with_parquet: return True + if name == 'plasma' and not self.with_plasma: + return True return False def _get_inplace_dir(self): @@ -335,6 +363,7 @@ def get_outputs(self): language-bindings for structure manipulation. It also provides IPC and common algorithm implementations.""" + class BinaryDistribution(Distribution): def has_ext_modules(foo): return True diff --git a/python/testing/README.md b/python/testing/README.md new file mode 100644 index 00000000000..07970a231b5 --- /dev/null +++ b/python/testing/README.md @@ -0,0 +1,26 @@ + + +# Testing tools for odds and ends + +## Testing HDFS file interface + +```shell +./test_hdfs.sh +``` \ No newline at end of file diff --git a/python/testing/functions.sh b/python/testing/functions.sh new file mode 100644 index 00000000000..6bc342bd794 --- /dev/null +++ b/python/testing/functions.sh @@ -0,0 +1,100 @@ +#!/usr/bin/env bash + +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +use_gcc() { + export CC=gcc-4.9 + export CXX=g++-4.9 +} + +use_clang() { + export CC=clang-4.0 + export CXX=clang++-4.0 +} + +bootstrap_python_env() { + PYTHON_VERSION=$1 + CONDA_ENV_DIR=$BUILD_DIR/pyarrow-test-$PYTHON_VERSION + + conda create -y -q -p $CONDA_ENV_DIR python=$PYTHON_VERSION cmake curl + source activate $CONDA_ENV_DIR + + python --version + which python + + # faster builds, please + conda install -y -q nomkl pip numpy pandas cython +} + +build_pyarrow() { + # Other stuff pip install + pushd $ARROW_PYTHON_DIR + pip install -r requirements.txt + python setup.py build_ext --with-parquet --with-plasma \ + install --single-version-externally-managed --record=record.text + popd + + python -c "import pyarrow.parquet" + python -c "import pyarrow.plasma" + + export PYARROW_PATH=$CONDA_PREFIX/lib/python$PYTHON_VERSION/site-packages/pyarrow +} + +build_arrow() { + mkdir -p $ARROW_CPP_BUILD_DIR + pushd $ARROW_CPP_BUILD_DIR + + cmake -GNinja \ + -DCMAKE_BUILD_TYPE=$BUILD_TYPE \ + -DCMAKE_INSTALL_PREFIX=$ARROW_HOME \ + -DARROW_NO_DEPRECATED_API=ON \ + -DARROW_PYTHON=ON \ + -DARROW_PLASMA=ON \ + -DARROW_BOOST_USE_SHARED=off \ + $ARROW_CPP_DIR + + ninja + ninja install + popd +} + +build_parquet() { + PARQUET_DIR=$BUILD_DIR/parquet + mkdir -p $PARQUET_DIR + + git clone https://github.com/apache/parquet-cpp.git $PARQUET_DIR + + pushd $PARQUET_DIR + mkdir build-dir + cd build-dir + + cmake \ + -GNinja \ + -DCMAKE_BUILD_TYPE=$BUILD_TYPE \ + -DCMAKE_INSTALL_PREFIX=$PARQUET_HOME \ + -DPARQUET_BOOST_USE_SHARED=off \ + -DPARQUET_BUILD_BENCHMARKS=off \ + -DPARQUET_BUILD_EXECUTABLES=off \ + -DPARQUET_BUILD_TESTS=off \ + .. + + ninja + ninja install + + popd +} diff --git a/python/testing/hdfs/Dockerfile b/python/testing/hdfs/Dockerfile new file mode 100644 index 00000000000..97355137ff3 --- /dev/null +++ b/python/testing/hdfs/Dockerfile @@ -0,0 +1,50 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +# TODO Replace this with a complete clean image build +FROM cpcloud86/impala:metastore + +USER root + +RUN apt-add-repository -y ppa:ubuntu-toolchain-r/test && \ + apt-get update && \ + apt-get install -y \ + gcc-4.9 \ + g++-4.9 \ + build-essential \ + autotools-dev \ + autoconf \ + gtk-doc-tools \ + autoconf-archive \ + libgirepository1.0-dev \ + libtool \ + libjemalloc-dev \ + ccache \ + valgrind \ + gdb + +RUN wget -O - http://llvm.org/apt/llvm-snapshot.gpg.key|sudo apt-key add - && \ + apt-add-repository -y \ + "deb http://llvm.org/apt/trusty/ llvm-toolchain-trusty-4.0 main" && \ + apt-get update && \ + apt-get install -y clang-4.0 clang-format-4.0 clang-tidy-4.0 + +USER ubuntu + +RUN wget -O /tmp/miniconda.sh https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh && \ + bash /tmp/miniconda.sh -b -p /home/ubuntu/miniconda && \ + rm /tmp/miniconda.sh diff --git a/python/testing/hdfs/libhdfs3-hdfs-client.xml b/python/testing/hdfs/libhdfs3-hdfs-client.xml new file mode 100644 index 00000000000..f929929b386 --- /dev/null +++ b/python/testing/hdfs/libhdfs3-hdfs-client.xml @@ -0,0 +1,332 @@ + + + + + + + + + + + + + + + rpc.client.timeout + 3600000 + + timeout interval of a RPC invocation in millisecond. default is 3600000. + + + + rpc.client.connect.tcpnodelay + true + + whether set socket TCP_NODELAY to true when connect to RPC server. default is true. + + + + + rpc.client.max.idle + 10000 + + the max idle time of a RPC connection in millisecond. default is 10000. + + + + + rpc.client.ping.interval + 10000 + + the interval which the RPC client send a heart beat to server. 0 means disable, default is 10000. + + + + + rpc.client.connect.timeout + 600000 + + the timeout interval in millisecond when the RPC client is trying to setup the connection. default is 600000. + + + + + rpc.client.connect.retry + 10 + + the max retry times if the RPC client fail to setup the connection to server. default is 10. + + + + + rpc.client.read.timeout + 3600000 + + the timeout interval in millisecond when the RPC client is trying to read from server. default is 3600000. + + + + + rpc.client.write.timeout + 3600000 + + the timeout interval in millisecond when the RPC client is trying to write to server. default is 3600000. + + + + + rpc.client.socket.linger.timeout + -1 + + set value to socket SO_LINGER when connect to RPC server. -1 means default OS value. default is -1. + + + + + + dfs.client.read.shortcircuit + false + + whether reading block file bypass datanode if the block and the client are on the same node. default is true. + + + + + dfs.default.replica + 1 + + the default number of replica. default is 3. + + + + + dfs.prefetchsize + 10 + + the default number of blocks which information will be prefetched. default is 10. + + + + + dfs.client.failover.max.attempts + 15 + + if multiply namenodes are configured, it is the max retry times when the dfs client try to issue a RPC call. default is 15. + + + + + dfs.default.blocksize + 134217728 + + default block size. default is 134217728. + + + + + dfs.client.log.severity + INFO + + the minimal log severity level, valid values include FATAL, ERROR, INFO, DEBUG1, DEBUG2, DEBUG3. default is INFO. + + + + + + input.connect.timeout + 600000 + + the timeout interval in millisecond when the input stream is trying to setup the connection to datanode. default is 600000. + + + + + input.read.timeout + 3600000 + + the timeout interval in millisecond when the input stream is trying to read from datanode. default is 3600000. + + + + + input.write.timeout + 3600000 + + the timeout interval in millisecond when the input stream is trying to write to datanode. default is 3600000. + + + + + input.localread.default.buffersize + 2097152 + + number of bytes of the buffer which is used to hold the data from block file and verify checksum. + it is only used when "dfs.client.read.shortcircuit" is set to true. default is 1048576. + + + + + input.localread.blockinfo.cachesize + 1000 + + the size of block file path information cache. default is 1000. + + + + + input.read.getblockinfo.retry + 3 + + the max retry times when the client fail to get block information from namenode. default is 3. + + + + + + output.replace-datanode-on-failure + false + + whether the client add new datanode into pipeline if the number of nodes in pipeline is less the specified number of replicas. default is false. + + + + + output.default.chunksize + 512 + + the number of bytes of a chunk in pipeline. default is 512. + + + + + output.default.packetsize + 65536 + + the number of bytes of a packet in pipeline. default is 65536. + + + + + output.default.write.retry + 10 + + the max retry times when the client fail to setup the pipeline. default is 10. + + + + + output.connect.timeout + 600000 + + the timeout interval in millisecond when the output stream is trying to setup the connection to datanode. default is 600000. + + + + + output.read.timeout + 3600000 + + the timeout interval in millisecond when the output stream is trying to read from datanode. default is 3600000. + + + + + output.write.timeout + 3600000 + + the timeout interval in millisecond when the output stream is trying to write to datanode. default is 3600000. + + + + + output.packetpool.size + 1024 + + the max number of packets in a file's packet pool. default is 1024. + + + + + output.close.timeout + 900000 + + the timeout interval in millisecond when close an output stream. default is 900000. + + + + + dfs.domain.socket.path + /var/lib/hadoop-hdfs/dn_socket + + Optional. This is a path to a UNIX domain socket that will be used for + communication between the DataNode and local HDFS clients. + If the string "_PORT" is present in this path, it will be replaced by the + TCP port of the DataNode. + + + + + dfs.client.use.legacy.blockreader.local + false + + Legacy short-circuit reader implementation based on HDFS-2246 is used + if this configuration parameter is true. + This is for the platforms other than Linux + where the new implementation based on HDFS-347 is not available. + + + + diff --git a/python/testing/hdfs/restart_docker_container.sh b/python/testing/hdfs/restart_docker_container.sh new file mode 100644 index 00000000000..15076cc2873 --- /dev/null +++ b/python/testing/hdfs/restart_docker_container.sh @@ -0,0 +1,38 @@ +#!/usr/bin/env bash + +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +export ARROW_TEST_NN_HOST=arrow-hdfs +export ARROW_TEST_IMPALA_HOST=$ARROW_TEST_NN_HOST +export ARROW_TEST_IMPALA_PORT=21050 +export ARROW_TEST_WEBHDFS_PORT=50070 +export ARROW_TEST_WEBHDFS_USER=ubuntu + +docker stop $ARROW_TEST_NN_HOST +docker rm $ARROW_TEST_NN_HOST + +docker run -d -it --name $ARROW_TEST_NN_HOST \ + -v $PWD:/io \ + --hostname $ARROW_TEST_NN_HOST \ + --shm-size=2gb \ + -p $ARROW_TEST_WEBHDFS_PORT -p $ARROW_TEST_IMPALA_PORT \ + arrow-hdfs-test + +while ! docker exec $ARROW_TEST_NN_HOST impala-shell -q 'SELECT VERSION()'; do + sleep 1 +done diff --git a/python/testing/hdfs/run_tests.sh b/python/testing/hdfs/run_tests.sh new file mode 100755 index 00000000000..e0d36df58a3 --- /dev/null +++ b/python/testing/hdfs/run_tests.sh @@ -0,0 +1,41 @@ +#!/usr/bin/env bash +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +set -ex + +HERE=$(cd `dirname "${BASH_SOURCE[0]:-$0}"` && pwd) + +source $HERE/../set_env_common.sh +source $HERE/../setup_toolchain.sh +source $HERE/../functions.sh + +git clone https://github.com/apache/arrow.git $ARROW_CHECKOUT + +use_clang + +bootstrap_python_env 3.6 + +build_arrow +build_parquet + +build_pyarrow + +$ARROW_CPP_BUILD_DIR/debug/io-hdfs-test + +python -m pytest -vv -r sxX -s $PYARROW_PATH --parquet --hdfs diff --git a/python/testing/parquet_interop.py b/python/testing/parquet_interop.py new file mode 100644 index 00000000000..ba2eb6fa416 --- /dev/null +++ b/python/testing/parquet_interop.py @@ -0,0 +1,53 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +import os +import pytest + +import fastparquet +import pandas as pd +import pyarrow as pa +import pyarrow.parquet as pq +import pandas.util.testing as tm + + +def hdfs_test_client(driver='libhdfs'): + host = os.environ.get('ARROW_HDFS_TEST_HOST', 'localhost') + user = os.environ['ARROW_HDFS_TEST_USER'] + try: + port = int(os.environ.get('ARROW_HDFS_TEST_PORT', 20500)) + except ValueError: + raise ValueError('Env variable ARROW_HDFS_TEST_PORT was not ' + 'an integer') + + return pa.HdfsClient(host, port, user, driver=driver) + + +def test_fastparquet_read_with_hdfs(): + fs = hdfs_test_client() + + df = tm.makeDataFrame() + table = pa.Table.from_pandas(df) + + path = '/tmp/testing.parquet' + with fs.open(path, 'wb') as f: + pq.write_table(table, f) + + parquet_file = fastparquet.ParquetFile(path, open_with=fs.open) + + result = parquet_file.to_pandas() + tm.assert_frame_equal(result, df) diff --git a/python/testing/set_env_common.sh b/python/testing/set_env_common.sh new file mode 100644 index 00000000000..00251f92be4 --- /dev/null +++ b/python/testing/set_env_common.sh @@ -0,0 +1,70 @@ +#!/usr/bin/env bash +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +export MINICONDA=$HOME/miniconda +export CPP_TOOLCHAIN=$HOME/cpp-toolchain + +export PATH="$MINICONDA/bin:$PATH" +export CONDA_PKGS_DIRS=$HOME/.conda_packages + +export ARROW_CHECKOUT=$HOME/arrow +export BUILD_DIR=$ARROW_CHECKOUT + +export BUILD_OS_NAME=linux +export BUILD_TYPE=debug + +export ARROW_CPP_DIR=$BUILD_DIR/cpp +export ARROW_PYTHON_DIR=$BUILD_DIR/python +export ARROW_C_GLIB_DIR=$BUILD_DIR/c_glib +export ARROW_JAVA_DIR=${BUILD_DIR}/java +export ARROW_JS_DIR=${BUILD_DIR}/js +export ARROW_INTEGRATION_DIR=$BUILD_DIR/integration + +export CPP_BUILD_DIR=$BUILD_DIR/cpp-build + +export ARROW_CPP_INSTALL=$BUILD_DIR/cpp-install +export ARROW_CPP_BUILD_DIR=$BUILD_DIR/cpp-build +export ARROW_C_GLIB_INSTALL=$BUILD_DIR/c-glib-install + +export ARROW_BUILD_TOOLCHAIN=$CPP_TOOLCHAIN +export PARQUET_BUILD_TOOLCHAIN=$CPP_TOOLCHAIN + +export BOOST_ROOT=$CPP_TOOLCHAIN +export PATH=$CPP_TOOLCHAIN/bin:$PATH +export LD_LIBRARY_PATH=$CPP_TOOLCHAIN/lib:$LD_LIBRARY_PATH + +export VALGRIND="valgrind --tool=memcheck" + +export ARROW_HOME=$CPP_TOOLCHAIN +export PARQUET_HOME=$CPP_TOOLCHAIN + +# Arrow test variables + +export JAVA_HOME=/usr/lib/jvm/java-7-oracle +export HADOOP_HOME=/usr/lib/hadoop +export CLASSPATH=`$HADOOP_HOME/bin/hadoop classpath --glob` +export HADOOP_OPTS="$HADOOP_OPTS -Djava.library.path=$HADOOP_HOME/lib/native" +export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HADOOP_HOME/lib/native/ + +export ARROW_HDFS_TEST_HOST=arrow-hdfs +export ARROW_HDFS_TEST_PORT=9000 +export ARROW_HDFS_TEST_USER=ubuntu +export ARROW_LIBHDFS_DIR=/usr/lib + +export LIBHDFS3_CONF=/io/hdfs/libhdfs3-hdfs-client.xml diff --git a/python/testing/setup_toolchain.sh b/python/testing/setup_toolchain.sh new file mode 100644 index 00000000000..c3837b45cbc --- /dev/null +++ b/python/testing/setup_toolchain.sh @@ -0,0 +1,65 @@ +#!/usr/bin/env bash +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +set -e + +export PATH="$MINICONDA/bin:$PATH" +conda update -y -q conda +conda config --set auto_update_conda false +conda info -a + +conda config --set show_channel_urls True + +# Help with SSL timeouts to S3 +conda config --set remote_connect_timeout_secs 12 + +conda config --add channels https://repo.continuum.io/pkgs/free +conda config --add channels conda-forge +conda info -a + +# faster builds, please +conda install -y nomkl + +conda install --y conda-build jinja2 anaconda-client cmake curl + +# Set up C++ toolchain +conda create -y -q -p $CPP_TOOLCHAIN python=3.6 \ + jemalloc=4.4.0 \ + nomkl \ + boost-cpp \ + rapidjson \ + flatbuffers \ + gflags \ + lz4-c \ + snappy \ + zstd \ + brotli \ + zlib \ + git \ + cmake \ + curl \ + thrift-cpp \ + libhdfs3 \ + ninja + +if [ $BUILD_OS_NAME == "osx" ]; then + brew update > /dev/null + brew install jemalloc + brew install ccache +fi diff --git a/python/testing/test_hdfs.sh b/python/testing/test_hdfs.sh new file mode 100755 index 00000000000..016e54a66a6 --- /dev/null +++ b/python/testing/test_hdfs.sh @@ -0,0 +1,25 @@ +#!/usr/bin/env bash +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +set -ex + +docker build -t arrow-hdfs-test -f hdfs/Dockerfile . +bash hdfs/restart_docker_container.sh +docker exec -it arrow-hdfs /io/hdfs/run_tests.sh +docker stop arrow-hdfs diff --git a/site/README.md b/site/README.md index 0e052c84aeb..1b0a82e03db 100644 --- a/site/README.md +++ b/site/README.md @@ -1,15 +1,20 @@ ## Apache Arrow Website diff --git a/site/_posts/2017-07-24-0.5.0-release.md b/site/_posts/2017-07-24-0.5.0-release.md new file mode 100644 index 00000000000..5c156bfec78 --- /dev/null +++ b/site/_posts/2017-07-24-0.5.0-release.md @@ -0,0 +1,114 @@ +--- +layout: post +title: "Apache Arrow 0.5.0 Release" +date: "2017-07-25 00:00:00 -0400" +author: wesm +categories: [release] +--- + + +The Apache Arrow team is pleased to announce the 0.5.0 release. It includes +[**130 resolved JIRAs**][1] with some new features, expanded integration +testing between implementations, and bug fixes. The Arrow memory format remains +stable since the 0.3.x and 0.4.x releases. + +See the [Install Page][2] to learn how to get the libraries for your +platform. The [complete changelog][5] is also available. + +## Expanded Integration Testing + +In this release, we added compatibility tests for dictionary-encoded data +between Java and C++. This enables the distinct values (the *dictionary*) in a +vector to be transmitted as part of an Arrow schema while the record batches +contain integers which correspond to the dictionary. + +So we might have: + +``` +data (string): ['foo', 'bar', 'foo', 'bar'] +``` + +In dictionary-encoded form, this could be represented as: + +``` +indices (int8): [0, 1, 0, 1] +dictionary (string): ['foo', 'bar'] +``` + +In upcoming releases, we plan to complete integration testing for the remaining +data types (including some more complicated types like unions and decimals) on +the road to a 1.0.0 release in the future. + +## C++ Activity + +We completed a number of significant pieces of work in the C++ part of Apache +Arrow. + +### Using jemalloc as default memory allocator + +We decided to use [jemalloc][4] as the default memory allocator unless it is +explicitly disabled. This memory allocator has significant performance +advantages in Arrow workloads over the default `malloc` implementation. We will +publish a blog post going into more detail about this and why you might care. + +### Sharing more C++ code with Apache Parquet + +We imported the compression library interfaces and dictionary encoding +algorithms from the [Apache Parquet C++ library][3]. The Parquet library now +depends on this code in Arrow, and we will be able to use it more easily for +data compression in Arrow use cases. + +As part of incorporating Parquet's dictionary encoding utilities, we have +developed an `arrow::DictionaryBuilder` class to enable building +dictionary-encoded arrays iteratively. This can help save memory and yield +better performance when interacting with databases, Parquet files, or other +sources which may have columns having many duplicates. + +### Support for LZ4 and ZSTD compressors + +We added LZ4 and ZSTD compression library support. In ARROW-300 and other +planned work, we intend to add some compression features for data sent via RPC. + +## Python Activity + +We fixed many bugs which were affecting Parquet and Feather users and fixed +several other rough edges with normal Arrow use. We also added some additional +Arrow type conversions: structs, lists embedded in pandas objects, and Arrow +time types (which deserialize to the `datetime.time` type). + +In upcoming releases we plan to continue to improve [Dask][7] support and +performance for distributed processing of Apache Parquet files with pyarrow. + +## The Road Ahead + +We have much work ahead of us to build out Arrow integrations in other data +systems to improve their processing performance and interoperability with other +systems. + +We are discussing the roadmap to a future 1.0.0 release on the [developer +mailing list][6]. Please join the discussion there. + +[1]: https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20(Resolved%2C%20Closed)%20AND%20fixVersion%20%3D%200.5.0 +[2]: http://arrow.apache.org/install +[3]: http://github.com/apache/parquet-cpp +[4]: https://github.com/jemalloc/jemalloc +[5]: http://arrow.apache.org/release/0.5.0.html +[6]: http://mail-archives.apache.org/mod_mbox/arrow-dev/ +[7]: http://github.com/dask/dask \ No newline at end of file diff --git a/site/_posts/2017-07-26-spark-arrow.md b/site/_posts/2017-07-26-spark-arrow.md new file mode 100644 index 00000000000..c4b16c0738c --- /dev/null +++ b/site/_posts/2017-07-26-spark-arrow.md @@ -0,0 +1,158 @@ +--- +layout: post +title: "Speeding up PySpark with Apache Arrow" +date: "2017-07-26 08:00:00 -0800" +author: BryanCutler +categories: [application] +--- + + +*[Bryan Cutler][11] is a software engineer at IBM's Spark Technology Center [STC][12]* + +Beginning with [Apache Spark][1] version 2.3, [Apache Arrow][2] will be a supported +dependency and begin to offer increased performance with columnar data transfer. +If you are a Spark user that prefers to work in Python and Pandas, this is a cause +to be excited over! The initial work is limited to collecting a Spark DataFrame +with `toPandas()`, which I will discuss below, however there are many additional +improvements that are currently [underway][3]. + +# Optimizing Spark Conversion to Pandas + +The previous way of converting a Spark DataFrame to Pandas with `DataFrame.toPandas()` +in PySpark was painfully inefficient. Basically, it worked by first collecting all +rows to the Spark driver. Next, each row would get serialized into Python's pickle +format and sent to a Python worker process. This child process unpickles each row into +a huge list of tuples. Finally, a Pandas DataFrame is created from the list using +`pandas.DataFrame.from_records()`. + +This all might seem like standard procedure, but suffers from 2 glaring issues: 1) +even using CPickle, Python serialization is a slow process and 2) creating +a `pandas.DataFrame` using `from_records` must slowly iterate over the list of pure +Python data and convert each value to Pandas format. See [here][4] for a detailed +analysis. + +Here is where Arrow really shines to help optimize these steps: 1) Once the data is +in Arrow memory format, there is no need to serialize/pickle anymore as Arrow data can +be sent directly to the Python process, 2) When the Arrow data is received in Python, +then pyarrow can utilize zero-copy methods to create a `pandas.DataFrame` from entire +chunks of data at once instead of processing individual scalar values. Additionally, +the conversion to Arrow data can be done on the JVM and pushed back for the Spark +executors to perform in parallel, drastically reducing the load on the driver. + +As of the merging of [SPARK-13534][5], the use of Arrow when calling `toPandas()` +needs to be enabled by setting the SQLConf "spark.sql.execution.arrow.enable" to +"true". Let's look at a simple usage example. + +``` +Welcome to + ____ __ + / __/__ ___ _____/ /__ + _\ \/ _ \/ _ `/ __/ '_/ + /__ / .__/\_,_/_/ /_/\_\ version 2.3.0-SNAPSHOT + /_/ + +Using Python version 2.7.13 (default, Dec 20 2016 23:09:15) +SparkSession available as 'spark'. + +In [1]: from pyspark.sql.functions import rand + ...: df = spark.range(1 << 22).toDF("id").withColumn("x", rand()) + ...: df.printSchema() + ...: +root + |-- id: long (nullable = false) + |-- x: double (nullable = false) + + +In [2]: %time pdf = df.toPandas() +CPU times: user 17.4 s, sys: 792 ms, total: 18.1 s +Wall time: 20.7 s + +In [3]: spark.conf.set("spark.sql.execution.arrow.enable", "true") + +In [4]: %time pdf = df.toPandas() +CPU times: user 40 ms, sys: 32 ms, total: 72 ms +Wall time: 737 ms + +In [5]: pdf.describe() +Out[5]: + id x +count 4.194304e+06 4.194304e+06 +mean 2.097152e+06 4.998996e-01 +std 1.210791e+06 2.887247e-01 +min 0.000000e+00 8.291929e-07 +25% 1.048576e+06 2.498116e-01 +50% 2.097152e+06 4.999210e-01 +75% 3.145727e+06 7.498380e-01 +max 4.194303e+06 9.999996e-01 +``` + +This example was run locally on my laptop using Spark defaults so the times +shown should not be taken precisely. Even though, it is clear there is a huge +performance boost and using Arrow took something that was excruciatingly slow +and speeds it up to be barely noticeable. + +# Notes on Usage + +Here are some things to keep in mind before making use of this new feature. At +the time of writing this, pyarrow will not be installed automatically with +pyspark and needs to be manually installed, see installation [instructions][6]. +It is planned to add pyarrow as a pyspark dependency so that +`> pip install pyspark` will also install pyarrow. + +Currently, the controlling SQLConf is disabled by default. This can be enabled +programmatically as in the example above or by adding the line +"spark.sql.execution.arrow.enable=true" to `SPARK_HOME/conf/spark-defaults.conf`. + +Also, not all Spark data types are currently supported and limited to primitive +types. Expanded type support is in the works and expected to also be in the Spark +2.3 release. + +# Future Improvements + +As mentioned, this was just a first step in using Arrow to make life easier for +Spark Python users. A few exciting initiatives in the works are to allow for +vectorized UDF evaluation ([SPARK-21190][7], [SPARK-21404][8]), and the ability +to apply a function on grouped data using a Pandas DataFrame ([SPARK-20396][9]). +Just as Arrow helped in converting a Spark to Pandas, it can also work in the +other direction when creating a Spark DataFrame from an existing Pandas +DataFrame ([SPARK-20791][10]). Stay tuned for more! + +# Collaborators + +Reaching this first milestone was a group effort from both the Apache Arrow and +Spark communities. Thanks to the hard work of [Wes McKinney][13], [Li Jin][14], +[Holden Karau][15], Reynold Xin, Wenchen Fan, Shane Knapp and many others that +helped push this effort forwards. + +[1]: https://spark.apache.org/ +[2]: https://arrow.apache.org/ +[3]: https://issues.apache.org/jira/issues/?filter=12335725&jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20text%20~%20%22arrow%22%20ORDER%20BY%20createdDate%20DESC +[4]: https://gist.github.com/wesm/0cb5531b1c2e346a0007 +[5]: https://issues.apache.org/jira/browse/SPARK-13534 +[6]: https://github.com/apache/arrow/blob/master/site/install.md +[7]: https://issues.apache.org/jira/browse/SPARK-21190 +[8]: https://issues.apache.org/jira/browse/SPARK-21404 +[9]: https://issues.apache.org/jira/browse/SPARK-20396 +[10]: https://issues.apache.org/jira/browse/SPARK-20791 +[11]: https://github.com/BryanCutler +[12]: http://www.spark.tc/ +[13]: https://github.com/wesm +[14]: https://github.com/icexelloss +[15]: https://github.com/holdenk diff --git a/site/_release/0.5.0.md b/site/_release/0.5.0.md new file mode 100644 index 00000000000..f28d86690f3 --- /dev/null +++ b/site/_release/0.5.0.md @@ -0,0 +1,203 @@ +--- +layout: default +title: Apache Arrow 0.5.0 Release +permalink: /release/0.5.0.html +--- + + +# Apache Arrow 0.5.0 (23 July 2017) + +This is a major release, with expanded features in the supported languages and +additional integration test coverage between Java and C++. + +Read more in the [release blog post][8]. + +## Download + +* [**Source Artifacts**][6] +* [Git tag][2] + +## Contributors + +```shell +$ git shortlog -sn apache-arrow-0.4.1..apache-arrow-0.5.0 + 42 Wes McKinney + 22 Uwe L. Korn + 12 Kouhei Sutou + 9 Max Risuhin + 9 Phillip Cloud + 6 Philipp Moritz + 5 Steven Phillips + 3 Julien Le Dem + 2 Bryan Cutler + 2 Kengo Seki + 2 Max Risukhin + 2 fjetter + 1 Antony Mayi + 1 Deepak Majeti + 1 Fang Zheng + 1 Hideo Hattori + 1 Holden Karau + 1 Itai Incze + 1 Jeff Knupp + 1 LynnYuan + 1 Mark Lavrynenko + 1 Michael König + 1 Robert Nishihara + 1 Sudheesh Katkam + 1 Zahari + 1 vkorukanti +``` + +# Changelog + +## New Features and Improvements + +* [ARROW-1041](https://issues.apache.org/jira/browse/ARROW-1041) - [Python] Support read_pandas on a directory of Parquet files +* [ARROW-1048](https://issues.apache.org/jira/browse/ARROW-1048) - Allow user LD_LIBRARY_PATH to be used with source release script +* [ARROW-1052](https://issues.apache.org/jira/browse/ARROW-1052) - Arrow 0.5.0 release +* [ARROW-1073](https://issues.apache.org/jira/browse/ARROW-1073) - C++: Adapative integer builder +* [ARROW-1095](https://issues.apache.org/jira/browse/ARROW-1095) - [Website] Add Arrow icon asset +* [ARROW-1100](https://issues.apache.org/jira/browse/ARROW-1100) - [Python] Add "mode" property to NativeFile instances +* [ARROW-1102](https://issues.apache.org/jira/browse/ARROW-1102) - Make MessageSerializer.serializeMessage() public +* [ARROW-111](https://issues.apache.org/jira/browse/ARROW-111) - [C++] Add static analyzer to tool chain to verify checking of Status returns +* [ARROW-1120](https://issues.apache.org/jira/browse/ARROW-1120) - [Python] Write support for int96 +* [ARROW-1122](https://issues.apache.org/jira/browse/ARROW-1122) - [Website] Guest blog post on Arrow + ODBC from turbodbc +* [ARROW-1123](https://issues.apache.org/jira/browse/ARROW-1123) - C++: Make jemalloc the default allocator +* [ARROW-1135](https://issues.apache.org/jira/browse/ARROW-1135) - Upgrade Travis CI clang builds to use LLVM 4.0 +* [ARROW-1137](https://issues.apache.org/jira/browse/ARROW-1137) - Python: Ensure Pandas roundtrip of all-None column +* [ARROW-1142](https://issues.apache.org/jira/browse/ARROW-1142) - [C++] Move over compression library toolchain from parquet-cpp +* [ARROW-1145](https://issues.apache.org/jira/browse/ARROW-1145) - [GLib] Add get_values() +* [ARROW-1146](https://issues.apache.org/jira/browse/ARROW-1146) - Add .gitignore for *_generated.h files in src/plasma/format +* [ARROW-1148](https://issues.apache.org/jira/browse/ARROW-1148) - [C++] Raise minimum CMake version to 3.2 +* [ARROW-1151](https://issues.apache.org/jira/browse/ARROW-1151) - [C++] Add gcc branch prediction to status check macro +* [ARROW-1154](https://issues.apache.org/jira/browse/ARROW-1154) - [C++] Migrate more computational utility code from parquet-cpp +* [ARROW-1160](https://issues.apache.org/jira/browse/ARROW-1160) - C++: Implement DictionaryBuilder +* [ARROW-1165](https://issues.apache.org/jira/browse/ARROW-1165) - [C++] Refactor PythonDecimalToArrowDecimal to not use templates +* [ARROW-1172](https://issues.apache.org/jira/browse/ARROW-1172) - [C++] Use unique_ptr with array builder classes +* [ARROW-1183](https://issues.apache.org/jira/browse/ARROW-1183) - [Python] Implement time type conversions in to_pandas +* [ARROW-1185](https://issues.apache.org/jira/browse/ARROW-1185) - [C++] Clean up arrow::Status implementation, add warn_unused_result attribute for clang +* [ARROW-1187](https://issues.apache.org/jira/browse/ARROW-1187) - Serialize a DataFrame with None column +* [ARROW-1193](https://issues.apache.org/jira/browse/ARROW-1193) - [C++] Support pkg-config forarrow_python.so +* [ARROW-1196](https://issues.apache.org/jira/browse/ARROW-1196) - [C++] Appveyor separate jobs for Debug/Release builds from sources; Build with conda toolchain; Build with NMake Makefiles Generator +* [ARROW-1198](https://issues.apache.org/jira/browse/ARROW-1198) - Python: Add public C++ API to unwrap PyArrow object +* [ARROW-1199](https://issues.apache.org/jira/browse/ARROW-1199) - [C++] Introduce mutable POD struct for generic array data +* [ARROW-1202](https://issues.apache.org/jira/browse/ARROW-1202) - Remove semicolons from status macros +* [ARROW-1212](https://issues.apache.org/jira/browse/ARROW-1212) - [GLib] Add garrow_binary_array_get_offsets_buffer() +* [ARROW-1214](https://issues.apache.org/jira/browse/ARROW-1214) - [Python] Add classes / functions to enable stream message components to be handled outside of the stream reader class +* [ARROW-1217](https://issues.apache.org/jira/browse/ARROW-1217) - [GLib] Add GInputStream based arrow::io::RandomAccessFile +* [ARROW-1220](https://issues.apache.org/jira/browse/ARROW-1220) - [C++] Standartize usage of *_HOME cmake script variables for 3rd party libs +* [ARROW-1221](https://issues.apache.org/jira/browse/ARROW-1221) - [C++] Pin clang-format version +* [ARROW-1227](https://issues.apache.org/jira/browse/ARROW-1227) - [GLib] Support GOutputStream +* [ARROW-1228](https://issues.apache.org/jira/browse/ARROW-1228) - [GLib] Test file name should be the same name as target class +* [ARROW-1229](https://issues.apache.org/jira/browse/ARROW-1229) - [GLib] Follow Reader API change (get -> read) +* [ARROW-1233](https://issues.apache.org/jira/browse/ARROW-1233) - [C++] Validate cmake script resolving of 3rd party linked libs from correct location in toolchain build +* [ARROW-460](https://issues.apache.org/jira/browse/ARROW-460) - [C++] Implement JSON round trip for DictionaryArray +* [ARROW-462](https://issues.apache.org/jira/browse/ARROW-462) - [C++] Implement in-memory conversions between non-nested primitive types and DictionaryArray equivalent +* [ARROW-575](https://issues.apache.org/jira/browse/ARROW-575) - Python: Auto-detect nested lists and nested numpy arrays in Pandas +* [ARROW-597](https://issues.apache.org/jira/browse/ARROW-597) - [Python] Add convenience function to yield DataFrame from any object that a StreamReader or FileReader can read from +* [ARROW-599](https://issues.apache.org/jira/browse/ARROW-599) - [C++] Add LZ4 codec to 3rd-party toolchain +* [ARROW-600](https://issues.apache.org/jira/browse/ARROW-600) - [C++] Add ZSTD codec to 3rd-party toolchain +* [ARROW-692](https://issues.apache.org/jira/browse/ARROW-692) - Java<->C++ Integration tests for dictionary-encoded vectors +* [ARROW-693](https://issues.apache.org/jira/browse/ARROW-693) - [Java] Add JSON support for dictionary vectors +* [ARROW-742](https://issues.apache.org/jira/browse/ARROW-742) - Handling exceptions during execution of std::wstring_convert +* [ARROW-834](https://issues.apache.org/jira/browse/ARROW-834) - [Python] Support creating Arrow arrays from Python iterables +* [ARROW-915](https://issues.apache.org/jira/browse/ARROW-915) - Struct Array reads limited support +* [ARROW-935](https://issues.apache.org/jira/browse/ARROW-935) - [Java] Build Javadoc in Travis CI +* [ARROW-960](https://issues.apache.org/jira/browse/ARROW-960) - [Python] Add source build guide for macOS + Homebrew +* [ARROW-962](https://issues.apache.org/jira/browse/ARROW-962) - [Python] Add schema attribute to FileReader +* [ARROW-966](https://issues.apache.org/jira/browse/ARROW-966) - [Python] pyarrow.list_ should also accept Field instance +* [ARROW-978](https://issues.apache.org/jira/browse/ARROW-978) - [Python] Use sphinx-bootstrap-theme for Sphinx documentation + +## Bug Fixes + +* [ARROW-1074](https://issues.apache.org/jira/browse/ARROW-1074) - from_pandas doesnt convert ndarray to list +* [ARROW-1079](https://issues.apache.org/jira/browse/ARROW-1079) - [Python] Empty "private" directories should be ignored by Parquet interface +* [ARROW-1081](https://issues.apache.org/jira/browse/ARROW-1081) - C++: arrow::test::TestBase::MakePrimitive doesn't fill null_bitmap +* [ARROW-1096](https://issues.apache.org/jira/browse/ARROW-1096) - [C++] Memory mapping file over 4GB fails on Windows +* [ARROW-1097](https://issues.apache.org/jira/browse/ARROW-1097) - Reading tensor needs file to be opened in writeable mode +* [ARROW-1098](https://issues.apache.org/jira/browse/ARROW-1098) - Document Error? +* [ARROW-1101](https://issues.apache.org/jira/browse/ARROW-1101) - UnionListWriter is not implementing all methods on interface ScalarWriter +* [ARROW-1103](https://issues.apache.org/jira/browse/ARROW-1103) - [Python] Utilize pandas metadata from common _metadata Parquet file if it exists +* [ARROW-1107](https://issues.apache.org/jira/browse/ARROW-1107) - [JAVA] NullableMapVector getField() should return nullable type +* [ARROW-1108](https://issues.apache.org/jira/browse/ARROW-1108) - Check if ArrowBuf is empty buffer in getActualConsumedMemory() and getPossibleConsumedMemory() +* [ARROW-1109](https://issues.apache.org/jira/browse/ARROW-1109) - [JAVA] transferOwnership fails when readerIndex is not 0 +* [ARROW-1110](https://issues.apache.org/jira/browse/ARROW-1110) - [JAVA] make union vector naming consistent +* [ARROW-1111](https://issues.apache.org/jira/browse/ARROW-1111) - [JAVA] Make aligning buffers optional, and allow -1 for unknown null count +* [ARROW-1112](https://issues.apache.org/jira/browse/ARROW-1112) - [JAVA] Set lastSet for VarLength and List vectors when loading +* [ARROW-1113](https://issues.apache.org/jira/browse/ARROW-1113) - [C++] gflags EP build gets triggered (as a no-op) on subsequent calls to make or ninja build +* [ARROW-1115](https://issues.apache.org/jira/browse/ARROW-1115) - [C++] Use absolute path for ccache +* [ARROW-1117](https://issues.apache.org/jira/browse/ARROW-1117) - [Docs] Minor issues in GLib README +* [ARROW-1124](https://issues.apache.org/jira/browse/ARROW-1124) - [Python] pyarrow needs to depend on numpy>=1.10 (not 1.9) +* [ARROW-1125](https://issues.apache.org/jira/browse/ARROW-1125) - Python: Table.from_pandas doesn't work anymore on partial schemas +* [ARROW-1128](https://issues.apache.org/jira/browse/ARROW-1128) - [Docs] command to build a wheel is not properly rendered +* [ARROW-1129](https://issues.apache.org/jira/browse/ARROW-1129) - [C++] Fix Linux toolchain build regression from ARROW-742 +* [ARROW-1131](https://issues.apache.org/jira/browse/ARROW-1131) - Python: Parquet unit tests are always skipped +* [ARROW-1132](https://issues.apache.org/jira/browse/ARROW-1132) - [Python] Unable to write pandas DataFrame w/MultiIndex containing duplicate values to parquet +* [ARROW-1136](https://issues.apache.org/jira/browse/ARROW-1136) - [C++/Python] Segfault on empty stream +* [ARROW-1138](https://issues.apache.org/jira/browse/ARROW-1138) - Travis: Use OpenJDK7 instead of OracleJDK7 +* [ARROW-1139](https://issues.apache.org/jira/browse/ARROW-1139) - [C++] dlmalloc doesn't allow arrow to be built with clang 4 or gcc 7.1.1 +* [ARROW-1141](https://issues.apache.org/jira/browse/ARROW-1141) - on import get libjemalloc.so.2: cannot allocate memory in static TLS block +* [ARROW-1143](https://issues.apache.org/jira/browse/ARROW-1143) - C++: Fix comparison of NullArray +* [ARROW-1144](https://issues.apache.org/jira/browse/ARROW-1144) - [C++] Remove unused variable +* [ARROW-1147](https://issues.apache.org/jira/browse/ARROW-1147) - [C++] Allow optional vendoring of flatbuffers in plasma +* [ARROW-1150](https://issues.apache.org/jira/browse/ARROW-1150) - [C++] AdaptiveIntBuilder compiler warning on MSVC +* [ARROW-1152](https://issues.apache.org/jira/browse/ARROW-1152) - [Cython] read_tensor should work with a readable file +* [ARROW-1155](https://issues.apache.org/jira/browse/ARROW-1155) - segmentation fault when run pa.Int16Value() +* [ARROW-1157](https://issues.apache.org/jira/browse/ARROW-1157) - C++/Python: Decimal templates are not correctly exported on OSX +* [ARROW-1159](https://issues.apache.org/jira/browse/ARROW-1159) - [C++] Static data members cannot be accessed from inline functions in Arrow headers by thirdparty users +* [ARROW-1162](https://issues.apache.org/jira/browse/ARROW-1162) - Transfer Between Empty Lists Should Not Invoke Callback +* [ARROW-1166](https://issues.apache.org/jira/browse/ARROW-1166) - Errors in Struct type's example and missing reference in Layout.md +* [ARROW-1167](https://issues.apache.org/jira/browse/ARROW-1167) - [Python] Create chunked BinaryArray in Table.from_pandas when a column's data exceeds 2GB +* [ARROW-1168](https://issues.apache.org/jira/browse/ARROW-1168) - [Python] pandas metadata may contain "mixed" data types +* [ARROW-1169](https://issues.apache.org/jira/browse/ARROW-1169) - C++: jemalloc externalproject doesn't build with CMake's ninja generator +* [ARROW-1170](https://issues.apache.org/jira/browse/ARROW-1170) - C++: ARROW_JEMALLOC=OFF breaks linking on unittest +* [ARROW-1174](https://issues.apache.org/jira/browse/ARROW-1174) - [GLib] Investigate root cause of ListArray glib test failure +* [ARROW-1177](https://issues.apache.org/jira/browse/ARROW-1177) - [C++] Detect int32 overflow in ListBuilder::Append +* [ARROW-1179](https://issues.apache.org/jira/browse/ARROW-1179) - C++: Add missing virtual destructors +* [ARROW-1180](https://issues.apache.org/jira/browse/ARROW-1180) - [GLib] garrow_tensor_get_dimension_name() returns invalid address +* [ARROW-1181](https://issues.apache.org/jira/browse/ARROW-1181) - [Python] Parquet test fail if not enabled +* [ARROW-1182](https://issues.apache.org/jira/browse/ARROW-1182) - C++: Specify BUILD_BYPRODUCTS for zlib and zstd +* [ARROW-1186](https://issues.apache.org/jira/browse/ARROW-1186) - [C++] Enable option to build arrow with minimal dependencies needed to build Parquet library +* [ARROW-1188](https://issues.apache.org/jira/browse/ARROW-1188) - Segfault when trying to serialize a DataFrame with Null-only Categorical Column +* [ARROW-1190](https://issues.apache.org/jira/browse/ARROW-1190) - VectorLoader corrupts vectors with duplicate names +* [ARROW-1191](https://issues.apache.org/jira/browse/ARROW-1191) - [JAVA] Implement getField() method for the complex readers +* [ARROW-1194](https://issues.apache.org/jira/browse/ARROW-1194) - Getting record batch size with pa.get_record_batch_size returns a size that is too small for pandas DataFrame. +* [ARROW-1197](https://issues.apache.org/jira/browse/ARROW-1197) - [GLib] record_batch.hpp Inclusion is missing +* [ARROW-1200](https://issues.apache.org/jira/browse/ARROW-1200) - [C++] DictionaryBuilder should use signed integers for indices +* [ARROW-1201](https://issues.apache.org/jira/browse/ARROW-1201) - [Python] Incomplete Python types cause a core dump when repr-ing +* [ARROW-1203](https://issues.apache.org/jira/browse/ARROW-1203) - [C++] Disallow BinaryBuilder to append byte strings larger than the maximum value of int32_t +* [ARROW-1205](https://issues.apache.org/jira/browse/ARROW-1205) - C++: Reference to type objects in ArrayLoader may cause segmentation faults. +* [ARROW-1206](https://issues.apache.org/jira/browse/ARROW-1206) - [C++] Enable MSVC builds to work with some compression library support disabled +* [ARROW-1208](https://issues.apache.org/jira/browse/ARROW-1208) - [C++] Toolchain build with ZSTD library from conda-forge failure +* [ARROW-1215](https://issues.apache.org/jira/browse/ARROW-1215) - [Python] Class methods in API reference +* [ARROW-1216](https://issues.apache.org/jira/browse/ARROW-1216) - Numpy arrays cannot be created from Arrow Buffers on Python 2 +* [ARROW-1218](https://issues.apache.org/jira/browse/ARROW-1218) - Arrow doesn't compile if all compression libraries are deactivated +* [ARROW-1222](https://issues.apache.org/jira/browse/ARROW-1222) - [Python] pyarrow.array returns NullArray for array of unsupported Python objects +* [ARROW-1223](https://issues.apache.org/jira/browse/ARROW-1223) - [GLib] Fix function name that returns wrapped object +* [ARROW-1235](https://issues.apache.org/jira/browse/ARROW-1235) - [C++] macOS linker failure with operator<< and std::ostream +* [ARROW-1236](https://issues.apache.org/jira/browse/ARROW-1236) - Library paths in exported pkg-config file are incorrect +* [ARROW-601](https://issues.apache.org/jira/browse/ARROW-601) - Some logical types not supported when loading Parquet +* [ARROW-784](https://issues.apache.org/jira/browse/ARROW-784) - Cleaning up thirdparty toolchain support in Arrow on Windows +* [ARROW-992](https://issues.apache.org/jira/browse/ARROW-992) - [Python] In place development builds do not have a __version__ + +[2]: https://github.com/apache/arrow/releases/tag/apache-arrow-0.5.0 +[6]: https://www.apache.org/dyn/closer.cgi/arrow/arrow-0.5.0/ +[8]: http://arrow.apache.org/blog/2017/07/25/0.5.0-release/ diff --git a/site/_release/index.md b/site/_release/index.md index 2dd65797622..f18cff3b649 100644 --- a/site/_release/index.md +++ b/site/_release/index.md @@ -26,6 +26,7 @@ limitations under the License. Navigate to the release page for downloads and the changelog. +* [0.5.0 (23 July 2017)][6] * [0.4.1 (9 June 2017)][5] * [0.4.0 (22 May 2017)][4] * [0.3.0 (5 May 2017)][1] @@ -37,3 +38,4 @@ Navigate to the release page for downloads and the changelog. [3]: {{ site.baseurl }}/release/0.1.0.html [4]: {{ site.baseurl }}/release/0.4.0.html [5]: {{ site.baseurl }}/release/0.4.1.html +[6]: {{ site.baseurl }}/release/0.5.0.html diff --git a/site/index.html b/site/index.html index 5b60a5fc3e2..8a06c6acec5 100644 --- a/site/index.html +++ b/site/index.html @@ -7,14 +7,18 @@

    Apache Arrow

    Powering Columnar In-Memory Analytics

    Join Mailing List - Install (0.4.1 Release - June 9, 2017) + Install (0.5.0 Release - July 23, 2017)

    -

    Latest News: Apache Arrow 0.4.1 release

    +

    Latest News: Apache Arrow 0.5.0 release

    Fast

    -

    Apache Arrow™ enables execution engines to take advantage of the latest SIMD (Single input multiple data) operations included in modern processors, for native vectorized optimization of analytical data processing. Columnar layout of data also allows for a better use of CPU caches by placing all data relevant to a column operation in as compact of a format as possible.

    +

    Apache Arrow™ enables execution engines to take advantage of the latest SIM +D (Single input multiple data) operations included in modern processors, for native vectorized optimization of analytical data processing. Columnar layout of data also allows for a better use of CPU caches by placing all data relevant to a column operation in as compact of a format + as possible.

    +

    The Arrow memory format supports zero-copy reads + for lightning-fast data access without serialization overhead.

    Flexible

    @@ -26,12 +30,6 @@

    Standard

    -

    Zero-Copy IPC and Streaming Messaging

    -
    -

    Apache Arrow supports zero-copy shared memory IPC and a streaming wire - format that fully avoids traditional data serialization costs

    -
    -

    Performance Advantage of Columnar In-Memory

    SIMD diff --git a/site/install.md b/site/install.md index 4252e7f4bf9..bd45642fe20 100644 --- a/site/install.md +++ b/site/install.md @@ -20,36 +20,40 @@ limitations under the License. {% endcomment %} --> -## Current Version: 0.4.1 +## Current Version: 0.5.0 -### Released: 9 June 2017 +### Released: 23 July 2017 See the [release notes][10] and [blog post][11] for more about what's new. ### Source release -* **Source Release**: [apache-arrow-0.4.1.tar.gz][6] +* **Source Release**: [apache-arrow-0.5.0.tar.gz][6] * **Verification**: [md5][3], [asc][7] -* [Git tag 46315431][2] +* [Git tag e9f76e1][2] ### Java Packages [Java Artifacts on Maven Central][4] +## Binary Installers for C, C++, Python + +It may take a little time for the binary packages to get updated + ### C++ and Python Conda Packages (Unofficial) We have provided binary conda packages on [conda-forge][5] for the following platforms: -* Linux and OS X (Python 2.7, 3.5, and 3.6) +* Linux and macOS (Python 2.7, 3.5, and 3.6) * Windows (Python 3.5 and 3.6) Install them with: ```shell -conda install arrow-cpp -c conda-forge -conda install pyarrow -c conda-forge +conda install arrow-cpp=0.5.0 -c conda-forge +conda install pyarrow=0.5.0 -c conda-forge ``` ### Python Wheels on PyPI (Unofficial) @@ -57,7 +61,7 @@ conda install pyarrow -c conda-forge We have provided binary wheels on PyPI for Linux, macOS, and Windows: ```shell -pip install pyarrow +pip install pyarrow==0.5.0 ``` These include the Apache Arrow and Apache Parquet C++ binary libraries bundled @@ -129,14 +133,14 @@ These repositories are managed at [red-data-tools/arrow-packages][9]. If you have any feedback, please send it to the project instead of Apache Arrow project. -[1]: https://www.apache.org/dyn/closer.cgi/arrow/arrow-0.4.1/ -[2]: https://github.com/apache/arrow/releases/tag/apache-arrow-0.4.1 -[3]: https://www.apache.org/dyn/closer.cgi/arrow/arrow-0.4.1/apache-arrow-0.4.1.tar.gz.md5 -[4]: http://search.maven.org/#search%7Cga%7C1%7Cg%3A%22org.apache.arrow%22%20AND%20v%3A%220.4.1%22 +[1]: https://www.apache.org/dyn/closer.cgi/arrow/arrow-0.5.0/ +[2]: https://github.com/apache/arrow/releases/tag/apache-arrow-0.5.0 +[3]: https://www.apache.org/dyn/closer.cgi/arrow/arrow-0.5.0/apache-arrow-0.5.0.tar.gz.md5 +[4]: http://search.maven.org/#search%7Cga%7C1%7Cg%3A%22org.apache.arrow%22%20AND%20v%3A%220.5.0%22 [5]: http://conda-forge.github.io -[6]: https://www.apache.org/dyn/closer.cgi/arrow/arrow-0.4.1/apache-arrow-0.4.1.tar.gz -[7]: https://www.apache.org/dyn/closer.cgi/arrow/arrow-0.4.1/apache-arrow-0.4.1.tar.gz.asc +[6]: https://www.apache.org/dyn/closer.cgi/arrow/arrow-0.5.0/apache-arrow-0.5.0.tar.gz +[7]: https://www.apache.org/dyn/closer.cgi/arrow/arrow-0.5.0/apache-arrow-0.5.0.tar.gz.asc [8]: https://github.com/red-data-tools/parquet-glib [9]: https://github.com/red-data-tools/arrow-packages -[10]: http://arrow.apache.org/release/0.4.1.html -[11]: http://arrow.apache.org/blog/2017/06/14/0.4.1-release/ \ No newline at end of file +[10]: http://arrow.apache.org/release/0.5.0.html +[11]: http://arrow.apache.org/blog/2017/07/25/0.5.0-release/