ARROW-6341: [Python] Implement low-level bindings for Dataset #5237

kszucs · 2019-08-30T13:07:09Z

No description provided.

emkornfield · 2019-10-24T03:57:57Z

@kszucs is this still WIP (I assume so based on CI builds?)

kszucs · 2019-10-24T13:13:12Z

@emkornfield yes, I'll continue to work on it after the 0.15.1 release. Theoretically the Dataset API should be ready for having bindings.

jorisvandenbossche · 2019-11-29T08:57:55Z

Something I ran into yesterday: trying to access the partition_scheme attribute of the discovery segfaults:

from pyarrow.dataset import FileSystemDataSourceDiscovery, ParquetFileFormat
from pyarrow.fs import Selector, LocalFileSystem

fs = LocalFileSystem()
selector = Selector('test_dataset/', recursive=True)
parquet_format = ParquetFileFormat()
discovery = FileSystemDataSourceDiscovery(fs, selector, parquet_format)
discovery.partition_scheme

where "test_dataset" is a simple directory with a single small parquet file in it.
(the actual tests you wrote do pass though)

kszucs · 2019-11-29T17:35:09Z

@jorisvandenbossche I've fixed that.

cpp/src/arrow/dataset/discovery.h

python/pyarrow/_dataset.pyx

fsaintjacques · 2019-12-02T19:05:15Z

python/pyarrow/_dataset.pyx

You need to add an implicit cast like in R for this to be bearable. Otherwise you get annoying errors:

In [41]: cond = ds.ComparisonExpression(ds.CompareOperator.Greater, ds.FieldExpression("total_amount"), ds.ScalarExpression(1000.0)) In [42]: scanner_builder.filter(cond) --------------------------------------------------------------------------- ArrowTypeError Traceback (most recent call last) <ipython-input-42-bb6fba558cf8> in <module> ----> 1 scanner_builder.filter(cond) ~/src/db/arrow/python/pyarrow/_dataset.pyx in pyarrow._dataset.ScannerBuilder.filter() 951 self : ScannerBuilder 952 """ --> 953 check_status(self.builder.Filter(filter_expression.unwrap())) 954 return self 955 ~/src/db/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status() 86 raise ArrowNotImplementedError(message) 87 elif status.IsTypeError(): ---> 88 raise ArrowTypeError(message) 89 elif status.IsCapacityError(): 90 raise ArrowCapacityError(message) ArrowTypeError: cannot compare expressions of differing type, float vs double

Well, I didn't want to use any implicit behaviour in the first iteration. Perhaps this should be done by the C++ filter method?

We should do it as a follow-up PR to handle it nicely.

python/pyarrow/_dataset.pyx

jorisvandenbossche · 2019-12-02T19:33:18Z

python/pyarrow/_dataset.pyx

It's not fully clear to me what this last sentence exactly means.
Does it mean this scheme will be used to filter/project the output of the data sources, or that the other projections/filters specified when scanning should result in something that matches this schema?

Well, it was not clear to me either, it is copied from the C++ apidocs. In case of projections this schema seems to be omitted? cc @fsaintjacques

python/pyarrow/_dataset.pyx

kszucs · 2019-12-07T21:08:31Z

Hmm, github doesn't allow to request @jorisvandenbossche's review, so here it is :)

cpp/src/arrow/dataset/scanner.h

python/pyarrow/_dataset.pyx

kszucs · 2019-12-07T21:14:50Z

python/pyarrow/_dataset.pyx

We should have a jira about introducing InMemoryDataSource, if I recall correctly we already have one?

kszucs · 2019-12-07T21:24:10Z

I'm not planning to implement any new features here, although we should discuss the possible follow-up PRs. A couple candidates:

We need to refactor the scalar handling on the python side, I have a WIP patch for it.
We need to define a more pythonic API for the dataset bindings, because the current one is pretty low-level.
We need to improve the tests.
We need to improve the apidocs.
We should have hypothesis strategies and tests for the datasets.
We should exercise the unit tests from the current parquet dataset implementation on the new one.
We should develop a shim over the datasets api to make the transition from the previous parquet dataset implementation more fluid.

jorisvandenbossche

Quickly tested again some of the code snippets I wrote last week, and that's all still working nicely.
One thing I noticed is that due to the removal of DataFragment and FileSource, you no longer can check the file paths in a Dataset/FileSystemDataSource. Maybe something to consider later if / how we want to expose something like that.

There are still several classes and methods/properties that need docstrings, but it would be fine for me to do that as a follow-up if that makes we can merge this faster.

python/pyarrow/_dataset.pyx

jorisvandenbossche · 2019-12-09T16:23:12Z

I'm not planning to implement any new features here, although we should discuss the possible follow-up PRs. A couple candidates:

That sounds good to me.

We need to define a more pythonic API for the dataset bindings, because the current one is pretty low-level.

Yes, something like the open_dataset from my notebook (https://nbviewer.jupyter.org/gist/jorisvandenbossche/73f7c8d0921a79b461c0a4928fbdc7fa) can probably be part of this (which was modelled after the existing R methods).

We should develop a shim over the datasets api to make the transition from the previous parquet dataset implementation more fluid.

And we also still need to discuss to what extent we want to keep the existing parquet dataset implementation, or deprecate it, or try to implement it partly using the new machinery (eg if we want to support dask's usage, we need to keep it)

bkietz

This is looking good, thanks for doing this!

I think most of your follow up candidates can wait for a different PR, but I think the unit tests need some work before this can be merged.

python/pyarrow/_dataset.pyx

python/pyarrow/tests/test_dataset.py

pitrou

Just some comments while I skimmed over this.

python/pyarrow/tests/conftest.py

python/pyarrow/tests/test_dataset.py

pitrou · 2019-12-10T16:58:24Z

python/pyarrow/tests/test_dataset.py

It's not obvious to me why partition_expression should be equal to the source_partition constructor argument. Can you explain?

That argument is used to set the partition_expression property, so they should be equal

python/pyarrow/tests/test_dataset.py

cpp/src/arrow/dataset/test_util.h

cpp/src/arrow/dataset/dataset.h

pitrou · 2019-12-10T17:28:32Z

python/pyarrow/_dataset.pyx

Need to add docstrings for all the public classes here.

python/pyarrow/_dataset.pyx

pitrou · 2019-12-10T17:31:58Z

python/pyarrow/_dataset.pyx

Should also add docstrings for public methods.

wesm

In my quick read didn't see anything too unreasonable. I left a handful of comments

cpp/src/arrow/dataset/dataset.h

cpp/src/arrow/dataset/scanner.h

cpp/src/arrow/dataset/test_util.h

cpp/src/arrow/filesystem/filesystem.h

python/pyarrow/tests/test_dataset.py

wesm · 2019-12-10T17:23:34Z

python/pyarrow/tests/test_dataset.py

Do you want to check that the one you pass in is passed on correctly to the ScanContext? I think you can use logging_memory_pool to check

python/pyarrow/tests/test_dataset.py

python/pyarrow/_dataset.pyx

wesm · 2019-12-10T17:49:06Z

python/pyarrow/_dataset.pyx

You might consider doing dispatch with a dict instead

…ccess

cpp/src/arrow/dataset/dataset.cc

cpp/src/arrow/dataset/partition.h

bkietz · 2019-12-12T14:19:24Z

python/pyarrow/_dataset.pyx

+        self.init(shared_ptr[CFileFormat](new CParquetFileFormat()))
+
+
+cdef class PartitionScheme:


Is it necessary to expose partition scheme as a class at all? I think it would suffice to have factories like make_hive_partition_scheme()

Is it necessary to expose partition scheme as a class at all? I think it would suffice to have factories like make_hive_partition_scheme()

What would a make_hive_partition_scheme() then return if the partition scheme object itself is not exposed?

ds.partition_scheme(pa.Schema schema, string flavor='hive') -> ds.PartitionScheme

Although I prefer to have not opaque return types, so I can distinguish between:

ds.partition_scheme(schema) ds.partition_scheme(schema, flavor='hive')

If we return the same class then I don't have the ability to inspect that they are differ.

bkietz

LGTM.

Follow up JIRA for winnowing the public classes: https://issues.apache.org/jira/browse/ARROW-7391

python/pyarrow/includes/libarrow_dataset.pxd

Follow-up on #5237 adding a higher-level API for datasets Closes #6022 from jorisvandenbossche/dataset-python and squashes the following commits: 745c218 <Joris Van den Bossche> rename keyword to partitioning + refactor tests + more coverage 8e03282 <Joris Van den Bossche> update for big renaming + doc updates 9c95938 <Joris Van den Bossche> Use FileSystem.from_uri ac0d83d <Joris Van den Bossche> split into source / dataset functions 866f72c <Joris Van den Bossche> Add single partitioning() function from kszucs + tests 7481fb6 <Joris Van den Bossche> fix import for python 2 d59595d <Joris Van den Bossche> add partition scheme creation functions 260b737 <Joris Van den Bossche> add support for Pathlib 5e00c87 <Joris Van den Bossche> fix with new partition discovery option 757fe80 <Joris Van den Bossche> Add higher level open_dataset function Authored-by: Joris Van den Bossche <[email protected]> Signed-off-by: Neal Richardson <[email protected]>

kszucs force-pushed the ARROW-6341 branch from c04be60 to 8ef3063 Compare November 11, 2019 11:08

kszucs force-pushed the ARROW-6341 branch 4 times, most recently from 694087f to 76abfe1 Compare November 21, 2019 12:38

kszucs force-pushed the ARROW-6341 branch from 2bcca59 to d945311 Compare November 25, 2019 17:30

kszucs marked this pull request as ready for review November 28, 2019 17:00

kszucs force-pushed the ARROW-6341 branch from 8707101 to 36f119a Compare November 28, 2019 17:20

fsaintjacques reviewed Dec 2, 2019

View reviewed changes

cpp/src/arrow/dataset/discovery.h Outdated Show resolved Hide resolved

jorisvandenbossche reviewed Dec 2, 2019

View reviewed changes

python/pyarrow/_dataset.pyx Outdated Show resolved Hide resolved

jorisvandenbossche reviewed Dec 2, 2019

View reviewed changes

python/pyarrow/_dataset.pyx Outdated Show resolved Hide resolved

fsaintjacques reviewed Dec 2, 2019

View reviewed changes

python/pyarrow/_dataset.pyx Outdated Show resolved Hide resolved

python/pyarrow/_dataset.pyx Outdated Show resolved Hide resolved

python/pyarrow/_dataset.pyx Outdated Show resolved Hide resolved

fsaintjacques reviewed Dec 2, 2019

View reviewed changes

jorisvandenbossche reviewed Dec 2, 2019

View reviewed changes

kszucs force-pushed the ARROW-6341 branch from dfe8f2b to 5cf1aae Compare December 6, 2019 14:30

kszucs requested a review from fsaintjacques December 7, 2019 21:07

kszucs commented Dec 7, 2019

View reviewed changes

cpp/src/arrow/dataset/scanner.h Outdated Show resolved Hide resolved

kszucs commented Dec 7, 2019

View reviewed changes

python/pyarrow/_dataset.pyx Outdated Show resolved Hide resolved

kszucs commented Dec 7, 2019

View reviewed changes

jorisvandenbossche reviewed Dec 9, 2019

View reviewed changes

bkietz requested changes Dec 9, 2019

View reviewed changes

pitrou reviewed Dec 10, 2019

View reviewed changes

wesm reviewed Dec 10, 2019

View reviewed changes

kszucs added 16 commits December 12, 2019 12:27

Expose root_partition

e9f77bd

schema as property

620ba6f

Remove move workaround

13eaf46

Remove DataFragment and ScanOptions

f52e735

Enable PYARROW_BUILD_DATASET

4384b74

Fixing review issues

3210e91

Docstring additions

9b38a40

Removed todo notes

f589ecb

Fix review comments

c9ba0fb

Expose root_partition setter; resolve a couple of review issues

e6a5623

Fix api changes

6260734

Remove ScanContext

a1d8254

Clang format

7553caa

Result iterator api

b8949a3

Rebase again

3988865

type_name

f2ab5eb

kszucs force-pushed the ARROW-6341 branch from 85365ff to f2ab5eb Compare December 12, 2019 11:53

kszucs added 4 commits December 12, 2019 13:04

more type_name

bd0d1d2

Execute instead of scan

64ca712

Don't deprecate RandomFileAccess just remove in favor of CRandomFileA…

27dbe56

…ccess

Test projected partitions

48b5556

bkietz requested changes Dec 12, 2019

View reviewed changes

jorisvandenbossche mentioned this pull request Dec 12, 2019

ARROW-7432: [Python] Add higher level open_dataset function #6022

Closed

kszucs added 2 commits December 12, 2019 19:46

Don't expose SimpleDataSource

069d8f5

Fix tests for MockFs

45121f7

bkietz approved these changes Dec 13, 2019

View reviewed changes

python/pyarrow/includes/libarrow_dataset.pxd Outdated Show resolved Hide resolved

bkietz closed this in 9cb49f3 Dec 13, 2019

This was referenced Dec 13, 2019

[Python] Implement low-level bindings for Dataset #22718

Closed

[Python] Remove unnecessary classes from the binding layer #23666

Closed

		self.init(shared_ptr[CFileFormat](new CParquetFileFormat()))


		cdef class PartitionScheme:

ARROW-6341: [Python] Implement low-level bindings for Dataset #5237

ARROW-6341: [Python] Implement low-level bindings for Dataset #5237

Uh oh!

Conversation

kszucs commented Aug 30, 2019

Uh oh!

emkornfield commented Oct 24, 2019

Uh oh!

kszucs commented Oct 24, 2019

Uh oh!

jorisvandenbossche commented Nov 29, 2019

Uh oh!

kszucs commented Nov 29, 2019

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

kszucs commented Dec 7, 2019

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kszucs commented Dec 7, 2019

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jorisvandenbossche commented Dec 9, 2019

Uh oh!

bkietz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pitrou left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!