Skip to content

Conversation

@kszucs
Copy link
Member

@kszucs kszucs commented Aug 30, 2019

No description provided.

@emkornfield
Copy link
Contributor

@kszucs is this still WIP (I assume so based on CI builds?)

@kszucs
Copy link
Member Author

kszucs commented Oct 24, 2019

@emkornfield yes, I'll continue to work on it after the 0.15.1 release. Theoretically the Dataset API should be ready for having bindings.

@kszucs kszucs force-pushed the ARROW-6341 branch 4 times, most recently from 694087f to 76abfe1 Compare November 21, 2019 12:38
@kszucs kszucs marked this pull request as ready for review November 28, 2019 17:00
@jorisvandenbossche
Copy link
Member

Something I ran into yesterday: trying to access the partition_scheme attribute of the discovery segfaults:

from pyarrow.dataset import FileSystemDataSourceDiscovery, ParquetFileFormat
from pyarrow.fs import Selector, LocalFileSystem

fs = LocalFileSystem()
selector = Selector('test_dataset/', recursive=True)
parquet_format = ParquetFileFormat()
discovery = FileSystemDataSourceDiscovery(fs, selector, parquet_format)
discovery.partition_scheme

where "test_dataset" is a simple directory with a single small parquet file in it.
(the actual tests you wrote do pass though)

@kszucs
Copy link
Member Author

kszucs commented Nov 29, 2019

@jorisvandenbossche I've fixed that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to add an implicit cast like in R for this to be bearable. Otherwise you get annoying errors:

In [41]: cond = ds.ComparisonExpression(ds.CompareOperator.Greater, ds.FieldExpression("total_amount"), ds.ScalarExpression(1000.0))                           

In [42]: scanner_builder.filter(cond)                                          
---------------------------------------------------------------------------
ArrowTypeError                            Traceback (most recent call last)
<ipython-input-42-bb6fba558cf8> in <module>
----> 1 scanner_builder.filter(cond)

~/src/db/arrow/python/pyarrow/_dataset.pyx in pyarrow._dataset.ScannerBuilder.filter()
    951         self : ScannerBuilder
    952         """
--> 953         check_status(self.builder.Filter(filter_expression.unwrap()))
    954         return self
    955 

~/src/db/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()
     86             raise ArrowNotImplementedError(message)
     87         elif status.IsTypeError():
---> 88             raise ArrowTypeError(message)
     89         elif status.IsCapacityError():
     90             raise ArrowCapacityError(message)

ArrowTypeError: cannot compare expressions of differing type, float vs double

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I didn't want to use any implicit behaviour in the first iteration. Perhaps this should be done by the C++ filter method?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should do it as a follow-up PR to handle it nicely.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not fully clear to me what this last sentence exactly means.
Does it mean this scheme will be used to filter/project the output of the data sources, or that the other projections/filters specified when scanning should result in something that matches this schema?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it was not clear to me either, it is copied from the C++ apidocs. In case of projections this schema seems to be omitted? cc @fsaintjacques

@kszucs
Copy link
Member Author

kszucs commented Dec 7, 2019

Hmm, github doesn't allow to request @jorisvandenbossche's review, so here it is :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have a jira about introducing InMemoryDataSource, if I recall correctly we already have one?

@kszucs
Copy link
Member Author

kszucs commented Dec 7, 2019

I'm not planning to implement any new features here, although we should discuss the possible follow-up PRs. A couple candidates:

  • We need to refactor the scalar handling on the python side, I have a WIP patch for it.
  • We need to define a more pythonic API for the dataset bindings, because the current one is pretty low-level.
  • We need to improve the tests.
  • We need to improve the apidocs.
  • We should have hypothesis strategies and tests for the datasets.
  • We should exercise the unit tests from the current parquet dataset implementation on the new one.
  • We should develop a shim over the datasets api to make the transition from the previous parquet dataset implementation more fluid.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quickly tested again some of the code snippets I wrote last week, and that's all still working nicely.
One thing I noticed is that due to the removal of DataFragment and FileSource, you no longer can check the file paths in a Dataset/FileSystemDataSource. Maybe something to consider later if / how we want to expose something like that.

There are still several classes and methods/properties that need docstrings, but it would be fine for me to do that as a follow-up if that makes we can merge this faster.

@jorisvandenbossche
Copy link
Member

I'm not planning to implement any new features here, although we should discuss the possible follow-up PRs. A couple candidates:

That sounds good to me.

We need to define a more pythonic API for the dataset bindings, because the current one is pretty low-level.

Yes, something like the open_dataset from my notebook (https://nbviewer.jupyter.org/gist/jorisvandenbossche/73f7c8d0921a79b461c0a4928fbdc7fa) can probably be part of this (which was modelled after the existing R methods).

We should develop a shim over the datasets api to make the transition from the previous parquet dataset implementation more fluid.

And we also still need to discuss to what extent we want to keep the existing parquet dataset implementation, or deprecate it, or try to implement it partly using the new machinery (eg if we want to support dask's usage, we need to keep it)

Copy link
Member

@bkietz bkietz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good, thanks for doing this!

I think most of your follow up candidates can wait for a different PR, but I think the unit tests need some work before this can be merged.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some comments while I skimmed over this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not obvious to me why partition_expression should be equal to the source_partition constructor argument. Can you explain?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That argument is used to set the partition_expression property, so they should be equal

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to add docstrings for all the public classes here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should also add docstrings for public methods.

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my quick read didn't see anything too unreasonable. I left a handful of comments

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to check that the one you pass in is passed on correctly to the ScanContext? I think you can use logging_memory_pool to check

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might consider doing dispatch with a dict instead

self.init(shared_ptr[CFileFormat](new CParquetFileFormat()))


cdef class PartitionScheme:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to expose partition scheme as a class at all? I think it would suffice to have factories like make_hive_partition_scheme()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to expose partition scheme as a class at all? I think it would suffice to have factories like make_hive_partition_scheme()

What would a make_hive_partition_scheme() then return if the partition scheme object itself is not exposed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ds.partition_scheme(pa.Schema schema, string flavor='hive') -> ds.PartitionScheme

Although I prefer to have not opaque return types, so I can distinguish between:

ds.partition_scheme(schema)
ds.partition_scheme(schema, flavor='hive')

If we return the same class then I don't have the ability to inspect that they are differ.

Copy link
Member

@bkietz bkietz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Follow up JIRA for winnowing the public classes: https://issues.apache.org/jira/browse/ARROW-7391

@bkietz bkietz closed this in 9cb49f3 Dec 13, 2019
nealrichardson pushed a commit that referenced this pull request Jan 18, 2020
Follow-up on #5237 adding a higher-level API for datasets

Closes #6022 from jorisvandenbossche/dataset-python and squashes the following commits:

745c218 <Joris Van den Bossche> rename keyword to partitioning + refactor tests + more coverage
8e03282 <Joris Van den Bossche> update for big renaming + doc updates
9c95938 <Joris Van den Bossche> Use FileSystem.from_uri
ac0d83d <Joris Van den Bossche> split into source / dataset functions
866f72c <Joris Van den Bossche> Add single partitioning() function from kszucs + tests
7481fb6 <Joris Van den Bossche> fix import for python 2
d59595d <Joris Van den Bossche> add partition scheme creation functions
260b737 <Joris Van den Bossche> add support for Pathlib
5e00c87 <Joris Van den Bossche> fix with new partition discovery option
757fe80 <Joris Van den Bossche>  Add higher level open_dataset function

Authored-by: Joris Van den Bossche <[email protected]>
Signed-off-by: Neal Richardson <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants