-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-7432: [Python] Add higher level open_dataset function #6022
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-7432: [Python] Add higher level open_dataset function #6022
Conversation
|
Thanks for opening a pull request! Could you open an issue for this pull request on JIRA? Then could you also rename pull request title in the following format? See also: |
|
@jorisvandenbossche it can be rebased now |
f5173d5 to
b0c7dfa
Compare
python/pyarrow/dataset.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about adding this functionality to FileSystemDataSourceDiscovery paths_or_selector argument handling?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about adding this functionality to FileSystemDataSourceDiscovery paths_or_selector argument handling?
Sounds good
python/pyarrow/dataset.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder how should we express the support for not just one, but multiple data sources.
For example opening a dataset with a source from s3 and a local one.
open_dataset(['s3://bucket/base/path', 'local://base/path'])We'll probably need more functions, but cannot think a meaningful name right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or in this case we'll simply require to construct the data sources and the dataset manually?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally think this advanced use case should not necessarily be handled by such a higher level function (not sure how the dataset discovery in C++ will handle this). So that it is indeed up to the user to construct the data sources manually
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Joris.
python/pyarrow/tests/test_dataset.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kszucs in the other dataset tests you have been using a MockFilesystem, while I used here pytest tempdir fixture, as we do in other tests. Was there a specific reason to use the mock filesystem (or that I should also use it here)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Less IO, but it's fine to use the local filesystem.
python/pyarrow/dataset.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, this is a little different from what R does: https://github.com/apache/arrow/blob/master/r/R/dataset.R#L23-L29
Also, I think this can/should change when #6008 lands. IIUC discovery will automatically identify hive partitions. You may also want to accept a list of strings as the names of the (unnamed, non-hive) partition segments (or perhaps this is to be done on https://issues.apache.org/jira/browse/ARROW-7413).
807c250 to
9b3e646
Compare
|
OK, I updated the PR to add helper functions for the partitioning as well, and to use this in the I created now a single helper function per partitioning scheme that can return both a factory or an actual partitioning (as being discussed in #6151 (comment), and similar as done in R). So with the current low-level API it looks like (in case of a hive partitioning): with what I just pushed, the higher level API becomes with distinct functions for explicit creation vs factory/discovery, it could look like: (or we could also decide to be fine with the class based API of course) |
|
From discussion on zulip, it's also an option to have a single partitioning(schema) # explicit schema scheme
partitioning(['date', 'client']) # error: positional argument must be schema
partitioning(field_names=['date', 'client']) # discover schema scheme
partitioning(schema, flavor='hive') # explicit hive
partitioning(flavor='hive'). # discover everything
partitioning(field_names=['date', 'client'], flavor='hive') # warning: hive partitioning ignores field_namesor without requiring the positional argument to be a Schema (combining it with field_names): partitioning(['date', 'client']) # schema-scheme
partitioning(['date', 'client'], flavor='hive')
partitioning(flavor='hive'). # discover everything
partitioning(schema) # explicit schema scheme
partitioning(schema, flavor='hive') # explicit hive
partitioning(lambda ...) # function partitioning |
|
Combining |
|
I've created a high-level API prototype including support for multiple data sources. @jorisvandenbossche PTAL. Because the C++ API uses If we want to unify the |
|
Thanks @kszucs ! Further, the main differences / discussion points are (focusing on the single source use case):
That's something we want to add anyhow, no? (also for easily creating a source without needing to specify the filesystem as an object) |
|
@bkietz the partition fields and values are included in the record batches coming from the scan. I guess those can be ignored if we pass a whitelist to the scanner projection excluding the unnecessary partition fields. Either way, for me it rather seems handy. |
Agree, and good question :)
I'm fine with both. If we can express both cases with a single function signature we can go with that.
Technically we're not opening, rather describing and discovering a dataset, although the latter could be better because of the
Definitely, but could help with expressing the single source shortcut. |
|
@jorisvandenbossche please incorporate the changes you agree with from my branch |
I agree that the "open" is not really needed, but indeed an indentical name for function as the module is what holds me back as well .. |
|
Let's defer the naming for later then, use open_dataset for consistency with R for now. |
|
Already added the single |
|
@jorisvandenbossche Thanks! Do you want to add support for multiple source or shall we handle it in a follow-up? |
|
I was not planning to do that here (we don't yet have support for that in Python, so can't test this I think?). |
|
@jorisvandenbossche the source/dataset distinction would be nice. I can add the multi source support later based on my branch as a follow-up. |
|
OK, updated this with separate In the end, I went with allowing to use The |
|
@kszucs please take a look! (I still need to add some additional tests to cover new aspects of the API) |
bkietz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking great, thanks for doing this!
python/pyarrow/dataset.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think an argument named field_names should accept a schema. Please move it to a separate argument or maybe rename to schema_or_field_names
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
python/pyarrow/tests/test_dataset.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a lot of repeated code in these tests. Please parameterize them or write helpers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you specify a bit more? What aspect do you find repetitive?
The creation of the data? (but this is different for each test) Or the way that the resulting dataset is checked? That could indeed be written with a helper function, but it might also point to a too verbose API (eg @kszucs was adding a Dataset.to_table method so dataset.new_scan().finish().to_table() could be shortened to dataset.to_table())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's a sketch of some helpers and a rewrite of test_open_dataset_directory which should clarify what I mean. I won't block on this; the test suite isn't super long yet
For data creation:
def write_parquet_table(columns, path):
assert isinstance(path, pathlib.Path)
path.parent.mkdir(exist_ok=True)
table = pa.table(columns)
import pyarrow.parquet as pq
pq.write_table(table, path)
return tableFor dataset extraction:
def dataset_table(path, **kwargs):
assert isinstance(path, pathlib.Path)
dataset = ds.dataset(path, **kwargs)
table = dataset.to_table()
# verify equivalently creatable from string path
assert ds.dataset(str(path)).to_table().equals(table)
return dataset, table
@pytest.mark.parquet
def test_open_dataset(tempdir):
columns = {'a': range(9), 'b': [0.] * 4 + [1.] * 5}
expected = write_parquet_table(columns, tempdir / 'test.parquet')
dataset, actual = dataset_table(tempdir / 'test.parquet')
assert actual.replace_schema_metadata().equals(expected)
jorisvandenbossche
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bkietz thanks for the review!
python/pyarrow/dataset.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
python/pyarrow/tests/test_dataset.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you specify a bit more? What aspect do you find repetitive?
The creation of the data? (but this is different for each test) Or the way that the resulting dataset is checked? That could indeed be written with a helper function, but it might also point to a too verbose API (eg @kszucs was adding a Dataset.to_table method so dataset.new_scan().finish().to_table() could be shortened to dataset.to_table())
acdb620 to
d1d3807
Compare
|
OK, I updated this after the renaming. @kszucs can you have a look? |
Co-authored-by: Krisztián Szűcs <[email protected]>
9a7e53d to
745c218
Compare
nealrichardson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm merging this even though there are a couple of pending discussions. I suggest that any further revision can happen in the context of https://issues.apache.org/jira/browse/ARROW-7431 (adding docs), which may lead us to other ergonomic changes anyway. @jorisvandenbossche can you pick that issue up next?
Follow-up on #5237 adding a higher-level API for datasets