Skip to content

Conversation

@amol-
Copy link
Member

@amol- amol- commented Jun 21, 2022

Expose a Dataset.filter method that applies a filter to the dataset without actually loading it in memory.

Addresses what was discussed in #13155 (comment)

  • Update documentation
  • Ensure the filtered dataset preserves the filter when writing it back
  • Ensure the filtered dataset preserves the filter when joining
  • Ensure the filtered dataset preserves the filter when applying standard Dataset.something methods.
  • Allow to extend the filter by adding more conditions subsequently dataset(filter=X).filter(filter=Y).scanner(filter=Z) (related to ARROW-16616: [Python] Add lazy Dataset.filter() method #13409 (comment))
  • Refactor to use only Dataset class instead of FilteredDataset as discussed with @jorisvandenbossche
  • Add support in replace_schema
  • Error in get_fragments in case a filter is set.
  • Verify support in UnionDataset

@amol- amol- requested a review from pitrou June 21, 2022 16:55
@github-actions
Copy link

@github-actions
Copy link

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@pitrou
Copy link
Member

pitrou commented Jun 28, 2022

I have no idea whether we want to expose such lazy construction APIs on Dataset.

cc @jorisvandenbossche

@amol-
Copy link
Member Author

amol- commented Jun 28, 2022

I have no idea whether we want to expose such lazy construction APIs on Dataset.

FYI, this task has spawned from #13155 (comment) discussion

@amol- amol- marked this pull request as ready for review July 5, 2022 09:59
@amol-
Copy link
Member Author

amol- commented Jul 5, 2022

@pitrou Addressed all comments, should be ready for re-review

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not completely against this but having FilteredDataset instead of something like Query might be a bit short-sighted. What happens if a user wants to add a dynamic column (project)?

If you had both a projection expression and a filter expression that might be more close to what scanner / datasets provides today.

@amol- amol- closed this Jul 6, 2022
@amol- amol- reopened this Jul 6, 2022
@jorisvandenbossche
Copy link
Member

I am personally also a bit wary of adding a new public class like FilteredDataset (at least until we have had the broader discussion about how we want to provide a dataframe-like / query object interface, as similar discussions will keep coming up for other methods).
If we want to provide this filter() method on the short term, I would also prefer doing it just on Dataset, as Weston suggested (that was also my original idea for this issue). Although that also creates its backwards compatibility issues of course, if we later let this method return an object backed by a query plan, as that then might not keep all methods/attributes that are currently available on Dataset.

@jorisvandenbossche
Copy link
Member

get_fragments is also another method where the user could expect this filter to be applied (giving the same result as specifying the filter keyword)

@amol-
Copy link
Member Author

amol- commented Jul 11, 2022

I'm not completely against this but having FilteredDataset instead of something like Query might be a bit short-sighted. What happens if a user wants to add a dynamic column (project)?

If you had both a projection expression and a filter expression that might be more close to what scanner / datasets provides today.

@westonpace I'm not particularly attached to the FilteredDataset name, I just want to avoid using the Query name explicitly to ensure we avoid hinting users that it's ever supposed to become a fully fledged query system at the moment. They can use IBIS if they are looking for that.

I also dislike the idea of reusing the Scanner class as it smells like hijacking its data read responsibility. I wanted a name that conveys correctly the idea of something that represents a dataset with an applied transformation and to which additional transformations can be applied. Maybe QueriedDataset could do? Smells a lot like the query already happened, so not exactly what I was looking for. I'm open to suggestions. Naming things correctly seems hard, I could try invalidating some caches instead ;)

@jorisvandenbossche
Copy link
Member

From #13409 (comment)

If self._filter can be None then what is the advantage of creating a separate FilteredDataset instead of just adding _filter to the existing Dataset?

That was the original implementation, and I was asked to explicitly move it in a dedicated class. Which I think in the end makes sense, better have a single responsibility per class.

Could you expand on this a bit? (I don't know where or why it was asked to move to a dedicated class, the only reference I find in the other PR is the question if this shouldn't live on the Scanner)

It seems to me that if we want to expose a helper filter() method (although it doesn't give that much of value compared to passing the filter to the method that actually will do the scanning, i.e. to_table(..), to_batches(..), etc), adding it just to the main Dataset class will expose the least amount of new API that we "lock in" (it avoids deciding now if we want some "Query" like class)

@amol- amol- added this to the 11.0.0 milestone Nov 28, 2022
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update!

There are a few more complexities:

There are some other Dataset methods that might also need to take into account this filter: replace_schema, get_fragments.
And what with a UnionDataset unioning filtered datasets ?

We should probably also note somewhere that Fragments don't inherit this filter.
The ParquetFileFragment has some methods that work with filters (split_by_row_group, subset).


cdef cppclass CSourceNodeOptions "arrow::compute::SourceNodeOptions"(CExecNodeOptions):
@staticmethod
CResult[shared_ptr[CSourceNodeOptions]] FromRecordBatchReader(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still used in the current version of the PR?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No anymore, but it still had a value on its own, so I didn't remove it. If it's a concern I can easily get rid of it. (Even though I would leave the C++ implementation around).

@jorisvandenbossche jorisvandenbossche changed the title ARROW-16616: [Python] Lazy datasets filtering ARROW-16616: [Python] Add Dataset.filter() method Dec 6, 2022
@jorisvandenbossche jorisvandenbossche changed the title ARROW-16616: [Python] Add Dataset.filter() method ARROW-16616: [Python] Add lazy Dataset.filter() method Dec 6, 2022
@amol-
Copy link
Member Author

amol- commented Dec 7, 2022

@jorisvandenbossche I checked for ParquetDataset and the experience is fairly confusing from the end user point of view.

If the dataset is created using ds.parquet_dataset it will have the filter capabilities, but if it's created using pyarrow.parquet.ParquetDataset it won't have filtering capabilities. But ParquetDataset in its V2 implementation is just a proxy to ds.Dataset, so it could in theory gain filtering support.

It seems that ParquetDataset is mostly a duplicate of what ds.parquet_dataset can do when use_legacy_dataset=False, so is there a reason why we keep it around? Is there a plan to deprecate it in the future?

Asking because if the plan is to deprecate it some day, then it probably doesn't make much sense to invest the effort to work toward feature parity with ds.Dataset and we can consider this task done.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Dec 7, 2022

I think you certainly don't have to care about ParquetDataset in this PR (we generally didn't add any of the additional methods from pyarrow.dataset.Dataset to it, so this PR just follows that). And let's have the discussion about what to do with ParquetDataset in a dedicated place (eg we already have https://issues.apache.org/jira/browse/ARROW-9720, although the description there is a bit outdated)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants