Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add load_dataset implementation #192

Merged
merged 11 commits into from
Dec 18, 2024
Merged

Add load_dataset implementation #192

merged 11 commits into from
Dec 18, 2024

Conversation

goeffthomas
Copy link
Contributor

@goeffthomas goeffthomas commented Dec 11, 2024

This adds the load_dataset implementation without including it to the exported kagglehub module yet. Before adding to the kagglehub export (and making the feature more easily discoverable), I want to add more testing around all the different file types and settings that are possible to load. This will come on a follow up PR along with added documentation on how to use it.

Notes:

  • This PR also includes adding the hf-datasets/pandas-datasets optional dependencies and hatch environments to install them
  • The implementation for this feature has changed a few times. Those other iterations should be visible in the commit history, but as a recap:
    • apache_beam doesn't play nicely with datasets due to an underlying dependency issue
    • mlcroissant (and Croissant itself) has some major file type limitations
    • Curated/validated arguments for datasets and pandas
    • Current: Simple passthrough kwargs for the underlying dependencies
  • As implemented, memory is the main constraint. dask looks like a great way to unlock more/larger datasets, but I'm leaning toward leaving that as an improvement if users run into the in-memory problems. Open to feedback on this decision (or any others big/small re: the implementation).

I went a little wide with reviews, but this is feeling more like a direction I could see us going and want to make sure I'm not missing any blind spots. This is my first major python contribution, so don't be shy about feedback 🙏

http://b/379756505

This adds the `load_dataset` implementation without including it to the exported `kagglehub` module. This PR also includes:
- Examples and documentation for how to run saved and temporary python scripts via hatch
- Adding the `hf-datasets` optional dependencies and a hatch environment that installs them
- Support for native hatch flags to pass through via `docker-hatch`--this also required renaming some of the ones defined in `docker-hatch` to avoid collisions

http://b/379756505
While testing, I discovered some major limitations to a dependency on `mlcroissant`. Namely:
- The spec does not currently support sqlite or Excel-like files, which means we can't load data from those file types: mlcommons/croissant#247
- Our current implementation of Croissant doesn't support parquet files because we don't analyze the schema of parquet files (yet). Without that, we can't generate the `RecordSet` information. So parquet is also unusable purely with Croissant
- Our implementation of Croissant directs users to download the archive first, before pathing into the various tables within. This means that interacting with any table in a dataset requires downloading the entire dataset.

The real benefit of `mlcroissant` was that it handled all the file type parsing and reading for us. But since it's incomplete, we can do that ourselves with pandas.
Copy link
Member

@neshdev neshdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM



def load_dataset(
adapter: KaggleDatasetAdapter,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want a default value or do we prefer always having the user explicitly set what they want?

Maybe explicit is better, the user will know what they will get.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, because we now have 2 to start (instead of just the one HF one), I preferred explicit.

@goeffthomas goeffthomas requested review from rosbo and neshdev December 17, 2024 18:31
# NOTE: Order matters here as we're letting users override our specified defaults, namely preserve_index=False.
# This may be valuable in the edge case that a user does actually want the index persisted as a column.
merged_kwargs = {**DEFAULT_PANDAS_KWARGS, **hf_kwargs}
return Dataset.from_pandas(result, **merged_kwargs)
Copy link
Contributor

@rosbo rosbo Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do that, we'd lose support for sqlite, Excel-like files, and feather

@goeffthomas
Copy link
Contributor Author

@jplotts @neshdev I'm going to merge this now because I have a follow up PR ready to open. I can apply any feedback you have here to that follow up PR.

@goeffthomas goeffthomas merged commit f010e11 into main Dec 18, 2024
6 checks passed
@goeffthomas goeffthomas deleted the add-load-dataset branch December 18, 2024 07:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants