Add `load_dataset` implementation #192

goeffthomas · 2024-12-11T23:27:44Z

This adds the load_dataset implementation without including it to the exported kagglehub module yet. Before adding to the kagglehub export (and making the feature more easily discoverable), I want to add more testing around all the different file types and settings that are possible to load. This will come on a follow up PR along with added documentation on how to use it.

Notes:

This PR also includes adding the hf-datasets/pandas-datasets optional dependencies and hatch environments to install them
The implementation for this feature has changed a few times. Those other iterations should be visible in the commit history, but as a recap:
- apache_beam doesn't play nicely with datasets due to an underlying dependency issue
- mlcroissant (and Croissant itself) has some major file type limitations
- Curated/validated arguments for datasets and pandas
- Current: Simple passthrough kwargs for the underlying dependencies
As implemented, memory is the main constraint. dask looks like a great way to unlock more/larger datasets, but I'm leaning toward leaving that as an improvement if users run into the in-memory problems. Open to feedback on this decision (or any others big/small re: the implementation).

I went a little wide with reviews, but this is feeling more like a direction I could see us going and want to make sure I'm not missing any blind spots. This is my first major python contribution, so don't be shy about feedback 🙏

http://b/379756505

This adds the `load_dataset` implementation without including it to the exported `kagglehub` module. This PR also includes: - Examples and documentation for how to run saved and temporary python scripts via hatch - Adding the `hf-datasets` optional dependencies and a hatch environment that installs them - Support for native hatch flags to pass through via `docker-hatch`--this also required renaming some of the ones defined in `docker-hatch` to avoid collisions http://b/379756505

While testing, I discovered some major limitations to a dependency on `mlcroissant`. Namely: - The spec does not currently support sqlite or Excel-like files, which means we can't load data from those file types: mlcommons/croissant#247 - Our current implementation of Croissant doesn't support parquet files because we don't analyze the schema of parquet files (yet). Without that, we can't generate the `RecordSet` information. So parquet is also unusable purely with Croissant - Our implementation of Croissant directs users to download the archive first, before pathing into the various tables within. This means that interacting with any table in a dataset requires downloading the entire dataset. The real benefit of `mlcroissant` was that it handled all the file type parsing and reading for us. But since it's incomplete, we can do that ourselves with pandas.

src/kagglehub/clients.py

tools/scripts/download_model.py

stubs/datasets/__init__.pyi

src/kagglehub/pandas_datasets.py

neshdev

LGTM

docker-hatch

README.md

rosbo · 2024-12-12T19:13:42Z

src/kagglehub/datasets.py

+
+
+def load_dataset(
+    adapter: KaggleDatasetAdapter,


Do we want a default value or do we prefer always having the user explicitly set what they want?

Maybe explicit is better, the user will know what they will get.

Yeah, because we now have 2 to start (instead of just the one HF one), I preferred explicit.

src/kagglehub/datasets.py

src/kagglehub/hf_datasets.py

rosbo · 2024-12-17T20:45:46Z

src/kagglehub/hf_datasets.py

+    # NOTE: Order matters here as we're letting users override our specified defaults, namely preserve_index=False.
+    # This may be valuable in the edge case that a user does actually want the index persisted as a column.
+    merged_kwargs = {**DEFAULT_PANDAS_KWARGS, **hf_kwargs}
+    return Dataset.from_pandas(result, **merged_kwargs)


Can you call load_dataset directly instead? https://screenshot.googleplex.com/552iQEq9vcFF2V9

Source: https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/loading_methods#datasets.load_dataset

And not split kwargs between pandas & hf?

If we do that, we'd lose support for sqlite, Excel-like files, and feather

goeffthomas · 2024-12-18T07:29:09Z

@jplotts @neshdev I'm going to merge this now because I have a follow up PR ready to open. I can apply any feedback you have here to that follow up PR.

goeffthomas added 4 commits December 4, 2024 04:52

New take on implementation, based on pandas

04fc39a

Minor fixes

68176d3

goeffthomas requested review from rosbo, neshdev and jplotts December 11, 2024 23:27

goeffthomas added 2 commits December 11, 2024 23:46

Convert pipes to Union

3edf9e2

Add HF datasets stubs

73ee25f

neshdev reviewed Dec 12, 2024

View reviewed changes

src/kagglehub/clients.py Outdated Show resolved Hide resolved

neshdev reviewed Dec 12, 2024

View reviewed changes

tools/scripts/download_model.py Outdated Show resolved Hide resolved

neshdev reviewed Dec 12, 2024

View reviewed changes

stubs/datasets/__init__.pyi Outdated Show resolved Hide resolved

neshdev reviewed Dec 12, 2024

View reviewed changes

src/kagglehub/pandas_datasets.py Outdated Show resolved Hide resolved

neshdev approved these changes Dec 12, 2024

View reviewed changes

rosbo reviewed Dec 12, 2024

View reviewed changes

src/kagglehub/datasets.py Outdated Show resolved Hide resolved

pcuenca reviewed Dec 12, 2024

View reviewed changes

src/kagglehub/hf_datasets.py Outdated Show resolved Hide resolved

goeffthomas added 4 commits December 12, 2024 21:14

Simplify docker-hatch

693f778

Merge branch 'main' into add-load-dataset

ee15989

Fix merge mistakes

a78cc59

Updated implementation with tests

20754ca

goeffthomas requested review from rosbo and neshdev December 17, 2024 18:31

Add openpyxl to hatch envs

7f32f22

rosbo reviewed Dec 17, 2024

View reviewed changes

rosbo approved these changes Dec 17, 2024

View reviewed changes

goeffthomas merged commit f010e11 into main Dec 18, 2024
6 checks passed

goeffthomas deleted the add-load-dataset branch December 18, 2024 07:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `load_dataset` implementation #192

Add `load_dataset` implementation #192

goeffthomas commented Dec 11, 2024 •

edited

Loading

neshdev left a comment

rosbo Dec 12, 2024

goeffthomas Dec 12, 2024

rosbo Dec 17, 2024 •

edited

Loading

goeffthomas Dec 17, 2024

goeffthomas commented Dec 18, 2024

Add load_dataset implementation #192

Add load_dataset implementation #192

Conversation

goeffthomas commented Dec 11, 2024 • edited Loading

neshdev left a comment

Choose a reason for hiding this comment

rosbo Dec 12, 2024

Choose a reason for hiding this comment

goeffthomas Dec 12, 2024

Choose a reason for hiding this comment

rosbo Dec 17, 2024 • edited Loading

Choose a reason for hiding this comment

goeffthomas Dec 17, 2024

Choose a reason for hiding this comment

goeffthomas commented Dec 18, 2024

Add `load_dataset` implementation #192

Add `load_dataset` implementation #192

goeffthomas commented Dec 11, 2024 •

edited

Loading

rosbo Dec 17, 2024 •

edited

Loading