-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add load_dataset
implementation
#192
Conversation
This adds the `load_dataset` implementation without including it to the exported `kagglehub` module. This PR also includes: - Examples and documentation for how to run saved and temporary python scripts via hatch - Adding the `hf-datasets` optional dependencies and a hatch environment that installs them - Support for native hatch flags to pass through via `docker-hatch`--this also required renaming some of the ones defined in `docker-hatch` to avoid collisions http://b/379756505
While testing, I discovered some major limitations to a dependency on `mlcroissant`. Namely: - The spec does not currently support sqlite or Excel-like files, which means we can't load data from those file types: mlcommons/croissant#247 - Our current implementation of Croissant doesn't support parquet files because we don't analyze the schema of parquet files (yet). Without that, we can't generate the `RecordSet` information. So parquet is also unusable purely with Croissant - Our implementation of Croissant directs users to download the archive first, before pathing into the various tables within. This means that interacting with any table in a dataset requires downloading the entire dataset. The real benefit of `mlcroissant` was that it handled all the file type parsing and reading for us. But since it's incomplete, we can do that ourselves with pandas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
||
|
||
def load_dataset( | ||
adapter: KaggleDatasetAdapter, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want a default value or do we prefer always having the user explicitly set what they want?
Maybe explicit is better, the user will know what they will get.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, because we now have 2 to start (instead of just the one HF one), I preferred explicit.
# NOTE: Order matters here as we're letting users override our specified defaults, namely preserve_index=False. | ||
# This may be valuable in the edge case that a user does actually want the index persisted as a column. | ||
merged_kwargs = {**DEFAULT_PANDAS_KWARGS, **hf_kwargs} | ||
return Dataset.from_pandas(result, **merged_kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you call load_dataset
directly instead? https://screenshot.googleplex.com/552iQEq9vcFF2V9
And not split kwargs between pandas & hf?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do that, we'd lose support for sqlite, Excel-like files, and feather
This adds the
load_dataset
implementation without including it to the exportedkagglehub
module yet. Before adding to thekagglehub
export (and making the feature more easily discoverable), I want to add more testing around all the different file types and settings that are possible to load. This will come on a follow up PR along with added documentation on how to use it.Notes:
hf-datasets
/pandas-datasets
optional dependencies and hatch environments to install themapache_beam
doesn't play nicely withdatasets
due to an underlying dependency issuemlcroissant
(and Croissant itself) has some major file type limitationsdatasets
andpandas
kwargs
for the underlying dependenciesI went a little wide with reviews, but this is feeling more like a direction I could see us going and want to make sure I'm not missing any blind spots. This is my first major python contribution, so don't be shy about feedback 🙏
http://b/379756505