Skip to content

Add Huggingface support for data preparation#419

Closed
fsiino-nvidia wants to merge 23 commits intomainfrom
fsiino/416-hf-download-utils
Closed

Add Huggingface support for data preparation#419
fsiino-nvidia wants to merge 23 commits intomainfrom
fsiino/416-hf-download-utils

Conversation

@fsiino-nvidia
Copy link
Contributor

@fsiino-nvidia fsiino-nvidia commented Dec 2, 2025

Integrates huggingface dataset download for ng_prepare_data
Refine ng_download_data_from_hf to handle non-jsonl files (like parquet)

Signed-off-by: Frankie Siino <fsiino@nvidia.com>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Dec 2, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@fsiino-nvidia fsiino-nvidia linked an issue Dec 2, 2025 that may be closed by this pull request
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
@fsiino-nvidia fsiino-nvidia changed the title Fs/Hf download utils Fsiino/Hf download utils Dec 3, 2025
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
@fsiino-nvidia fsiino-nvidia marked this pull request as ready for review December 4, 2025 03:47
@fsiino-nvidia fsiino-nvidia requested a review from a team as a code owner December 4, 2025 03:47
@fsiino-nvidia fsiino-nvidia changed the title Fsiino/Hf download utils Add Huggingface support for data preparation Dec 4, 2025
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
jsonl_fpath: resources_servers/mini_swe_agent/data/train.jsonl
dataset_url: https://huggingface.co/datasets/princeton-nlp/SWE-bench_Verified
huggingface_identifier:
repo_id: princeton-nlp/SWE-bench_Verified
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this right? can we double check with @sdevare-nv ? if someone downloads HF data from here is it directly usable in gym?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sdevare-nv I dont recall where I got that link. Should this be https://huggingface.co/datasets/SWE-Gym/SWE-Gym instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fsiino-nvidia @bxyu-nvidia yes it should be swe-gym ; but i have filtered out samples that we don't have singularity images for. So will we need to upload a new version of the dataset?

Copy link
Contributor Author

@fsiino-nvidia fsiino-nvidia Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sdevare-nv Okay, I've fixed the link for the 'train' dataset. Also per the readme for the agent, I populated the former link as 'validation': e7f4478#diff-5c310e3413ad995f1bbb76a622a51225ac1bffa5b36b1bdf65b5cfad0a562e98R27

Not sure about your question. Would defer to @bxyu-nvidia here.

@bxyu-nvidia
Copy link
Contributor

@shashank3959 can you double check if this new flow works for training e2e pls?

Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
…d-utils

Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
…d-utils

Signed-off-by: Frankie Siino <fsiino@nvidia.com>

# Conflicts:
#	nemo_gym/config_types.py
…d-utils

Signed-off-by: Frankie Siino <fsiino@nvidia.com>
@bxyu-nvidia
Copy link
Contributor

changes merged via #481

@fsiino-nvidia fsiino-nvidia deleted the fsiino/416-hf-download-utils branch February 23, 2026 16:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Dataset Download Utils requires Gitlab integration

3 participants