Add Huggingface support for data preparation#419
Add Huggingface support for data preparation#419fsiino-nvidia wants to merge 23 commits intomainfrom
Conversation
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
| jsonl_fpath: resources_servers/mini_swe_agent/data/train.jsonl | ||
| dataset_url: https://huggingface.co/datasets/princeton-nlp/SWE-bench_Verified | ||
| huggingface_identifier: | ||
| repo_id: princeton-nlp/SWE-bench_Verified |
There was a problem hiding this comment.
is this right? can we double check with @sdevare-nv ? if someone downloads HF data from here is it directly usable in gym?
There was a problem hiding this comment.
@sdevare-nv I dont recall where I got that link. Should this be https://huggingface.co/datasets/SWE-Gym/SWE-Gym instead?
There was a problem hiding this comment.
@fsiino-nvidia @bxyu-nvidia yes it should be swe-gym ; but i have filtered out samples that we don't have singularity images for. So will we need to upload a new version of the dataset?
There was a problem hiding this comment.
@sdevare-nv Okay, I've fixed the link for the 'train' dataset. Also per the readme for the agent, I populated the former link as 'validation': e7f4478#diff-5c310e3413ad995f1bbb76a622a51225ac1bffa5b36b1bdf65b5cfad0a562e98R27
Not sure about your question. Would defer to @bxyu-nvidia here.
|
@shashank3959 can you double check if this new flow works for training e2e pls? |
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
…d-utils Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
…d-utils Signed-off-by: Frankie Siino <fsiino@nvidia.com> # Conflicts: # nemo_gym/config_types.py
…d-utils Signed-off-by: Frankie Siino <fsiino@nvidia.com>
|
changes merged via #481 |
Integrates huggingface dataset download for
ng_prepare_dataRefine
ng_download_data_from_hfto handle non-jsonl files (like parquet)