Skip to content

Conversation

@kindaQ
Copy link
Contributor

@kindaQ kindaQ commented Apr 23, 2023

add load downloaded huggingface dataset, in case of machines cannot connect to huggingface.co

@kindaQ
Copy link
Contributor Author

kindaQ commented Apr 23, 2023

@microsoft-github-policy-service agree

@conglongli conglongli self-assigned this Apr 23, 2023
@conglongli
Copy link
Contributor

@kindaQ HF datasets has caching (https://huggingface.co/docs/datasets/cache) so that if you copy the downloaded data into a machine and properly set HF_DATASETS_CACHE, that machine can directly load the datasets without downloading. If you agree caching will solve your issue, we would prefer not to merge this PR and add another layer of complexity. If caching does not solve your case, please explain. Thank you.

@kindaQ
Copy link
Contributor Author

kindaQ commented Apr 24, 2023

@kindaQ HF datasets has caching (https://huggingface.co/docs/datasets/cache) so that if you copy the downloaded data into a machine and properly set HF_DATASETS_CACHE, that machine can directly load the datasets without downloading. If you agree caching will solve your issue, we would prefer not to merge this PR and add another layer of complexity. If caching does not solve your case, please explain. Thank you.

I have tried " export HF_DATASETS_CACHE="/path/to/another/directory"" and "dataset = load_dataset('LOADING_SCRIPT', cache_dir="PATH/TO/MY/CACHE/DIR")",
but still got "ConnectionError: Couldn't reach 'Dahoas/rm-static' on the Hub (ConnectionError)"
here is my environment:
image
image

would you please try it in your machine which cannot connect to huggingface.co

The cache_dir only change the directory that saves the downloaded cache files, but won't change the downloading step.
follow are the hugggingface cached files and user download files, they are not in same type.
image
image

@conglongli
Copy link
Contributor

Based on HF's doc https://huggingface.co/docs/datasets/loading#offline, could you try to set HF_DATASETS_OFFLINE to 1 to enable full offline mode?

@kindaQ
Copy link
Contributor Author

kindaQ commented Apr 24, 2023

Based on HF's doc https://huggingface.co/docs/datasets/loading#offline, could you try to set HF_DATASETS_OFFLINE to 1 to enable full offline mode?

it does not work
image

I have reviewed the code of "load_dataset", the first param only support
"_PACKAGED_DATASETS_MODULES = {
"csv": (csv.name, _hash_python_lines(inspect.getsource(csv).splitlines())),
"json": (json.name, _hash_python_lines(inspect.getsource(json).splitlines())),
"pandas": (pandas.name, _hash_python_lines(inspect.getsource(pandas).splitlines())),
"parquet": (parquet.name, _hash_python_lines(inspect.getsource(parquet).splitlines())),
"text": (text.name, _hash_python_lines(inspect.getsource(text).splitlines())),
"imagefolder": (imagefolder.name, _hash_python_lines(inspect.getsource(imagefolder).splitlines())),
"audiofolder": (audiofolder.name, _hash_python_lines(inspect.getsource(audiofolder).splitlines())),
}"
or xxx.py
or directory path
otherwise it will connect to hub, "load_dataset("Dahoas/rm-static")" will go into this branch

so changing cache_dir could not solve my issue

return raw_datasets.DahoasFullhhrlhfDataset(output_path, seed,
local_rank)
elif dataset_name == "Dahoas/synthetic-instruct-gptj-pairwise":
dn = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please change dn to longer and meaningful name, such as local_data_dir

class PromptRawDataset(object):

def __init__(self, output_path, seed, local_rank):
def __init__(self, output_path, seed, local_rank, dataset_name=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dataset_name is a confusing variable name, I recommend to change it to local_data_dir. Same comment apply to all other classes.

return None


class LocalParquetDataset(PromptRawDataset):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this class? This LocalParquetDataset class is never used in your PR. Please remove it if it's not necessary.

@conglongli
Copy link
Contributor

conglongli commented Apr 24, 2023

@kindaQ Thanks for the clarifications. Now I agree that this PR is needed. I finished my review and left some comments that need your fix. Please also write a short paragraph of documentation about this feature at https://github.com/microsoft/DeepSpeedExamples/blob/master/applications/DeepSpeed-Chat/README.md#-adding-and-using-your-own-datasets-in-deepspeed-chat, in order for other users to actually understand how to use your contribution. One other thing is that the formatting test failed, please follow https://www.deepspeed.ai/contributing/ to fix it with pre-commit.

@conglongli conglongli merged commit f4ad1d5 into deepspeedai:master Apr 25, 2023
@conglongli
Copy link
Contributor

@kindaQ this PR was having formatting issues. I helped to fix it this time, but next time please make sure to use pre-commit to resolve them: "pre-commit install" then "pre-commit run --files files_you_modified"

@kindaQ
Copy link
Contributor Author

kindaQ commented Apr 26, 2023

@kindaQ this PR was having formatting issues. I helped to fix it this time, but next time please make sure to use pre-commit to resolve them: "pre-commit install" then "pre-commit run --files files_you_modified"

thx a lot.
I have tried "pre-commit run" in terminal, but got failing conneted to github.

@conglongli conglongli added the deespeed chat DeepSpeed Chat label Apr 29, 2023
leocnj pushed a commit to leocnj/DeepSpeedExamples that referenced this pull request May 27, 2023
* change load local ./ hf parquet dataset

* change get_raw_dataset load local dir of downloaded huggingface datasets

* pre-commit formatting

---------

Co-authored-by: Conglong Li <[email protected]>
hwchen2017 pushed a commit that referenced this pull request Jun 8, 2025
* change load local ./ hf parquet dataset

* change get_raw_dataset load local dir of downloaded huggingface datasets

* pre-commit formatting

---------

Co-authored-by: Conglong Li <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

deespeed chat DeepSpeed Chat

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants