Datasets crashing runs due to KeyError #6124

conceptofmind · 2023-08-05T17:48:56Z

Describe the bug

Hi all,

I have been running into a pretty persistent issue recently when trying to load datasets.

    train_dataset = load_dataset(
        'llama-2-7b-tokenized', 
        split = 'train'
    )

I receive a KeyError which crashes the runs.

Traceback (most recent call last):
    main()

    train_dataset = load_dataset(
                    ^^^^^^^^^^^^^
    builder_instance = load_dataset_builder(
                       ^^^^^^^^^^^^^^^^^^^^^
    dataset_module = dataset_module_factory(
                     ^^^^^^^^^^^^^^^^^^^^^^^
    raise e1 from None

    ).get_module()
      ^^^^^^^^^^^^
    else get_data_patterns(base_path, download_config=self.download_config)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    return _get_data_files_patterns(resolver)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    data_files = pattern_resolver(pattern)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^
    fs, _, _ = get_fs_token_paths(pattern, storage_options=storage_options)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    paths = [f for f in sorted(fs.glob(paths)) if not fs.isdir(f)]
                               ^^^^^^^^^^^^^^

    allpaths = self.find(root, maxdepth=depth, withdirs=True, detail=True, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    for _, dirs, files in self.walk(path, maxdepth, detail=True, **kwargs):

    listing = self.ls(path, detail=True, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    "last_modified": parse_datetime(tree_item["lastCommit"]["date"]),
                                    ~~~~~~~~~^^^^^^^^^^^^^^
KeyError: 'lastCommit'

Any help would be greatly appreciated.

Thank you,

Enrico

Steps to reproduce the bug

Load the dataset from the Huggingface hub.

    train_dataset = load_dataset(
        'llama-2-7b-tokenized', 
        split = 'train'
    )

Expected behavior

Loads the dataset.

Environment info

datasets-2.14.3
CUDA 11.8
Python 3.11

The text was updated successfully, but these errors were encountered:

erfanzar · 2023-08-15T18:34:08Z

i once had the same error and I could fix that by pushing a fake or a dummy commit on my hugging face dataset repo

mariosasko · 2023-08-17T16:32:57Z

Hi! We need a reproducer to fix this. Can you provide a link to the dataset (if it's public)?

conceptofmind · 2023-08-20T17:33:03Z

Hi! We need a reproducer to fix this. Can you provide a link to the dataset (if it's public)?

Hi Mario,

Unfortunately, the dataset in question is currently private until the model is trained and released.

This is not happening with one dataset but numerous hosted private datasets.

I am only loading the dataset and doing nothing else currently. It seems to happen completely sporadically.

Thank you,

Enrico

rs9000 · 2023-10-12T09:03:37Z

Hi,

I have the same error in the dataset viewer with my dataset
https://huggingface.co/datasets/elsaEU/ELSA10M_track1

Has anyone solved this issue?

Edit: After a dummy commit the error changed in ConfigNamesError

mariosasko · 2023-10-12T14:03:10Z

@rs9000 The problem seems to be the (large) number of commits, as explained in https://huggingface.co/docs/hub/repositories-recommendations. This can be fixed by running:

import huggingface_hub
huggingface_hub.super_squash_history(repo_id="elsaEU/ELSA10M_track1")

The issue stems from push_to_hub creating one commit per shard - #6269 should fix this issue (will create one commit per 50 uploaded shards by default). The linked PR will be included in the next datasets release.

cc @lhoestq @severo for visibility

rs9000 · 2023-10-12T14:44:28Z

Thank you @mariosasko it works.

mariosasko · 2023-11-30T16:28:57Z

#6269 has been merged, so I'm closing this issue

mariosasko closed this as completed Nov 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Datasets crashing runs due to KeyError #6124

Datasets crashing runs due to KeyError #6124

conceptofmind commented Aug 5, 2023

erfanzar commented Aug 15, 2023

mariosasko commented Aug 17, 2023

conceptofmind commented Aug 20, 2023 •

edited

Loading

rs9000 commented Oct 12, 2023 •

edited

Loading

mariosasko commented Oct 12, 2023

rs9000 commented Oct 12, 2023

mariosasko commented Nov 30, 2023

Datasets crashing runs due to KeyError #6124

Datasets crashing runs due to KeyError #6124

Comments

conceptofmind commented Aug 5, 2023

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

erfanzar commented Aug 15, 2023

mariosasko commented Aug 17, 2023

conceptofmind commented Aug 20, 2023 • edited Loading

rs9000 commented Oct 12, 2023 • edited Loading

mariosasko commented Oct 12, 2023

rs9000 commented Oct 12, 2023

mariosasko commented Nov 30, 2023

conceptofmind commented Aug 20, 2023 •

edited

Loading

rs9000 commented Oct 12, 2023 •

edited

Loading