Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Return local paths to Common Voice #3664

Closed
wants to merge 19 commits into from

Conversation

anton-l
Copy link
Member

@anton-l anton-l commented Feb 1, 2022

Fixes #3663

This is a proposed way of returning the old local file-based generator while keeping the new streaming generator intact.

TODO:

Comment on lines 717 to 718
if archive_iterator is not None:
yield from self._generate_examples_streaming(archive_iterator, filepath, path_to_clips)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The key part: if we get an archive files iterator, then use the new streaming logic, otherwise use the old pre-streaming logic.

@patrickvonplaten
Copy link
Contributor

Cool thanks for giving it a try @anton-l !

Would be very much in favor of having "real" paths to the audio files again for non-streaming use cases. At the same time it would be nice to make the audio data loading script as understandable as possible so that the community can easily add audio datasets in the future by looking at this one as an example. Think if it's clear for a contributor how to add an audio datasets script that works for the standard non-streaming case while it is easy to extend it afterwards to a streaming dataset script, then this would be perfect

@polinaeterna
Copy link
Contributor

polinaeterna commented Feb 7, 2022

@anton-l @patrickvonplaten @lhoestq Is it possible somehow to provide this logic inside the library instead of a loading script so that we don't need to completely rewrite all the scripts for audio datasets and users don't have to care about two different loading approaches in the same script? 🤔

@patrickvonplaten
Copy link
Contributor

@anton-l @patrickvonplaten @lhoestq Is it possible somehow to provide this logic inside the library instead of a loading script so that we don't need to completely rewrite all the scripts for audio datasets and users don't have to care about two different loading approaches in the same script? thinking

Not sure @lhoestq - what do you think?

Now that we've corrected the previous resampling bug, think this one here is of high importance. @lhoestq - what do you think how we should proceed here?

@lhoestq
Copy link
Member

lhoestq commented Feb 7, 2022

@anton-l @patrickvonplaten @lhoestq Is it possible somehow to provide this logic inside the library instead of a loading script so that we don't need to completely rewrite all the scripts for audio datasets and users don't have to care about two different loading approaches in the same script? 🤔

Yes let's do this :)

Maybe we can change the behavior of DownloadManager.iter_archive back to extracting the TAR archive locally, and return an iterable of (local path, file obj). And the StreamingDownloadManager.iter_archive can return an iterable of (relative path inside the archive, file obj) ?

In this case, a dataset would need to have something like this:

for path, f in files:
    yield id_, {"audio": {"path": path, "bytes": f.read() if not is_local_file(path) else None}}

Alternatively, we can allow this if we consider that Audio.encode_example sets the "bytes" field to None automatically if path is a local path:

for path, f in files:
    yield id_, {"audio": {"path": path, "bytes": f.read()}}

Note that in this case the file is read for nothing though (maybe it's not a big deal ?)

Let me know if it sounds good to you and what you'd prefer !

@anton-l
Copy link
Member Author

anton-l commented Feb 7, 2022

@lhoestq I'm very much in favor of your first aproach! With the full paths returned I think we won't even need to mess with os.path.join vs "/".join()" and other local/streaming differences 👍

@mariosasko
Copy link
Collaborator

@lhoestq I also like the idea and favor your first approach to avoid an unnecessary read and make yielding faster.

@patrickvonplaten
Copy link
Contributor

Looks cool - thanks for working on this. I just feel strongly about path being an absolute path that exist and can be inspected in the non-streaming case :-) For streaming=True IMO it's absolutely fine if we only have access to the bytes

@lhoestq
Copy link
Member

lhoestq commented Feb 8, 2022

Hi ! I started implementing this but I noticed that returning an absolute path is breaking for many datasets that do things like

for path, f in files:
    if path.startswith(data_dir):
        ...

so I think I will have to add a parameter to iter_archive like extract_locally=True to avoid the breaking change, does that sound good to you ?

This makes me also think that in streaming mode it could also return a local path too, if we think that writing and deleting temporary files on-the-fly while iterating over the streaming dataset is ok.

@albertvillanova
Copy link
Member

albertvillanova commented Feb 9, 2022

@lhoestq I think it is a good idea to rollback to extracting the archives locally in non-streaming mode, as far as (as you mentioned) we do not store the bytes in the Arrow file for those cases to avoid "doubling" the disk space usage.

On the other hand, I don't like:

  • neither the possibility to avoid extracting locally in non-streaming: the behavior should be consistent; thus we always extract in non-streaming
    • which could be the criterium to decide whether an archive should or should not be extracted? Just because I want to make a condition on path.startswith?
  • nor the option to download/delete temporary files in streaming (see discussion here: Fix streaming for servers not supporting HTTP range requests #3689 (comment))

Unfortunately, in order to fix the datasets that are breaking after the rollback, I would suggest fixing their scripts so that the paths are handled more robustly (considering that they can be absolute or relative).

@anton-l
Copy link
Member Author

anton-l commented Feb 9, 2022

I agree with Albert, fixing all of the audio datasets isn't too big of a deal (yet). I can help with those if needed :)

@lhoestq
Copy link
Member

lhoestq commented Feb 9, 2022

Ok cool ! I'm completely rolling it back then

@lhoestq
Copy link
Member

lhoestq commented Feb 9, 2022

Alright I did the rollback and now you can get local paths :)
Feel free to try it out and let me know if it's good for you

@lhoestq
Copy link
Member

lhoestq commented Feb 10, 2022

I'll fix the CI tomorrow x)

@lhoestq
Copy link
Member

lhoestq commented Feb 10, 2022

Ok according to the CI there around 60+ datasets to fix

@polinaeterna
Copy link
Contributor

polinaeterna commented Feb 11, 2022

fixing all of the audio datasets isn't too big of a deal (yet). I can help with those if needed :)

I can help with them too :)

@lhoestq
Copy link
Member

lhoestq commented Feb 11, 2022

Here is the full list to keep track of things:

  • air_dialogue
  • id_nergrit_corpus
  • id_newspapers_2018
  • imdb
  • indic_glue
  • inquisitive_qg
  • klue
  • lama
  • lex_glue
  • lm1b
  • amazon_polarity
  • mac_morpho
  • math_dataset
  • md_gender_bias
  • mdd
  • assin
  • atomic
  • babi_qa
  • mlqa
  • mocha
  • blended_skill_talk
  • capes
  • cbt
  • newsgroup
  • cifar10
  • cifar100
  • norec
  • ohsumed
  • code_x_glue_cc_clone_detection_poj104
  • openslr
  • orange_sum
  • paws
  • paws-x
  • cppe-5
  • polyglot_ner
  • dbrd
  • empathetic_dialogues
  • eraser_multi_rc
  • flores
  • flue
  • food101
  • py_ast
  • qasc
  • qasper
  • race
  • reuters21578
  • ropes
  • rotten_tomatoes
  • vivos
  • wi_locness
  • wiki_movies
  • wikiann
  • wmt20_mlqe_task1
  • wmt20_mlqe_task2
  • wmt20_mlqe_task3
  • scicite
  • xsum
  • scielo
  • scifact
  • setimes
  • social_bias_frames
  • sogou_news
  • speech_commands
  • ted_hrlr
  • ted_multi
  • tlc
  • turku_ner_corpus

@lhoestq
Copy link
Member

lhoestq commented Feb 14, 2022

I'll do my best to fix as many as possible tomorrow :)

@polinaeterna
Copy link
Contributor

the audio datasets are fixed if I didn't forget anything :)

btw what are we gonna do with the community ones that would be broken after the fix?

@lhoestq
Copy link
Member

lhoestq commented Feb 22, 2022

Closing in favor of #3736

@lhoestq lhoestq closed this Feb 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Audio] Path of Common Voice cannot be used for audio loading anymore
6 participants