Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Audio] Path of Common Voice cannot be used for audio loading anymore #3663

Closed
patrickvonplaten opened this issue Feb 1, 2022 · 19 comments
Closed
Assignees
Labels
bug Something isn't working

Comments

@patrickvonplaten
Copy link
Contributor

Describe the bug

Steps to reproduce the bug

from datasets import load_dataset
from torchaudio import load

ds = load_dataset("common_voice", "ab", split="train")

# both of the following commands fail at the moment
load(ds[0]["audio"]["path"])
load(ds[0]["path"])

Expected results

The path should be the complete absolute path to the downloaded audio file not some relative path.

Actual results

~/hugging_face/venv_3.9/lib/python3.9/site-packages/torchaudio/backend/sox_io_backend.py in load(filepath, frame_offset, num_frames, normalize, channels_first, format)
    150                 filepath, frame_offset, num_frames, normalize, channels_first, format)
    151         filepath = os.fspath(filepath)
--> 152     return torch.ops.torchaudio.sox_io_load_audio_file(
    153         filepath, frame_offset, num_frames, normalize, channels_first, format)
    154

RuntimeError: Error loading audio file: failed to open file cv-corpus-6.1-2020-12-11/ab/clips/common_voice_ab_19904194.mp3

Environment info

  • datasets version: 1.18.3.dev0
  • Platform: Linux-5.4.0-96-generic-x86_64-with-glibc2.27
  • Python version: 3.9.1
  • PyArrow version: 3.0.0
@patrickvonplaten patrickvonplaten added the bug Something isn't working label Feb 1, 2022
@patrickvonplaten
Copy link
Contributor Author

patrickvonplaten commented Feb 1, 2022

Having talked to @lhoestq, I see that this feature is no longer supported.

I really don't think this was a good idea. It is a major breaking change and one for which we don't even have a working solution at the moment, which is bad for PyTorch as we don't want to force people to have datasets decode audio files automatically, but really bad for Tensorflow and Flax where we currently cannot even use datasets to load .mp3 files - e.g. common_voice doesn't work anymore in a TF training script. Note this worked perfectly fine before making the change (think it was done here no?)

IMO, it's really important to think about a solution here and I strongly favor to make a difference here between loading a dataset in streaming mode and in non-streaming mode, so that in non-streaming mode the actual downloaded file is displayed. It's really crucial for people to be able to analyse the original files IMO when the dataset is not downloaded in streaming mode.

There are the following reasons why it is paramount to have access to the original audio file in my opinion (in non-streaming mode):

  • There are a wide variety of different libraries to load audio data with varying support on different platforms. For me it was quite clear that there is simply to single good library to load audio files for all platforms - so we have to leave the option to the user to decide which loading to use.
  • We had support for audio datasets a long time before streaming audio was possible. There were quite some versions where we advertised everywhere to load the audio from the path name (and there are many places where we still do even though it's not possible anymore). To give some examples:
  • Similar to this we also shouldn't assume that there is only one resampling method for Audio. I think it's good to have one offered automatically by datasets, but we have to leave the user the freedom to choose her/his own resampling as well. Resampling can take very different filtering windows and other parameters which are currently somewhat hardcoded in datasets, which users might very well want to change.

=> IMO, it's a very big priority to again have the correct absolute path in non-streaming mode. The other solution of providing a path-like object derived from the bytes stocked in the .array file is not nearly as user-friendly, but better than nothing.

@patrickvonplaten patrickvonplaten changed the title [Audio] Path of Common Voice cannot be used for audio loading [Audio] Path of Common Voice cannot be used for audio loading anymore Feb 1, 2022
@cahya-wirawan
Copy link
Contributor

Agree that we need to have access to the original sound files. Few days ago I was looking for these original files because I suspected there is bug in the audio resampling (confirmed in #3662) and I want to do my own resampling to workaround the bug, which is now not possible anymore due to the unavailability of the original files.

@mariosasko
Copy link
Collaborator

@patrickvonplaten

The other solution of providing a path-like object derived from the bytes stocked in the .array file is not nearly as user-friendly, but better than nothing

Just to clarify, here you describe the approach that uses the Audio.decode attribute to access the underlying bytes?

The official example currently doesn't work and we don't even have a workaround for it for MP3 files at the moment

I'd assume this is because we use sox_io as a backend for decoding. However, soon we should be able to use soundfile, which supports path-like objects, for MP3 (#3667 (comment)).

Your concern is reasonable, but there are situations where we can only serve bytes (see #3685 for instance). IMO it makes sense to fix the affected datasets for now, but I don't think we should care too much whether we rely on local paths or bytes after soundfile adds support for MP3 as long as our examples work (shouldn't be too hard to update the map_to_array functions) and we properly document how to access the underlying path/bytes for custom decoding (via ds.cast_column("audio", Audio(decode=False))).

@lhoestq
Copy link
Member

lhoestq commented Feb 7, 2022

Related to this discussion: in #3664 (comment) I propose how we could change iter_archive to work for streaming and also return local paths (as it used too !). I'd love your opinions on this

@patrickvonplaten
Copy link
Contributor Author

patrickvonplaten commented Feb 8, 2022

@patrickvonplaten

The other solution of providing a path-like object derived from the bytes stocked in the .array file is not nearly as user-friendly, but better than nothing

Just to clarify, here you describe the approach that uses the Audio.decode attribute to access the underlying bytes?

Yes!

The official example currently doesn't work and we don't even have a workaround for it for MP3 files at the moment

I'd assume this is because we use sox_io as a backend for decoding. However, soon we should be able to use soundfile, which supports path-like objects, for MP3 (#3667 (comment)).
Your concern is reasonable, but there are situations where we can only serve bytes (see #3685 for instance). IMO it makes sense to fix the affected datasets for now, but I don't think we should care too much whether we rely on local paths or bytes after soundfile adds support for MP3 as long as our examples work (shouldn't be too hard to update the map_to_array functions) and we properly document how to access the underlying path/bytes for custom decoding (via ds.cast_column("audio", Audio(decode=False))).

Yes this might be, but I highly doubt that soundfile is the go-to library for audio then. @anton-l and I have tried out a bunch of different audio loading libraries (soundfile, librosa, torchaudio, pure ffmpeg, audioread, ...). One thing that was pretty clear to me is that there is just no "de-facto standard" library and they all have pros and cons. None of the libraries really supports "batch"-ed audio loading. Some depend on PyTorch. torchaudio is 100x faster (really!) than librosa's fallback on MP3. torchaudio often has problems with multi-proessing, ... Also we should keep in mind that resampling is similarly not as simple as reading a text file. It's a pretty complex signal processing transform and people very well might want to use special filters, etc...at the moment we just hard-code torchaudio's or librosa's default filter when doing resampling.

=> All this to say that we should definitely care about whether we rely on local paths or bytes IMO. We don't want to loose all users that are forced to use datasets decoding or resampling or have to built a very much not intuitive way of loading bytes into a numpy array. It's much more intuitive to be able to inspect a local file. I feel pretty strongly about this and am happy to also jump on a call. Keeping libraries flexible and lean as well as exposing internals is very important IMO (this philosophy has worked quite well so far with Transformers).

@mariosasko
Copy link
Collaborator

Thanks a lot for the very detailed explanation. Now everything makes much more sense.

@lhoestq
Copy link
Member

lhoestq commented Feb 22, 2022

From #3736 the Common Voice dataset now gives access to the local audio files as before

@albertz
Copy link

albertz commented Apr 21, 2022

I understand the argument that it is bad to have a breaking change. How to deal with the introduction of breaking changes is a topic of its own and not sure how you want to deal with that (or is the policy this is never allowed, and there must be a load_dataset_v2 or so if you really want to introduce a breaking change?).

Regardless of whether it is a breaking change, however, I don't see the other arguments.

but really bad for Tensorflow and Flax where we currently cannot even use datasets to load .mp3 files

I don't exactly understand this. Why not?

Why does the HF dataset on-the-fly decoding mechanism not work? Why is it anyway specific to PyTorch or TensorFlow? Isn't this independent?

But even if you just provide the raw bytes to TF, on TF you could just use sth like tfio.audio.decode_mp3 or tf.audio.decode_ogg or tfio.audio.decode_flac?

There are the following reasons why it is paramount to have access to the original audio file in my opinion ...

I don't really understand the arguments (despite that it maybe breaks existing code). You anyway have the original audio files but it is just embedded in the dataset? I don't really know about any library which cannot also load the audio from memory (i.e. from the dataset).

Btw, on librosa being slow for decoding audio files, I saw that as well, so we have this comment RETURNN:

Don't use librosa.load which internally uses audioread which would use Gstreamer as a backend which has multiple issues:
beetbox/audioread#62
beetbox/audioread#63
Instead, use PySoundFile (soundfile), which is also faster. See here for discussions:
beetbox/audioread#64
librosa/librosa#681

Resampling is also a separate aspect, which is also less straightforward and with different compromises between speed and quality. So there the different tradeoffs and different implementations can make a difference.

However, I don't see how this is related to the question whether there should be the raw bytes inside the dataset or as separate local files.

@patrickvonplaten
Copy link
Contributor Author

Thanks for your comments here @albertz - cool to get your input!

Answering a bit here between the lines:

I understand the argument that it is bad to have a breaking change. How to deal with the introduction of breaking changes is a topic of its own and not sure how you want to deal with that (or is the policy this is never allowed, and there must be a load_dataset_v2 or so if you really want to introduce a breaking change?).

Regardless of whether it is a breaking change, however, I don't see the other arguments.

but really bad for Tensorflow and Flax where we currently cannot even use datasets to load .mp3 files

I don't exactly understand this. Why not?

Why does the HF dataset on-the-fly decoding mechanism not work? Why is it anyway specific to PyTorch or TensorFlow? Isn't this independent?

The problem with decoding on the fly is that we currently rely on torchaudio for this now which relies on torch which is not necessarily something people would like to install when using tensorflow or flax. Therefore we cannot just rely on people using the decoding on the fly method. We just didn't find a library that is ML framework independent and fast enough for all formats. torchaudio is currently in our opinion by far the best here.

So for TF and Flax it's important that users can load audio files or bytes they way the want to - this might become less important if we find (or make) a good library with few dependencies that is fast for all kinds of platforms / use cases.

Now the question is whether it's better to store audio data as a path to a file or as raw bytes I guess.
My main arguments for storing the audio data as a path to a file is pretty much all about users experience - I don't really expect our users to understand the inner workings of datasets:

    1. It's not straightforward to know which function to use to decode it - not all load_audio(...) or read_audio(...) work on raw bytes. E.g. Looking at https://pytorch.org/audio/stable/torchaudio.html?highlight=load#torchaudio.load one would not see directly how to load raw bytes . There are also some functions of other libraries which only work on files which would require the user to save the bytes as a file first before being able to load it.
    1. It's difficult to see which format the bytes are coming from (mp3, ogg, ...) - guess this could be remedied by adding the format to each sample though
    1. It is a bit scary IMO to see raw bytes for users. Overall, I think it's better to leave the data in it's raw form as this way it's much easier for people to play around with the audio files, less need to read docs because people don't worry about what happened to the audio files (are the bytes already resampled?)

But the argument that the audio should be loadable directly from memory is good - haven't thought about this too much.
I guess it's still very much possible for the user to do this:

def save_as_bytes:
      batch["bytes"] = read_in_bytes_from_file(batch["file"])\
      os.remove(batch["file"])

ds = ds.map(save_as_bytes)

ds.save_to_disk(...)

Guess the question is more a bit about what should be the default case?

@albertz
Copy link

albertz commented Apr 21, 2022

The problem with decoding on the fly is that we currently rely on torchaudio for this now which relies on torch which is not necessarily something people would like to install when using tensorflow or flax. Therefore we cannot just rely on people using the decoding on the fly method. We just didn't find a library that is ML framework independent and fast enough for all formats. torchaudio is currently in our opinion by far the best here.

But how is this relevant for this issue here? I thought this issue here is about having the (correct) path in the dataset or having raw bytes in the dataset.

How did TF users use it at all then? Or they just do not use on-the-fly decoding? I did not even notice this problem (maybe because I had torchaudio installed). But what do they use instead?

But as I outlined before, they could just use tfio.audio.decode_flac and co, where it would be more natural if you already provide the raw bytes.

Looking at https://pytorch.org/audio/stable/torchaudio.html?highlight=load#torchaudio.load one would not see directly how to load raw bytes

I was not really familiar with torchaudio. It seems that they really don't provide an easy/direct API to operate on raw bytes. Which is very strange and unfortunate because as far as I can see, all the underlying backend libraries (e.g. soundfile) easily allow that. So I would say that this is the fault of torchaudio then. But despite, if you anyway use torchaudio with soundfile backend, why not just use soundfile directly. It's very simple to use and crossplatform.

But ok, now we are just discussing how to handle the on-the-fly decoding. I still think this is a separate issue and having raw bytes in the dataset instead of local files should just be fine as well.

It is a bit scary IMO to see raw bytes for users.

I think nobody who writes code is scared by seeing the raw bytes content of a binary file. :)

I guess it's still very much possible for the user to do this:

def save_as_bytes:
      batch["bytes"] = read_in_bytes_from_file(batch["file"])\
      os.remove(batch["file"])

ds = ds.map(save_as_bytes)

ds.save_to_disk(...)

In #4184 (comment), you said/proposed that this map is not needed anymore and save_to_disk could do it automatically (maybe via some option)?

Guess the question is more a bit about what should be the default case?

Yea this is up to you. I'm happy as long as we can get it the way we want easily and this is a well supported use case. :)

@patrickvonplaten
Copy link
Contributor Author

In #4184 (comment), you said/proposed that this map is not needed anymore and save_to_disk could do it automatically (maybe via some option)?

Yes! Should be super easy now see discussion here: rwth-i6/i6_core#257 (comment)

Thanks for the super useful input :-)

@DCNemesis
Copy link

Despite the comments that this has been fixed, I am finding the exact same problem is occurring again (with datasets version 2.3.2)

@DCNemesis
Copy link

Despite the comments that this has been fixed, I am finding the exact same problem is occurring again (with datasets version 2.3.2)

It appears downgrading to torchaudio 0.11.0 fixed this problem.

@patrickvonplaten
Copy link
Contributor Author

@DCNemesis, sorry which problem exactly is occuring again? Also cc @lhoestq @polinaeterna here

@DCNemesis
Copy link

DCNemesis commented Aug 23, 2022

@patrickvonplaten @lhoestq @polinaeterna I was unable to load audio from Common Voice using 🤗 with the current version of torchaudio, but downgrading to torchaudio 0.11.0 fixed it. This is probably more of a torch problem than a Hugging Face problem.

@polinaeterna
Copy link
Contributor

polinaeterna commented Aug 24, 2022

@DCNemesis that's interesting, could you please share the error message if you still can access it?

@DCNemesis
Copy link

@polinaeterna I believe it is the same exact error as above. It occurs on other .mp3 sources as well, but the problem is with torchaudio > 0.11.0. I've created a short colab notebook that reproduces the error, and the fix here: https://colab.research.google.com/drive/18wsuwdHwBPN3JkcnhEtk8MUYqF9swuWZ?usp=sharing

@albertvillanova
Copy link
Member

albertvillanova commented Sep 21, 2022

Hi @DCNemesis,

Your issue was slightly different from the original one in this issue page. Yours seems related to a change in the backend used by torchaudio (ffmpeg instead of sox). Refer to the issue page here:

Normally, it should be circumvented with the patch made by @polinaeterna in:

@albertvillanova
Copy link
Member

I think the original issue reported here was already fixed by:

Otherwise, feel free to reopen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants