Skip to content

New ffmpeg backend changes samples when saving WAVE #3281

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
pzelasko opened this issue Apr 27, 2023 · 2 comments
Open

New ffmpeg backend changes samples when saving WAVE #3281

pzelasko opened this issue Apr 27, 2023 · 2 comments
Labels

Comments

@pzelasko
Copy link

🐛 Describe the bug

Snippet to reproduce the error is provided below. Adding backend="sox" or backend="soundfile" to torchaudio.save removes the issue.

import os
from tempfile import NamedTemporaryFile

os.environ["TORCHAUDIO_USE_BACKEND_DISPATCHER"] = "1"

import torch
import torchaudio

torch.manual_seed(0)

noise = torch.rand(1, 32000, dtype=torch.float32)

with NamedTemporaryFile(suffix=".wav") as f:
    torchaudio.save(f.name, noise, sample_rate=16000)
    f.flush()
    f.seek(0)
    noise_load, _ = torchaudio.load(f)
    
torch.testing.assert_close(noise_load, noise)

Output:

Traceback (most recent call last):
  File "/Users/pzelasko/Library/Application Support/JetBrains/PyCharm2023.1/scratches/scratch_12.py", line 19, in <module>
    torch.testing.assert_close(noise_load, noise)
  File "/Users/pzelasko/miniconda3/envs/lhotse-py3.10/lib/python3.10/site-packages/torch/testing/_comparison.py", line 1511, in assert_close
    raise error_metas[0].to_error(msg)
AssertionError: Tensor-likes are not close!

Mismatched elements: 9760 / 32000 (30.5%)
Greatest absolute difference: 1.52587890625e-05 at index (0, 134) (up to 1e-05 allowed)
Greatest relative difference: 1.0 at index (0, 24308) (up to 1.3e-06 allowed)

Versions

Collecting environment information...
PyTorch version: 2.0.0
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: macOS 13.3.1 (arm64)
GCC version: Could not collect
Clang version: 14.0.3 (clang-1403.0.22.14.1)
CMake version: version 3.25.0
Libc version: N/A

Python version: 3.10.4 (main, Mar 31 2022, 03:37:37) [Clang 12.0.0 ] (64-bit runtime)
Python platform: macOS-13.3.1-arm64-arm-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Apple M1 Max

Versions of relevant libraries:
[pip3] flake8==5.0.4
[pip3] k2==1.23.4.dev20230412+cpu.torch2.0.0
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.23.5
[pip3] torch==2.0.0
[pip3] torchaudio==2.0.0
[pip3] torchvision==0.15.0
[conda] k2 1.23.4.dev20230412+cpu.torch2.0.0 pypi_0 pypi
[conda] numpy 1.23.5 py310hb93e574_0
[conda] numpy-base 1.23.5 py310haf87e8b_0
[conda] pytorch 2.0.0 py3.10_0 pytorch
[conda] torch 1.12.1 pypi_0 pypi
[conda] torchaudio 2.0.0 py310_cpu pytorch
[conda] torchvision 0.15.0 py310_cpu pytorch

@mthrok
Copy link
Collaborator

mthrok commented Apr 27, 2023

I think what is happening is that for WAV, ffmpeg defaults to int16, so the test is causing some discrepancy, but the discrepancy is at most in the order of e-5.
This is due to how the underlying implementation StreamReader work. It picks a default precision of the format, and the default is governed by FFmpeg's mechanism.

It is possible to make the behavior match the previous backends, but I think there were user feedbacks that int16 is better as that's what vast majority of audio system expects and many do not understand other precision.

One reason why the existing backend picked the matching precision is to preserve the data as precise as it was returned by the model for the sake of scientific computation.

What do you think? @pzelasko @hwangjeff

@pzelasko
Copy link
Author

Good insight! I was able to validate that you're right by replacing noise generation like this:

INT16MAX = 32768
noise = torch.randint(-INT16MAX, INT16MAX - 1, (1, 32000))
noise = noise / INT16MAX

I think it makes sense, it's the most common format and people rarely need the actual float32 precision when saving files. I only found out because some of Lhotse unit tests for correct save->load behavior failed when moving to ffmpeg, but they used artificial data anyway.

In that case you might want to update the documentation here:

Supported formats/encodings/bit depth/compression are:
``"wav"``
- 32-bit floating-point PCM
- 32-bit signed integer PCM
- 24-bit signed integer PCM
- 16-bit signed integer PCM
- 8-bit unsigned integer PCM
- 8-bit mu-law
- 8-bit a-law
Note:
Default encoding/bit depth is determined by the dtype of
the input Tensor.
``"flac"``
- 16-bit (default)
- 24-bit

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants