New `ffmpeg` backend changes samples when saving WAVE #3281

pzelasko · 2023-04-27T00:22:53Z

🐛 Describe the bug

Snippet to reproduce the error is provided below. Adding backend="sox" or backend="soundfile" to torchaudio.save removes the issue.

import os
from tempfile import NamedTemporaryFile

os.environ["TORCHAUDIO_USE_BACKEND_DISPATCHER"] = "1"

import torch
import torchaudio

torch.manual_seed(0)

noise = torch.rand(1, 32000, dtype=torch.float32)

with NamedTemporaryFile(suffix=".wav") as f:
    torchaudio.save(f.name, noise, sample_rate=16000)
    f.flush()
    f.seek(0)
    noise_load, _ = torchaudio.load(f)
    
torch.testing.assert_close(noise_load, noise)

Output:

Traceback (most recent call last):
  File "/Users/pzelasko/Library/Application Support/JetBrains/PyCharm2023.1/scratches/scratch_12.py", line 19, in <module>
    torch.testing.assert_close(noise_load, noise)
  File "/Users/pzelasko/miniconda3/envs/lhotse-py3.10/lib/python3.10/site-packages/torch/testing/_comparison.py", line 1511, in assert_close
    raise error_metas[0].to_error(msg)
AssertionError: Tensor-likes are not close!

Mismatched elements: 9760 / 32000 (30.5%)
Greatest absolute difference: 1.52587890625e-05 at index (0, 134) (up to 1e-05 allowed)
Greatest relative difference: 1.0 at index (0, 24308) (up to 1.3e-06 allowed)

Versions

Collecting environment information...
PyTorch version: 2.0.0
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: macOS 13.3.1 (arm64)
GCC version: Could not collect
Clang version: 14.0.3 (clang-1403.0.22.14.1)
CMake version: version 3.25.0
Libc version: N/A

Python version: 3.10.4 (main, Mar 31 2022, 03:37:37) [Clang 12.0.0 ] (64-bit runtime)
Python platform: macOS-13.3.1-arm64-arm-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Apple M1 Max

Versions of relevant libraries:
[pip3] flake8==5.0.4
[pip3] k2==1.23.4.dev20230412+cpu.torch2.0.0
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.23.5
[pip3] torch==2.0.0
[pip3] torchaudio==2.0.0
[pip3] torchvision==0.15.0
[conda] k2 1.23.4.dev20230412+cpu.torch2.0.0 pypi_0 pypi
[conda] numpy 1.23.5 py310hb93e574_0
[conda] numpy-base 1.23.5 py310haf87e8b_0
[conda] pytorch 2.0.0 py3.10_0 pytorch
[conda] torch 1.12.1 pypi_0 pypi
[conda] torchaudio 2.0.0 py310_cpu pytorch
[conda] torchvision 0.15.0 py310_cpu pytorch

The text was updated successfully, but these errors were encountered:

mthrok · 2023-04-27T14:56:16Z

I think what is happening is that for WAV, ffmpeg defaults to int16, so the test is causing some discrepancy, but the discrepancy is at most in the order of e-5.
This is due to how the underlying implementation StreamReader work. It picks a default precision of the format, and the default is governed by FFmpeg's mechanism.

It is possible to make the behavior match the previous backends, but I think there were user feedbacks that int16 is better as that's what vast majority of audio system expects and many do not understand other precision.

One reason why the existing backend picked the matching precision is to preserve the data as precise as it was returned by the model for the sake of scientific computation.

What do you think? @pzelasko @hwangjeff

pzelasko · 2023-04-27T16:08:19Z

Good insight! I was able to validate that you're right by replacing noise generation like this:

INT16MAX = 32768
noise = torch.randint(-INT16MAX, INT16MAX - 1, (1, 32000))
noise = noise / INT16MAX

I think it makes sense, it's the most common format and people rarely need the actual float32 precision when saving files. I only found out because some of Lhotse unit tests for correct save->load behavior failed when moving to ffmpeg, but they used artificial data anyway.

In that case you might want to update the documentation here:

audio/torchaudio/_backend/utils.py

Lines 533 to 550 in 151ac4d

    
                   Supported formats/encodings/bit depth/compression are: 
        
                   ``"wav"`` 
        
                       - 32-bit floating-point PCM 
        
                       - 32-bit signed integer PCM 
        
                       - 24-bit signed integer PCM 
        
                       - 16-bit signed integer PCM 
        
                       - 8-bit unsigned integer PCM 
        
                       - 8-bit mu-law 
        
                       - 8-bit a-law 
        
                       Note: 
        
                           Default encoding/bit depth is determined by the dtype of 
        
                           the input Tensor. 
        
                   ``"flac"`` 
        
                       - 16-bit (default) 
        
                       - 24-bit

pzelasko mentioned this issue Apr 27, 2023

Integrate torchaudio's 2.0 ffmpeg backend for audio loading + add some optimizations lhotse-speech/lhotse#1043

Merged

xiaohui-zhang added the triaged label May 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New `ffmpeg` backend changes samples when saving WAVE #3281

New `ffmpeg` backend changes samples when saving WAVE #3281

pzelasko commented Apr 27, 2023

mthrok commented Apr 27, 2023

pzelasko commented Apr 27, 2023

New ffmpeg backend changes samples when saving WAVE #3281

New ffmpeg backend changes samples when saving WAVE #3281

Comments

pzelasko commented Apr 27, 2023

🐛 Describe the bug

Versions

mthrok commented Apr 27, 2023

pzelasko commented Apr 27, 2023

New `ffmpeg` backend changes samples when saving WAVE #3281

New `ffmpeg` backend changes samples when saving WAVE #3281