Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AudioCache: caching for "command" type of audio files #1050

Merged
merged 5 commits into from
May 12, 2023

Conversation

KarelVesely84
Copy link
Contributor

@KarelVesely84 KarelVesely84 commented May 3, 2023

  • reduces overhead when reading lots of segments from 1 audio file
  • cache-size is limited to 500MB or no more than 100 files
  • by default it is enabled
  • thread-safety is secured by threading.Lock

Tested by running ./pruned_transducer_stateless7_streaming/streaming_decode.py with dataset imported from kaldi format (locally).

Question to @pzelasko : should the cache be used also for 'url' type of audio files ?
(locally I see 3x XFAIL on pytest test/test_audio_reads.py -v, but happens also for current master branch)

@pzelasko
Copy link
Collaborator

pzelasko commented May 4, 2023

Thanks! The code looks good to me but I'm wondering if we should really merge it. We already have two mechanisms that can be used instead:

Is it possible to achieve your goal by using either of those? If not, can you elaborate?

@KarelVesely84
Copy link
Contributor Author

KarelVesely84 commented May 4, 2023

Hi Piotr,
I prepare a kaldi data folder with many segments pointing to one recording and with pipe commands in wav.scp.
The kaldi-data dir is imported to lhotse, and then I do streaming_decode.py with k2/icefall.
(Having long audio files should be benficial to avoid loading lots of small audio files that causes IO overhead.)

I tried to use the dynamic_lru_cache, the process takes some 25GB of memory, but apparently it is calling the same audio loading "command" again and again for all its segments, and the progress is slow. What could work is to put lru_cache on this run() function before the audio file is segmented. That would be an alternative implementation of what I was am doing in PR.

And recording.move_to_memory() and cut.move_to_memory() method does not seem to be a solution. The streaming_decode.py is using a CutSet for directly loading and segmenting audio.

I import the data folder with these steps:

lhotse kaldi import ...
lhotse fix ...
./local/compute_fbank_imported.py ... [ICEFALL]
lhotse cut trim-to-supervisions ...

Example record from feature manifest is here:

{"id": "Czech-PDTSC20_pdtsc_142_spk1-0013229_0013942", "start": 132.296, "duration": 7.13, "channel": 0, "supervisions": [{"id": "Czech-PDTSC20_pdtsc_142_spk1-0013229_0013942", "recording_id": "Czech-PDTSC20_pdtsc_142", "start": 0.0, "duration": 7.13, "channel": 0, "text": "no už nevim , co bych o tom povídala , už tam <unk> pamatujete si z tohoto výletu něco zajímavého ?", "speaker": "Czech-PDTSC20_pdtsc_142_spk1"}], "features": {"type": "kaldi-fbank", "num_frames": 363125, "num_features": 80, "frame_shift": 0.01, "sampling_rate": 16000, "start": 0, "duration": 3631.25, "storage_type": "lilcom_chunky", "storage_path": "data/fbank/imported_feats_pdtsc-test/feats-0.lca", "storage_key": "${STORAGE_KEY}", "channels": 0}, "recording": {"id": "Czech-PDTSC20_pdtsc_142", "sources": [{"type": "command", "channels": [0], "source": "sox /mnt/matylda2/data/CZECH_PDTSC_DIALOGS/PDTSC-2.0/data/pdtsc_142.ogg -t wav -r16000 -c1 -b16 -e signed-integer - remix - "}], "sampling_rate": 16000, "num_samples": 58100000, "duration": 3631.25, "channel_ids": [0]}, "type": "MonoCut"}
{"id": "Czech-PDTSC20_pdtsc_142_spk1-0033351_0033651", "start": 333.51, "duration": 3.0, "channel": 0, "supervisions": [{"id": "Czech-PDTSC20_pdtsc_142_spk1-0033351_0033651", "recording_id": "Czech-PDTSC20_pdtsc_142", "start": 0.0, "duration": 3.0, "channel": 0, "text": "máte v klubu nějaké přátele ?", "speaker": "Czech-PDTSC20_pdtsc_142_spk1"}], "features": {"type": "kaldi-fbank", "num_frames": 363125, "num_features": 80, "frame_shift": 0.01, "sampling_rate": 16000, "start": 0, "duration": 3631.25, "storage_type": "lilcom_chunky", "storage_path": "data/fbank/imported_feats_pdtsc-test/feats-0.lca", "storage_key": "${STORAGE_KEY}", "channels": 0}, "recording": {"id": "Czech-PDTSC20_pdtsc_142", "sources": [{"type": "command", "channels": [0], "source": "sox /mnt/matylda2/data/CZECH_PDTSC_DIALOGS/PDTSC-2.0/data/pdtsc_142.ogg -t wav -r16000 -c1 -b16 -e signed-integer - remix - "}], "sampling_rate": 16000, "num_samples": 58100000, "duration": 3631.25, "channel_ids": [0]}, "type": "MonoCut"}
...

My PR made the runtime of streaming_decode.py much faster (IO overhead in "sys.time" was reduced ~10x with 200cuts sample 28secs -> 2.1secs).

What would you suggest ?
Best, Karel

@KarelVesely84
Copy link
Contributor Author

KarelVesely84 commented May 4, 2023

I played with it a bit more and tried the lru_cache, but the caching was not working, still reading the same over and over again.

This is what i tried (created a run_cached() as a wraper of subprocess.run() command):

        if self.type == "command":
            if offset != 0.0 or duration is not None:
                # TODO(pzelasko): How should we support chunking for commands?
                #                 We risk being very inefficient when reading many chunks from the same file
                #                 without some caching scheme, because we'll be re-running commands.
                warnings.warn(
                    "You requested a subset of a recording that is read from disk via a bash command. "
                    "Expect large I/O overhead if you are going to read many chunks like these, "
                    "since every time we will read the whole file rather than its subset."
                )
             
            @lru_cache(maxsize=10)
            def run_cached(command: str) -> bytes:
                """ Adapter for 'run()' to allow caching of audio files in stdout """
                return run(command, shell=True, stdout=PIPE).stdout
                
            source = BytesIO(run_cached(self.source))
            samples, sampling_rate = read_audio(
                source, offset=offset, duration=duration
            )

@pzelasko
Copy link
Collaborator

pzelasko commented May 4, 2023

I checked that script -- in general it has another issue, it's not using dataloader with background workers to load the audio, so it's bound to be slow. But I understand how the cache could help here. I am OK to merge but I'd like to change the following:

  • we disable it by default (we can use env var to enable)
  • register it in lhotse.set_caching_enabled
  • let's purge @dynamic_lru_cache decorators everywhere (I think your implementation is better than mine and more predictable in memory size)
  • expose some method/function for purging the cache (currently I understand that it will be only effective for the first 500MB of data?)

WDYT?

I played with it a bit more and tried the lru_cache, but the caching was not working, still reading the same over and over again.

This is what i tried (created a run_cached() as a wraper of subprocess.run() command):

It wouldn't work because every time you enter this function, you are creating a new closure with a new cache. You'd need a global cached function instead.

@KarelVesely84
Copy link
Contributor Author

Okay, i'll continue working on it. I'll have a longer weekend, so I'll continue on wednesday.
The plan LGTM. Thanks!

@KarelVesely84
Copy link
Contributor Author

Hi Piotr,
I incroporated your feedback:

  • disabled by default (decided not to introduce env variable)
  • inter-connecting AudioCache with caching.py, so set_caching_enabled() activates/deactivates AudioCache
  • cache is freed upon disabling
  • removed @dynamic_lru_cache from audio.py (in features/io.py it is not removed)

And the code of AudioCache code was moved to caching.py (seems to be a more logical place).

And new unit tests were created.

Should I also remove the old unit tests?
test_audio_caching_disabled_works
test_audio_caching_enabled_works

Cheers,
Karel

- reduces overhead when reading lots of segments from 1 audio file
- cache-size is limited to 500MB or no more than 100 files
- by default it is enabled
- thread-safety is secured by threading.Lock

Tested by running `./pruned_transducer_stateless7_streaming/streaming_decode.py`
with dataset imported from kaldi format (locally).

Question to @pzelasko : should the cache be used also for 'url' type of
audio files ?
- disabled by default (decided not to introduce env variable)
- inter-connecting `AudioCache` with `caching.py`, so `set_caching_enabled()` activates/deactivates AudioCache
- cache is freed upon disabling
- removed @dynamic_lru_cache from `audio.py` (in `features/io.py` it is not removed)

The code of `AudioCache` code was moved to `caching.py`.

And the new unit tests were created.
Should I also remove the old unit tests?
`test_audio_caching_disabled_works`
`test_audio_caching_enabled_works`

Cheers,
Karel
lhotse/audio.py Outdated
@@ -210,6 +216,7 @@ def load_audio(
"Expect large I/O overhead if you are going to read many chunks like these, "
"since every time we will download the whole file rather than its subset."
)
# Should AudioCache be used also for 'url' type ? (url never contains a live-stream ?)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think it makes sense to cache URL as well

@@ -121,6 +121,7 @@ def test_opus_torchaudio_vs_ffmpeg_with_resampling(path, force_opus_sampling_rat
np.testing.assert_almost_equal(audio_ta, audio_ff, decimal=1)


@pytest.mark.skip(reason="The audio caching by @lru_cache_optional was removed.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just remove the test

@pzelasko
Copy link
Collaborator

Thanks Karel! I left two last comments, if you could resolve those then LGTM

- apply AudioCache also to URL reads
- remove the unit test of the no-longer existing @lru_cache_optional audio caching
@KarelVesely84
Copy link
Contributor Author

KarelVesely84 commented May 12, 2023

Okay, I did the two changes. Thanks Piotr! K.

pzelasko
pzelasko previously approved these changes May 12, 2023
Copy link
Collaborator

@pzelasko pzelasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Karel! LGTM, will just commit some cosmetics

lhotse/audio.py Outdated Show resolved Hide resolved
lhotse/audio.py Outdated Show resolved Hide resolved
@pzelasko pzelasko merged commit fcca8a7 into lhotse-speech:master May 12, 2023
@pzelasko pzelasko added this to the v1.15 milestone May 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants