AudioCache: caching for "command" type of audio files #1050

KarelVesely84 · 2023-05-03T16:33:11Z

reduces overhead when reading lots of segments from 1 audio file
cache-size is limited to 500MB or no more than 100 files
by default it is enabled
thread-safety is secured by threading.Lock

Tested by running ./pruned_transducer_stateless7_streaming/streaming_decode.py with dataset imported from kaldi format (locally).

Question to @pzelasko : should the cache be used also for 'url' type of audio files ?
(locally I see 3x XFAIL on pytest test/test_audio_reads.py -v, but happens also for current master branch)

pzelasko · 2023-05-04T12:33:39Z

Thanks! The code looks good to me but I'm wondering if we should really merge it. We already have two mechanisms that can be used instead:

dynamic_lru_cache which is used by audio and feature loading methods, and it can be enabled by lhotse.set_caching_enabled(True)
recording.move_to_memory() and cut.move_to_memory() which you can call explicitly to create a recording/cut that holds the data in memory.

Is it possible to achieve your goal by using either of those? If not, can you elaborate?

KarelVesely84 · 2023-05-04T14:44:39Z

Hi Piotr,
I prepare a kaldi data folder with many segments pointing to one recording and with pipe commands in wav.scp.
The kaldi-data dir is imported to lhotse, and then I do streaming_decode.py with k2/icefall.
(Having long audio files should be benficial to avoid loading lots of small audio files that causes IO overhead.)

I tried to use the dynamic_lru_cache, the process takes some 25GB of memory, but apparently it is calling the same audio loading "command" again and again for all its segments, and the progress is slow. What could work is to put lru_cache on this run() function before the audio file is segmented. That would be an alternative implementation of what I was am doing in PR.

And recording.move_to_memory() and cut.move_to_memory() method does not seem to be a solution. The streaming_decode.py is using a CutSet for directly loading and segmenting audio.

I import the data folder with these steps:

lhotse kaldi import ...
lhotse fix ...
./local/compute_fbank_imported.py ... [ICEFALL]
lhotse cut trim-to-supervisions ...

Example record from feature manifest is here:

{"id": "Czech-PDTSC20_pdtsc_142_spk1-0013229_0013942", "start": 132.296, "duration": 7.13, "channel": 0, "supervisions": [{"id": "Czech-PDTSC20_pdtsc_142_spk1-0013229_0013942", "recording_id": "Czech-PDTSC20_pdtsc_142", "start": 0.0, "duration": 7.13, "channel": 0, "text": "no už nevim , co bych o tom povídala , už tam <unk> pamatujete si z tohoto výletu něco zajímavého ?", "speaker": "Czech-PDTSC20_pdtsc_142_spk1"}], "features": {"type": "kaldi-fbank", "num_frames": 363125, "num_features": 80, "frame_shift": 0.01, "sampling_rate": 16000, "start": 0, "duration": 3631.25, "storage_type": "lilcom_chunky", "storage_path": "data/fbank/imported_feats_pdtsc-test/feats-0.lca", "storage_key": "${STORAGE_KEY}", "channels": 0}, "recording": {"id": "Czech-PDTSC20_pdtsc_142", "sources": [{"type": "command", "channels": [0], "source": "sox /mnt/matylda2/data/CZECH_PDTSC_DIALOGS/PDTSC-2.0/data/pdtsc_142.ogg -t wav -r16000 -c1 -b16 -e signed-integer - remix - "}], "sampling_rate": 16000, "num_samples": 58100000, "duration": 3631.25, "channel_ids": [0]}, "type": "MonoCut"}
{"id": "Czech-PDTSC20_pdtsc_142_spk1-0033351_0033651", "start": 333.51, "duration": 3.0, "channel": 0, "supervisions": [{"id": "Czech-PDTSC20_pdtsc_142_spk1-0033351_0033651", "recording_id": "Czech-PDTSC20_pdtsc_142", "start": 0.0, "duration": 3.0, "channel": 0, "text": "máte v klubu nějaké přátele ?", "speaker": "Czech-PDTSC20_pdtsc_142_spk1"}], "features": {"type": "kaldi-fbank", "num_frames": 363125, "num_features": 80, "frame_shift": 0.01, "sampling_rate": 16000, "start": 0, "duration": 3631.25, "storage_type": "lilcom_chunky", "storage_path": "data/fbank/imported_feats_pdtsc-test/feats-0.lca", "storage_key": "${STORAGE_KEY}", "channels": 0}, "recording": {"id": "Czech-PDTSC20_pdtsc_142", "sources": [{"type": "command", "channels": [0], "source": "sox /mnt/matylda2/data/CZECH_PDTSC_DIALOGS/PDTSC-2.0/data/pdtsc_142.ogg -t wav -r16000 -c1 -b16 -e signed-integer - remix - "}], "sampling_rate": 16000, "num_samples": 58100000, "duration": 3631.25, "channel_ids": [0]}, "type": "MonoCut"}
...

My PR made the runtime of streaming_decode.py much faster (IO overhead in "sys.time" was reduced ~10x with 200cuts sample 28secs -> 2.1secs).

What would you suggest ?
Best, Karel

KarelVesely84 · 2023-05-04T15:21:53Z

I played with it a bit more and tried the lru_cache, but the caching was not working, still reading the same over and over again.

This is what i tried (created a run_cached() as a wraper of subprocess.run() command):

        if self.type == "command":
            if offset != 0.0 or duration is not None:
                # TODO(pzelasko): How should we support chunking for commands?
                #                 We risk being very inefficient when reading many chunks from the same file
                #                 without some caching scheme, because we'll be re-running commands.
                warnings.warn(
                    "You requested a subset of a recording that is read from disk via a bash command. "
                    "Expect large I/O overhead if you are going to read many chunks like these, "
                    "since every time we will read the whole file rather than its subset."
                )
             
            @lru_cache(maxsize=10)
            def run_cached(command: str) -> bytes:
                """ Adapter for 'run()' to allow caching of audio files in stdout """
                return run(command, shell=True, stdout=PIPE).stdout
                
            source = BytesIO(run_cached(self.source))
            samples, sampling_rate = read_audio(
                source, offset=offset, duration=duration
            )

pzelasko · 2023-05-04T16:55:28Z

I checked that script -- in general it has another issue, it's not using dataloader with background workers to load the audio, so it's bound to be slow. But I understand how the cache could help here. I am OK to merge but I'd like to change the following:

we disable it by default (we can use env var to enable)
register it in lhotse.set_caching_enabled
let's purge @dynamic_lru_cache decorators everywhere (I think your implementation is better than mine and more predictable in memory size)
expose some method/function for purging the cache (currently I understand that it will be only effective for the first 500MB of data?)

WDYT?

I played with it a bit more and tried the lru_cache, but the caching was not working, still reading the same over and over again.

This is what i tried (created a run_cached() as a wraper of subprocess.run() command):

It wouldn't work because every time you enter this function, you are creating a new closure with a new cache. You'd need a global cached function instead.

KarelVesely84 · 2023-05-05T14:01:35Z

Okay, i'll continue working on it. I'll have a longer weekend, so I'll continue on wednesday.
The plan LGTM. Thanks!

KarelVesely84 · 2023-05-10T17:16:31Z

Hi Piotr,
I incroporated your feedback:

disabled by default (decided not to introduce env variable)
inter-connecting AudioCache with caching.py, so set_caching_enabled() activates/deactivates AudioCache
cache is freed upon disabling
removed @dynamic_lru_cache from audio.py (in features/io.py it is not removed)

And the code of AudioCache code was moved to caching.py (seems to be a more logical place).

And new unit tests were created.

Should I also remove the old unit tests?
test_audio_caching_disabled_works
test_audio_caching_enabled_works

Cheers,
Karel

@pzelasko

- reduces overhead when reading lots of segments from 1 audio file - cache-size is limited to 500MB or no more than 100 files - by default it is enabled - thread-safety is secured by threading.Lock Tested by running `./pruned_transducer_stateless7_streaming/streaming_decode.py` with dataset imported from kaldi format (locally). Question to @pzelasko : should the cache be used also for 'url' type of audio files ?

- disabled by default (decided not to introduce env variable) - inter-connecting `AudioCache` with `caching.py`, so `set_caching_enabled()` activates/deactivates AudioCache - cache is freed upon disabling - removed @dynamic_lru_cache from `audio.py` (in `features/io.py` it is not removed) The code of `AudioCache` code was moved to `caching.py`. And the new unit tests were created. Should I also remove the old unit tests? `test_audio_caching_disabled_works` `test_audio_caching_enabled_works` Cheers, Karel

pzelasko · 2023-05-11T19:17:54Z

lhotse/audio.py

@@ -210,6 +216,7 @@ def load_audio(
                    "Expect large I/O overhead if you are going to read many chunks like these, "
                    "since every time we will download the whole file rather than its subset."
                )
+            # Should AudioCache be used also for 'url' type ? (url never contains a live-stream ?)


Yes, I think it makes sense to cache URL as well

pzelasko · 2023-05-11T19:19:13Z

test/test_audio_reads.py

@@ -121,6 +121,7 @@ def test_opus_torchaudio_vs_ffmpeg_with_resampling(path, force_opus_sampling_rat
    np.testing.assert_almost_equal(audio_ta, audio_ff, decimal=1)


+@pytest.mark.skip(reason="The audio caching by @lru_cache_optional was removed.")


Let's just remove the test

pzelasko · 2023-05-11T19:20:14Z

Thanks Karel! I left two last comments, if you could resolve those then LGTM

- apply AudioCache also to URL reads - remove the unit test of the no-longer existing @lru_cache_optional audio caching

KarelVesely84 · 2023-05-12T16:53:03Z

Okay, I did the two changes. Thanks Piotr! K.

pzelasko

Thanks Karel! LGTM, will just commit some cosmetics

lhotse/audio.py

KarelVesely84 force-pushed the audio_cache branch from 74bd0e5 to 0c29ac1 Compare May 11, 2023 09:00

KarelVesely84 force-pushed the audio_cache branch from 0c29ac1 to f3af054 Compare May 11, 2023 09:44

pzelasko reviewed May 11, 2023

View reviewed changes

Incorporating 2nd feedback from Piotr

d720100

- apply AudioCache also to URL reads - remove the unit test of the no-longer existing @lru_cache_optional audio caching

pzelasko previously approved these changes May 12, 2023

View reviewed changes

lhotse/audio.py Outdated Show resolved Hide resolved

lhotse/audio.py Outdated Show resolved Hide resolved

pzelasko dismissed their stale review via d715cf7 May 12, 2023 17:06

pzelasko added 2 commits May 12, 2023 13:06

Update lhotse/audio.py

d715cf7

Update lhotse/audio.py

9b99843

pzelasko approved these changes May 12, 2023

View reviewed changes

pzelasko merged commit fcca8a7 into lhotse-speech:master May 12, 2023

pzelasko added this to the v1.15 milestone May 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AudioCache: caching for "command" type of audio files #1050

AudioCache: caching for "command" type of audio files #1050

KarelVesely84 commented May 3, 2023 •

edited

Loading

pzelasko commented May 4, 2023 •

edited

Loading

KarelVesely84 commented May 4, 2023 •

edited

Loading

KarelVesely84 commented May 4, 2023 •

edited

Loading

pzelasko commented May 4, 2023

KarelVesely84 commented May 5, 2023

KarelVesely84 commented May 10, 2023

pzelasko May 11, 2023

pzelasko May 11, 2023

pzelasko commented May 11, 2023

KarelVesely84 commented May 12, 2023 •

edited

Loading

pzelasko left a comment

		@@ -121,6 +121,7 @@ def test_opus_torchaudio_vs_ffmpeg_with_resampling(path, force_opus_sampling_rat
		np.testing.assert_almost_equal(audio_ta, audio_ff, decimal=1)


		@pytest.mark.skip(reason="The audio caching by @lru_cache_optional was removed.")

AudioCache: caching for "command" type of audio files #1050

AudioCache: caching for "command" type of audio files #1050

Conversation

KarelVesely84 commented May 3, 2023 • edited Loading

pzelasko commented May 4, 2023 • edited Loading

KarelVesely84 commented May 4, 2023 • edited Loading

KarelVesely84 commented May 4, 2023 • edited Loading

pzelasko commented May 4, 2023

KarelVesely84 commented May 5, 2023

KarelVesely84 commented May 10, 2023

pzelasko May 11, 2023

Choose a reason for hiding this comment

pzelasko May 11, 2023

Choose a reason for hiding this comment

pzelasko commented May 11, 2023

KarelVesely84 commented May 12, 2023 • edited Loading

pzelasko left a comment

Choose a reason for hiding this comment

KarelVesely84 commented May 3, 2023 •

edited

Loading

pzelasko commented May 4, 2023 •

edited

Loading

KarelVesely84 commented May 4, 2023 •

edited

Loading

KarelVesely84 commented May 4, 2023 •

edited

Loading

KarelVesely84 commented May 12, 2023 •

edited

Loading