Skip to content

Add cache for KaldiReader #1004

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Mar 22, 2023
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 23 additions & 9 deletions lhotse/features/io.py
Original file line number Diff line number Diff line change
Expand Up @@ -942,6 +942,26 @@ def write(self, key: str, value: np.ndarray) -> str:
Kaldi-compatible feature reader
"""

def check_kaldi_native_io_installed():
if not is_module_available("kaldi_native_io"):
raise ValueError(
"To read Kaldi feats.scp, please 'pip install kaldi_native_io' first."
)


@lru_cache(maxsize=None)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also update

def close_cached_file_handles() -> None:

to clear the objects cached by this function?

Also maybe I'm wrong but I think we once supported caching for kaldi_io/kaldi_native_io but for very large Kaldi data dirs it took a lot of memory.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion on clearing cache. I updated accordingly.

I've tested on some fairly large data dirs and have not observed memory issue.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for confirming that! LGTM, please just fix the formatting.

def lookup_reader_cache_or_open(storage_path: str):
"""
Helper internal function used in KaldiReader.
It opens kaldi scp files and keeps their handles open in a global program cache
to avoid excessive amount of syscalls when the Reader class is instantiated
and destroyed in a loop repeatedly (frequent use-case).
"""
check_kaldi_native_io_installed()
import kaldi_native_io

return kaldi_native_io.RandomAccessFloatMatrixReader(f"scp:{storage_path}")


@register_reader
class KaldiReader(FeaturesReader):
Expand All @@ -957,19 +977,13 @@ class KaldiReader(FeaturesReader):
name = "kaldiio"

def __init__(self, storage_path: Pathlike, *args, **kwargs):
if not is_module_available("kaldi_native_io"):
raise ValueError(
"To read Kaldi feats.scp, please 'pip install kaldi_native_io' first."
)
import kaldi_native_io

super().__init__()
self.storage_path = storage_path
if storage_path.endswith(".scp"):
self.storage = kaldi_native_io.RandomAccessFloatMatrixReader(
f"scp:{self.storage_path}"
)
self.storage = lookup_reader_cache_or_open(self.storage_path)
else:
check_kaldi_native_io_installed()
import kaldi_native_io
self.storage = None
self.reader = kaldi_native_io.FloatMatrix

Expand Down