-
Notifications
You must be signed in to change notification settings - Fork 225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
making the kaldi import more robust #1129
making the kaldi import more robust #1129
Conversation
Hi Piotr, what would you think about this change ? (I found this issue while importing per-utterance flac files for Chime challenge) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, I think I found one bug though (see the other comment) -- can you test it?
lhotse/kaldi.py
Outdated
durations = dict(zip(recordings.keys(), dur_vals)) | ||
|
||
# remove recordings with 'None' duration (i.e. there was a read error) | ||
for recording_id, duration in durations.items(): | ||
if durations == None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if durations == None: | |
if duration is None: |
lhotse/kaldi.py
Outdated
logging.warning( | ||
f"[{recording_id}] Could not get duration. " | ||
f"Failed to read audio from `{recordings[recording_id]}`. " | ||
f"Dropping the recording from manifest." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
f"Dropping the recording from manifest." | |
"Dropping the recording from manifest." |
lhotse/kaldi.py
Outdated
durations = dict(zip(recordings.keys(), dur_vals)) | ||
|
||
# remove recordings with 'None' duration (i.e. there was a read error) | ||
for recording_id, dur_value in durations.items(): | ||
if dur_value == None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To fix the style issue, we can use
if dur_value is None:
to replace
if dur_value == None:
get_duration(): - recover if audio file cannot be loaded for get_duration(), drop such recordings... - use chunksize for ProcessPoolExecutor::map (avoid hanging of ProcessPoolExecutor for large RecordingSets)
087c5ef
to
4c1f3cf
Compare
Ok, both suggested changes are done. I also added a new sanity check... |
- not more than 20% utterances can be dropped on `kaldi import`
4c1f3cf
to
c70b20b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, LGTM
get_duration():