[Fix] Relative Audio Paths #4470

stevehuang52 · 2022-06-29T20:40:48Z

Signed-off-by: stevehuang52 [email protected]

This PR fixes the second problem in this issue 4455.

Previously, the code will fail when there are "/" in the filepaths of tarred datasets. Now we only add the manifest directory to the audio filepath if the resulting file exists, thus fixing the problem.

Signed-off-by: stevehuang52 <[email protected]>

titu1994

Overall looks good. Minor comments v

titu1994 · 2022-06-30T03:36:42Z

nemo/collections/common/parts/preprocessing/manifest.py

    manifest_dir = Path(manifest_file).parent
    audio_file = Path(item['audio_file'])
-    if not audio_file.is_file() and not audio_file.is_absolute() and audio_file.parent != Path("."):
-        # assume the wavs/ dir and manifest are under the same parent dir
+    if not audio_file.is_file() and not audio_file.is_absolute():


To note, I think even this is_file() check will fail for too long file names. Any idea ?

I think we can try to avoid too long file names. When creating tarred datasets, we can trim the common path of all audio files and use the trimmed paths as the new file names.

For example,

/a/b/c/d/e/xxxxxx.wav /a/b/c/d/f/yyyyyy.wav

can be trimmed as

e/xxxxxx.wav f/yyyyyy.wav

No let's not do that, there are many datasets of ASR in mcc and MLS where there is a lot of common folders.

How about changing the code to the following? We just keep the original path if the input file is 255 chars or longer.

if (len(str(audio_file) < 255) and not audio_file.is_file() and not audio_file.is_absolute():

No, this will not work with ASR data loaders.

They need exact match to filename which is globally unique

There might be some confusion here (or maybe I am the confused one!). The 255 character limit refers to individual "nodes" in the file system tree. You can certainly have a path to a file with a length greater than 255, assuming there are directories leading to it. It's just that each file name and directory name must individually not exceed 255 characters.

Yes, and ASR requires exact full path match between all files in the manifest, and they must be globally unique. The issue with slicing off means there are cases where things are no longer globally unique

Yes, I understand.

if (len(str(audio_file) < 255) and not audio_file.is_file() and not audio_file.is_absolute(): audio_file = manifest_dir / audio_file item['audio_file'] = str(audio_file.absolute()) if audio_file.is_file(): item['audio_file'] = str(audio_file.absolute()) else: item['audio_file'] = expanduser(item['audio_file']) else: item['audio_file'] = expanduser(item['audio_file'])

No, this will not work with ASR data loaders.

@titu1994 Could you please help me understand why this won't work? If a filename is >= 255 chars, we just use the given filename and don't add the prefix, would this break something?

* update Signed-off-by: stevehuang52 <[email protected]> * update Signed-off-by: stevehuang52 <[email protected]> * fix typo Signed-off-by: stevehuang52 <[email protected]> Signed-off-by: arendu <[email protected]>

* update Signed-off-by: stevehuang52 <[email protected]> * update Signed-off-by: stevehuang52 <[email protected]> * fix typo Signed-off-by: stevehuang52 <[email protected]> Signed-off-by: David Mosallanezhad <[email protected]>

* update Signed-off-by: stevehuang52 <[email protected]> * update Signed-off-by: stevehuang52 <[email protected]> * fix typo Signed-off-by: stevehuang52 <[email protected]> Signed-off-by: Hainan Xu <[email protected]>

update

2b54293

Signed-off-by: stevehuang52 <[email protected]>

stevehuang52 requested a review from titu1994 June 29, 2022 20:40

update

0edb39b

Signed-off-by: stevehuang52 <[email protected]>

stevehuang52 added the fix label Jun 29, 2022

stevehuang52 self-assigned this Jun 29, 2022

fix typo

10ed011

Signed-off-by: stevehuang52 <[email protected]>

stevehuang52 mentioned this pull request Jun 29, 2022

Tar file datasets don't allow for "/"in key names for ASR training. #4455

Closed

titu1994 approved these changes Jun 30, 2022

View reviewed changes

titu1994 merged commit 550e468 into NVIDIA:main Jun 30, 2022

stevehuang52 mentioned this pull request Aug 25, 2022

WIP Fix for #4455 #4456

Closed

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fix] Relative Audio Paths #4470

[Fix] Relative Audio Paths #4470

stevehuang52 commented Jun 29, 2022

titu1994 left a comment

titu1994 Jun 30, 2022

stevehuang52 Jun 30, 2022

titu1994 Jun 30, 2022

stevehuang52 Jun 30, 2022

titu1994 Jun 30, 2022

titu1994 Jun 30, 2022

galv Jun 30, 2022

titu1994 Jun 30, 2022

galv Jul 1, 2022

stevehuang52 Jul 1, 2022 •

edited

Loading

[Fix] Relative Audio Paths #4470

[Fix] Relative Audio Paths #4470

Conversation

stevehuang52 commented Jun 29, 2022

titu1994 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stevehuang52 Jul 1, 2022 • edited Loading

Choose a reason for hiding this comment

stevehuang52 Jul 1, 2022 •

edited

Loading