Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fix] Relative Audio Paths #4470

Merged
merged 3 commits into from
Jun 30, 2022
Merged

[Fix] Relative Audio Paths #4470

merged 3 commits into from
Jun 30, 2022

Conversation

stevehuang52
Copy link
Collaborator

Signed-off-by: stevehuang52 [email protected]

This PR fixes the second problem in this issue 4455.

Previously, the code will fail when there are "/" in the filepaths of tarred datasets. Now we only add the manifest directory to the audio filepath if the resulting file exists, thus fixing the problem.

Signed-off-by: stevehuang52 <[email protected]>
Signed-off-by: stevehuang52 <[email protected]>
@stevehuang52 stevehuang52 self-assigned this Jun 29, 2022
Signed-off-by: stevehuang52 <[email protected]>
Copy link
Collaborator

@titu1994 titu1994 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good. Minor comments v

manifest_dir = Path(manifest_file).parent
audio_file = Path(item['audio_file'])
if not audio_file.is_file() and not audio_file.is_absolute() and audio_file.parent != Path("."):
# assume the wavs/ dir and manifest are under the same parent dir
if not audio_file.is_file() and not audio_file.is_absolute():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To note, I think even this is_file() check will fail for too long file names. Any idea ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can try to avoid too long file names. When creating tarred datasets, we can trim the common path of all audio files and use the trimmed paths as the new file names.

For example,

/a/b/c/d/e/xxxxxx.wav
/a/b/c/d/f/yyyyyy.wav

can be trimmed as

e/xxxxxx.wav
f/yyyyyy.wav

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No let's not do that, there are many datasets of ASR in mcc and MLS where there is a lot of common folders.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about changing the code to the following? We just keep the original path if the input file is 255 chars or longer.

 if (len(str(audio_file) < 255) and not audio_file.is_file() and not audio_file.is_absolute():

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this will not work with ASR data loaders.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They need exact match to filename which is globally unique

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might be some confusion here (or maybe I am the confused one!). The 255 character limit refers to individual "nodes" in the file system tree. You can certainly have a path to a file with a length greater than 255, assuming there are directories leading to it. It's just that each file name and directory name must individually not exceed 255 characters.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and ASR requires exact full path match between all files in the manifest, and they must be globally unique. The issue with slicing off means there are cases where things are no longer globally unique

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I understand.

Copy link
Collaborator Author

@stevehuang52 stevehuang52 Jul 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (len(str(audio_file) < 255) and not audio_file.is_file() and not audio_file.is_absolute():
    audio_file = manifest_dir / audio_file
    item['audio_file'] = str(audio_file.absolute())
    if audio_file.is_file():
        item['audio_file'] = str(audio_file.absolute())
    else:
        item['audio_file'] = expanduser(item['audio_file'])
else:
    item['audio_file'] = expanduser(item['audio_file'])

No, this will not work with ASR data loaders.

@titu1994 Could you please help me understand why this won't work? If a filename is >= 255 chars, we just use the given filename and don't add the prefix, would this break something?

@titu1994 titu1994 merged commit 550e468 into NVIDIA:main Jun 30, 2022
arendu pushed a commit that referenced this pull request Jul 21, 2022
* update

Signed-off-by: stevehuang52 <[email protected]>

* update

Signed-off-by: stevehuang52 <[email protected]>

* fix typo

Signed-off-by: stevehuang52 <[email protected]>
Signed-off-by: arendu <[email protected]>
Davood-M pushed a commit to Davood-M/NeMo that referenced this pull request Aug 9, 2022
* update

Signed-off-by: stevehuang52 <[email protected]>

* update

Signed-off-by: stevehuang52 <[email protected]>

* fix typo

Signed-off-by: stevehuang52 <[email protected]>
Signed-off-by: David Mosallanezhad <[email protected]>
@stevehuang52 stevehuang52 mentioned this pull request Aug 25, 2022
8 tasks
hainan-xv pushed a commit to hainan-xv/NeMo that referenced this pull request Nov 29, 2022
* update

Signed-off-by: stevehuang52 <[email protected]>

* update

Signed-off-by: stevehuang52 <[email protected]>

* fix typo

Signed-off-by: stevehuang52 <[email protected]>
Signed-off-by: Hainan Xu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants