Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.ctm in data simulator annotator compliant with RT-09 specification #8004

Merged
merged 47 commits into from
Jan 8, 2024

Conversation

popcornell
Copy link
Contributor

Redid this PR from #7999

An attempt to fix #7445 so that the data simulator .ctm are compliant with RT-09 specification (see https://web.archive.org/web/20170119114252/http://www.itl.nist.gov/iad/mig/tests/rt/2009/docs/rt09-meeting-eval-plan-v2.pdf):

<SOURCE><SP><CHANNEL><SP> <BEG-TIME><SP><DURATION><SP><TOKEN><SP>
<CONF><SP><TYPE><SP><SPEAKER><NEWLINE>

I have put for fields unknown e.g. .
This makes it also easy to use the generated sessions with https://github.com/lhotse-speech/lhotse

@github-actions github-actions bot added the ASR label Dec 9, 2023
@tango4j
Copy link
Collaborator

tango4j commented Dec 11, 2023

NeMo contributors, please do not merge this PR until I make sure all the CTM in NeMo is following the official CTM format.

@popcornell
Copy link
Contributor Author

Seems that also this is not compliant (speaker id instead of channel):

https://github.com/NVIDIA/NeMo/blob/fa8d416793d850f4ce56bea65e1fe28cc0d092c0/scripts/speaker_tasks/create_alignment_manifest.py#L74C79-L74C79

This is instead kinda compliant (but lacks for missing fields):

f_ctm.write(f"{utt_obj.utt_id} 1 {start_time:.2f} {end_time - start_time:.2f} {text}\n")

@tango4j
Copy link
Collaborator

tango4j commented Dec 12, 2023

@erastorgueva-nv Elena, I have found a line that renders CTM and I replaced with the get_ctm_line to remove the duplication when users create CTM files with NeMo. Please check if there is anything wrong happening and this does not cause any errors. (There is no unit tests for the make_ctm_files.py)

@stevehuang52 We are also making slight changes to data simulator. Please review and approve.

Comment on lines 77 to 80
if type(beg_time) != float:
beg_time = round(float(beg_time), output_precision)
if type(duration) != float:
duration = round(float(duration), output_precision)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, beg_time and duration do not get rounded if they are floats already. Please remove the if-statements, I don't think they are necessary.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to always round the number. Also checking whether beg_time is either float or string containing floating point number.

tango4j
tango4j previously approved these changes Dec 29, 2023
Copy link
Collaborator

@tango4j tango4j left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving this, since this PR went through several rounds of reviews and feedbacks.

@titu1994
Copy link
Collaborator

jenkins

@titu1994
Copy link
Collaborator

@tango4j / reviewers, merge when ready.

Also reminder, NeMo devs need to explicitly write "jenkins" in order to execute the CI

@tango4j
Copy link
Collaborator

tango4j commented Dec 29, 2023

@tango4j / reviewers, merge when ready.

Also reminder, NeMo devs need to explicitly write "jenkins" in order to execute the CI

Oh, when did it change the protocol?
Thanks for the heads up. From now on, I will do "jenkins".

@stevehuang52
Copy link
Collaborator

jenkins

@tango4j
Copy link
Collaborator

tango4j commented Jan 2, 2024

jenkins

Signed-off-by: Taejin Park <[email protected]>
@tango4j
Copy link
Collaborator

tango4j commented Jan 5, 2024

jenkins

@tango4j
Copy link
Collaborator

tango4j commented Jan 5, 2024

jenkins

@tango4j
Copy link
Collaborator

tango4j commented Jan 6, 2024

jenkins

Copy link
Collaborator

@tango4j tango4j left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving changes

@tango4j tango4j merged commit ef6ed61 into NVIDIA:main Jan 8, 2024
11 checks passed
minitu pushed a commit to minitu/NeMo that referenced this pull request Jan 19, 2024
…VIDIA#8004)

* .ctm fix for data simulation

Signed-off-by: popcornell <[email protected]>

* .ctm fix, channel should be 1 not 0

Signed-off-by: popcornell <[email protected]>

* .ctm fix, only two na, type and confidence

Signed-off-by: popcornell <[email protected]>

* Revised all the parts in NeMo touching CTM files

Signed-off-by: Taejin Park <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Updated tutorial, nemo-docs and tests for CTM formats

Signed-off-by: Taejin Park <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fixed the docstrings in create_alignment_manifest.py

Signed-off-by: Taejin Park <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Some missing refactored variables for type_of_token

Signed-off-by: Taejin Park <[email protected]>

* Another un-fixed part in data_simulation_utils.py

Signed-off-by: Taejin Park <[email protected]>

* Reflected comments from PR

Signed-off-by: Taejin Park <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Reflected another precision related comments from PR

Signed-off-by: Taejin Park <[email protected]>

* Updated tests to use decimal rounding of 2

Signed-off-by: Taejin Park <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Changed beg_time to start_time and fixed unit tests

Signed-off-by: Taejin Park <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fixed typos and errors in manifest_utils.py

Signed-off-by: Taejin Park <[email protected]>

* Resolved another merge conflict

Signed-off-by: Taejin Park <[email protected]>

* Fixed the test errors

Signed-off-by: Taejin Park <[email protected]>

* Fixed the missed commented lines

Signed-off-by: Taejin Park <[email protected]>

---------

Signed-off-by: popcornell <[email protected]>
Signed-off-by: Taejin Park <[email protected]>
Co-authored-by: Taejin Park <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: He Huang (Steve) <[email protected]>
ssh-meister pushed a commit to ssh-meister/NeMo that referenced this pull request Feb 15, 2024
…VIDIA#8004)

* .ctm fix for data simulation

Signed-off-by: popcornell <[email protected]>

* .ctm fix, channel should be 1 not 0

Signed-off-by: popcornell <[email protected]>

* .ctm fix, only two na, type and confidence

Signed-off-by: popcornell <[email protected]>

* Revised all the parts in NeMo touching CTM files

Signed-off-by: Taejin Park <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Updated tutorial, nemo-docs and tests for CTM formats

Signed-off-by: Taejin Park <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fixed the docstrings in create_alignment_manifest.py

Signed-off-by: Taejin Park <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Some missing refactored variables for type_of_token

Signed-off-by: Taejin Park <[email protected]>

* Another un-fixed part in data_simulation_utils.py

Signed-off-by: Taejin Park <[email protected]>

* Reflected comments from PR

Signed-off-by: Taejin Park <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Reflected another precision related comments from PR

Signed-off-by: Taejin Park <[email protected]>

* Updated tests to use decimal rounding of 2

Signed-off-by: Taejin Park <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Changed beg_time to start_time and fixed unit tests

Signed-off-by: Taejin Park <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fixed typos and errors in manifest_utils.py

Signed-off-by: Taejin Park <[email protected]>

* Resolved another merge conflict

Signed-off-by: Taejin Park <[email protected]>

* Fixed the test errors

Signed-off-by: Taejin Park <[email protected]>

* Fixed the missed commented lines

Signed-off-by: Taejin Park <[email protected]>

---------

Signed-off-by: popcornell <[email protected]>
Signed-off-by: Taejin Park <[email protected]>
Co-authored-by: Taejin Park <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: He Huang (Steve) <[email protected]>
Signed-off-by: Sasha Meister <[email protected]>
rohitrango pushed a commit to rohitrango/NeMo that referenced this pull request Jun 25, 2024
…VIDIA#8004)

* .ctm fix for data simulation

Signed-off-by: popcornell <[email protected]>

* .ctm fix, channel should be 1 not 0

Signed-off-by: popcornell <[email protected]>

* .ctm fix, only two na, type and confidence

Signed-off-by: popcornell <[email protected]>

* Revised all the parts in NeMo touching CTM files

Signed-off-by: Taejin Park <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Updated tutorial, nemo-docs and tests for CTM formats

Signed-off-by: Taejin Park <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fixed the docstrings in create_alignment_manifest.py

Signed-off-by: Taejin Park <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Some missing refactored variables for type_of_token

Signed-off-by: Taejin Park <[email protected]>

* Another un-fixed part in data_simulation_utils.py

Signed-off-by: Taejin Park <[email protected]>

* Reflected comments from PR

Signed-off-by: Taejin Park <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Reflected another precision related comments from PR

Signed-off-by: Taejin Park <[email protected]>

* Updated tests to use decimal rounding of 2

Signed-off-by: Taejin Park <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Changed beg_time to start_time and fixed unit tests

Signed-off-by: Taejin Park <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fixed typos and errors in manifest_utils.py

Signed-off-by: Taejin Park <[email protected]>

* Resolved another merge conflict

Signed-off-by: Taejin Park <[email protected]>

* Fixed the test errors

Signed-off-by: Taejin Park <[email protected]>

* Fixed the missed commented lines

Signed-off-by: Taejin Park <[email protected]>

---------

Signed-off-by: popcornell <[email protected]>
Signed-off-by: Taejin Park <[email protected]>
Co-authored-by: Taejin Park <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: He Huang (Steve) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Output .ctm of Speech Data Simulator has channel and spk_id swapped
5 participants