Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Function to drop alignments from cut #1019

Merged
merged 3 commits into from
Apr 6, 2023

Conversation

desh2608
Copy link
Collaborator

@desh2608 desh2608 commented Apr 6, 2023

This is just a helper function to drop alignments if they are not required. I have been using it in my multi-talker dataset class to reduce size of the returned batch. Might be useful until such time as we decide to store alignments in a separate archive, like features.

@pzelasko pzelasko added this to the v1.14 milestone Apr 6, 2023
@pzelasko
Copy link
Collaborator

pzelasko commented Apr 6, 2023

Might be useful until such time as we decide to store alignments in a separate archive, like features.

This should be possible to do with Shar using sth like:

# Writing
from lhotse.shar import JsonlShardWriter

cuts = CutSet.from_file(...)
with JsonlShardWriter("alis.jsonl.gz", shard_size=None) as w:
  for cut in cuts:
    ali = get_ali(cut)
    if ali is not None:
      w.write({"cut_id": cut.id, "word_alignment": [item.serialize() for item in ali["word"]]})
    else:
      w.write_placeholder(cut.id)

# Reading
cuts = CutSet.from_shar({"cuts": "path/to/cuts.jsonl.gz", "word_alignment": "alis.jsonl.gz"})
for cut in cuts:
  ali = cut.word_alignment

It's a bit raw, because currently we can't attach the alignment to the supervision with Shar, it's instead attached to the cut. Either:

  • we'd support attaching shar fields to supervision
  • we'd rewrite Lhotse a bit to support alignments being a property on cut rather than supervision (which would probably cause problems for users so naturally I don't like it; also not sure if it makes sense in the overall design)

@pzelasko
Copy link
Collaborator

pzelasko commented Apr 6, 2023

Anyway merging, this is useful :)

@pzelasko pzelasko merged commit 516551c into lhotse-speech:master Apr 6, 2023
@desh2608 desh2608 deleted the drop_alignments branch November 2, 2023 19:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants