Research existing datasets #2

klh5 · 2024-11-08T15:24:22Z

Are there any existing datasets derived from human speech we could use?

jack89roberts · 2024-11-08T15:29:43Z

Particularly interested in conversational style, and noisy/messy data (lots of filler words, differences in transcription etc.).

klh5 · 2024-11-12T08:38:31Z

Also worth investigating datasets used as part of the MTQE project

klh5 · 2024-11-14T14:01:45Z

Other datasets which could be useful include those used for disfluency detection, in particular Disfl-QA which includes both an original question (derived from SQuADv2) and a human-altered question containing added disfluencies.

klh5 · 2024-11-19T10:49:25Z

I've tried to summarize the existing datasets from the literature here.

None of these fulfil all of our requirements. The WMT dataset provides a "gold standard" human score but no reference translation.

DISCO could provide an easy way to show the effect of different disfluency types. The dataset as distributed does not provide translations of the original imperfect speech, only fluent English translations, so we would need to pick a translation model to produce these.

klh5 self-assigned this Nov 8, 2024

klh5 added research datasets labels Nov 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Research existing datasets #2

Research existing datasets #2

klh5 commented Nov 8, 2024

jack89roberts commented Nov 8, 2024

klh5 commented Nov 12, 2024

klh5 commented Nov 14, 2024 •

edited

Loading

klh5 commented Nov 19, 2024

Research existing datasets #2

Research existing datasets #2

Comments

klh5 commented Nov 8, 2024

jack89roberts commented Nov 8, 2024

klh5 commented Nov 12, 2024

klh5 commented Nov 14, 2024 • edited Loading

klh5 commented Nov 19, 2024

klh5 commented Nov 14, 2024 •

edited

Loading