Reduce monolingual data for da-en to investigate distillation performance #771

gregtatum · 2024-07-29T19:47:30Z

An experiment for #231

da-en is one of our best models from the spring-2024 run. The teacher ensemble had a COMET score of 0.9013. The student COMET was 0.8950, with a tiny -0.0063 gap. In order to understand the role of the quantity of monolingual data this experiment will artificially reduce the amount of data to observe the effects on distillation.

Experiment splits:

Metric	Value
Teacher COMET	90.13
Original parallel corpus	77,673,571
Original distillation corpus	177,346,837
Monolingual corpus	122,958,567

(Edit: I changed my technique to limit the distillation corpus size, rather than strictly monolingual size, so my percentages are off from my original 100%, 75%, 50%, and 25% splits. The distillation corpus is formed from the da side of the parallel corpus and the da monolingual data.)

Sentences	Percent	Student COMET	Teacher Gap	Vs Baseline
`177,346,837`	`100%`	`89.50`	`-0.63`	-
`92,218,925`	`52%`	`89.44`	`-0.69`	`-0.06`
`61,479,283`	`35%`	`89.31`	`-0.82`	`-0.19`
`30,739,641`	`17%`	`89.44`	`-0.69`	`-0.06`
`1,000,000`	`0.5%`	`85.79`	`-3.71`	`-3.71`
`10,000`	-	`39.57`	`-50.56`	`-49.93`

Sentences	Percent	Student chrF	Vs Baseline
`177,346,837`	`100%`	`70.74`	-
`92,218,925`	`52%`	`70.67`	`-0.07`
`61,479,283`	`35%`	`70.75`	`+0.01`
`30,739,641`	`17%`	`70.63`	`-0.11`
`1,000,000`	`0.5%`	`66.73`	`-4.01`
`10,000`	-	`26.94`	`-43.8`

Hypothesis:

Reducing the monolingual data will decrease the COMET score, but there is a benefit as it would decrease cost by not having to synthesize as much training data for distillation.

Results

The amount of sentences used for distillation did not measurably affect COMET scores beyond a certain threshold. Here, 50 million seems like a safe threshold with no measurable loss. We would want to adjust this threshold carefully, as it could be that COMET for our limited evaluation set does not capture the breadth of information diversity that is on the web.

It's also unclear if this will result scale to other languages, as this is going into English which is morphologically quite simple compared to other languages. I filed #915 to investigate this. It's also unclear what happens when we use a bigger model, if it's able to distill more information and thus require more information.

Links:

The text was updated successfully, but these errors were encountered:

gregtatum · 2024-08-22T21:10:00Z

I wrote a hacky truncation script to try this out and reuse the cached artifacts.

https://github.com/mozilla/firefox-translations-training/tree/da-en-experiment

gregtatum · 2024-08-27T16:27:41Z

W&B Reports

Training Dashboard

Status: All 3 students are training

gregtatum · 2024-10-18T17:33:06Z

So I screwed up the truncation script, and everything trained as the same. I'm re-running things.

New dashboard link

25% train-student
50% train action
75% train action

gregtatum · 2024-10-18T20:51:56Z

I'm starting another attempt on the latest main, and using previous_group_ids.

config: configs/experiments-H2-2024/da-en.yml
name: mono_75_percent
langpair: da-en
time: 2024-10-18 16:08:24.167164
train action: https://firefox-ci-tc.services.mozilla.com/tasks/W_feK0IfSNiJ7PyY0cg5rg
branch: dev-da-en-mono-reduction
hash: 52f8874c

config: configs/experiments-H2-2024/da-en.yml
name: mono_50_percent
langpair: da-en
time: 2024-10-18 16:13:49.783889
train action: https://firefox-ci-tc.services.mozilla.com/tasks/GphLOywHSAGulKhTk2SjVA
branch: dev-da-en-mono-reduction
hash: c888519a

config: configs/experiments-H2-2024/da-en.yml
name: mono_25_percent
langpair: da-en
time: 2024-10-18 16:16:56.063919
train action: https://firefox-ci-tc.services.mozilla.com/tasks/E4vDtMWqRCiAytxpFcy70w
branch: dev-da-en-mono-reduction
hash: bab35ab2

gregtatum · 2024-10-18T21:20:44Z

train-student 75%
train-student 50%
train-student 25%

gregtatum · 2024-10-29T15:20:30Z

I'm running another experiment with 1 million, and 10,000 as a confidence check on my experiment. The 10,000 looks like it's correctly training badly, so it looks like my truncation script was working. I'll let them finish to have more data points.

gregtatum · 2024-10-29T15:21:34Z

Here is a W&B view for the runs

ZJaume · 2024-11-06T11:38:01Z

This results seem very interesting to me. I believe the fact that NLLB and Paracrawl are full of redundant and repetitive data has something to do with this. If there is interest in finding a better way to sample, I think n-gram saturation (rank lower the sentences that have a significant portion of the 2-grams or 3-grams already present in the corpus) could be something worth to explore.

gregtatum added the experiment A training experiment with hypothesis and results label Jul 29, 2024

gregtatum self-assigned this Jul 29, 2024

gregtatum mentioned this issue Jul 29, 2024

Figure out the behavior of OpusTrainer augmentation on student distillation gap #773

Closed

gregtatum mentioned this issue Oct 18, 2024

Compute the standard deviation of COMET scores for training student models #885

Closed

gregtatum mentioned this issue Oct 29, 2024

Limit the amount of data used for distillation #905

Open

gregtatum mentioned this issue Oct 31, 2024

Reduce monolingual data for en-lt to investigate distillation performance #915

Open

gregtatum closed this as completed Oct 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce monolingual data for da-en to investigate distillation performance #771

Reduce monolingual data for da-en to investigate distillation performance #771

gregtatum commented Jul 29, 2024 •

edited

Loading

gregtatum commented Aug 22, 2024

gregtatum commented Aug 27, 2024 •

edited

Loading

gregtatum commented Oct 18, 2024 •

edited

Loading

gregtatum commented Oct 18, 2024 •

edited

Loading

gregtatum commented Oct 18, 2024 •

edited

Loading

gregtatum commented Oct 29, 2024

gregtatum commented Oct 29, 2024

ZJaume commented Nov 6, 2024

Reduce monolingual data for da-en to investigate distillation performance #771

Reduce monolingual data for da-en to investigate distillation performance #771

Comments

gregtatum commented Jul 29, 2024 • edited Loading

Experiment splits:

Hypothesis:

Results

Links:

gregtatum commented Aug 22, 2024

gregtatum commented Aug 27, 2024 • edited Loading

gregtatum commented Oct 18, 2024 • edited Loading

gregtatum commented Oct 18, 2024 • edited Loading

gregtatum commented Oct 18, 2024 • edited Loading

gregtatum commented Oct 29, 2024

gregtatum commented Oct 29, 2024

ZJaume commented Nov 6, 2024

gregtatum commented Jul 29, 2024 •

edited

Loading

gregtatum commented Aug 27, 2024 •

edited

Loading

gregtatum commented Oct 18, 2024 •

edited

Loading

gregtatum commented Oct 18, 2024 •

edited

Loading

gregtatum commented Oct 18, 2024 •

edited

Loading