Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce monolingual data for da-en to investigate distillation performance #771

Closed
gregtatum opened this issue Jul 29, 2024 · 8 comments
Closed
Assignees
Labels
experiment A training experiment with hypothesis and results

Comments

@gregtatum
Copy link
Member

gregtatum commented Jul 29, 2024

An experiment for #231

da-en is one of our best models from the spring-2024 run. The teacher ensemble had a COMET score of 0.9013. The student COMET was 0.8950, with a tiny -0.0063 gap. In order to understand the role of the quantity of monolingual data this experiment will artificially reduce the amount of data to observe the effects on distillation.

Experiment splits:

Metric Value
Teacher COMET 90.13
Original parallel corpus 77,673,571
Original distillation corpus 177,346,837
Monolingual corpus 122,958,567

(Edit: I changed my technique to limit the distillation corpus size, rather than strictly monolingual size, so my percentages are off from my original 100%, 75%, 50%, and 25% splits. The distillation corpus is formed from the da side of the parallel corpus and the da monolingual data.)

Sentences Percent Student COMET Teacher Gap Vs Baseline
177,346,837 100% 89.50 -0.63 -
92,218,925 52% 89.44 -0.69 -0.06
61,479,283 35% 89.31 -0.82 -0.19
30,739,641 17% 89.44 -0.69 -0.06
1,000,000 0.5% 85.79 -3.71 -3.71
10,000 - 39.57 -50.56 -49.93
Sentences Percent Student chrF Vs Baseline
177,346,837 100% 70.74 -
92,218,925 52% 70.67 -0.07
61,479,283 35% 70.75 +0.01
30,739,641 17% 70.63 -0.11
1,000,000 0.5% 66.73 -4.01
10,000 - 26.94 -43.8

Hypothesis:

Reducing the monolingual data will decrease the COMET score, but there is a benefit as it would decrease cost by not having to synthesize as much training data for distillation.

Results

The amount of sentences used for distillation did not measurably affect COMET scores beyond a certain threshold. Here, 50 million seems like a safe threshold with no measurable loss. We would want to adjust this threshold carefully, as it could be that COMET for our limited evaluation set does not capture the breadth of information diversity that is on the web.

It's also unclear if this will result scale to other languages, as this is going into English which is morphologically quite simple compared to other languages. I filed #915 to investigate this. It's also unclear what happens when we use a bigger model, if it's able to distill more information and thus require more information.

Links:

@gregtatum gregtatum added the experiment A training experiment with hypothesis and results label Jul 29, 2024
@gregtatum gregtatum self-assigned this Jul 29, 2024
@gregtatum
Copy link
Member Author

I wrote a hacky truncation script to try this out and reuse the cached artifacts.

https://github.com/mozilla/firefox-translations-training/tree/da-en-experiment

@gregtatum
Copy link
Member Author

gregtatum commented Aug 27, 2024

W&B Reports

Training Dashboard

Status: All 3 students are training

@gregtatum
Copy link
Member Author

gregtatum commented Oct 18, 2024

So I screwed up the truncation script, and everything trained as the same. I'm re-running things.

New dashboard link

25% train-student
50% train action
75% train action

@gregtatum
Copy link
Member Author

gregtatum commented Oct 18, 2024

I'm starting another attempt on the latest main, and using previous_group_ids.

config: configs/experiments-H2-2024/da-en.yml
name: mono_75_percent
langpair: da-en
time: 2024-10-18 16:08:24.167164
train action: https://firefox-ci-tc.services.mozilla.com/tasks/W_feK0IfSNiJ7PyY0cg5rg
branch: dev-da-en-mono-reduction
hash: 52f8874c

config: configs/experiments-H2-2024/da-en.yml
name: mono_50_percent
langpair: da-en
time: 2024-10-18 16:13:49.783889
train action: https://firefox-ci-tc.services.mozilla.com/tasks/GphLOywHSAGulKhTk2SjVA
branch: dev-da-en-mono-reduction
hash: c888519a

config: configs/experiments-H2-2024/da-en.yml
name: mono_25_percent
langpair: da-en
time: 2024-10-18 16:16:56.063919
train action: https://firefox-ci-tc.services.mozilla.com/tasks/E4vDtMWqRCiAytxpFcy70w
branch: dev-da-en-mono-reduction
hash: bab35ab2

@gregtatum
Copy link
Member Author

gregtatum commented Oct 18, 2024

@gregtatum
Copy link
Member Author

I'm running another experiment with 1 million, and 10,000 as a confidence check on my experiment. The 10,000 looks like it's correctly training badly, so it looks like my truncation script was working. I'll let them finish to have more data points.

@gregtatum
Copy link
Member Author

Here is a W&B view for the runs

@ZJaume
Copy link
Collaborator

ZJaume commented Nov 6, 2024

This results seem very interesting to me. I believe the fact that NLLB and Paracrawl are full of redundant and repetitive data has something to do with this. If there is interest in finding a better way to sample, I think n-gram saturation (rank lower the sentences that have a significant portion of the 2-grams or 3-grams already present in the corpus) could be something worth to explore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
experiment A training experiment with hypothesis and results
Projects
None yet
Development

No branches or pull requests

2 participants