-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce monolingual data for da-en to investigate distillation performance #771
Comments
I wrote a hacky truncation script to try this out and reuse the cached artifacts. https://github.com/mozilla/firefox-translations-training/tree/da-en-experiment |
Status: All 3 students are training |
So I screwed up the truncation script, and everything trained as the same. I'm re-running things. 25% train-student |
I'm starting another attempt on the latest main, and using
|
I'm running another experiment with 1 million, and 10,000 as a confidence check on my experiment. The 10,000 looks like it's correctly training badly, so it looks like my truncation script was working. I'll let them finish to have more data points. |
Here is a W&B view for the runs |
This results seem very interesting to me. I believe the fact that NLLB and Paracrawl are full of redundant and repetitive data has something to do with this. If there is interest in finding a better way to sample, I think n-gram saturation (rank lower the sentences that have a significant portion of the 2-grams or 3-grams already present in the corpus) could be something worth to explore. |
An experiment for #231
da-en is one of our best models from the spring-2024 run. The teacher ensemble had a COMET score of 0.9013. The student COMET was 0.8950, with a tiny -0.0063 gap. In order to understand the role of the quantity of monolingual data this experiment will artificially reduce the amount of data to observe the effects on distillation.
Experiment splits:
(Edit: I changed my technique to limit the distillation corpus size, rather than strictly monolingual size, so my percentages are off from my original 100%, 75%, 50%, and 25% splits. The distillation corpus is formed from the
da
side of the parallel corpus and theda
monolingual data.)177,346,837
100%
89.50
-0.63
92,218,925
52%
89.44
-0.69
-0.06
61,479,283
35%
89.31
-0.82
-0.19
30,739,641
17%
89.44
-0.69
-0.06
1,000,000
0.5%
85.79
-3.71
-3.71
10,000
39.57
-50.56
-49.93
177,346,837
100%
70.74
92,218,925
52%
70.67
-0.07
61,479,283
35%
70.75
+0.01
30,739,641
17%
70.63
-0.11
1,000,000
0.5%
66.73
-4.01
10,000
26.94
-43.8
Hypothesis:
Reducing the monolingual data will decrease the COMET score, but there is a benefit as it would decrease cost by not having to synthesize as much training data for distillation.
Results
The amount of sentences used for distillation did not measurably affect COMET scores beyond a certain threshold. Here, 50 million seems like a safe threshold with no measurable loss. We would want to adjust this threshold carefully, as it could be that COMET for our limited evaluation set does not capture the breadth of information diversity that is on the web.
It's also unclear if this will result scale to other languages, as this is going into English which is morphologically quite simple compared to other languages. I filed #915 to investigate this. It's also unclear what happens when we use a bigger model, if it's able to distill more information and thus require more information.
Links:
The text was updated successfully, but these errors were encountered: