Publish evaluation metrics #598

vrigal · 2024-05-15T12:53:03Z

Based on #589 (required for valid evaluation task naming).

Closes #519

Based on Bastien's work (#558)

[skip ci]

vrigal · 2024-05-16T08:52:01Z

I tried testing by setting {"training_config": { "wandb-publication": True }} in the training configuration (taskcluster.translations_taskgraph.parameters) and change parameters to avoid caching, all in a temporary commit (TRASHME).

However no train nor evaluation task is started in Taskcluster (marked unscheduled). @eu9ene do you have right to start this task for example https://firefox-ci-tc.services.mozilla.com/tasks/ce0zK1kgSxC_cwZesDXPLQ ? (if it works it should publish that eval metric to a new group ci_b_lxggt6TXCnpi1qsjeIUQ on moz-translations workspace project ru-en).

eu9ene · 2024-05-16T17:55:27Z

I tried testing by setting {"training_config": { "wandb-publication": True }} in the training configuration (taskcluster.translations_taskgraph.parameters) and change parameters to avoid caching, all in a temporary commit (TRASHME).

However no train nor evaluation task is started in Taskcluster (marked unscheduled). @eu9ene do you have right to start this task for example https://firefox-ci-tc.services.mozilla.com/tasks/ce0zK1kgSxC_cwZesDXPLQ ? (if it works it should publish that eval metric to a new group ci_b_lxggt6TXCnpi1qsjeIUQ on moz-translations workspace project ru-en).

The train-backwards task is failed, that's why it doesn't start. It's something about: Publication failed: argument --wandb-project: conflicting option string: --wandb-project

eu9ene · 2024-05-17T23:09:33Z

Depends on #611 to run CI

eu9ene · 2024-05-17T23:09:51Z

We should not forget to check that the newly added COMET metric is being published.

eu9ene

This will likely work after the bug fix but we should refactor the code reuse without adding arguments to another script. Also please double check that the recently added COMET metric is working. We'll need green CI and everything published on W&B to verify that this works. Thanks for pushing on this one, we really need it for the release!

taskcluster/kinds/evaluate-quantized/kind.yml

taskcluster/kinds/evaluate-teacher-ensemble/kind.yml

pipeline/eval/eval.py

vrigal · 2024-05-21T09:01:51Z

Also please double check that the recently added COMET metric is working.

This metric was not specified. I will open a new Issue/PR for this.

It is actually ignored by the online publication. Actual support for .metrics files will not work (from old experiments and offline Taskcluster uploads). It will raise AssertionError: file must contain exactly 2 float values.

…rainings)

vrigal · 2024-05-21T09:16:37Z

pipeline/eval/eval.py

+        )
+        logger.info("Initializing Weight & Biases client")
+        # Allow publishing metrics as a table on existing runs (i.e. previous trainings)
+        wandb.open(resume=True)


This would also benefit from #610.

vrigal · 2024-05-21T12:00:23Z

Evaluation data has been published to W&B: https://wandb.ai/moz-translations/ru-en/groups/ci_bVp4cnGJSgCeY2pK5aKK0A/workspace

I found two other things:

Teacher-1 is not correctly parsed by the regex (label evaluate-teacher-flores-devtest-ru-en-1), because it detects ru-en-1 as part of the dataset (instead of the language at the end of the label). In case that extra language part is always present, it will be a 1 character patch.
I noticed that simple logs can be displayed as bar charts on W&B, once the grouping option is disabled (see our custom workspace). That would be a great option in order to remove Tables/custom charts for evaluation metrics and homogenize all charts in the future.

⚠️ note: Commit 9d6cfa2 must be dropped before merging

eu9ene · 2024-05-21T22:55:13Z

@vrigal things don't look right here: https://wandb.ai/moz-translations/ru-en/groups/ci_bVp4cnGJSgCeY2pK5aKK0A/workspace. Evals use different run names than training:

Teacher-1 is not correctly parsed by the regex (label evaluate-teacher-flores-devtest-ru-en-1), because it detects ru-en-1 as part of the dataset (instead of the language at the end of the label). In case that extra language part is always present, it will be a 1 character patch.

Yes, I see, this is a must-fix before merging. Please also add "sacrebleu_aug-upper_wmt19" to https://github.com/mozilla/firefox-translations-training/blob/f5a6af83849f08cce3e7b8b9d06cc2a0456f2c68/taskcluster/translations_taskgraph/parameters.py#L91 so that we can tests that the augmentation part of the dataset is parsed correctly.

vrigal · 2024-05-22T15:57:46Z

Publication should be fine now (flores & sacrebleu): https://wandb.ai/moz-translations/ru-en/groups/ci_Z45R3e22TWCzyqQ5UFdWdQ/workspace

I updated the regex so it supports labels ending with -1, merged run names in the label parser (backward -> backwards and finetuned -> finetune) and dropped the temporary commit used to trigger the complete CI.

eu9ene · 2024-05-22T17:25:01Z

The issue in CI is #549

vrigal force-pushed the publish-eval branch from ba7cbd7 to 9b3efa9 Compare May 15, 2024 12:56

vrigal mentioned this pull request May 15, 2024

Publish evaluation metrics (rebase) #593

Closed

vrigal force-pushed the publish-eval branch 2 times, most recently from 3e1ba35 to cf387a3 Compare May 16, 2024 08:08

vrigal marked this pull request as ready for review May 16, 2024 08:52

vrigal requested a review from a team as a code owner May 16, 2024 08:52

vrigal requested a review from gbrownmozilla May 16, 2024 08:52

bhearsum requested review from eu9ene and removed request for a team and gbrownmozilla May 16, 2024 14:20

vrigal force-pushed the publish-eval branch 3 times, most recently from 1660452 to 8543142 Compare May 17, 2024 12:15

vrigal mentioned this pull request May 17, 2024

Publish experiment config from taskcluster training task (group_logs) #602

Merged

vrigal force-pushed the publish-eval branch 2 times, most recently from d630a0e to ba1b9cd Compare May 17, 2024 14:34

eu9ene requested changes May 17, 2024

View reviewed changes

taskcluster/kinds/evaluate-quantized/kind.yml Outdated Show resolved Hide resolved

taskcluster/kinds/evaluate-teacher-ensemble/kind.yml Outdated Show resolved Hide resolved

pipeline/eval/eval.py Show resolved Hide resolved

vrigal force-pushed the publish-eval branch from 7d0f251 to 2ca0ebe Compare May 20, 2024 18:54

Bastien Abadie and others added 8 commits May 21, 2024 08:30

Configure evaluation tasks

cba7112

Extract w&b code into module

08b7175

Do not check taskcluwter when publication is disabled

bfc4e0b

Publish evaluation metrics to W&B

f9c163e

Fix running eval tracking on CI

9009544

Use args.wandb_run_name instead of default teacher

eb7d6db

Remove duplicated arguments

0819fcc

Retrieve dataset from Taskcluster directly

990943d

vrigal force-pushed the publish-eval branch from 2ca0ebe to 5f4a454 Compare May 21, 2024 06:30

vrigal added 2 commits May 21, 2024 11:07

Add missing calls to publisher and logging

7c86b89

Allow publishing metrics as a table on existing runs (i.e. previous t…

d3d5ec3

…rainings)

vrigal force-pushed the publish-eval branch from 5f4a454 to 929afa6 Compare May 21, 2024 09:15

vrigal commented May 21, 2024

View reviewed changes

vrigal mentioned this pull request May 21, 2024

Support COMET metric in the parser #615

Closed

vrigal force-pushed the publish-eval branch 2 times, most recently from 9e92d78 to 9d6cfa2 Compare May 21, 2024 09:36

Update regex to parse labels ending with '-1'

782a8d7

vrigal force-pushed the publish-eval branch from 9d6cfa2 to 09aeec4 Compare May 22, 2024 11:02

Generic support for train/eval different naming

468f6fe

vrigal force-pushed the publish-eval branch from 09aeec4 to ccdc233 Compare May 22, 2024 13:09

Update tests

044d93b

vrigal force-pushed the publish-eval branch from ccdc233 to 044d93b Compare May 22, 2024 15:14

Support disabled publication

53a012c

eu9ene approved these changes May 22, 2024

View reviewed changes

vrigal mentioned this pull request May 22, 2024

Publish comet metrics #621

Merged

Merge branch 'main' into publish-eval

f8c8a8b

eu9ene merged commit 8a1d8ef into mozilla:main May 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Publish evaluation metrics #598

Publish evaluation metrics #598

vrigal commented May 15, 2024 •

edited by eu9ene

Loading

vrigal commented May 16, 2024

eu9ene commented May 16, 2024

eu9ene commented May 17, 2024

eu9ene commented May 17, 2024

eu9ene left a comment

vrigal commented May 21, 2024

vrigal May 21, 2024

vrigal commented May 21, 2024 •

edited

Loading

eu9ene commented May 21, 2024

vrigal commented May 22, 2024

eu9ene commented May 22, 2024

Publish evaluation metrics #598

Publish evaluation metrics #598

Conversation

vrigal commented May 15, 2024 • edited by eu9ene Loading

vrigal commented May 16, 2024

eu9ene commented May 16, 2024

eu9ene commented May 17, 2024

eu9ene commented May 17, 2024

eu9ene left a comment

Choose a reason for hiding this comment

vrigal commented May 21, 2024

vrigal May 21, 2024

Choose a reason for hiding this comment

vrigal commented May 21, 2024 • edited Loading

eu9ene commented May 21, 2024

vrigal commented May 22, 2024

eu9ene commented May 22, 2024

vrigal commented May 15, 2024 •

edited by eu9ene

Loading

vrigal commented May 21, 2024 •

edited

Loading