New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

17 demetr results by language #22

Open

klh5 wants to merge 17 commits into main from 17-demetr-results-by-language

Collaborator

klh5 commented Feb 4, 2025

Changes scripts to output individual results to JSON files, rather than saving an aggregate result.
Adds notebook used to plot results.

Closes #17

klh5 added 4 commits

January 27, 2025 10:31


          ♻️ Output to JSON files

e1f96ce


          🐛 Fix args

717a58b


          ✨ Add analysis notebook

59baf10


          ➕ Add category mapping

a1ee901

klh5 linked an issue

that may be closed by this pull request

DEMETR results by language #17

Open

review-notebook-app bot commented Feb 4, 2025

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

jack89roberts reviewed

View reviewed changes

notebooks/cat_severity.yaml Outdated Show resolved Hide resolved

notebooks/cat_severity.yaml Outdated Show resolved Hide resolved

src/m4st/process_demetr.py Show resolved Hide resolved

src/m4st/process_demetr.py Show resolved Hide resolved

src/m4st/process_demetr.py Show resolved Hide resolved

src/m4st/process_demetr.py Show resolved Hide resolved

src/m4st/process_demetr.py Show resolved Hide resolved

src/m4st/metrics.py Outdated Show resolved Hide resolved

src/m4st/metrics.py Show resolved Hide resolved

src/m4st/metrics.py

Collaborator

jack89roberts Feb 6, 2025

I left a couple of comments about this but this file is assuming we're processing DEMETR, so for this PR it should be made clear in docstrings/filenames etc. that that's the case, and then later, as part of running on WMT/CallHome maybe, can see if some of this can be abstracted away so the classes etc. can more easily be used on different datasets?

jack89roberts reviewed

View reviewed changes

notebooks/process_demetr_results.ipynb

    
            @@ -0,0 +1,3162 @@
          
              {

Collaborator

jack89roberts Feb 6, 2025 •

edited

Loading

Would be nice to add some markdown/prose around the notebook to help explain it, if we're using it as the main point of reference for the DEMETR analysis. And some of the more debugging/checking prints etc. could be removed maybe.

In terms of ideas for plots/tables I've left a few comments/suggestions, but for the DEMETR section of the report I guess what we're looking for is to effectively + succinctly show:

How does BLASER compare to COMET (and the other metrics) overall?
Is BLASER, or any of the other metrics, more sensitive to critical/semantic errors than minor errors?
Does metric performance vary by language?

and in the ideal case limiting it to 1 plot/table that best represents the answers to those.

Reply via ReviewNB

notebooks/process_demetr_results.ipynb Show resolved Hide resolved

notebooks/process_demetr_results.ipynb

    
            @@ -0,0 +1,3162 @@
          
              {

Collaborator

jack89roberts Feb 6, 2025 •

edited

Loading

Line #1.    demetr_results = "../data/demetr_paper_results_tidy.csv"

File should be added to the repo (or documented how to obtain it)

Reply via ReviewNB

Collaborator Author

klh5 Feb 19, 2025

This is lifted from the table in the paper; would there be any issues with adding it to our repo since it isn't published separately anywhere else?

Collaborator

jack89roberts Feb 21, 2025

Ah right, probably safer not to include it then but if e.g. you have instructions/a script for how to obtain it, or at minimum can point to the table the numbers came from, that would be good. And maybe structure the notebook so most of it can be run without it?

notebooks/process_demetr_results.ipynb

    
            @@ -0,0 +1,3162 @@
          
              {

Collaborator

jack89roberts Feb 6, 2025 •

edited

Loading

Should be documented in the notebook/elsewhere where to get all the necessary input files, what scripts to run first etc.

Reply via ReviewNB

notebooks/process_demetr_results.ipynb

    
            @@ -0,0 +1,3162 @@
          
              {

Collaborator

jack89roberts Feb 6, 2025 •

edited

Loading

Line #3.    id = next(c for c in res_files[0].split('_') if 'id' in c)

Best not to use id as a variable name. Also what this line is doing is quite hard to follow so could be worth a comment.

Reply via ReviewNB

notebooks/process_demetr_results.ipynb

    
            @@ -0,0 +1,3162 @@
          
              {

Collaborator

jack89roberts Feb 6, 2025 •

edited

Loading

Unless we can pick out an interesting difference in the base/critical/major/minor trends across the languages, probably one to skip for the report? It's a lot of bars so hard to digest.

Reply via ReviewNB

notebooks/process_demetr_results.ipynb

    
            @@ -0,0 +1,3162 @@
          
              {

Collaborator

jack89roberts Feb 6, 2025 •

edited

Loading

Possibly easier to digest as a table (including a row with the average across languages)?

Reply via ReviewNB

notebooks/process_demetr_results.ipynb

    
            @@ -0,0 +1,3162 @@
          
              {

Collaborator

jack89roberts Feb 6, 2025 •

edited

Loading

Apply to everything:

BLEU should be capitalised.
Probably a case for only including which of ChrF1 and ChrF2 performs best (on average).
Consider whether any other bar plots are easier to digest as tables.

Apply to this plot:

It's more along the lines of the DEMETR sensitivity analysis but there's a discussion point here about what trend would be desirable, which I think is for critical > major > minor in terms of accuracy/sensitivity, which none of them have.

Reply via ReviewNB

notebooks/process_demetr_results.ipynb

    
            @@ -0,0 +1,3162 @@
          
              {

Collaborator

jack89roberts Feb 6, 2025 •

edited

Loading

In terms of the report I think anything about differences between our implementation and the original DEMETR results can be demoted to an appendix or left out completely with a mention/link back to the repo for more details (i.e. just keep the one on the right?)

Reply via ReviewNB

notebooks/process_demetr_results.ipynb

    
            @@ -0,0 +1,3162 @@
          
              {

Collaborator

jack89roberts Feb 6, 2025 •

edited

Loading

I think score difference aggregated across all categories by severity type and for multiple metrics would be interesting. Possibly as a table again e.g.

| Metric      | Minor | Major | Critical |
| ----------- | ----- | ----- | -------- |
| BLASER      |       |       |          |
| COMET       |       |       |          |
| ChrF        |       |       |          |

and maybe separately vs. language if it adds anything interesting.

Reply via ReviewNB

Collaborator

jack89roberts commented Feb 6, 2025 •

edited

Loading

Also, if we need to limit this to Python 3.11 we can change the version the Actions use in .github/workflows/ci.yml so they don't fail.

klh5 added 13 commits

February 11, 2025 10:22


          🔧 COMET model now optional

5a47208


          🔧 Fix cat severity

506ea9c


          📌 Update github actions

b3dade4


          ♻️ Create separate config dir

3edce43


          📝 Add docstrings and fix metric name

9e825ba


          📝 Add docstring to Metric class

1c93ff8


          ♻️ Move category mapping to seperate config

df8ddc4


          🔥 Remove pointless dataframe creation

af7cfb4


          Add YAML config load

c160185


          Test


          ♻️ COMET model is now a command line arg

c2b19a9


          ⚡ Add slurm script

35b5581


          Additional results for COMET

8ea6fe6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet