Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

17 demetr results by language #22

Open
wants to merge 17 commits into
base: main
Choose a base branch
from
Open

Conversation

klh5
Copy link
Collaborator

@klh5 klh5 commented Feb 4, 2025

  • Changes scripts to output individual results to JSON files, rather than saving an aggregate result.
  • Adds notebook used to plot results.

Closes #17

@klh5 klh5 linked an issue Feb 4, 2025 that may be closed by this pull request
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a couple of comments about this but this file is assuming we're processing DEMETR, so for this PR it should be made clear in docstrings/filenames etc. that that's the case, and then later, as part of running on WMT/CallHome maybe, can see if some of this can be abstracted away so the classes etc. can more easily be used on different datasets?

@@ -0,0 +1,3162 @@
{
Copy link
Collaborator

@jack89roberts jack89roberts Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to add some markdown/prose around the notebook to help explain it, if we're using it as the main point of reference for the DEMETR analysis. And some of the more debugging/checking prints etc. could be removed maybe.

In terms of ideas for plots/tables I've left a few comments/suggestions, but for the DEMETR section of the report I guess what we're looking for is to effectively + succinctly show:

  • How does BLASER compare to COMET (and the other metrics) overall?
  • Is BLASER, or any of the other metrics, more sensitive to critical/semantic errors than minor errors?
  • Does metric performance vary by language?

and in the ideal case limiting it to 1 plot/table that best represents the answers to those.


Reply via ReviewNB

@@ -0,0 +1,3162 @@
{
Copy link
Collaborator

@jack89roberts jack89roberts Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #1.    demetr_results = "../data/demetr_paper_results_tidy.csv"

File should be added to the repo (or documented how to obtain it)


Reply via ReviewNB

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is lifted from the table in the paper; would there be any issues with adding it to our repo since it isn't published separately anywhere else?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right, probably safer not to include it then but if e.g. you have instructions/a script for how to obtain it, or at minimum can point to the table the numbers came from, that would be good. And maybe structure the notebook so most of it can be run without it?

@@ -0,0 +1,3162 @@
{
Copy link
Collaborator

@jack89roberts jack89roberts Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be documented in the notebook/elsewhere where to get all the necessary input files, what scripts to run first etc.


Reply via ReviewNB

@@ -0,0 +1,3162 @@
{
Copy link
Collaborator

@jack89roberts jack89roberts Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #3.    id = next(c for c in res_files[0].split('_') if 'id' in c)

Best not to use id as a variable name. Also what this line is doing is quite hard to follow so could be worth a comment.


Reply via ReviewNB

@@ -0,0 +1,3162 @@
{
Copy link
Collaborator

@jack89roberts jack89roberts Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless we can pick out an interesting difference in the base/critical/major/minor trends across the languages, probably one to skip for the report? It's a lot of bars so hard to digest.


Reply via ReviewNB

@@ -0,0 +1,3162 @@
{
Copy link
Collaborator

@jack89roberts jack89roberts Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly easier to digest as a table (including a row with the average across languages)?


Reply via ReviewNB

@@ -0,0 +1,3162 @@
{
Copy link
Collaborator

@jack89roberts jack89roberts Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apply to everything:

  • BLEU should be capitalised.
  • Probably a case for only including which of ChrF1 and ChrF2 performs best (on average).
  • Consider whether any other bar plots are easier to digest as tables.

Apply to this plot:

  • It's more along the lines of the DEMETR sensitivity analysis but there's a discussion point here about what trend would be desirable, which I think is for critical > major > minor in terms of accuracy/sensitivity, which none of them have.

Reply via ReviewNB

@@ -0,0 +1,3162 @@
{
Copy link
Collaborator

@jack89roberts jack89roberts Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In terms of the report I think anything about differences between our implementation and the original DEMETR results can be demoted to an appendix or left out completely with a mention/link back to the repo for more details (i.e. just keep the one on the right?)


Reply via ReviewNB

@@ -0,0 +1,3162 @@
{
Copy link
Collaborator

@jack89roberts jack89roberts Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think score difference aggregated across all categories by severity type and for multiple metrics would be interesting. Possibly as a table again e.g.

| Metric      | Minor | Major | Critical |
| ----------- | ----- | ----- | -------- |
| BLASER      |       |       |          |
| COMET       |       |       |          |
| ChrF        |       |       |          |

and maybe separately vs. language if it adds anything interesting.


Reply via ReviewNB

@jack89roberts
Copy link
Collaborator

jack89roberts commented Feb 6, 2025

Also, if we need to limit this to Python 3.11 we can change the version the Actions use in .github/workflows/ci.yml so they don't fail.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DEMETR results by language
2 participants