Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix MS MARCO v1 doc segmented uniCOIL encoding issues #1853

Closed
lintool opened this issue Apr 23, 2022 · 0 comments · Fixed by #1854
Closed

Fix MS MARCO v1 doc segmented uniCOIL encoding issues #1853

lintool opened this issue Apr 23, 2022 · 0 comments · Fixed by #1854

Comments

@lintool
Copy link
Member

lintool commented Apr 23, 2022

Background, ref #1850

On MS MARCO v2, in addition to these previous corpora:

889db095113cc4fe152382ccff73304a  msmarco_v2_doc_segmented_unicoil_0shot.tar
28261587d6afde56efd8df4f950e7fb4  msmarco_v2_doc_segmented_unicoil_noexp_0shot.tar

@MXueguang also created these:

c5639748c2cbad0152e10b0ebde3b804  msmarco_v2_doc_segmented_unicoil_0shot_v2.tar
97ba262c497164de1054f357caea0c63  msmarco_v2_doc_segmented_unicoil_noexp_0shot_v2.tar

The first set is encoded only with segment; the second prepended doc title.

We wanted do the same with V1. These are the existing corpora:

d7536219c03cc24c60654bea18204e93  msmarco-doc-segmented-unicoil-noexp.tar
6a00e2c0c375cb1e52c83ae5ac377ebb  msmarco-doc-segmented-unicoil.tar

And these are the new ones @MXueguang prepared:

bc71c588c9a2ed175cf2fbe52bd0b86e  msmarco-doc-segmented-unicoil-noexp-v2.tar
3618700294e0edc59b2b3b0820e3acf9  msmarco-doc-segmented-unicoil-v2.tar

However, we later discovered:

  1. msmarco-passage-unicoil-noexp.tar - appears to be just segment encoded
  2. msmarco-passage-unicoil.tar - actually appears to be title/segment encoded
  3. msmarco-passage-unicoil-noexp-v2.tar- title/segment encoded
  4. msmarco-passage-unicoil-v2.tar - title/segment encoded

So, it appears that (2) and (4) are the same condition.

Experimental results check out - (3) is more effective than (1), but (4) ~ (2), differences due to noise.

Here's my plan - let's replace (1) with (3) - i.e., just overwrite, but retain (2); so we'll discard (4). Thus, we will not be creating new YAML files for regressions.

  • msmarco-passage-unicoil: remains exactly the same
  • msmarco-passage-unicoil-noexp: updated to title/segment encoding
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant