Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add BGE-M3 Encoder #22

Merged
merged 21 commits into from
Nov 22, 2024
Merged

Add BGE-M3 Encoder #22

merged 21 commits into from
Nov 22, 2024

Conversation

andreaschari
Copy link
Contributor

Adds BGE-M3 Encoder support and documentation.

Copy link
Collaborator

@seanmacavaney seanmacavaney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks! I think it's mostly there, but it needs a few changes. Mostly stuff to bring it in line with the existing conventions in the package so that it doesn't need to be treated differently.

Also, can you add unit tests?

README.md Show resolved Hide resolved
pyterrier_dr/bge_m3.py Outdated Show resolved Hide resolved
pyterrier_dr/bge_m3.py Outdated Show resolved Hide resolved
pyterrier_dr/bge_m3.py Outdated Show resolved Hide resolved
pyterrier_dr/bge_m3.py Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
pyterrier_dr/bge_m3.py Outdated Show resolved Hide resolved
pyterrier_dr/bge_m3.py Outdated Show resolved Hide resolved
pyterrier_dr/bge_m3.py Outdated Show resolved Hide resolved
@cmacdonald
Copy link
Collaborator

realised my review wasnt submitted (and its substantially the same as Seans)

@cmacdonald
Copy link
Collaborator

This looks better now Andreas. You need some unit tests though.

@seanmacavaney should we align the column names for the multi-vec with the pyt_colbert?:

  • query_toks [list of token ids]
  • query_embs [tensor of shape padded query len [32] x dimension size]
  • doc_toks [list of token ids]
  • doc_embs [tensor of shape doc len x dimension size]

@seanmacavaney
Copy link
Collaborator

@cmacdonald I largely agree the above. But it looks like query_toks is an unfortunate name conflict between LSR stuff (type: Dict[str, float|int]) and the ColBERT stuff (type: List[int]). And to make matters worse, these directly conflict in this setting :/.

pyterrier_dr/bge_m3.py Outdated Show resolved Hide resolved
pyterrier_dr/bge_m3.py Outdated Show resolved Hide resolved
pyterrier_dr/bge_m3.py Outdated Show resolved Hide resolved
pyterrier_dr/bge_m3.py Show resolved Hide resolved
pyterrier_dr/bge_m3.py Outdated Show resolved Hide resolved
@seanmacavaney
Copy link
Collaborator

@andreaschari chatting with @cmacdonald, I think we can use *_embs_toks for the tokens from the multivec model, to avoid conflicting the names with the LSR model.

README.md Outdated Show resolved Hide resolved
pyterrier_dr/bge_m3.py Outdated Show resolved Hide resolved
pyterrier_dr/bge_m3.py Outdated Show resolved Hide resolved
pyterrier_dr/bge_m3.py Outdated Show resolved Hide resolved
setup.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@seanmacavaney seanmacavaney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks @andreaschari!

@seanmacavaney seanmacavaney merged commit 621ec9d into terrierteam:master Nov 22, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants