Skip to content

Conversation

@sam-hey
Copy link
Contributor

@sam-hey sam-hey commented Mar 30, 2025

@NohTow
Copy link
Collaborator

NohTow commented Mar 31, 2025

add cache to ci to reduce errors when loading files from hf
Thank god, I had to rerun a bunch of tests just because of this lately...
fix: cached_contrastive.py #106
I have two questions regarding this fix:

  1. Did you test if training was pretty sane on MPS devices? The affected line is only to make the different forward passes on the same elements be equivalent by fixing randomness state (e.g, for dropout), but I wonder how it goes when ignoring this. If it does not work properly, I would rather we find a solution or raise an error when using MPS devices.
  2. I am pretty bad at handling MPS right now, is there any setup where torch.backends.mps.is_available(): could return true while other kind of devices are available/are actually being used for training?

update model_card to use huggingface/sentence-transformers#3253

Thank you the handling the PR on ST and also adding it here! FWIU what you did should work (did you test it?), the only nitpick I have is now we should remove the generate_model_card function overriding, even though you already removed it's usage and so the behavior is already what we want, it is cleaner to remove it to not let people think we override it!

chor: st to 4.0.1

I suppose this is because the PR has been included in this version that you want to do the upgrade in this PR.
That being said, although the major version change is mostly due to the reranker part and I believe Tom told me there shouldn't be any breaking change, maybe I should run some sanity checks on the trainings to check everything is correct before merging. Do you need this to be merged quickly?

Edit: also, obviously thanks for the contributions, really appreciate it!

@sam-hey
Copy link
Contributor Author

sam-hey commented Mar 31, 2025

Always glad to help!

I'm not too sure about the MPS implementation—the values looked ok, but I can't guarantee their correctness. Therefore, I switched to raising an error for MPS.

I am pretty bad at handling MPS right now, is there any setup where torch.backends.mps.is_available(): could return true while other kind of devices are available/are actually being used for training?

All devices with MPS do not have any other graphics card.

Thank you the handling the PR on ST and also adding it here! FWIU what you did should work (did you test it?),

Yes, i did:

model = models.ColBERT(
    model_name_or_path="colbert-ir/colbertv2.0",
    model_card_data=PylateModelCardData(language="de", model_name="testing"),
)
model.push_to_hub(
    "samheym/pylate-test", private=True, train_datasets=[], exist_ok=True
)

Do you need this to be merged quickly?

No, I have time to wait for this. Just out of curiosity, what are you checking manually? Is there any option to include these tests in the CI?

@NohTow
Copy link
Collaborator

NohTow commented Mar 31, 2025

No, I have time to wait for this. Just out of curiosity, what are you checking manually? Is there any option to include these tests in the CI?

Not much, just trying to launch some larger scale training (nothing fancy, just ms marco) and checking the results, to make sure there isn't any huge regression.
I think a proper training added to the tests would slow everything down even more, and there is very little chance the training is degraded while not crashing, but I am a bit paranoiac!

@NohTow
Copy link
Collaborator

NohTow commented Apr 7, 2025

Hello,
I did some tests and it seems to work fine!
Could you just change the st version directly to 4.0.2? It introduces a fix to multi gpu training

@tomaarsen
Copy link
Collaborator

Hello!

I didn't investigate this PR with much care, but I do want to share that v4.0 was designed to not introduce anything breaking, especially not if you don't use CrossEncoder (training). Nice work here @sam-hey!

  • Tom Aarsen

@NohTow
Copy link
Collaborator

NohTow commented Apr 7, 2025

Yeah no it's just me being paranoiac, but I did some unrelated training and take the opportunity to test the memory fix/sanity check training and it's fine!
(but thanks for being cautious with releases and pinging us when there might be breaking changes, really appreciated!)

Co-authored-by: Antoine Chaffin <[email protected]>
@sam-hey
Copy link
Contributor Author

sam-hey commented Apr 8, 2025

Thanks a lot to you @tomaarsen @NohTow — it’s always a pleasure working with you!

Best
Sam

@NohTow
Copy link
Collaborator

NohTow commented Apr 8, 2025

LGTM thanks!

@NohTow NohTow merged commit 5940bc7 into lightonai:main Apr 8, 2025
9 checks passed
@sam-hey sam-hey deleted the ref/model_card branch April 8, 2025 12:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants