Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrity check for tokenizer downloads uses the wrong checksum field #865

Open
PeanutButterRat opened this issue Nov 30, 2024 · 0 comments · May be fixed by #866
Open

Integrity check for tokenizer downloads uses the wrong checksum field #865

PeanutButterRat opened this issue Nov 30, 2024 · 0 comments · May be fixed by #866
Labels
bug Something isn't working

Comments

@PeanutButterRat
Copy link
Contributor

PeanutButterRat commented Nov 30, 2024

Describe the bug:
Recently, #840 was merged that allows the download manager to validate models, datasets, and tokenizers with an MD5 hash after completion. Because tokenizers and models share the same assert card for metadata, they require 2 separate checksum fields to distinguish the two assets but currently both default to the same checksum field.

Describe how to reproduce:

import fairseq2
from fairseq2.assets import default_asset_store, InProcAssetDownloadManager

fairseq2.setup_fairseq2()
card = default_asset_store.retrieve_card("mistral_7b")
uri = card.field("tokenizer").as_uri()  # swap with "checkpoint"
checksum = card.field("checksum").get_as_(str)
download_manager = InProcAssetDownloadManager()

download_manager.download_tokenizer(uri, model_name="mistral", checksum=checksum)  # swap with download_checkpoint

The above code sample more or less behaves like how the download manager is used to download the tokenizer. The real issue lies with the fact that text_tokenizer.py uses the same checksum field as loader.py when it really should use something different like tokenizer_checksum.

Describe the expected behavior:
Both assets should be validated with separate hashes because they are different assets. This should be easily resolved by changing text_tokenizer.py to look for a separate field like tokenizer_checksum.

Environment:
fairseq2: 0.3.0.dev0
PyTorch: 2.4.0+cu121
Python: 3.10.12
OS: Windows 10 (WSL)

Additional Context:
None

@PeanutButterRat PeanutButterRat added the bug Something isn't working label Nov 30, 2024
@PeanutButterRat PeanutButterRat changed the title Integrity check for tokenizer downloads uses the wrong checksum Integrity check for tokenizer downloads uses the wrong checksum field Nov 30, 2024
@PeanutButterRat PeanutButterRat linked a pull request Nov 30, 2024 that will close this issue
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant