You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug:
Recently, #840 was merged that allows the download manager to validate models, datasets, and tokenizers with an MD5 hash after completion. Because tokenizers and models share the same assert card for metadata, they require 2 separate checksum fields to distinguish the two assets but currently both default to the same checksum field.
Describe how to reproduce:
importfairseq2fromfairseq2.assetsimportdefault_asset_store, InProcAssetDownloadManagerfairseq2.setup_fairseq2()
card=default_asset_store.retrieve_card("mistral_7b")
uri=card.field("tokenizer").as_uri() # swap with "checkpoint"checksum=card.field("checksum").get_as_(str)
download_manager=InProcAssetDownloadManager()
download_manager.download_tokenizer(uri, model_name="mistral", checksum=checksum) # swap with download_checkpoint
The above code sample more or less behaves like how the download manager is used to download the tokenizer. The real issue lies with the fact that text_tokenizer.py uses the same checksum field as loader.py when it really should use something different like tokenizer_checksum.
Describe the expected behavior:
Both assets should be validated with separate hashes because they are different assets. This should be easily resolved by changing text_tokenizer.py to look for a separate field like tokenizer_checksum.
PeanutButterRat
changed the title
Integrity check for tokenizer downloads uses the wrong checksum
Integrity check for tokenizer downloads uses the wrong checksum field
Nov 30, 2024
Describe the bug:
Recently, #840 was merged that allows the download manager to validate models, datasets, and tokenizers with an MD5 hash after completion. Because tokenizers and models share the same assert card for metadata, they require 2 separate checksum fields to distinguish the two assets but currently both default to the same
checksum
field.Describe how to reproduce:
The above code sample more or less behaves like how the download manager is used to download the tokenizer. The real issue lies with the fact that text_tokenizer.py uses the same
checksum
field as loader.py when it really should use something different liketokenizer_checksum
.Describe the expected behavior:
Both assets should be validated with separate hashes because they are different assets. This should be easily resolved by changing text_tokenizer.py to look for a separate field like
tokenizer_checksum
.Environment:
fairseq2: 0.3.0.dev0
PyTorch: 2.4.0+cu121
Python: 3.10.12
OS: Windows 10 (WSL)
Additional Context:
None
The text was updated successfully, but these errors were encountered: