Integrity check for tokenizer downloads uses the wrong checksum field #865

PeanutButterRat · 2024-11-30T22:07:47Z

Describe the bug:
Recently, #840 was merged that allows the download manager to validate models, datasets, and tokenizers with an MD5 hash after completion. Because tokenizers and models share the same assert card for metadata, they require 2 separate checksum fields to distinguish the two assets but currently both default to the same checksum field.

Describe how to reproduce:

import fairseq2
from fairseq2.assets import default_asset_store, InProcAssetDownloadManager

fairseq2.setup_fairseq2()
card = default_asset_store.retrieve_card("mistral_7b")
uri = card.field("tokenizer").as_uri()  # swap with "checkpoint"
checksum = card.field("checksum").get_as_(str)
download_manager = InProcAssetDownloadManager()

download_manager.download_tokenizer(uri, model_name="mistral", checksum=checksum)  # swap with download_checkpoint

The above code sample more or less behaves like how the download manager is used to download the tokenizer. The real issue lies with the fact that text_tokenizer.py uses the same checksum field as loader.py when it really should use something different like tokenizer_checksum.

Describe the expected behavior:
Both assets should be validated with separate hashes because they are different assets. This should be easily resolved by changing text_tokenizer.py to look for a separate field like tokenizer_checksum.

Environment:
fairseq2: 0.3.0.dev0
PyTorch: 2.4.0+cu121
Python: 3.10.12
OS: Windows 10 (WSL)

Additional Context:
None

The text was updated successfully, but these errors were encountered:

PeanutButterRat added the bug Something isn't working label Nov 30, 2024

PeanutButterRat changed the title ~~Integrity check for tokenizer downloads uses the wrong checksum~~ Integrity check for tokenizer downloads uses the wrong checksum field Nov 30, 2024

PeanutButterRat linked a pull request Nov 30, 2024 that will close this issue

Change checksum field to tokenizer_checksum #866

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrity check for tokenizer downloads uses the wrong checksum field #865

Integrity check for tokenizer downloads uses the wrong checksum field #865

PeanutButterRat commented Nov 30, 2024 •

edited

Loading

Integrity check for tokenizer downloads uses the wrong checksum field #865

Integrity check for tokenizer downloads uses the wrong checksum field #865

Comments

PeanutButterRat commented Nov 30, 2024 • edited Loading

PeanutButterRat commented Nov 30, 2024 •

edited

Loading