Fix BPE bonus materials #561

rasbt · 2025-03-08T17:01:36Z

Fixes issues with the BPE tokenizer to correctly handle edge cases.

Fixes #558

review-notebook-app · 2025-03-08T17:01:41Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

d-kleine · 2025-03-09T10:34:22Z

@rasbt About the lowest possible merge rank defined as min_rank = 1_000_000_000, I am not sure if inf would be a better choice (just like in the original BPE implementation) to keep the code resilient. The value 1_000_000_000 might look very specific to users, although it's arbitrary. Also, with multi-modality and growing vocabulary sizes, adjacent token combinations of 1_000_000_000 might be reached at some point in the future, possible breaking the code then.

LLMs-from-scratch/ch02/05_bpe-from-scratch/bpe-from-scratch.ipynb

Lines 607 to 613 in f63f04d

    
           "            min_rank = 1_000_000_000\n", 
        
           "            bigram = None\n", 
        
           "            for p in pairs:\n", 
        
           "                r = self.bpe_ranks.get(p, 1_000_000_000)\n", 
        
           "                if r < min_rank:\n", 
        
           "                    min_rank = r\n", 
        
           "                    bigram = p\n",

rasbt · 2025-03-09T15:18:07Z

yes, I agree!

Fix BPE bonus materials

be9c5e5

rasbt marked this pull request as draft March 8, 2025 17:01

fix bpe implementation

92f473e

rasbt marked this pull request as ready for review March 8, 2025 20:44

update

9fb2bf0

rasbt mentioned this pull request Mar 8, 2025

Encoding issue with "Hello," #558

Closed

rasbt added 5 commits March 8, 2025 14:55

Add 'Hello, world. Is this-- a test?' test case

ca09357

update link to test file

7c7e60e

update path handling

3bfad4c

update path handling

cf081c0

fix pytest paths

53c2fec

rasbt merged commit f63f04d into main Mar 8, 2025
13 checks passed

rasbt deleted the fix-bpe branch March 8, 2025 23:21

rasbt mentioned this pull request Mar 9, 2025

Cosmetic improvements to the BPE code #562

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix BPE bonus materials #561

Fix BPE bonus materials #561

rasbt commented Mar 8, 2025

review-notebook-app bot commented Mar 8, 2025

d-kleine commented Mar 9, 2025 •

edited

Loading

rasbt commented Mar 9, 2025

Fix BPE bonus materials #561

Fix BPE bonus materials #561

Conversation

rasbt commented Mar 8, 2025

review-notebook-app bot commented Mar 8, 2025

d-kleine commented Mar 9, 2025 • edited Loading

rasbt commented Mar 9, 2025

d-kleine commented Mar 9, 2025 •

edited

Loading