A step-by-step Jupyter Notebook demonstrating how to build and train a compact small language model (“SLM”) from scratch using the TinyStories dataset. Covers data preparation, BPE tokenization, efficient binary storage, GPU memory locking, Transformer architecture, training configuration, and sample text generation.
- End-to-end pipeline
From raw text to a fully trained model—all within one notebook. - Efficient Tokenization
Uses Hugging Face’stiktoken
(BPE) for subword encodings. - Disk-backed Dataset
Saves token IDs in.bin
files for fast reloads. - Memory Locking
Demonstrates reserving GPU memory to avoid fragmentation. - Custom Transformer
Minimal PyTorch model with multi-head attention and feed-forward blocks. - Training Loop
Configurable optimizer, learning-rate schedules, gradient clipping, and logging. - Sample Outputs
Generates TinyStories-style text to verify model behavior.
- Introduction
- Dataset
- Prerequisites
- Setup & Installation
- Notebook Walk-through
- Training Configuration
- Sample Generation
- Results & Next Steps
- Contributing
- License
Building Large Language Models (LLMs) from scratch can be resource-intensive. This notebook shows how to create a Small Language Model (SLM) using a lightweight dataset, minimalist code, and standard hardware (e.g., a single GPU).
- TinyStories (~2 million stories) for training and ~20,000 for validation
- Hosted on Hugging Face Datasets
- Each “story” is a short, self-contained text ideal for low-compute experimentation
- Python 3.8+
- GPU with CUDA (optional, but highly recommended)
- PyTorch
- Hugging Face
datasets
tiktoken
numpy
,matplotlib
,tqdm
# Create a virtual environment (recommended)
python3 -m venv venv
source venv/bin/activate
# Install core dependencies
pip install torch torchvision \
datasets \
tiktoken \
numpy \
matplotlib \
tqdm
from datasets import load_dataset
ds = load_dataset("roneneldan/TinyStories")
Implements Byte Pair Encoding via tiktoken
.
Converts text → token IDs → binary .bin
files.
Saves training/validation tokens on disk for fast reload.
Reserves GPU memory (torch.cuda.set_per_process_memory_fraction
or similar) to prevent fragmentation.
Lightweight Transformer with configurable layers, heads, and embedding size.
- Optimizer: AdamW
- LR Schedulers: LinearLR, CosineAnnealingLR, SequentialLR
- Gradient clipping, periodic evaluation, and loss logging.
- Samples new stories to verify qualitative performance.
- Plots loss curves with Matplotlib.
@dataclass
class GPTConfig:
block_size: int
vocab_size: int
n_layer: int
n_head: int
n_embd: int
dropout: float = 0.0
bias: bool = True
Example hyperparameters:
Parameter | Value |
---|---|
block_size | 128 |
vocab_size | 50,000 |
n_layer | 4 |
n_head | 8 |
n_embd | 256 |
dropout | 0.1 |
After training, run:
prompt = "Once upon a time"
generated = model.generate(prompt, max_new_tokens=100)
print(generated)
Expect TinyStories-style outputs (short, coherent sentences).
- Validation loss curve (see notebook plot).
- Qualitative samples demonstrating grammar and coherence.
- Scale up dataset (longer stories).
- Experiment with deeper/wider architectures.
- Integrate more sophisticated tokenizers (e.g., SentencePiece).
- Fork this repository
- Create a new branch (
git checkout -b feature/xyz
) - Commit your changes (
git commit -m 'Add xyz feature'
) - Push to your branch (
git push origin feature/xyz
) - Open a Pull Request
All contributions—bug reports, documentation fixes, new features—are welcome!
I have taken inspiration from the following resources
- Karpathy, A. (2023). nanoGPT [GitHub repository].
- Eldan, R., & Li, Y. (2023). TinyStories: How Small Can Language Models Be and Still Speak Coherent English? arXiv preprint arXiv:2305.07759.
Released under the MIT License.