Building a Small Language Model(SLM) from Scratch

A step-by-step Jupyter Notebook demonstrating how to build and train a compact small language model (“SLM”) from scratch using the TinyStories dataset. Covers data preparation, BPE tokenization, efficient binary storage, GPU memory locking, Transformer architecture, training configuration, and sample text generation.

🚀 Highlights

End-to-end pipeline
From raw text to a fully trained model—all within one notebook.
Efficient Tokenization
Uses Hugging Face’s tiktoken (BPE) for subword encodings.
Disk-backed Dataset
Saves token IDs in .bin files for fast reloads.
Memory Locking
Demonstrates reserving GPU memory to avoid fragmentation.
Custom Transformer
Minimal PyTorch model with multi-head attention and feed-forward blocks.
Training Loop
Configurable optimizer, learning-rate schedules, gradient clipping, and logging.
Sample Outputs
Generates TinyStories-style text to verify model behavior.

Introduction

Building Large Language Models (LLMs) from scratch can be resource-intensive. This notebook shows how to create a Small Language Model (SLM) using a lightweight dataset, minimalist code, and standard hardware (e.g., a single GPU).

Dataset

TinyStories (~2 million stories) for training and ~20,000 for validation
Hosted on Hugging Face Datasets
Each “story” is a short, self-contained text ideal for low-compute experimentation

Prerequisites

Python 3.8+
GPU with CUDA (optional, but highly recommended)
PyTorch
Hugging Face datasets
tiktoken
numpy, matplotlib, tqdm

Setup & Installation

# Create a virtual environment (recommended)
python3 -m venv venv
source venv/bin/activate

# Install core dependencies
pip install torch torchvision \
            datasets \
            tiktoken \
            numpy \
            matplotlib \
            tqdm

Notebook Walk-through

Data Loading

from datasets import load_dataset
ds = load_dataset("roneneldan/TinyStories")

Tokenization

Implements Byte Pair Encoding via tiktoken.
Converts text → token IDs → binary .bin files.

Dataset Storage

Saves training/validation tokens on disk for fast reload.

Memory Locking

Reserves GPU memory (torch.cuda.set_per_process_memory_fraction or similar) to prevent fragmentation.

Model Definition

Lightweight Transformer with configurable layers, heads, and embedding size.

Training Loop

Optimizer: AdamW
LR Schedulers: LinearLR, CosineAnnealingLR, SequentialLR
Gradient clipping, periodic evaluation, and loss logging.

Evaluation & Generation

Samples new stories to verify qualitative performance.
Plots loss curves with Matplotlib.

Training Configuration

@dataclass
class GPTConfig:
    block_size: int
    vocab_size: int
    n_layer: int
    n_head: int
    n_embd: int
    dropout: float = 0.0
    bias: bool = True

Example hyperparameters:

Parameter	Value
block_size	128
vocab_size	50,000
n_layer	4
n_head	8
n_embd	256
dropout	0.1

Sample Generation

After training, run:

prompt = "Once upon a time"
generated = model.generate(prompt, max_new_tokens=100)
print(generated)

Expect TinyStories-style outputs (short, coherent sentences).

Results & Next Steps

Results

Validation loss curve (see notebook plot).
Qualitative samples demonstrating grammar and coherence.

Next Steps

Scale up dataset (longer stories).
Experiment with deeper/wider architectures.
Integrate more sophisticated tokenizers (e.g., SentencePiece).

Contributing

Fork this repository
Create a new branch (git checkout -b feature/xyz)
Commit your changes (git commit -m 'Add xyz feature')
Push to your branch (git push origin feature/xyz)
Open a Pull Request

All contributions—bug reports, documentation fixes, new features—are welcome!

References

I have taken inspiration from the following resources

Karpathy, A. (2023). nanoGPT [GitHub repository].
Eldan, R., & Li, Y. (2023). TinyStories: How Small Can Language Models Be and Still Speak Coherent English? arXiv preprint arXiv:2305.07759.

License

Released under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
LICENSE		LICENSE
README.md		README.md
Small_Language_Model_Building (1).ipynb		Small_Language_Model_Building (1).ipynb
Small_Language_Model_Building (2).ipynb		Small_Language_Model_Building (2).ipynb
Small_Language_Model_Building.ipynb		Small_Language_Model_Building.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Building a Small Language Model(SLM) from Scratch

🚀 Highlights

📖 Table of Contents

Introduction

Dataset

Prerequisites

Setup & Installation

Notebook Walk-through

Data Loading

Tokenization

Dataset Storage

Memory Locking

Model Definition

Training Loop

Evaluation & Generation

Training Configuration

Sample Generation

Results & Next Steps

Results

Next Steps

Contributing

References

License

About

Uh oh!

Releases

Packages

Languages

License

ChaitanyaK77/Building-a-Small-Language-Model-SLM-

Folders and files

Latest commit

History

Repository files navigation

Building a Small Language Model(SLM) from Scratch

🚀 Highlights

📖 Table of Contents

Introduction

Dataset

Prerequisites

Setup & Installation

Notebook Walk-through

Data Loading

Tokenization

Dataset Storage

Memory Locking

Model Definition

Training Loop

Evaluation & Generation

Training Configuration

Sample Generation

Results & Next Steps

Results

Next Steps

Contributing

References

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages