A simplified yet functional implementation of the GPT-2 language model written entirely in C.
This project demonstrates the core concepts of transformer-based language models with a focus on educational clarity and minimal dependencies.
- Pure C Implementation: No external dependencies beyond standard C libraries
- Complete Transformer Architecture: Includes multi-head attention, feed-forward networks, and layer normalization
- Training Capabilities: Basic training loop with cross-entropy loss and gradient updates
- Text Generation: Interactive text completion with temperature-controlled sampling
- Memory Management: Proper allocation and deallocation of matrix structures
- Educational Focus: Well-commented code with debug output for learning purposes
The implementation includes:
- Token and Positional Embeddings: Convert input tokens to dense vector representations
- Multi-Head Self-Attention: Core attention mechanism for modeling token relationships
- Feed-Forward Networks: Position-wise fully connected layers with GELU activation
- Layer Normalization: Stabilizes training and improves convergence
- Language Modeling Head: Projects hidden states to vocabulary logits
Default hyperparameters (defined in gpt2.h
):
- Vocabulary Size: 50,257 (standard GPT-2 vocabulary)
- Embedding Dimension: 768
- Number of Layers: 12
- Number of Attention Heads: 12
- Feed-Forward Dimension: 3072
- Maximum Sequence Length: 1024
- Learning Rate: 0.0001
├── gpt2.h # Header file with structure definitions and constants
├── gpt2.c # Core implementation (matrix operations, model architecture)
├── main.c # Training loop and text generation interface
├── train.txt # Training data file (user-provided)
└── README.md # This file
- C compiler (GCC, Clang, or MSVC)
- Standard C libraries (
stdio.h
,stdlib.h
,string.h
,math.h
,time.h
)
gcc -o gpt2 main.c gpt2.c -lm
Or using Makefile:
CC=gcc
CFLAGS=-Wall -O2
LIBS=-lm
gpt2: main.c gpt2.c gpt2.h
$(CC) $(CFLAGS) -o gpt2 main.c gpt2.c $(LIBS)
clean:
rm -f gpt2
Create a train.txt
(or simply use this one) file with your training text. Each line will be treated as a separate training example:
Hello world, this is a sample sentence.
Machine learning is fascinating.
Transformers have revolutionized NLP.
./gpt2
The program will:
- Load training data from
train.txt
- Initialize the GPT-2 model with random weights
- Train for a specified number of epochs
- Generate text completions for test prompts
- Enter interactive mode for custom text generation
Enter prompt: Altherya
(…………………)
[forward_pass] ÔûÂ exiting
[generate_text] Forward pass completed
[generate_text] Sampled token: 32 (' ')
[generate_text] Generation complete, final seq_len=104
[generate_text] Final text: 'Altherya nel '
Generated: "Altherya nel "
Clearly it still have many issues, but this is for exercise purpose only.
This implementation makes several simplifications for educational clarity:
- Character-level Tokenization: Uses simple character mapping instead of BPE
- Simplified Attention: Multi-head attention is partially implemented
- Basic Optimizer: Uses simple gradient updates instead of Adam
- Limited Gradient Computation: Backward pass is simplified
- No Regularization: Missing dropout and weight decay
- Fixed Hyperparameters: No dynamic learning rate scheduling
This implementation is designed to help understand:
- Transformer Architecture: How attention, feed-forward layers, and normalization work together
- Matrix Operations: Low-level implementation of neural network computations
- Memory Management: Proper handling of dynamic memory in C
- Training Loop: How language models learn from sequential data
- Text Generation: Autoregressive sampling strategies
- Low Memory Usage
- Training Speed: Significantly slower than optimized implementations
- Numerical Stability: Uses basic floating-point arithmetic
- Scalability: Not optimized for large datasets or models
Potential enhancements:
- Better Tokenization: Implement BPE or WordPiece tokenization
- Optimized Attention: Full multi-head attention with proper reshaping
- Advanced Optimizers: Adam, AdamW, or other modern optimizers
- Regularization: Dropout, layer dropout, attention dropout
- Mixed Precision: Half-precision training for memory efficiency
- Parallel Processing: Multi-threading or GPU acceleration
- Model Checkpointing: Save/load trained models
- Evaluation Metrics: Perplexity calculation and validation
To better understand the concepts implemented here:
- Attention Is All You Need - Original Transformer paper
- Language Models are Unsupervised Multitask Learners - GPT-2 paper
- The Illustrated Transformer - Visual explanation
- GPT-2 Architecture - GPT-2 specific details
This is an educational implementation. Contributions that improve clarity, add comments, or fix bugs are welcome. Please maintain the focus on readability and educational value.
This is a simplified implementation for educational purposes and should not be used for production applications. The model architecture and training procedures are significantly simplified compared to the original GPT-2.