A unified, extensible framework for text classification with categorical variables built on PyTorch and PyTorch Lightning.
- Complex input support: Handle text data alongside categorical variables seamlessly.
- Unified yet highly customizable:
- Use any tokenizer from HuggingFace or the original fastText's ngram tokenizer.
- Manipulate the components (
TextEmbedder,CategoricalVariableNet,ClassificationHead) to easily create custom architectures - including self-attention. All of them aretorch.nn.Module! - The
TextClassificationModelclass combines these components and can be extended for custom behavior.
- Multiclass / multilabel classification support: Support for both multiclass (only one label is true) and multi-label (several labels can be true) classification tasks.
- PyTorch Lightning: Automated training with callbacks, early stopping, and logging
- Easy experimentation: Simple API for training, evaluating, and predicting with minimal code:
- The
torchTextClassifierswrapper class orchestrates the tokenizer and the model for you
- The
- Additional features: explainability using Captum
# Clone the repository
git clone https://github.com/InseeFrLab/torchTextClassifiers.git
cd torchtextClassifiers
# Install with uv (recommended)
uv sync
# Or install with pip
pip install -e .Full documentation is available at: https://inseefrlab.github.io/torchTextClassifiers/ The documentation includes:
- Getting Started: Installation and quick start guide
- Architecture: Understanding the 3-layer design
- Tutorials: Step-by-step guides for different use cases
- API Reference: Complete API documentation
Checkout the notebook for a quick start.
See the examples/ directory for:
- Basic text classification
- Multi-class classification
- Mixed features (text + categorical)
- Advanced training configurations
- Prediction and explainability
This project is licensed under the MIT License - see the LICENSE file for details.