π FoNE: Precise Single-Token Number Embeddings via Fourier Features
π’ Efficient and accurate numerical representation for Large Language Models (LLMs).
Traditional LLMs tokenize numbers inefficiently, leading to:
- Multiple tokens per number (e.g.,
"12345.6789"
β 5 tokens in GPT-4, 10 in LLaMA2). - Loss of precision, impacting arithmetic and numerical reasoning.
FoNE directly maps numbers to their Fourier representations, making:
- β Running time more efficient
- β Number embeddings precise
- β Data efficiency improved
π Read the full details on our website
- β Single-token number embeddings
- β Improves accuracy on arithmetic tasks
- β Reduces training data needs by up to 64Γ
- β Works for any numeric data, including decimals & large numbers
Tokenizer | Tokenized Representation | Tokens Used |
---|---|---|
GPT-4, LLaMA3.2 (BPE) | 123 45 . 678 9 |
5 |
LLaMA2 (Digitwise Tokenization) | 1 2 3 4 5 . 6 7 8 9 |
10 |
FoNE (Ours) | 12345.6789 |
1 β |
FoNE achieves 99%+ accuracy with 64Γ less data compared to baseline models.
π Performance Highlights: β 100% accuracy on 6-digit integer addition β 98.4% accuracy on 50-digit integer addition β Significant gains in subtraction & multiplication tasks
If you find this project useful, please cite our work:
@article{zhou2025fone,
title={FoNE: Precise Single-Token Number Embeddings via Fourier Features},
author={Zhou, Tianyi and Fu, Deqing and Soltanolkotabi, Mahdi and Jia, Robin and Sharan, Vatsal},
journal={arXiv preprint arXiv:2502.09741},
year={2025}
}
If you would like to discuss applying Fourier Number Embedding (FoNE) to quantization, data analysis, time series, or othersβor explore adding new features to FNE, feel free to connect! π§ Email: tzhou029@usc.edu