Finetuning multimodal LLMs on STEM datasets.
- Compile and preprocess multiple datasets
- Finetune SOLAR-10.7B-Instruct-v1.0 on this data using QLoRA
- Release on Huggingface so that anybody can use it!
There are 100k samples which I will be using to train this model. The total combination of all these datasets is about 1 million samples, but I only use about 100k samples to save costs. Those samples all come from the datasets listed below:
- MetaMath
- Camel AI Math
- ArXiv Math
- Camel AI Chemistry
- Camel AI Physics
- Camel AI Biology
- ArXiv Physics
- GSM8K
- MMLU
- Evol Instruct Code
- GlaiveAI Code Assistant v2
- ArXiv Computer Science and ML
- ScienceQA
The dataset can be found here. During training, I used the
100k-text
subset.
This LLM is named after Codegebra, which is a program I made to solve equations, perform Fourier transforms, etc. It is intended to be Codegebra's successor, with a more natural interface and expanded abilities.