CodegebraGPT

Finetuning multimodal LLMs on STEM datasets.

Planned Procedure

Compile and preprocess multiple datasets
Finetune SOLAR-10.7B-Instruct-v1.0 on this data using QLoRA
Release on Huggingface so that anybody can use it!

Datasets

There are 100k samples which I will be using to train this model. The total combination of all these datasets is about 1 million samples, but I only use about 100k samples to save costs. Those samples all come from the datasets listed below:

MetaMath
Camel AI Math
ArXiv Math
Camel AI Chemistry
Camel AI Physics
Camel AI Biology
ArXiv Physics
GSM8K
MMLU
Evol Instruct Code
GlaiveAI Code Assistant v2
ArXiv Computer Science and ML
ScienceQA The dataset can be found here. During training, I used the 100k-text subset.

Name

This LLM is named after Codegebra, which is a program I made to solve equations, perform Fourier transforms, etc. It is intended to be Codegebra's successor, with a more natural interface and expanded abilities.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
Prepare_CodegebraGPT_Dataset.ipynb		Prepare_CodegebraGPT_Dataset.ipynb
README.md		README.md
Train_CodegebraGPT.ipynb		Train_CodegebraGPT.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CodegebraGPT

Planned Procedure

Datasets

Name

About

Releases

Packages

Languages

License

sr5434/CodegebraGPT

Folders and files

Latest commit

History

Repository files navigation

CodegebraGPT

Planned Procedure

Datasets

Name

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages