-
Notifications
You must be signed in to change notification settings - Fork 459
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
want to fine-tuned llama3.2.1b on MMLU and Arc_challenge and gsm8k(maths) #2132
Comments
Hey @sorobedio - happy to help in this journey! Seems like you've got a good idea on what you'd like to accomplish! To get started, I'd recommend the following workflow:
Now that you've got the initial setup, let's talk about what you're actually trying to accomplish. Are you looking to just see the best overall model you can make on your own? Are you wanting to deploy this model for a specific use-case? Are you just curious how the finetuning process works for local models? Your answers to these questions impact the direction you might want to go with finetuning and the data you'll want to use. You mentioned wanting to start with MMLU, Arc-Challenge, and GSM8K. These are interesting ones b/c they're often used more for evaluation than training. As such, when you look on the Hugging Face Datasets Hub for MMLU, you'll see that the splits provided are for test or validation, not train. You're still welcome to train on the test set, but when you go to evaluate the model, it'll unsurprisingly do very well :) I might suggest taking a look at the subsections of the MMLU benchmark, which include abstract algebra, astronomy, chemistry, etc. and training a model on some interesting data in each section. For example, I found this dataset with lots of questions about chemistry. Let's see how we can train a model on that. It looks like it takes on a very simple Question/Answer style data structure, so in your config you can specify the dataset like the following: ...
dataset:
_component_: torchtune.datasets.instruct_dataset
source: camel-ai/chemistry
split: train
column_map:
input: message_1
output: message_2
... This uses a torchtune instruct_dataset that pulls the data directly from the Hugging Face Datasets Hub and maps the input/output columns to the correct ones. See our docs for more information. Then, training is as easy as:
Now that you've done some training, you also want to evaluate the model! You can point our example # Model Arguments
model:
_component_: torchtune.models.llama3_2.llama3_2_1b
#Tokenizer
tokenizer:
_component_: torchtune.models.llama3.llama3_tokenizer
path: /tmp/Llama-3.2-1B-Instruct/original/tokenizer.model
max_seq_len: null
# Load in the trained weights
checkpointer:
_component_: torchtune.training.FullModelHFCheckpointer
checkpoint_dir: PATH/TO/YOUR/FINETUNED/MODEL
checkpoint_files: [model.safetensors]
output_dir: ./
model_type: LLAMA3_2
# Environment
device: cuda
dtype: bf16
seed: 1234 # It is not recommended to change this seed, b/c it matches EleutherAI's default seed
log_level: INFO
# EleutherAI specific eval args
tasks: ["mmlu_val_science"] # Defaulting to science as a good subset
limit: null
batch_size: 1
enable_kv_cache: True
max_seq_length: 8192
# Quantization specific args
quantizer: null Then launch it with:
Follow-ups
Hope this helped and feel free to reach out with any more questions! |
Hello, everyone,
I’m new to fine-tuning large language models (LLMs), but I have experience with PyTorch. I’m planning to fine-tune the LLaMA 3.2-1B (base and instruction models) on the MMLU, ARC-Challenge, and GSM8K (math) datasets, using full fine-tuning instead of LoRA. After fine-tuning, I aim to evaluate the models.
Could you please guide me on managing these datasets and share any working examples or resources to get started? Any initial push would be greatly appreciated.
Thank you!
The text was updated successfully, but these errors were encountered: