This code is built on HuggingFace Transformers, Datasets, and Models, and allows you to do any combination of:
- training/finetuning any causal LM from the HF Model Hub on the HackerCup dataset
- evaluating on the same
- sampling n Python code generations from the model checkpoint
- running all generations on the sample input/output pairs, and choosing the highest scoring one to run the full input/output pairs
WARNING: This code will run code generated by an LLM on your machine without checking it for safety. DO THIS AT YOUR OWN RISK. By default, this script will terminate after program generation, to give you time to read the LLM generations. We recommend that you manually check all LLM-generated code before running it. Options to disable this behavior are described below.
Install requirements with
pip install -r requirements.txt
An environment manager (e.g. conda) is recommended. Depending on your exact CUDA version and other factors, you may need to install some requirements manually.
python train_and_eval.py --output_dir ~/temp --num_gens 10 --max_code_gen_examples -1 ## Does not run generated code
## We highly recommend you read the generated code here before proceeding.
python train_and_eval.py --output_dir ~/temp --only_run_generated_code
--output_dir
is required but is not used if not finetuning.
This command will currently generate code, but will not run it immediately. You will need to run the second command above in order to run and evaluate this code on the test set inputs. You can turn off this behavior and trigger the entire process with only one command thus, but again, we recommend against this unless you know what you're doing:
python train_and_eval.py --output_dir ~/temp --num_gens 10 --max_code_gen_examples -1 --run_generated_code
We currently filter to only include problems from HackerCup for which each test case occupies exactly one input and one output line, as we are only providing a starting point. This significantly reduces the number of available examples, from 284 to 17. However, models can handle input and output parsing and formatting, and this starter kit does this parsing programmatically only because it is meant to serve as a starting point.
Command-line arguments controlling some data and model hyperparameters are defined at the top of
train_and_eval.py
. The most relevant may include:
--text_columns
, the set of columns included in the input to the model.--num_gens
, the number of generations to sample for each problem.--max_code_gen_examples
, the number of HackerCup problems to sample code generations for.--model_name_or_path
, the HuggingFace model to use.
Beware that the problem statements in the HackerCup dataset are quite long. For many models using fixed-max-length positional embeddings (this excludes, e.g., RoPE), including the problem statement as-is will significantly exceed the context length.
python train_and_eval.py --do_train --do_eval --output_dir <OUTPUT_DIR> --num_gens 10
This command trains on the HackerCup dataset using a causal language modeling objective, by which we mean that the model is trained to maximize the probability of the next token, and that the loss is applied to all tokens. See the next section if you want a loss only on the code solution.
--do_eval
also triggers evaluation on the training dataset, as a causal LM task. The HackerCup dataset
does not come with a precomputed train/val/test split, so one is created in the code. This may be non-deterministic.
Additionally, because we are using the same HackerCup dataset for training and the final code generation eval, we are training on the test set in this demo, albeit in a more roundabout way, since the metrics are not directly computed on the code generations. To adapt this code to tune on other tasks instead, see below.
Command-line arguments controlling some data and model hyperparameters are defined at the top of
train_and_eval.py
. We also use the HuggingFace transformer.TrainingArguments
hyperparameters. A
list of options
including train epochs, batch size, learning rate, and precision can be found in the transformers documentation.
It is possible to only apply LM loss to tokens in the code solutions by using a
trl.SFTTrainer
.
This behavior is triggered with --train_obj seq2seq
. It is currently implemented in train_hf_model()
and depends on the presence of the string ### Code:
, which we manually add in each example (see format_example()
). If you make changes to the code, e.g. to allow for code generation in other languages, as below, it's possible you'll break this behavior.
The example code solutions for the HackerCup dataset are primarily written in C++, but we opt to generate Python code in this kit.
To generate other coding languages, changes to format_datasets()
are required,
including adding a "C" or "C++" case for the if-then statements, to check for the language
argument.
The template
string also needs to be rewritten for this language, though
your model's generation should be capable of handling input and output parsing/formatting.
The dataset to train on is determined by --dataset_name
.
Modifications need to be made to format_datasets()
, which is currently written
specifically to ingest examples in the HackerCup format. For some datasets, this can
simply return the text
field of an example.
Some command line arguments, such as --text_columns
need to be adjusted for the columns in your dataset.