[π Paper] β’
[π€ HF Models] β’
[π± GitHub]
[π¦ Twitter] β’
[π¬ Reddit] β’
[π Unofficial Blog]
Repo for "WaveCoder: Widespread And Versatile Enhanced Instruction Tuning with Refined Data Generation" [ACL 2024 Main]
Figure 1: WaveCoder models pipeline.
- [2024/05/16] WaveCoder paper is accepted by main conference of ACL 2024.
- [2024/04/10] π₯π₯π₯ WaveCoder repo, models released at π€ HuggingFace!
- [2023/12/26] WaveCoder paper released.
WaveCoder π is a series of large language models (LLMs) for the coding domain, designed to solve relevant problems in the field of code through instruction-following learning. Its training dataset was generated from a subset of code-search-net data using a generator-discriminator framework based on LLMs that we proposed, covering four general code-related tasks: code generation, code summary, code translation, and code repair.
Model | HumanEval | MBPP(500) | HumanEval Fix(Avg.) |
HumanEval Explain(Avg.) |
---|---|---|---|---|
GPT-4 | 85.4 | - | 47.8 | 52.1 |
![]() |
65.8 | 63.0 | 49.5 | 40.8 |
![]() |
74. 4 | 63.4 | 52.1 | 43.0 |
![]() |
79.9 | 64.6 | 52.3 | 45.7 |
Figure 2: Main framwork of LLM-based Generator-Discriminator.
Figure 3: An Example of Our Data Generation.
We combine our dataset with the decontaminated evol-codealpaca-v1 dataset (WaveCoder-evol-instruct) to train WaveCoder-Ultra-6.7B.
![]() |
![]() |
We recommend using Conda to manage your environment. Run the following commands to setup your environment:
conda create -n wavecoder python=3.9
conda activate wavecoder
cd src
pip install -r requirements.txt
pip install transformers==4.34.1
pip install flash-attn==2.5.5
We also open-source our complete training scripts for the community, and you may construct your own dataset for training. Our training scripts refer to Fastchat
To train a model, run the following command:
cd src
bash script/train.sh
- For HumanEval benchmark, we use the code base from Evalplus. We recommend using the code base from Magicoder and the following command to reproduce the HumanEval result of WaveCoder.
MODEL_KEY=deepseek-ai/deepseek-coder-6.7b-base
MODEL=microsoft/wavecoder-ultra-6.7b
DATASET=humaneval
SAVE_PATH=evalplus-$(basename $MODEL)-$DATASET.jsonl
SANITIZED_PATH=humaneval_result/evalplus-$(basename $MODEL)-$DATASET-sanitized.jsonl
python -m experiments.text2code \
--model_key $MODEL_KEY \
--model_name_or_path $MODEL \
--save_path $SAVE_PATH \
--dataset $DATASET \
--temperature 0.0 \
--top_p 1.0 \
--max_new_tokens 512 \
--n_problems_per_batch 28 \
--n_samples_per_problem 1 \
--n_batches 1
echo "$MODEL"
evalplus.evaluate --dataset $DATASET --samples $SAVE_PATH
- For MBPP (500), you can get generations by running the following command:
cd src
bash script/generate.sh
and then get a pass_k score and the error type analysis by running the following command:
bash script/evaluate.sh
- For HumanEvalFix and HumanEvalExplain benchmarks, we use the code base from bigcode-evaluation-harness.
Firstly, you should prepare your raw code data and save it as .jsonl file, then you can run the following command:
cd src
bash script/coreset.sh
to get the coreset of you raw data. Once you get the coreset, you can run
cd src
bash script/data_generate.sh
to launch the LLM-based Generator-Discriminator framework. You can customize your data by controlling the prompt and the configurations in the above .sh script.
This code repository is licensed under the MIT License. The use of DeepSeek Coder models is subject to the its License.
If you find this repository helpful, please consider citing our paper:
@article{yu2023wavecoder,
title={Wavecoder: Widespread and versatile enhanced instruction tuning with refined data generation},
author={Yu, Zhaojian and Zhang, Xin and Shang, Ning and Huang, Yangyu and Xu, Can and Zhao, Yishujie and Hu, Wenxiang and Yin, Qiufeng},
journal={arXiv preprint arXiv:2312.14187},
year={2023}
}
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
Resources:
- Microsoft Open Source Code of Conduct
- Microsoft Code of Conduct FAQ
- Contact [email protected] with questions or concerns