Skip to content

Merge aquila #324

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 225 commits into from
Jun 9, 2023
Merged
Show file tree
Hide file tree
Changes from 215 commits
Commits
Show all changes
225 commits
Select commit Hold shift + click to select a range
38dcf52
add llama
jongjyh Mar 16, 2023
4b460cf
fix bugs for llama initialization
ftgreat Mar 17, 2023
4439623
add schedulers
ftgreat Mar 18, 2023
9be5542
change default llama max length to 2048
ftgreat Mar 20, 2023
7aa85a4
add llama into gpm_dev branch
ftgreat Mar 21, 2023
b08765a
Merge pull request #282 from FlagAI-Open/master
ftgreat Mar 21, 2023
ea9a2b8
Merge pull request #284 from FlagAI-Open/master
ftgreat Mar 21, 2023
905b1a0
Revert "update master into gpm_dev"
ftgreat Mar 21, 2023
07e3ace
Merge pull request #285 from FlagAI-Open/revert-284-master
ftgreat Mar 21, 2023
4c2dcf3
updated
ftgreat Mar 21, 2023
2380cba
Merge branch 'gpm_dev' into opt_tokenizer
ftgreat Mar 21, 2023
3f0322a
Merge pull request #286 from Anhforth/opt_tokenizer
ftgreat Mar 21, 2023
908a416
Merge branch 'master' into opt_tokenizer
ftgreat Mar 21, 2023
fc4e482
fixed
ftgreat Mar 21, 2023
5c32934
merged
ftgreat Mar 21, 2023
31f0b6a
Merge pull request #287 from Anhforth/opt_tokenizer
ftgreat Mar 21, 2023
3d712d2
up gpt2_new_100k
ftgreat Mar 21, 2023
041b32a
add training codes
jongjyh Mar 21, 2023
47f0d42
merge gpm_dev
jongjyh Mar 21, 2023
eb05dd9
Merge pull request #288 from ftgreat/gpm_dev
ftgreat Mar 21, 2023
4b8ebdd
empty commit
Mar 21, 2023
5382063
conflicts fixed
Mar 21, 2023
4a96da5
fix llama bugs
920232796 Mar 22, 2023
84ce1fd
update
920232796 Mar 22, 2023
ad6b9f2
update
920232796 Mar 22, 2023
b2adb82
fix bug in utils.py and tokenizer
jongjyh Mar 22, 2023
efcbdf3
Merge branch 'add_llama' of gitee.com:baai-opensp/flagai-internal int…
marscrazy Mar 22, 2023
7af30db
!2 fix llama inference
marscrazy Mar 22, 2023
cdb229e
up llama embeddings
ftgreat Mar 22, 2023
91c40e0
add llama init range config
ftgreat Mar 22, 2023
ddfe03b
fix some problems in llama infer
jongjyh Mar 22, 2023
5a1f42a
fix checkpoint bug
jongjyh Mar 22, 2023
d4282f7
add llama conflicts resolved
Mar 22, 2023
14088ee
add init
ftgreat Mar 22, 2023
dcf02ad
fix bug bool in checkpoint fn
jongjyh Mar 22, 2023
10b1d7a
fix bug in llamamodel
jongjyh Mar 22, 2023
1738477
fix self.total_iter bug
jongjyh Mar 22, 2023
0f391b6
update add_llama conflicts resolved
Mar 23, 2023
6e9216b
add ds_mpu vs bmt 7B model pretrain
ftgreat Mar 23, 2023
f8dc36c
up configs and initializer_range
ftgreat Mar 23, 2023
493cff0
add llama_ds_mpu.sh
ftgreat Mar 23, 2023
7663b02
add make_indexed_dataset.sh tools
ftgreat Mar 23, 2023
a7f94f1
fix llama attentions and up trigger tools
ftgreat Mar 24, 2023
adff9ca
up build_weighted_datasets use new tok cls&sep
ftgreat Mar 24, 2023
f878897
add shuffle_dataset config
ftgreat Mar 25, 2023
0e5bab1
add build_index_mappings check_tool
ftgreat Mar 26, 2023
c576d5b
update build_weighted_datasets
ftgreat Mar 27, 2023
0dc7d10
update llama_model
ftgreat Mar 27, 2023
f118acb
update build_weighted_datasets 400B
ftgreat Mar 27, 2023
c26b9f6
limit weighted_datasets 400B
ftgreat Mar 28, 2023
24a23dd
fix llama config
ftgreat Mar 28, 2023
3b5675a
pick patches from add_llama branch
Mar 28, 2023
55c3ef1
reorg indexed_dataset
ftgreat Mar 28, 2023
18bf065
Merge branch 'gpm_dev2' of github.com:ftgreat/FlagAI into gpm_dev2
ftgreat Mar 28, 2023
e0aaea2
add indexed_dataset example
ftgreat Mar 28, 2023
136787b
add indexed_dataset example
ftgreat Mar 28, 2023
6b6d8b8
local conflicts resolved
ftgreat Mar 29, 2023
f18214e
up indexed_dataset example
ftgreat Mar 29, 2023
1b46628
local
ftgreat Mar 29, 2023
3492f8c
add bmtrain consine10PP
ftgreat Mar 29, 2023
fe163f5
Merge branch 'gpm_dev2' of https://gitee.com/baai-opensp/flagai-inter…
ftgreat Mar 29, 2023
0667a1b
up llama model
ftgreat Mar 30, 2023
675bc13
cp llama into debug gpt3_pretrain
ftgreat Mar 30, 2023
87fda92
Merge branch 'gpm_dev2' of https://gitee.com/baai-opensp/flagai-inter…
ftgreat Mar 30, 2023
2a25469
add train_tokenizer
ftgreat Mar 30, 2023
856bb17
add generate_bminf
ftgreat Mar 30, 2023
281010c
add gpt2_new_80k
ftgreat Mar 30, 2023
67b93ec
Merge branch 'gpm_dev2' of https://gitee.com/baai-opensp/flagai-inter…
ftgreat Mar 30, 2023
c93d22e
add binary dataset
ftgreat Mar 30, 2023
781dd49
add llama local scripts
Mar 30, 2023
6de0b33
local
ftgreat Mar 30, 2023
e05e282
Merge branch 'gpm_dev2' of https://gitee.com/baai-opensp/flagai-inter…
ftgreat Mar 30, 2023
7c4c5eb
add flagai demo
Mar 31, 2023
ea03b9c
Merge branches 'gpm_dev2' and 'gpm_dev2' of gitee.com:baai-opensp/fla…
Mar 31, 2023
f8fcdee
add extra_args for trainer
ftgreat Mar 31, 2023
ab2dd66
fix extra_args for trainer
ftgreat Mar 31, 2023
99a057d
Merge branch 'gpm_dev2' of https://gitee.com/baai-opensp/flagai-inter…
Mar 31, 2023
a573524
up gpt3_pretrain llama train_llama
ftgreat Apr 1, 2023
c20daf1
Merge branch 'gpm_dev2' of https://gitee.com/baai-opensp/flagai-inter…
ftgreat Apr 1, 2023
9bdfa5a
add bmtrain_mgpu trigger_docker
ftgreat Apr 4, 2023
f127819
up bmtrain_mgpu trigger_docker
ftgreat Apr 4, 2023
c269025
up bmtrain_mgpu trigger_docker
ftgreat Apr 4, 2023
d76dabb
add model_name arg
ftgreat Apr 6, 2023
fe42b03
add grad_norm report
ftgreat Apr 7, 2023
41b2f05
add build_weighted_frozen_datasets.py
ftgreat Apr 7, 2023
29b747a
up bmtrain_mgpu.sh
ftgreat Apr 12, 2023
3d94961
Merge branch 'gpm_dev2' of https://gitee.com/baai-opensp/flagai-inter…
ftgreat Apr 12, 2023
ed487b4
add llama-7b-en
ftgreat Apr 13, 2023
91ab9dc
add mp_tools
ftgreat Apr 13, 2023
f9e018e
add llama-30b-en
ftgreat Apr 13, 2023
5fb7346
add avoid sync loading models
ftgreat Apr 14, 2023
40528ec
up *trainer_v1
ftgreat Apr 14, 2023
64ea3e2
update trainer_v1
ftgreat Apr 14, 2023
8a7c8c6
add hostfile.bmt_16n8g
ftgreat Apr 15, 2023
ecf2374
up bmtrain_mgpu.sh
ftgreat Apr 15, 2023
7c43cd2
rich updates
ftgreat Apr 15, 2023
3569e34
llama enable flash_atten
ftgreat Apr 15, 2023
c600157
up bmtrain_mgpu.sh save configs
ftgreat Apr 15, 2023
4834c0d
add mp_tools.py
ftgreat Apr 15, 2023
7a99c24
fix bmtrain optim manager lr scheduler when grad accum
ftgreat Apr 16, 2023
e7950c5
add indexed_dataset tool
ftgreat Apr 16, 2023
2f7d7a7
rich updates & sft
ftgreat Apr 17, 2023
f62b9b1
Merge branch 'gpm_dev2' of https://gitee.com/baai-opensp/flagai-inter…
ftgreat Apr 17, 2023
a3ba228
update check_datasets
ftgreat Apr 17, 2023
3f2038c
Merge branch 'gpm_dev2' of github.com:ftgreat/FlagAI into gpm_dev2
ftgreat Apr 17, 2023
ae8b7fa
update yaml
ftgreat Apr 17, 2023
af93fe4
Merge branch 'gpm_dev2' of https://gitee.com/baai-opensp/flagai-inter…
ftgreat Apr 17, 2023
c49adc8
save ckpt at end of each epoch
ftgreat Apr 18, 2023
cd72c3e
final datasets weights
ftgreat Apr 18, 2023
51778ca
update bmtrain_mgpu.sh
ftgreat Apr 18, 2023
bd343ed
update Aquila-7b-sft-1n8g
ftgreat Apr 18, 2023
e19fd77
update llama predictor token_end_id early stop
ftgreat Apr 18, 2023
2b7616a
add build_weighted_datasets_v2.py
ftgreat Apr 18, 2023
2c0ea1d
add bmt_loss_scale arg
ftgreat Apr 19, 2023
16ea217
add bmt_loss_scale_steps arg
ftgreat Apr 19, 2023
ddd36be
add llama flash_atten_pdrop config
ftgreat Apr 19, 2023
23e6fd0
rich updates
ftgreat Apr 19, 2023
8ebf5b9
gpt2 model config
ftgreat Apr 20, 2023
84e3583
add Aquila-7b-16n8g.yaml
ftgreat Apr 20, 2023
55fda0e
update configs
ftgreat Apr 20, 2023
c86b271
add sft text dataset
ftgreat Apr 21, 2023
6dbe455
add sft data
ftgreat Apr 23, 2023
3d241b2
add Aquila configs
ftgreat Apr 24, 2023
3701686
update bmtrain save & load
ftgreat Apr 26, 2023
49ebdb2
add Aquila-7b-sft
ftgreat Apr 26, 2023
fc2aaae
commit 162
ftgreat Apr 27, 2023
5bea37f
add enable_sft_dataset_text
ftgreat Apr 27, 2023
60ed408
Merge branch 'gpm_dev2' of https://gitee.com/baai-opensp/flagai-inter…
ftgreat Apr 27, 2023
14abc8e
add sft dataset jsonl
ftgreat Apr 27, 2023
c164ff1
add sft dataset jsonl
ftgreat Apr 28, 2023
02c979b
add tokens B num in report
ftgreat May 2, 2023
c324a79
update sft no lower
ftgreat May 4, 2023
150f261
add llama_generate_bminf
ftgreat May 5, 2023
cb74a89
up bmtrain_mgpu.sh
ftgreat May 5, 2023
a2fce43
up train_llama_bmtrain_datasets
ftgreat May 5, 2023
61ab0d5
feat(llama): overlap communication with computation when using bmtrain
May 5, 2023
825eb3a
!12 feat(llama): overlap communication with computation when using bm…
ftgreat May 5, 2023
57016bd
llama support bmt_comm_overlap
ftgreat May 5, 2023
12d7186
add bmt_fused_ce
ftgreat May 6, 2023
713b6c3
add bmt_fused_ce inplace
ftgreat May 6, 2023
11714ee
support conversations_dataset
ftgreat May 7, 2023
639e534
add conversations_dataset demo
ftgreat May 8, 2023
ae4d8cf
update data
ftgreat May 8, 2023
ee398f1
add llama_generate_bminf_convo
ftgreat May 9, 2023
2a388be
update llama predictor llama_generate_bminf_convo
ftgreat May 9, 2023
faa19cc
update data
ftgreat May 10, 2023
8351d19
add examples/aquila llama/llama-infer
ftgreat May 11, 2023
8d53267
fix sft
ftgreat May 12, 2023
50f7350
add code pretraining
ftgreat May 12, 2023
9f5a6ae
add gpt2_new_100k_newline for sft
May 15, 2023
56ee4f5
add enable_sft_conversations_dataset_v2
ftgreat May 15, 2023
91ed218
add llama_generate_bminf_convo_v2.py
ftgreat May 15, 2023
13fa515
add fix_large_bsz CUDA Illegal memory access on CrossEntropyLoss with…
ftgreat May 15, 2023
62b75b0
update tokens_B
ftgreat May 16, 2023
f660fdd
update examples/gpt3_pretrain/llama/data/script/llama_generate_bminf_…
ftgreat May 16, 2023
a9164a9
add call_infer_server
ftgreat May 16, 2023
4c62b7e
add aquila2llama_hf.py
May 17, 2023
f1f6b5e
rename flash_aquila2llama_hf
ftgreat May 17, 2023
d16810e
add tools
ftgreat May 19, 2023
1145586
add aquila flashattn sft
ftgreat May 19, 2023
ec5bbe9
up convo_dataset.py
ftgreat May 19, 2023
7a3e16f
temporary save ckpt
ftgreat May 19, 2023
cfca9b0
Merge branch 'gpm_dev2' of https://gitee.com/baai-opensp/flagai-inter…
ftgreat May 19, 2023
3c014d6
add ignore_index
ftgreat May 20, 2023
d4c7d96
update ignore_index
ftgreat May 20, 2023
5615280
up detect_model_params
ftgreat May 21, 2023
5a425f0
Merge branch 'gpm_dev2' of https://gitee.com/baai-opensp/flagai-inter…
ftgreat May 21, 2023
88449c6
add enable_flash_attn_models
ftgreat May 22, 2023
b7f22e9
delete bos & eos
ftgreat May 22, 2023
93a8dca
add eps args
ftgreat May 22, 2023
95dc29a
add bos & eos
ftgreat May 23, 2023
a715b8f
add eps
ftgreat May 23, 2023
36372a5
Merge branch 'gpm_dev2' of https://gitee.com/baai-opensp/flagai-inter…
ftgreat May 23, 2023
d3fb4c7
update train_llama_bmtrain_datasets
ftgreat May 23, 2023
48cfb1e
add loss infer
ftgreat May 23, 2023
2fd277f
fix infer
ftgreat May 24, 2023
d931619
add demo data
ftgreat May 25, 2023
0206658
fix flagai/model/llama_model.py.
ftgreat May 26, 2023
257dfff
aquila-7b-server update
ftgreat May 27, 2023
e767882
add enable_sft_conversations_dataset_v3
ftgreat May 30, 2023
196f8e0
fix train_llama_bmtrain_datasets.py.
ftgreat May 30, 2023
b962988
add flash_atten_llama_style for exact convert
ftgreat May 31, 2023
63f27bb
Merge branch 'gpm_dev2' of https://gitee.com/baai-opensp/flagai-inter…
ftgreat May 31, 2023
e69a987
add enable_sft_dataset_val_file
ftgreat Jun 2, 2023
d962b95
tmp_model.config fix
ftgreat Jun 2, 2023
4e720bb
update gpt.py add acc metric
ftgreat Jun 3, 2023
fc49f61
add infer upload to wandb
ftgreat Jun 3, 2023
7a5c97d
mv data into tools
ftgreat Jun 3, 2023
08c837f
Merge branch 'gpm_dev2' of https://gitee.com/baai-opensp/flagai-inter…
ftgreat Jun 3, 2023
1df2cc3
add make_conversations_v3
ftgreat Jun 3, 2023
c0704a7
update tools/script/generate_valid_loss_sft.py
ftgreat Jun 3, 2023
befa604
up tolls
ftgreat Jun 3, 2023
d48fa78
fix llama
ftgreat Jun 4, 2023
65551f9
raw add for testing
ftgreat Jun 6, 2023
3c98fe3
added aquila
ftgreat Jun 6, 2023
9132ee7
renamed and removed some files
ftgreat Jun 7, 2023
65bbe0c
updated docs
ftgreat Jun 8, 2023
2c39b78
Add files via upload
Anhforth Jun 8, 2023
fbf5fce
image added
ftgreat Jun 8, 2023
5c77b27
docs updated
ftgreat Jun 8, 2023
bacd223
dir changed
ftgreat Jun 8, 2023
a224efe
Add files via upload
Anhforth Jun 8, 2023
f7256aa
updated docs and reorg dirs
ftgreat Jun 8, 2023
71177f3
upadted
ftgreat Jun 8, 2023
cc2fc7d
removed some files
ftgreat Jun 8, 2023
c2f99c5
updated docs
ftgreat Jun 8, 2023
9a5553e
updated docs
ftgreat Jun 8, 2023
92784ec
fixed some typos
ftgreat Jun 8, 2023
1c1cf89
Update README_AquilaChat-7B.md
Anhforth Jun 8, 2023
d1c3fe3
Update README_Aquila-7B.md
Anhforth Jun 8, 2023
d82c5dc
Update README_AquilaCode-7B-nv.md
Anhforth Jun 8, 2023
ed85296
Update README.md
Anhforth Jun 8, 2023
e4b9830
Add files via upload
Anhforth Jun 8, 2023
5af62c6
modified docs and files
ftgreat Jun 8, 2023
5fe0b59
removed some paths
ftgreat Jun 8, 2023
b9961b0
modified docs
ftgreat Jun 8, 2023
9ee7d38
modified docs
ftgreat Jun 8, 2023
82e72b9
add license link
ftgreat Jun 8, 2023
c18c296
updated docs
ftgreat Jun 9, 2023
0d7d046
removed temp dir
ftgreat Jun 9, 2023
26202a2
removed some parts
ftgreat Jun 9, 2023
4e1f3f1
updated docs
ftgreat Jun 9, 2023
cb05c23
Add files via upload
Anhforth Jun 9, 2023
3e9d5e7
updated
ftgreat Jun 9, 2023
d2a2e46
updated
ftgreat Jun 9, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions .gitignore
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,8 @@ test_report
/data/
/tests/*/data
checkpoints
checkpoints_in
checkpoints_out
state_dict
checkpoints*
vocabs
Expand All @@ -28,3 +30,8 @@ qqp
glm_large_qqp_pytorch
wandb
clip_benchmark_datasets
examples/AltCLIP/clip_benchmark_datasets
examples/glm_pretrain/data.lazy
examples/glm_pretrain/examples/glm_pretrain/data.lazy
examples/vit_cifar100/cifar100
examples/vit_cifar100/data
Empty file modified CHANGELOG.md
100644 → 100755
Empty file.
Empty file modified CLA.md
100644 → 100755
Empty file.
Empty file modified CODE_OF_CONDUCT.md
100644 → 100755
Empty file.
Empty file modified COMMITTERS.csv
100644 → 100755
Empty file.
Empty file modified CONTRIBUTING.md
100644 → 100755
Empty file.
Empty file modified Dockerfile
100644 → 100755
Empty file.
Empty file modified GOVERNANCE.md
100644 → 100755
Empty file.
Empty file modified LICENSE
100644 → 100755
Empty file.
9 changes: 8 additions & 1 deletion README.md
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,8 @@
--------------------------------------------------------------------------------


FlagAI (Fast LArge-scale General AI models) is a fast, easy-to-use and extensible toolkit for large-scale model. Our goal is to support training, fine-tuning, and deployment of large-scale models on various downstream tasks with multi-modality.
FlagAI (Fast LArge-scale General AI models) is a fast, easy-to-use and extensible toolkit for large-scale model. Our goal is to support training, fine-tuning, and deployment of large-scale models on various downstream tasks with multi-modality.



## Why should I use FlagAI?
Expand Down Expand Up @@ -292,6 +293,12 @@ The majority of FlagAI is licensed under the [Apache 2.0 license](LICENSE), howe
- [29 Jun 2022] release v1.1.0, support OPTs downloading and inference/fine-tuning [#63](https://github.com/FlagAI-Open/FlagAI/pull/63)
- [17 May 2022] made our first contribution in [#1](https://github.com/FlagAI-Open/FlagAI/pull/1)

## Platforms supported

<div align="center">
<img src="./examples/aquila/img/merged_platform.jpg" height = "100" align=center />
</div>



## Misc
Expand Down
11 changes: 11 additions & 0 deletions README_zh.md
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,13 @@
**FlagAI飞智**是一个快速、易于使用和可扩展的大模型工具包。 我们的目标是支持在多模态的各种下游任务上训练、微调和部署大规模模型。
<br><br>

<p align="center">
已支持平台
</p>

****
天数智芯 Nvidia
****

## 为什么你需要 FlagAI?

Expand Down Expand Up @@ -283,6 +289,11 @@ FlagAI飞智大部分项目基于 [Apache 2.0 license](LICENSE),但是请注
* GLM 是基于协议 [MIT license](https://github.com/THUDM/GLM/blob/main/LICENSE)
* AltDiffusion 是基于协议 [CreativeML Open RAIL-M license](https://huggingface.co/spaces/CompVis/stable-diffusion-license)

## 平台支持

<div align="center">
<img src="./examples/aquila/img/merged_platform.jpg" height = "100" align=center />
</div>


## Misc
Expand Down
Empty file modified SUPPORT.md
100644 → 100755
Empty file.
16 changes: 16 additions & 0 deletions examples/Aquila/Aquila-code/Aquila-code.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
batch_size: 10
gradient_accumulation_steps: 1
lr: 2.0e-5
warm_up: 0.01
save_interval: 1000

bmt_cpu_offload: False
bmt_pre_load: False
bmt_async_load: False
bmt_loss_scale: 524288

save_optim: True
save_rng: True

load_optim: False
resume_dataset: False
160 changes: 160 additions & 0 deletions examples/Aquila/Aquila-code/README_AquilaCode-7B.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,160 @@
license: [Apache License 2.0](https://model.baai.ac.cn/use-agreement)


# AquilaCode-7B

## 简介/Overview
Aquila语言大模型在技术上继承了GPT-3、LLaMA等的架构设计优点,替换了一批更高效的底层算子实现、重新设计实现了中英双语的tokenizer,升级了BMTrain并行训练方法,在Aquila的训练过程中实现了比Magtron+DeepSpeed ZeRO-2将近8倍的训练效率。Aquila语言大模型是在中英文高质量语料基础上从0开始训练的,通过数据质量的控制、多种训练的优化方法,实现在更小的数据集、更短的训练时间,获得比其它开源模型更优的性能。也是首个支持中英双语知识、支持商用许可协议、符合国内数据合规需要的大规模开源语言模型。

The Aquila language model inherits the architectural design advantages of GPT-3 and LLaMA, replacing a batch of more efficient underlying operator implementations and redesigning the tokenizer for Chinese-English bilingual support. It upgrades the BMTrain parallel training method, achieving nearly 8 times the training efficiency of Magtron+DeepSpeed ZeRO-2 in the training process of Aquila. The Aquila language model is trained from scratch on high-quality Chinese and English corpora. Through data quality control and various training optimization methods, it achieves better performance than other open-source models with smaller datasets and shorter training times. It is also the first large-scale open-source language model that supports Chinese-English-Knowledge, commercial licensing, and complies with domestic data regulations.

<!-- AquilaCode-7B-NV是在Aquila-7B模型的基础上,经过代码数据的继续预训练得到的基础代码模型。此模型由智源研究院研发。在主流评测数据集上的评测结果如下

AquilaCode-7B-nv is a foundational code model obtained by further pretraining on code data based on the Aquila-7B model. It was developed by Beijing Academy of Artificial Intelligence. The evaluation results on mainstream benchmark datasets are as follows:

| 名称/Name | MMLU_Chinese_EM | CLUE-EM |MMLU-EM| BoolQ-EM| TruthfulQA-EM |IMDB-EM| RAFT-EM|
| ----- | ---- | ----- | ---- | ----- | ---- | ----- | ----- |
| [AquilaCode-7B-nv](https://model.baai.ac.cn/model-detail/xxxxx) | 0.xxx | 0.xxx|0.xxx | 0.xxx|0.xxx | -->


<!-- 您可以在[FlagEval基础模型评测平台](https://flageval.baai.ac.cn/#/home) 查看更多评测指标

You can view [FlagEval Model Evaluation Platform](https://flageval.baai.ac.cn/#/home) for more details -->



我们的模型也同时支持[Huggingface平台](hflink)

We also support [Huggingface](hflink)

## 模型细节/Model details

我们使用了一系列更高效的底层算子来辅助模型训练,其中包括参考[flash-attention](https://github.com/HazyResearch/flash-attention)的方法并替换了一些中间计算,同时还使用了RMSNorm。在此基础上,我们应用了[BMtrain](https://github.com/OpenBMB/BMTrain)技术进行轻量化的并行训练,该技术采用了数据并行、ZeRO(零冗余优化器)、优化器卸载、检查点和操作融合、通信-计算重叠等方法来优化模型训练过程。

Aquila模型所采用的tokenizer是由我们从头开始训练的,支持中英双语。与其他tokenizer的参数对比见下表:

我们在处理英文、中文以及代码数据时,采用了不同的分词器对一万个样本进行了抽取。随后,我们统计了每个样本的token数量,并将其记录在表格中。


We used a series of more efficient low-level operators to assist with model training, including methods referenced from [flash-attention](https://github.com/HazyResearch/flash-attention) and replacing some intermediate calculations, as well as using RMSNorm. Building upon this foundation, we applied the [BMtrain](https://github.com/OpenBMB/BMTrain) for lightweight parallel training, which utilizes methods such as data parallelism, ZeRO (zero redundancy optimizer), optimizer offloading, checkpoint and operation fusion, and communication-computation overlap to optimize the model training process.

The tokenizer used in the Aquila model was trained from scratch by us and supports both English and Chinese. The parameters of this tokenizer are compared to those of other tokenizers in the table below:

We used different tokenizers to extract ten thousand data samples from English, Chinese, and code data respectively, obtained the count of tokens for each sample, and also included it in the table.

| 模型/Model | 词表大小/Vocab size | 说明/Note |英文平均tokens量/Avg tokens(English)| 中文平均tokens量/Avg tokens(Chinesse)|代码平均tokens量/Avg tokens(code) |
| ----- | ---- | ----- | ---- | ----- | ---- |
| gpt2 | 50527 | bpe|1717 | 1764|2323 |
| llama | 32000 | sp(bpe)|1805| 1257|1970 |
| gpt2_new_100k | 100000 | bpe|1575 | 477|1679 |


## 训练数据集/Training data
`AquilaCode-7B-NV`和`AquilaCode-7B-TS`训练使用了[starcoderdata](https://huggingface.co/datasets/bigcode/starcoderdata)中的shell, sql,C, C++, Java, Javascript, Python, git-commits, github-issues, jupyter-scripts, jupyter-structured-text数据

给予我们的模型进行了continue pretrain--------
The AquilaCode-7B-NV model was supervised fine-tuning on [starcoderdata](https://huggingface.co/datasets/bigcode/starcoderdata)(shell, sql,C, C++, Java, Javascript, Python, git-commits, github-issues, jupyter-scripts, jupyter-structured-text).

## 使用方式/How to use

### 1. 推断/Inference

```python
import torch
import os
import argparse
import sys
from flagai import mpu
from flagai.auto_model.auto_loader import AutoLoader
import numpy as np
from flagai.model.predictor.predictor import Predictor
from pathlib import Path
from flagai.data.tokenizer import Tokenizer
import time
import torch.distributed as dist
import json, datetime

import os

model_dir = "./checkpoints_in"
device = "cuda"

print(f"building model...")
loader = AutoLoader("lm", model_name="aquilacode-7b-nv",
only_download_config=True,
use_cache=True,
fp16=True,
model_dir=model_dir)

model = loader.get_model()
tokenizer = loader.get_tokenizer()

model.eval()

model.to(device)

vocab = tokenizer.get_vocab()

id2word = {v:k for k, v in vocab.items()}

predictor = Predictor(model, tokenizer)

max_new_tokens = 256

test_file = "./datasets/code_test.txt"
with open(test_file) as fin:
prompt = '\n'+fin.read()+'\n'

input_ids = tokenizer.encode_plus_non_glm(prompt)["input_ids"][:-1]
input_length = len(input_ids)

max_length = input_length+max_new_tokens
with torch.no_grad():
res = predictor.predict_generate_randomsample(prompt,
out_max_length=max_length,
top_p=0.95,
temperature=t0.7)
print(res)
```

### 2. 可监督微调/Supervised Fine-tuning(SFT)
#### Step 1: 配置模型/ Setup Checkpoints
在`./checkpoints_in`里新建`aquilacode-7b-nv`(或`aquilacode-7b-ts`)目录。将微调后的checkpoint,以及原始`aquilacode-7b-nv`模型里的其余文件,包括`config.json`, `mergex.txt`, `vocab.json`, `special_tokens_map.json`放进去

Create a new directory named `aquilacode-7b-nv` (or`aquilacode-7b-ts`) inside `./checkpoints_in`. Place the fine-tuned checkpoint and all other files from the original `aquilacode-7b-nv` model, including `config.json`, `mergex.txt`, `vocab.json`, and `special_tokens_map.json`, into this directory.

#### Step 2: 修改参数/Modify Parameters
* `cd /examples/Aquila/Aquila-code`
* 配置`hostfile`文件, 参考[这里](../../../doc_zh/TUTORIAL_8_ENVIRONMENT_SETUP.md#a配置hostfilehostfile-中的v100-1-与sshconfig-对应) ; Configure the `hostfile` file, refer to [here](../../../docs/TUTORIAL_8_ENVIRONMENT_SETUP.md)
* 配置`bmtrain_mgpu.sh`文件, 将`SCRIPT_FILE`改成`aquila_sft_code.py`; configure the `bmtrain_mgpu.sh` file, change `SCRIPT_FILE` to `aquila_sft_code.py`
* (可选) 在`Aquila-sft.yaml`文件里更改参数 ; (optional) change parameters in `Aquila-sft-code.yaml`

| 参数名 Parameter | 类型 Type | 描述 Description |
|--------------------------------|------------|-------------------------------------------------------|
| batch_size | int | 每次迭代训练时,从数据集中抽取的样本数。一般来说,它越大,处理速度越快,但会占用更多的内存; The number of samples extracted from the dataset for each iteration during training. Generally, a larger batch size can speed up processing but may also consume more memory |
| gradient_accumulation_steps | int | 在更新模型权重之前,要对多个小批次进行梯度计算的次数。主要应用于GPU显存较小的情况下,可以使用小的batch_size,通过梯度累积达到与大batch_size相同的效果; The number of samples extracted from the dataset for each iteration during training. Generally, a larger batch size can speed up processing but may also consume more memoryimages |
| lr | float | 指控制模型更新参数时的步长或速率。学习率过高可能导致模型不收敛,而学习率过低则可能导致训练时间过长或者陷入局部最优解; The step size or rate at which the model updates its parameters during training. A high learning rate may cause the model not to converge, while a low learning rate may result in long training times or being stuck in a local optimum |
| warm_up | float | 初始学习率与原始学习率的比例; The ratio between the initial learning rate and the original learning rate
| save_interval | int | 模型保存的间隔,即每训练多少个iteration保存一次模型。当训练时间较长时,保存间隔可以避免因突然中断或出现错误导致训练成果全部丢失; The interval at which the model is saved, i.e., how often the model is saved per epoch during training. When training takes a long time, saving intervals can prevent all training achievements from being lost due to sudden interruptions or errors. |
| enable_sft_conversations_dataset_v3 | bool | 数据处理方式; Data preprocessing method |
| enable_sft_dataset_dir | str | 可监督微调的数据集目录; Dataset directory of SFT dataset |
| enable_sft_dataset_file | str | 可监督微调的数据集文件名; Filename of SFT dataset | |

#### Step 3: 启动可监督微调/Start SFT
```
bash dist_trigger_docker.sh hostfile Aquila-sft.yaml [aquilacode-7b-nv/aquilacode-7b-ts] [实验名]
```
接下来会输出下列信息,注意`NODES_NUM`应该与节点数相等,`LOGFILE`是模型运行的日志文件;The following information will be output. Note that `NODES_NUM` should be equal to the number of nodes, and `LOGFILE` is the log file for the model run.

![Screenshot](../img/info.jpg)

成功训练之前能看到如下信息(具体参数可能不同); Before successful training, you may see the following information with parameters that may differ:

![Screenshot](../img/info2.jpg)

## 证书/License

AquilaCode-7B-NV开源模型使用 [智源Aquila系列模型许可协议](linkhere), 原始代码基于[Apache Licence 2.0](https://www.apache.org/licenses/LICENSE-2.0)


AquilaCode-7B-NV open-source model is licensed under [ BAAI Aquila Model Licence Agreement](linkhere). The source code is under [Apache Licence 2.0](https://www.apache.org/licenses/LICENSE-2.0)
Loading