Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPT extrapolatable position embedding (xpos/sandwich/alibi/kerple) and Flash Attention #6666
GPT extrapolatable position embedding (xpos/sandwich/alibi/kerple) and Flash Attention #6666
Changes from 250 commits
a8564d3
7a17f73
a67b00f
1e1fbbe
4561e12
599f522
ae4a4dd
df2b870
0c85e21
24c77d0
07f6533
8bffc80
6b84a8a
56ce2a6
b460716
319b191
2dd91fa
d0e2f5a
3cff6ce
c2a4264
669a8c2
798978d
cb53ede
1217668
5090a94
b2f23bd
6c77583
ce84b1f
8bbc140
dc0c332
42691c3
e7f2210
69b2c34
24076ca
b41a511
77369ef
fa62794
3817d41
a57ec70
5fd9c7f
04c1b72
c13ffb9
846fc83
c19aac5
fba50b8
d0785d5
c7f58d8
58440fb
aa2b9b8
cf60b6c
3c1147f
08ab1a7
3ed0282
a9d2910
077b7f9
9eed6d3
77b9a85
9f367f4
8592562
b3f5f39
09f2e37
2a446cb
2cc0f62
94e6e25
7f48130
c44e3b6
ef49b0a
1b785e2
44e890e
a5fcbee
aedcc7c
7fbf571
ddb067e
81a8c21
36d685b
a1d1e5a
2eaa60a
5eb3552
4e94268
56847f3
986feed
6d2c969
954d43f
11c58f3
db7d578
acb2c56
1b28a7b
6fb6e47
4ccba61
82d5d58
89b428c
9683d02
c648d99
f736f60
4e7afbb
7e62925
e009385
c5e229a
9d7d0b1
b739a5e
473ff20
4a0699d
9cfea92
8c3bfbd
9d01255
8a6e294
8bd1466
0e02478
8ed6a0a
a6b856c
213b5a3
1b93141
d2938b9
1564d94
82f863b
8b55842
fbdd7fe
8535a6a
873f2e1
7847a54
4aa46d7
7514bf4
9a133d0
5ce3819
bbee276
de61214
dcab11e
cd3bb6d
4fac042
129e55d
3b5ec97
b2eb222
c73f983
06ce313
8c969fe
f70cc3f
a0cea83
c7c6a1b
6876703
6c65625
456153a
69992d6
4ee6d8f
c856936
b70dbf7
1a66d30
ff772f7
2231a57
8b3dce5
963855b
b0f33f1
a4ef711
da5e6f8
2c35e0b
2bac13d
5831405
2e963da
7f83283
599c503
0725b2d
bdeab5b
146371b
8f43ae3
7daad62
49e016e
34f5452
c022acb
e98f425
bcb3fd3
71bff2f
2e6eba5
424a15d
e79a35a
4c953aa
0b18768
8863360
2dc0418
1353aca
b8d19b2
7ad325d
b875a78
1f229c0
26dbc9f
a3cf08e
90ef33a
380a6f2
b69cbf7
de52c2d
8dc863b
29c3cd4
e782202
7f40a05
d814f47
65118c4
89d4547
9c50e29
218ffa3
84acce0
874f992
35ac850
98783ce
08dbd86
bdfe61e
6df2df8
1f460d9
01f4391
23634bf
4cfb2da
528c416
a751928
ee692d4
5178f6b
45876ad
1c15644
74da509
5da1bc3
fb895da
81d2fb0
b578ff5
82120c3
a9bb73e
662733b
6b18be2
bdd91d6
92e7dba
bb89e61
672f262
2a526ad
0ac5374
7acf5cf
cdff779
dcab29d
aba44ae
7c0a530
194b4bb
fe173c1
7c38447
1104174
a3010bd
81b002e
a883aa2
6a895f0
0504814
889dec6
fd2899a
7255e31
c972553
b8b5611
83ef08d
498ec3d
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
Large diffs are not rendered by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When will the tokenizer not have the
model
orvocab_file
ormerge_file
property?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just wondering if this may introduce silent failures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@trias702 could you please comment on that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MaximumEntropy It happens if you use a HF tokenizer. To use a pretrained HF tokenizer like
EleutherAI/gpt-neox-20b
orbert-base-uncased
you don't need to pass anymodel
,vocab_file
, ormerge_file
in the config, you just need to passtokenizer.library=huggingface
andtokenizer.type=EleutherAI/gpt-neox-20b
. But because the old code would call register_artifact forcfg.tokenizer.model
which would be None, it would throw an error. That's why I changed this block, to ensure it works correctly with pure HF tokenizers at model instantiation.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is override_vocab_size and why is it needed? I don't see it in the
gpt_config.yaml
file either.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is needed for converting models from one platform (Mosaic, HF, etc) into Megatron, because you need to instantiate the model with your Embedding layer for the exact dimensionality you require in order to import the weights, otherwise you get a mismatch or you need to do complex weight surgery to cut out the weights you need and pad with zeros. It was easier to add a special override flag to just force the model to use the vocab_size necessary to make downstream conversion work. If the
override_vocab_size
is None, which it will be 99.8% of the time, it uses padded_vocab_size same as before, so it felt like a really harmless bypass to make model conversion much easier for all future conversion work.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, for MPT-7B, the override_vocab_size is not equal to the padded vocab size computed based on the tokenizer vocab size plus divisibility by 128 upscaling?
I have concerns about this for two reasons 1) This is a "dangling" property with no reference in the yaml file so the only way for people to use this is to override from CLI via something like
+model.override_vocab_size=32000
2) There are divisibility constraints on the vocab size based on the tensor parallel size and mod 128 divisibility for tensor cores, this sidesteps both of them.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is divisible, but the actual number is incorrect if you try to divine the Embedding size by reading the tokeniser vocab size. For MPT-7B, the tokeniser they trained with has vocab size of 50254, but their Embedding weights are size 50432 which is divisible by 128 and works just fine with TP. However, if you just pass the MPT-7B tokeniser to MegatronGPTModel at instantiation, MegatronGPTModel does that upsampling-divisibility logic to create the padded vocab_size, which in this case gives you 50304. However, that is wrong, because the actual state_dict weights in MPT-7B are sized for 50432, so you'll have a mismatch when trying to load the weights. The solution was to either introduce complex weight surgery logic, which pads with zeros, but even then you have to pass some target number to pad up to, so it was easier to just pass that number as the actual matrix size for the Embedding layer, bypassing the upsampling-divisibility stuff.
This override parameter is extremely useful for any work where we want to import some external model weights from a 3rd party Transformer model into Megatron, because we need to tell Megatron exactly what size we want the Embedding layer to be, and currently there is no way of doing that. I agree that this parameter is "dangling", but it's designed to only ever be used inside of a script which converts a 3rd party model into Nemo, for example this one: https://github.com/NVIDIA/NeMo/blob/2a526adf9dd5d4ef22c1d6c1d807c6a8bbea14ec/scripts/nlp_language_modeling/convert_mpt_7b_hf_to_nemo.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MaximumEntropy for the final call, but @trias702, if it's for external only - how about renaming it to make it clear or adding a comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is really on a case-by-case basis, why don't we create a model type class and set this vocab according to the model type instead of relying on user to set it? I would rather set "MPT-7B" than some magic number "50432".