Skip to content

Conversation

@RaymondLi0
Copy link
Collaborator

@RaymondLi0 RaymondLi0 commented Nov 22, 2022


eod = self.tokenizer.eod
pad = self.tokenizer.tokenizer.special_tokens[FIM_PAD]
pad = self.tokenizer.special_tokens[FIM_PAD]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The local GPT2Tokenizer implementation has this special_tokens attribute, but not the PreTrainedTokenizerFast from the transformers library.
So the code here instead rely on the wrappers around these tokenizers :

self.special_tokens = {

self.special_tokens = self.tokenizer.special_tokens

scaled_softmax_warp_forward<input_t, output_t, acc_t, 12>
<<<blocks, threads, 0, at::cuda::getCurrentCUDAStream()>>>(dst, src, scale, batch_count, key_seq_len);
break;
case 13: // 8192
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. Did you double-check that it works as intended? Things can get tricky when kernels grow too big (registers, shared memory, etc. Don't know if relevant here.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I admit that I did not double-check this. I relied on the code from: https://github.com/NVIDIA/Megatron-LM/pull/243/files
I tried to train a model with seq-length 4092 and it seemed to work fine

@@ -0,0 +1,201 @@
# coding=utf-8
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK for now, but we'll want to switch to the common implementation soon.

@@ -0,0 +1,323 @@
import argparse
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this an exact copy from transformers? Same for push_checkpoints.py

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

convert_checkpoints.py is largely copied form transformers, but includes support for the MQA models.
push_checkpoints.py is new code.
This folder is named hf_transformers because these are tools to facilitate usage of Megatron models in transformers

@RaymondLi0 RaymondLi0 merged commit 58884e0 into multi-query-attention Feb 7, 2023
@RaymondLi0 RaymondLi0 deleted the preprocess-hf branch February 7, 2023 21:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants