Preprocess hf #10

RaymondLi0 · 2022-11-22T19:40:01Z

Preprocess from HF dataset
Add HF tokenizer
increase maximum sequence-length to 8192
specify device_ids in barrier
Add scripts to convert MQA model to a HF custom model (will need to be reworked once [WIP] Adding GPT2 with Multi Query Attention huggingface/transformers#21253 is merged)

…M into preprocess-hf

…vious configs

megatron/arguments.py

megatron/checkpointing.py

jlamypoirier · 2023-01-27T02:48:23Z

megatron/data/gpt_dataset.py


            eod = self.tokenizer.eod
-            pad = self.tokenizer.tokenizer.special_tokens[FIM_PAD]
+            pad = self.tokenizer.special_tokens[FIM_PAD]


The local GPT2Tokenizer implementation has this special_tokens attribute, but not the PreTrainedTokenizerFast from the transformers library.
So the code here instead rely on the wrappers around these tokenizers :

Megatron-LM/megatron/tokenizer/tokenizer.py

Line 330 in 7457e32

self.special_tokens = {

Megatron-LM/megatron/tokenizer/tokenizer.py

Line 293 in 7457e32

self.special_tokens = self.tokenizer.special_tokens

jlamypoirier · 2023-01-27T02:52:31Z

megatron/fused_kernels/scaled_masked_softmax.h

                scaled_softmax_warp_forward<input_t, output_t, acc_t, 12>
                    <<<blocks, threads, 0, at::cuda::getCurrentCUDAStream()>>>(dst, src, scale, batch_count, key_seq_len);
                break;
+            case 13: // 8192


Interesting. Did you double-check that it works as intended? Things can get tricky when kernels grow too big (registers, shared memory, etc. Don't know if relevant here.)

I admit that I did not double-check this. I relied on the code from: https://github.com/NVIDIA/Megatron-LM/pull/243/files
I tried to train a model with seq-length 4092 and it seemed to work fine

tools/hf_transformers/push_checkpoints.py

jlamypoirier · 2023-01-27T03:05:26Z

tools/hf_transformers/configuration_gpt2_mq.py

@@ -0,0 +1,201 @@
+# coding=utf-8


OK for now, but we'll want to switch to the common implementation soon.

jlamypoirier · 2023-01-27T03:06:35Z

tools/hf_transformers/convert_checkpoint.py

@@ -0,0 +1,323 @@
+import argparse


Is this an exact copy from transformers? Same for push_checkpoints.py

convert_checkpoints.py is largely copied form transformers, but includes support for the MQA models.
push_checkpoints.py is new code.
This folder is named hf_transformers because these are tools to facilitate usage of Megatron models in transformers

RaymondLi0 and others added 28 commits November 7, 2022 14:14

add preprocessing of HF datasets directly

66e61e7

modify max seq-length from 2048 to 8192

a79988a

add missing cases in fused kernels

db3809b

add longer sequence lengths in fused kernels test

acda627

larger MAX_TOKENS_TO_OOM

d59c85b

use custom barrier with device_ids

7b0cee2

add HF tokenizer

93cb6a0

add special tokens in HF tokenizer

9f2c442

fix vocab_size in _HFTokenizer

9fe3bcb

fix: initialize tokenizer with TokenizerFromFile

6982c4e

Merge branch 'preprocess-hf' of github.com:bigcode-project/Megatron-L…

0348b3a

…M into preprocess-hf

fix: add special_tokens dict for FIM

4f060a2

load attention-head-type from checkpoint

332e8db

attention-head-type defaults to None instead

0717dab

use detokenize method un text_generation

96daa55

add mqa conversion to huggingface

2d36c14

remove config and tokenizer save

760eed9

add Readme

baa7b3b

add some documentation

2ceaf70

add push to hub logic

66beabe

add docs

de83476

convert_checkpoint as function, push starting from last pushed iteration

1b7c96f

add iter_interval argument

5cb878f

use relative imports in modeling file

ab1c4cc

update readme

60fbd1d

remove debug prints

732396a

more precise error for attention_type/head_type values

9d80f8a

attention-head-type defaults to multihead again to avoid breaking pre…

7457e32

…vious configs

RaymondLi0 requested a review from jlamypoirier January 24, 2023 20:19

jlamypoirier reviewed Jan 27, 2023

View reviewed changes

RaymondLi0 added 3 commits January 30, 2023 15:51

documentation on the --tokenizer-file argument

cdbcfc9

add missing newlines

94306d1

revert barrier() to torch.distributed.barrier()

506fbd4

RaymondLi0 requested a review from jlamypoirier January 31, 2023 18:33

jlamypoirier approved these changes Feb 7, 2023

View reviewed changes

RaymondLi0 merged commit 58884e0 into multi-query-attention Feb 7, 2023

RaymondLi0 deleted the preprocess-hf branch February 7, 2023 21:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preprocess hf #10

Preprocess hf #10

Uh oh!

RaymondLi0 commented Nov 22, 2022 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

jlamypoirier Jan 27, 2023

Uh oh!

RaymondLi0 Jan 30, 2023

Uh oh!

jlamypoirier Jan 27, 2023

Uh oh!

RaymondLi0 Jan 30, 2023

Uh oh!

Uh oh!

jlamypoirier Jan 27, 2023

Uh oh!

jlamypoirier Jan 27, 2023

Uh oh!

RaymondLi0 Jan 30, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Preprocess hf #10

Preprocess hf #10

Uh oh!

Conversation

RaymondLi0 commented Nov 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jlamypoirier Jan 27, 2023

Choose a reason for hiding this comment

Uh oh!

RaymondLi0 Jan 30, 2023

Choose a reason for hiding this comment

Uh oh!

jlamypoirier Jan 27, 2023

Choose a reason for hiding this comment

Uh oh!

RaymondLi0 Jan 30, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jlamypoirier Jan 27, 2023

Choose a reason for hiding this comment

Uh oh!

jlamypoirier Jan 27, 2023

Choose a reason for hiding this comment

Uh oh!

RaymondLi0 Jan 30, 2023

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

RaymondLi0 commented Nov 22, 2022 •

edited

Loading