Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finetuning on FLUE #32

Open
LisaBanana opened this issue Oct 5, 2020 · 12 comments
Open

Finetuning on FLUE #32

LisaBanana opened this issue Oct 5, 2020 · 12 comments

Comments

@LisaBanana
Copy link

Hi !
I would like to finetune Flaubert on FLUE task with hugging face library. I downloaded the PAWS datas and use the code you gave on your github repo but I have this error message I can't go through :
image

any idea on what to do ?

Thanks for this project by the way, him looking forward to use it !

Have a good day,
Lisa

@LisaBanana
Copy link
Author

Ok, I went through this by writting "encoding = 'utf-8' on line 61 of extract_pawsx.py :
with open(os.path.join(outdir, splts[s][idx]), 'w') as f_out: (old)
with open(os.path.join(outdir, splts[s][idx]), 'w', encoding = 'utf-8') as f_out: (new)

Anyway, this could be usefull for future users of flue whith hugging face library given script.

@formiel
Copy link
Contributor

formiel commented Oct 6, 2020

Hi @LisaBanana,

Thanks a lot for your interest! I'm glad that you made it work for you. I'm going to update the code with your fix.
Please do not hesitate to let me know if there are further issues.

Have a nice day!

@LisaBanana
Copy link
Author

LisaBanana commented Oct 7, 2020

Hi @formiel ! Thanks for your message, actually I do have a new issue. So, I'm trying to finetune Flaubert on th MRPC Glue task, my command line is :
python C:/Python/Python36/Lib/site-packages/transformers/examples/text-classification/run_glue.py
--data_dir ~/Data/FLUE/pawsx/processed/
--model_name_or_path flaubert/flaubert_base_cased
--task_name MRPC
--output_dir D:/LisaBanana/MYPATH/output_flue
--max_seq_length 512
--do_train
--do_eval
--learning_rate 2e-5
--num_train_epochs 3.0
--save_steps 10
--fp16
--fp16_opt_level O1
|& tee output.log
But at one point I encounter this message :

image

From that point I opened the glue.py file and the run_glue.py files to see what could be done but it's kinda out of my reach.
If you have any ideas on what to do I'll be more than happy (I'll continue looking for it on my side, of course, if I find something I'll post it there) :)

Update :
I have an error also if I change the task MRPC for QQP :
image

And same for CoLA task :

image

It seems to come from the train.tsv but the only thing I change on the data process is the encoding line from my previous question so I don't understand what I'm doing wrong.

Have a nice day :)

@formiel
Copy link
Contributor

formiel commented Oct 7, 2020

Hi @LisaBanana,

Could you please try looking at the file train.tsv in ~/Data/FLUE/pawsx/processed/ to see if it has the following format (5 columns)? And please notice to use the following command when fine-tuning with HuggingFace's transformers library (not the bash command for fine-tuning using XLM's library).

Label			Sent1	Sent2
0	0	0	"À Paris, en octobre 1560, il rencontra secrètement l'ambassadeur d'Angleterre, Nicolas Throckmorton, lui demandant un passeport pour retourner en Angleterre en passant par l'Écosse.	En octobre 1560, il rencontra secrètement l'ambassadeur d'Angleterre, Nicolas Throckmorton, à Paris, et lui demanda un passeport pour retourner en Écosse par l'Angleterre."
1	1	1	"La saison NBA 1975 - 76 était la 30e saison de la National Basketball Association.	La saison 1975-1976 de la National Basketball Association était la 30e saison de la NBA."

@LisaBanana
Copy link
Author

LisaBanana commented Oct 7, 2020

My file is on this format so it's good :
image
It's also on the right path (the error doesn't say "file not found" or something like that + when I print the path from the glue.py file it gives me the right path. I use the command for hugging face :
image

my line commmand is :
config='flue/examples/pawsx_lr5e6_hf_base_cased.cfg'
source $config

python C:/Python/Python36/Lib/site-packages/transformers/examples/text-classification/run_glue.py
--data_dir ~/Data/FLUE/pawsx/processed/
--model_name_or_path flaubert/flaubert_base_cased
--task_name MRPC
--output_dir D:/LisaBanana/MYPATH/output_flue
--max_seq_length 512
--do_train
--do_eval
--learning_rate 5e-6
--num_train_epochs 3.0
--save_steps 10
--fp16
--fp16_opt_level O1
|& tee output.log
so yeah, I don't know what else can be wrong. Maybe the versions of some libraries ? but it's uncanny ...

@formiel
Copy link
Contributor

formiel commented Oct 7, 2020

That’s strange… I’ve tried again the pipeline, with the latest version of transformer, and it works.

I would suggest to clone the transformers repo and install it in editable mode (pip install -e ., make sure to pip uninstall transformers first), so that you can debug more easily. For example, you can add some prints in run_glue.py to check the length of the variable line to see why it’s list index out of range.

@LisaBanana
Copy link
Author

LisaBanana commented Oct 7, 2020

It still doesn't work, I tried to reinstall everything in an other virtual env but still. Would you mind sharing a requirements.txt with the versions of your libraries / packages ? I really don't understand what's wrong with what I do ...

@LisaBanana
Copy link
Author

Ok, thanks to my genius co-worker, we've got it ! The tsv file as it was processed (from the script you provide on the repo, which is strange btw if you don't have the same issues I encountered) had some unexpected "/n" and it was the reason why everything was brpken. Anyway, thanks for your help before :)

@AmauryLepicard
Copy link

AmauryLepicard commented Oct 9, 2020

Hi there, I'm a colleague of LisaBanana, and we managed to make the whole thing work!

Unfortunately, it seems that the training will take 140 hours. Is this expected?

ii'm using transformers 3.3.1 and running on a Tesla k80 GPU.

Screenshot 2020-10-09 155558

Here is the command i'm using:

python examples\text-classification\run_glue.py --data_dir E:\flaubert\data\FLUE\pawsx\processed --model_name_or_path flaubert/flaubert_base_cased --task_name MRPC --output_dir e:\flaubert\Experiments\FLUE\Flaubert\pawsx_hf_flaubert_base_cased\lr_5e-6 --max_seq_length 512 --do_train --do_eval --learning_rate 5e-6 --num_train_epochs 30 --save_steps 50000 --fp16 --fp16_opt_level O1 --per_device_train_batch_size 8 --per_device_eval_batch_size 8

Have a good day,
Amaury

@formiel
Copy link
Contributor

formiel commented Oct 9, 2020

Hi @LisaBanana,

Ok, thanks to my genius co-worker, we've got it ! The tsv file as it was processed (from the script you provide on the repo, which is strange btw if you don't have the same issues I encountered) had some unexpected "/n" and it was the reason why everything was brpken. Anyway, thanks for your help before :)

Sorry I was a little bit overwhelmed during the last few days and forgot to answer you. I'm glad that it worked for you. That's strange, the line break is \n in my file. Could you please let me know which OS you are using?

Hi @AmauryLepicard,

Thanks for your help to make the code run previously!

Unfortunately, it seems that the training will take 140 hours. Is this expected?

Oh I think that's not expected. However, I think your training should take around 19.31 hours instead of 140 hours (184260 steps / 2.65s / 3600)?

@AmauryLepicard
Copy link

AmauryLepicard commented Oct 9, 2020

Thanks for your help to make the code run previously!

Actually I'm not the "genius coder", he was another of our colleagues :-)

Unfortunately, it seems that the training will take 140 hours. Is this expected?

Oh I think that's not expected. However, I think your training should take around 19.31 hours instead of 140 hours (184260 steps / 2.65s / 3600)?

It would be 19h if it was 2.65 steps per second, but it is the opposite, 2.65 seconds per step!
I started the training for 13 epochs which would bring us to Monday morning, and we'll see what happens.
If you try it on your machine, how fast is it?

@formiel
Copy link
Contributor

formiel commented Oct 9, 2020

Hi @AmauryLepicard,

Actually I'm not the "genius coder", he was another of our colleagues :-)

Oh sorry for my mistake. So thanks for your interest!

It would be 19h if it was 2.65 steps per second, but it is the opposite, 2.65 seconds per step!

Oops, that's right. The code displays number of steps per second on my side, so I overlooked and thought it's the same case for you (maybe because I'm using transformers version 3.0.2). Training it on 2 Quadro P6000 GPUs with per-gpu batch size as yours takes me around 22 hours. Assuming the training time is scaled linearly, it would take me around 44 hours to train on 1 GPU.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants