[Examples] Generalise Seq2Seq ASR to handle Whisper #19519

sanchit-gandhi · 2022-10-12T10:27:24Z

What does this PR do?

Generalises run_speech_recognition_seq2seq.py to handle Whisper.

To train the "tiny.en" model on LibriSpeech dummy:

Bash script

#!/usr/bin/env bash
CUDA_VISIBLE_DEVICES=0 python run_speech_recognition_seq2seq.py \
        --dataset_name="hf-internal-testing/librispeech_asr_dummy" \
        --model_name_or_path="openai/whisper-tiny.en" \
        --dataset_config_name="clean" \
        --train_split_name="validation" \
        --eval_split_name="validation" \
        --output_dir="./" \
        --preprocessing_num_workers="1" \
        --length_column_name="input_length" \
        --overwrite_output_dir \
        --num_train_epochs="1" \
        --per_device_train_batch_size="8" \
        --per_device_eval_batch_size="8" \
        --learning_rate="3e-4" \
        --warmup_steps="500" \
        --evaluation_strategy="steps" \
        --text_column_name="text" \
        --save_strategy="no" \
        --evaluation_strategy="epoch" \
        --logging_steps="10" \
        --save_total_limit="1" \
        --generation_max_length="40" \
        --generation_num_beams="1" \
        --fp16 \
        --gradient_checkpointing \
        --group_by_length \
        --predict_with_generate \
        --do_train --do_eval \
        --do_lower_case

To train the "medium.en" model on LibriSpeech 960h:

Bash script

#!/usr/bin/env bash
CUDA_VISIBLE_DEVICES=0 python run_speech_recognition_seq2seq.py \
    --model_name_or_path="openai/whisper-medium.en" \
    --dataset_name="librispeech_asr" \
    --dataset_config_name="all" \
    --train_split_name="train.clean.100+train.clean.360+train.other.500" \
    --eval_split_name="validation.clean" \
    --max_steps="5000" \
    --output_dir="./" \
    --run_name="whisper-librispeech" \
    --per_device_train_batch_size="64" \
    --per_device_eval_batch_size="16" \
    --logging_steps="25" \
    --learning_rate="1e-4" \
    --warmup_steps="500" \
    --report_to="wandb" \
    --preprocessing_num_workers="16" \
    --evaluation_strategy="steps" \
    --eval_steps="1000" \
    --save_strategy="steps" \
    --save_steps="1000" \
    --generation_max_length="224" \
    --generation_num_beams="1" \
    --length_column_name="input_length" \
    --gradient_checkpointing \
    --group_by_length \
    --freeze_encoder \
    --fp16 \
    --overwrite_output_dir \
    --do_train \
    --do_eval \
    --predict_with_generate \
    --use_auth_token

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

HuggingFaceDocBuilderDev · 2022-10-12T10:41:43Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

examples/pytorch/speech-recognition/run_speech_recognition_seq2seq.py

patrickvonplaten

Thanks for iterating here!

HuggingFaceDocBuilderDev · 2022-11-14T11:49:12Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

HuggingFaceDocBuilderDev · 2022-11-14T12:10:50Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

sanchit-gandhi · 2022-11-14T16:59:27Z

@sgugger this one's ready to go! Just an FYI in-case you wanted to take a look :)

HuggingFaceDocBuilderDev · 2022-11-14T17:17:43Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

sgugger

Thanks for working on this!

* merge conflicts * bos and eos in datacollator * (temp) hardcode removal of attention mask * freeze encoder * actually freeze encoder * set max length / num beams according to gen kwargs * (temp) fix tests * don't pop attn mask * override return attention mask config from Hub * Hub configs updated 🤗 * final fixes * update type annotations * backward comp

sanchit-gandhi force-pushed the whisper-fine-tuning branch from ae020a9 to ecd4e03 Compare October 13, 2022 14:07

sanchit-gandhi commented Oct 17, 2022

View reviewed changes

examples/pytorch/speech-recognition/run_speech_recognition_seq2seq.py Outdated Show resolved Hide resolved

sanchit-gandhi force-pushed the whisper-fine-tuning branch from 7172de6 to f78a08b Compare October 17, 2022 14:43

patrickvonplaten reviewed Oct 17, 2022

View reviewed changes

examples/pytorch/speech-recognition/run_speech_recognition_seq2seq.py Outdated Show resolved Hide resolved

patrickvonplaten approved these changes Oct 20, 2022

View reviewed changes

sanchit-gandhi added 11 commits November 14, 2022 11:37

merge conflicts

b6551ec

bos and eos in datacollator

4ea964e

(temp) hardcode removal of attention mask

8044e0d

freeze encoder

893d4d4

actually freeze encoder

3013a2b

set max length / num beams according to gen kwargs

9d0783b

(temp) fix tests

19fe9c1

don't pop attn mask

ba95a8e

override return attention mask config from Hub

6217d6c

Hub configs updated 🤗

6fa9c0e

final fixes

0f83fb7

sanchit-gandhi force-pushed the whisper-fine-tuning branch from 60e91b4 to 0f83fb7 Compare November 14, 2022 11:38

update type annotations

292d769

backward comp

3ac3806

sgugger approved these changes Nov 14, 2022

View reviewed changes

sanchit-gandhi merged commit af1a7c8 into huggingface:main Nov 14, 2022

sanchit-gandhi deleted the whisper-fine-tuning branch November 14, 2022 17:45

This was referenced Nov 15, 2022

[ASR Examples] Update README for Whisper #20230

Merged

Ability to fine-tune whisper large on a GPU with 24 gb of ram #20348

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Examples] Generalise Seq2Seq ASR to handle Whisper #19519

[Examples] Generalise Seq2Seq ASR to handle Whisper #19519

Uh oh!

sanchit-gandhi commented Oct 12, 2022 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Oct 12, 2022

Uh oh!

Uh oh!

Uh oh!

patrickvonplaten left a comment

Uh oh!

HuggingFaceDocBuilderDev commented Nov 14, 2022

Uh oh!

HuggingFaceDocBuilderDev commented Nov 14, 2022

Uh oh!

sanchit-gandhi commented Nov 14, 2022

Uh oh!

HuggingFaceDocBuilderDev commented Nov 14, 2022

Uh oh!

sgugger left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[Examples] Generalise Seq2Seq ASR to handle Whisper #19519

[Examples] Generalise Seq2Seq ASR to handle Whisper #19519

Uh oh!

Conversation

sanchit-gandhi commented Oct 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

HuggingFaceDocBuilderDev commented Oct 12, 2022

Uh oh!

Uh oh!

Uh oh!

patrickvonplaten left a comment

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Nov 14, 2022

Uh oh!

HuggingFaceDocBuilderDev commented Nov 14, 2022

Uh oh!

sanchit-gandhi commented Nov 14, 2022

Uh oh!

HuggingFaceDocBuilderDev commented Nov 14, 2022

Uh oh!

sgugger left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

sanchit-gandhi commented Oct 12, 2022 •

edited

Loading