Skip to content
Merged
Changes from all commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
3ff7dc9
📝 add image/vision classification and asr
stevhliu Mar 31, 2022
4e8453a
🖍 minor formatting fixes
stevhliu Mar 31, 2022
4f80d31
Fixed a typo in legacy seq2seq_trainer.py (#16531)
Agoniii Apr 1, 2022
1f426af
Add ONNX export for BeiT (#16498)
akuma12 Apr 1, 2022
ef37dc4
call on_train_end when trial is pruned (#16536)
fschlatt Apr 1, 2022
91167b1
Type hints added (#16529)
Dahlbomii Apr 1, 2022
a9425ec
Fix Bart type hints (#16297)
gchhablani Apr 1, 2022
a1dfe00
Add VisualBert type hints (#16544)
gchhablani Apr 1, 2022
8976f05
Adding missing type hints for mBART model (PyTorch) (#16429)
reichenbch Apr 1, 2022
50ca1d1
Remove MBart subclass of XLMRoberta in tokenzier docs (#16546)
gchhablani Apr 1, 2022
4ba9b4d
Use random_attention_mask for TF tests (#16517)
ydshieh Apr 1, 2022
ecd9f35
Improve code example (#16450)
NielsRogge Apr 1, 2022
f05d235
Pin tokenizers version <0.13 (#16539)
LysandreJik Apr 1, 2022
085f0f7
Add code samples for TF speech models (#16494)
ydshieh Apr 1, 2022
cbc776a
[FlaxSpeechEncoderDecoder] Fix dtype bug (#16581)
patrickvonplaten Apr 4, 2022
b615e7c
Making the impossible to connect error actually report the right URL.…
Narsil Apr 4, 2022
0e1dc49
Fix flax import in __init__.py: modeling_xglm -> modeling_flax_xglm (…
stancld Apr 4, 2022
90ea204
Add utility to find model labels (#16526)
sgugger Apr 4, 2022
02d49ea
Enable doc in Spanish (#16518)
sgugger Apr 4, 2022
96494dc
Add use_auth to load_datasets for private datasets to PT and TF examp…
KMFODA Apr 4, 2022
cb50ff9
add a test checking the format of `convert_tokens_to_string`'s output…
SaulLu Apr 4, 2022
7d64881
TF: Finalize `unpack_inputs`-related changes (#16499)
gante Apr 4, 2022
b442b33
[SpeechEncoderDecoderModel] Correct Encoder Last Hidden State Output …
sanchit-gandhi Apr 4, 2022
29a3b42
initialize the default rank set on TrainerState (#16530)
andrescodas Apr 4, 2022
b96e629
Trigger doc build
sgugger Apr 4, 2022
82ad581
Fix CI: test_inference_for_pretraining in ViTMAEModelTest (#16591)
ydshieh Apr 5, 2022
cfb63da
add a template to add missing tokenization test (#16553)
SaulLu Apr 5, 2022
e98825c
made _load_pretrained_model_low_mem static + bug fix (#16548)
FrancescoSaverioZuppichini Apr 5, 2022
8336282
handle torch_dtype in low cpu mem usage (#16580)
patil-suraj Apr 5, 2022
85f2bd9
[Doctests] Correct filenaming (#16599)
patrickvonplaten Apr 5, 2022
d726e67
Adding new train_step logic to make things less confusing for users (…
Rocketknight1 Apr 5, 2022
7744ce7
Adding missing type hints for BigBird model (#16555)
reichenbch Apr 5, 2022
20ef1a0
[deepspeed] fix typo, adjust config name (#16597)
stas00 Apr 5, 2022
578abb1
🖍 apply feedback
stevhliu Apr 5, 2022
7ea3d54
Merge branch 'main' into update-tasks-summary
stevhliu Apr 5, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
155 changes: 155 additions & 0 deletions docs/source/en/task_summary.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -967,3 +967,158 @@ Here is an example of doing translation using a model and a tokenizer. The proce
</frameworkcontent>

We get the same translation as with the pipeline example.

## Audio classification

Audio classification assigns a class to an audio signal. The Keyword Spotting dataset from the [SUPERB](https://huggingface.co/datasets/superb) benchmark is an example dataset that can be used for audio classification fine-tuning. This dataset contains ten classes of keywords for classification. If you'd like to fine-tune a model for audio classification, take a look at the [run_audio_classification.py](https://github.com/huggingface/transformers/blob/main/examples/pytorch/audio-classification/run_audio_classification.py) script or this [how-to guide](./tasks/audio_classification).

The following examples demonstrate how to use a [`pipeline`] and a model and tokenizer for audio classification inference:

```py
>>> from transformers import pipeline

>>> audio_classifier = pipeline(
... task="audio-classification", model="ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition"
... )
>>> audio_classifier("jfk_moon_speech.wav")
[{'label': 'calm', 'score': 0.13856211304664612},
{'label': 'disgust', 'score': 0.13148026168346405},
{'label': 'happy', 'score': 0.12635163962841034},
{'label': 'angry', 'score': 0.12439591437578201},
{'label': 'fearful', 'score': 0.12404385954141617}]
```

The general process for using a model and feature extractor for audio classification is:

1. Instantiate a feature extractor and a model from the checkpoint name.
2. Process the audio signal to be classified with a feature extractor.
3. Pass the input through the model and take the `argmax` to retrieve the most likely class.
4. Convert the class id to a class name with `id2label` to return an interpretable result.

<frameworkcontent>
<pt>
```py
>>> from transformers import AutoFeatureExtractor, AutoModelForAudioClassification
>>> from datasets import load_dataset
>>> import torch

>>> dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
>>> dataset = dataset.sort("id")
>>> sampling_rate = dataset.features["audio"].sampling_rate

>>> feature_extractor = AutoFeatureExtractor.from_pretrained("superb/wav2vec2-base-superb-ks")
>>> model = AutoModelForAudioClassification.from_pretrained("superb/wav2vec2-base-superb-ks")

>>> inputs = feature_extractor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt")

>>> with torch.no_grad():
... logits = model(**inputs).logits

>>> predicted_class_ids = torch.argmax(logits, dim=-1).item()
>>> predicted_label = model.config.id2label[predicted_class_ids]
>>> predicted_label
```
</pt>
</frameworkcontent>

## Automatic speech recognition

Automatic speech recognition transcribes an audio signal to text. The [Common Voice](https://huggingface.co/datasets/common_voice) dataset is an example dataset that can be used for automatic speech recognition fine-tuning. It contains an audio file of a speaker and the corresponding sentence. If you'd like to fine-tune a model for automatic speech recognition, take a look at the [run_speech_recognition_ctc.py](https://github.com/huggingface/transformers/blob/main/examples/pytorch/speech-recognition/run_speech_recognition_ctc.py) or [run_speech_recognition_seq2seq.py](https://github.com/huggingface/transformers/blob/main/examples/pytorch/speech-recognition/run_speech_recognition_seq2seq.py) scripts or this [how-to guide](./tasks/asr).

The following examples demonstrate how to use a [`pipeline`] and a model and tokenizer for automatic speech recognition inference:

```py
>>> from transformers import pipeline

>>> speech_recognizer = pipeline(
... task="automatic-speech-recognition", model="facebook/wav2vec2-base-960h"
... )
>>> speech_recognizer("jfk_moon_speech.wav")
{'text': "PRESENTETE MISTER VICE PRESIDENT GOVERNOR CONGRESSMEN THOMAS SAN O TE WILAN CONGRESSMAN MILLA MISTER WEBB MSTBELL SCIENIS DISTINGUISHED GUESS AT LADIES AND GENTLEMAN I APPRECIATE TO YOUR PRESIDENT HAVING MADE ME AN HONORARY VISITING PROFESSOR AND I WILL ASSURE YOU THAT MY FIRST LECTURE WILL BE A VERY BRIEF I AM DELIGHTED TO BE HERE AND I'M PARTICULARLY DELIGHTED TO BE HERE ON THIS OCCASION WE MEED AT A COLLEGE NOTED FOR KNOWLEGE IN A CITY NOTED FOR PROGRESS IN A STATE NOTED FOR STRAINTH AN WE STAND IN NEED OF ALL THREE"}
```

The general process for using a model and processor for automatic speech recognition is:

1. Instantiate a processor (which regroups a feature extractor for input processing and a tokenizer for decoding) and a model from the checkpoint name.
2. Process the audio signal and text with a processor.
3. Pass the input through the model and take the `argmax` to retrieve the predicted text.
4. Decode the text with a tokenizer to obtain the transcription.

<frameworkcontent>
<pt>
```py
>>> from transformers import AutoProcessor, AutoModelForCTC
>>> from datasets import load_dataset
>>> import torch

>>> dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
>>> dataset = dataset.sort("id")
>>> sampling_rate = dataset.features["audio"].sampling_rate

>>> processor = AutoProcessor.from_pretrained("facebook/wav2vec2-base-960h")
>>> model = AutoModelForCTC.from_pretrained("facebook/wav2vec2-base-960h")

>>> inputs = processor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt")
>>> with torch.no_grad():
... logits = model(**inputs).logits
>>> predicted_ids = torch.argmax(logits, dim=-1)

>>> transcription = processor.batch_decode(predicted_ids)
>>> transcription[0]
```
</pt>
</frameworkcontent>

## Image classification

Like text and audio classification, image classification assigns a class to an image. The [CIFAR-100](https://huggingface.co/datasets/cifar100) dataset is an example dataset that can be used for image classification fine-tuning. It contains an image and the corresponding class. If you'd like to fine-tune a model for image classification, take a look at the [run_image_classification.py](https://github.com/huggingface/transformers/blob/main/examples/pytorch/image-classification/run_image_classification.py) script or this [how-to guide](./tasks/image_classification).

The following examples demonstrate how to use a [`pipeline`] and a model and tokenizer for image classification inference:

```py
>>> from transformers import pipeline

>>> vision_classifier = pipeline(task="image-classification")
>>> vision_classifier(
... images="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
... )
[{'label': 'lynx, catamount', 'score': 0.4403027892112732},
{'label': 'cougar, puma, catamount, mountain lion, painter, panther, Felis concolor',
'score': 0.03433405980467796},
{'label': 'snow leopard, ounce, Panthera uncia',
'score': 0.032148055732250214},
{'label': 'Egyptian cat', 'score': 0.02353910356760025},
{'label': 'tiger cat', 'score': 0.023034192621707916}]
```

The general process for using a model and feature extractor for image classification is:

1. Instantiate a feature extractor and a model from the checkpoint name.
2. Process the image to be classified with a feature extractor.
3. Pass the input through the model and take the `argmax` to retrieve the predicted class.
4. Convert the class id to a class name with `id2label` to return an interpretable result.

<frameworkcontent>
<pt>
```py
>>> from transformers import AutoFeatureExtractor, AutoModelForImageClassification
>>> import torch
>>> from datasets import load_dataset

>>> dataset = load_dataset("huggingface/cats-image")
>>> image = dataset["test"]["image"][0]

>>> feature_extractor = AutoFeatureExtractor.from_pretrained("google/vit-base-patch16-224")
>>> model = AutoModelForImageClassification.from_pretrained("google/vit-base-patch16-224")

>>> inputs = feature_extractor(image, return_tensors="pt")

>>> with torch.no_grad():
... logits = model(**inputs).logits

>>> predicted_label = logits.argmax(-1).item()
>>> print(model.config.id2label[predicted_label])
Egyptian cat
```
</pt>
</frameworkcontent>