Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles

This is the official implementation for our ICML 2023 Oral paper:
Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles
Chaitanya Ryali*, Yuan-Ting Hu*, Daniel Bolya*, Chen Wei, Haoqi Fan, Po-Yao Huang, Vaibhav Aggarwal, Arkabandhu Chowdhury, Omid Poursaeed, Judy Hoffman, Jitendra Malik, Yanghao Li*, Christoph Feichtenhofer*
ICML '23 Oral | GitHub | arXiv | BibTeX

*: Equal contribution.

What is Hiera?

Hiera is a hierarchical vision transformer that is fast, powerful, and, above all, simple. It outperforms the state-of-the-art across a wide array of image and video tasks while being much faster.

How does it work?

Vision transformers like ViT use the same spatial resolution and number of features throughout the whole network. But this is inefficient: the early layers don't need that many features, and the later layers don't need that much spatial resolution. Prior hierarchical models like ResNet accounted for this by using fewer features at the start and less spatial resolution at the end.

Several domain specific vision transformers have been introduced that employ this hierarchical design, such as Swin or MViT. But in the pursuit of state-of-the-art results using fully supervised training on ImageNet-1K, these models have become more and more complicated as they add specialized modules to make up for spatial biases that ViTs lack. While these changes produce effective models with attractive FLOP counts, under the hood the added complexity makes these models slower overall.

We show that a lot of this bulk is actually unnecessary. Instead of manually adding spatial bases through architectural changes, we opt to teach the model these biases instead. By training with MAE, we can simplify or remove all of these bulky modules in existing transformers and increase accuracy in the process. The result is Hiera, an extremely efficient and simple architecture that outperforms the state-of-the-art in several image and video recognition tasks.

News

[2023.06.01] Initial release.

Installation

Hiera requires a reasonably recent version of torch. After that, you can install hiera through pip:

pip install hiera-transformer

This repo should support the latest timm version, but timm is a constantly updating package. Create an issue if you have problems with a newer version of timm.

Installing from Source

If using torch hub, you don't need to install the hiera package. But, if you'd like to develop using hiera, it could be a good idea to install it from source:

git clone https://github.com/facebookresearch/hiera.git
cd hiera
python setup.by build develop

Model Zoo

Here we provide model checkpoints for Hiera. Each model listed is accessible on torch hub, e.g.:

model = torch.hub.load("facebookresearch/hiera", model="hiera_base_224")

The model name is the same as the checkpoint name.

Note: the speeds listed here were benchmarked without PyTorch's optimized scaled dot product attention. If using PyTorch 2.0 or above, your inference speed will probably be faster than what's listed here.

Coming Soon

As of now, base finetuned models are available. The rest are coming soon.

Image Models

Model	Input Size	Pretrained Models (IN-1K MAE)	Finetuned Models (IN-1K Supervised)	IN-1K Top-1 (%)	A100 fp16 Speed (im/s)
Hiera-T	224x224	mae_hiera_tiny_224	hiera_tiny_224	82.8	2758
Hiera-S	224x224	mae_hiera_small_224	hiera_small_224	83.8	2211
Hiera-B	224x224	mae_hiera_base_224	hiera_base_224	84.5	1556
Hiera-B+	224x224	mae_hiera_base_plus_224	hiera_base_plus_224	85.2	1247
Hiera-L	224x224	mae_hiera_large_224	hiera_large_224	86.1	531
Hiera-H	224x224	mae_hiera_huge_224	hiera_huge_224	86.9	274

Video Models

Model	Input Size	Pretrained Models (K400 MAE)	Finetuned Models (K400)	K400 (3x5 views) Top-1 (%)	A100 fp16 Speed (clip/s)
Hiera-B	16x224x224	Coming Soon	hiera_base_16x224	84.0	133.6
Hiera-B+	16x224x224	Coming Soon	hiera_base_plus_16x224	85.0	84.1
Hiera-L	16x224x224	Coming Soon	hiera_large_16x224	87.3	40.8
Hiera-H	16x224x224	Coming Soon	hiera_huge_16x224	87.8	20.9

Usage

This repo implements the code to run Hiera models for inference. This repository is still in progress. Here's what we currently have available and what we have planned:

See examples for examples of how to use Hiera.

Inference

See examples/inference for an example of how to prepare the data for inference.

Instantiate a model with either torch hub or by installing hiera and running:

import hiera
model = hiera.hiera_base_224(pretrained=True)

Then you can run inference like any other model:

output = model(x)

Video inference works the same way, just use a 16x224 model instead.

Note: for efficiency, Hiera re-orders its tokens at the start of the network (see the Roll and Unroll modules in hiera_utils.py). Thus, tokens aren't in spatial order by default. If you'd like to use intermediate feature maps for a downstream task, pass the return_intermediates flag when running the model:

output, intermediates = model(img_norm[None, ...], return_intermediates=True)

Benchmarking

We provide a script for easy benchmarking. See examples/benchmark to see how to use it.

Scaled Dot Product Attention

PyTorch 2.0 introduced optimized scaled dot product attention, which can speed up transformers quite a bit. We didn't use this in our original benchmarking, but since it's a free speed-up this repo will automatically use it if available. To get its benefits, make sure your torch version is 2.0 or above.

Training

Coming soon.

Citation

If you use Hiera or this code in your work, please cite:

@article{ryali2023hiera,
  title={Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles},
  author={Ryali, Chaitanya and Hu, Yuan-Ting and Bolya, Daniel and Wei, Chen and Fan, Haoqi and Huang, Po-Yao and Aggarwal, Vaibhav and Chowdhury, Arkabandhu and Poursaeed, Omid and Hoffman, Judy and Malik, Jitendra and Li, Yanghao and Feichtenhofer, Christoph},
  journal={ICML},
  year={2023}
}

License

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Contributing

See contributing and the code of conduct.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
examples		examples
hiera		hiera
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
hubconf.py		hubconf.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles

What is Hiera?

How does it work?

News

Installation

Installing from Source

Model Zoo

Coming Soon

Image Models

Video Models

Usage

Inference

Benchmarking

Scaled Dot Product Attention

Training

Citation

License

Contributing

About

Releases

Packages

Languages

License

YznMur/hiera

Folders and files

Latest commit

History

Repository files navigation

Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles

What is Hiera?

How does it work?

News

Installation

Installing from Source

Model Zoo

Coming Soon

Image Models

Video Models

Usage

Inference

Benchmarking

Scaled Dot Product Attention

Training

Citation

License

Contributing

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages