Skip to content

Commit b8587d7

Browse files
authored
Add segment-anything-fast perf/acc benchmarks to torchao (pytorch#457)
This PR adds in segment-anything-fast evaluation to torchao, and also adds benchmarks for int8 quantization + 2:4 sparsity. With this we can run combined perf/accuracy benchmarks for segment-anything. This should give us a starting point for the relative perf vs relative acc graph for PTC. | Model Type | Technique | img/s | memory (MiB) | mIoU | relative speedup | relative accuracy | |------------|------------------------------------------------------------------------------------------------------|-------|--------------|--------|------------------|-------------------| | ViT-h | baseline (bfloat16, max-autotune) | 22.75 | 15172 | 0.5811 | | | | | int8 dynamic quant (attn + mlp) | 24.91 | 15154 | 0.5822 | **1.09x** | **100.19%** | | | 2:4 sparsity (mlp only) | 24.81 | 15632 | 0.5672 | **1.10x** | **97.61%** | | | 2:4 sparsity (attn + mlp) | 24.30 | 13429 | 0.5306 | **1.07x** | **91.31%** | | | int8 dynamic quant (attn)<br>int8 dynamic quant + 2:4 sparsity (mlp lin1)<br>2:4 sparsity (mlp lin2) | 26.46 | 14865 | 0.5668 | **1.16x** | **97.54%** | This just copies over the evaluation scripts. Eventually I think we should move over the modeling code too, but plan to do that in a subsequent PR.
1 parent b0ee1ba commit b8587d7

14 files changed

+877
-159
lines changed

README.md

+14-1
Original file line numberDiff line numberDiff line change
@@ -52,7 +52,20 @@ And a quick crash course on inference quantization to help parse the above table
5252

5353
In some cases we rewrote popular GenAI models to be significantly faster in native PyTorch as in no C++/CUDA to achieve at the time SOTA inference performance. These involve more intrusive code changes.
5454

55-
* 8x speedups for Image segmentation models with [sam-fast](https://pytorch.org/blog/accelerating-generative-ai)
55+
* 9.5x speedups for Image segmentation models with [sam-fast](https://pytorch.org/blog/accelerating-generative-ai) compared to vanilla [sam](https://github.com/facebookresearch/segment-anything).
56+
* 1.16x speedup when composing int8 quantization with 2:4 sparsity against the accelerated baseline `bfloat16` dtype and `torch.compile="max_autotune"`.
57+
58+
| Model Type | Technique | img/s | memory (MiB) | mIoU (coco2017 val) | relative speedup | relative accuracy |
59+
|------------|------------------------------------------------------------------------------------------------------|-------|--------------|---------------------|------------------|-------------------|
60+
| ViT-h | sam (float32, eager) | 2.78 | 28806 | 0.58 | baseline | baseline |
61+
| | sam (bfloat16, eager) | 14.85 | 14424 | 0.58 | **5.34x** | **100%** |
62+
| | sam-fast (bfloat16, max-autotune) | 22.75 | 15172 | 0.58 | **8.18x** | **100%** |
63+
| | int8 dynamic quant (attn + mlp) | 24.91 | 15154 | 0.58 | **8.96x** | **100%** |
64+
| | 2:4 sparsity (mlp only) | 24.81 | 15632 | 0.57 | **8.92x** | **98%** |
65+
| | int8 dynamic quant (attn)<br>int8 dynamic quant + 2:4 sparsity (mlp lin1)<br>2:4 sparsity (mlp lin2) | 26.46 | 14865 | 0.57 | **9.52x** | **98%** |
66+
67+
The relative speedup is measured purely across the image encoder (ViT) of the model, where we apply our model optimizations. Benchmarks ran on an NVIDIA-A100-80GB with batch_size=32
68+
5669
* 10x speedups for Language models with [gpt-fast](https://pytorch.org/blog/accelerating-generative-ai-2)
5770
* 3x speedup for Diffusion models with [sd-fast](https://pytorch.org/blog/accelerating-generative-ai-3)
5871

benchmarks/benchmark_sam.py

-137
This file was deleted.

scripts/sam/.gitignore

+3
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
tmp
2+
checkpoints
3+
datasets

scripts/sam/README.md

+21
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
# benchmarking instructions:
2+
3+
Setup your enviornment with:
4+
```
5+
conda env create -n "saf-ao" python=3.10
6+
conda activate saf-ao
7+
pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu124
8+
pip3 install git+https://github.com/pytorch-labs/segment-anything-fast.git
9+
pip3 install tqdm fire pandas
10+
cd ../.. && python setup.py install
11+
```
12+
13+
Then download data and models by running
14+
```
15+
sh setup.sh
16+
```
17+
18+
Finally, you can run benchmarks with
19+
```
20+
sh benchmark_sam.sh
21+
```

scripts/sam/benchmark.sh

+11
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
# baseline
2+
python eval_combo.py --coco_root_dir datasets/coco2017 --coco_slice_name val2017 --sam_checkpoint_base_path checkpoints --sam_model_type vit_h --point_sampling_cache_dir tmp/sam_coco_mask_center_cache --mask_debug_out_dir tmp/sam_eval_masks_out --batch_size 32 --num_workers 32 --use_compile max-autotune --use_half bfloat16 --device cuda --print_header True
3+
# int8 dynamic quant (all)
4+
python eval_combo.py --coco_root_dir datasets/coco2017 --coco_slice_name val2017 --sam_checkpoint_base_path checkpoints --sam_model_type vit_h --point_sampling_cache_dir tmp/sam_coco_mask_center_cache --mask_debug_out_dir tmp/sam_eval_masks_out --batch_size 32 --num_workers 32 --use_compile max-autotune --use_half bfloat16 --device cuda --compress int8_dynamic_quant
5+
# 2:4 sparsity (all)
6+
python eval_combo.py --coco_root_dir datasets/coco2017 --coco_slice_name val2017 --sam_checkpoint_base_path checkpoints --sam_model_type vit_h --point_sampling_cache_dir tmp/sam_coco_mask_center_cache --mask_debug_out_dir tmp/sam_eval_masks_out --batch_size 32 --num_workers 32 --use_compile max-autotune --use_half bfloat16 --device cuda --compress sparse_mlp_only
7+
# 2:4 sparsity (mlp only)
8+
python eval_combo.py --coco_root_dir datasets/coco2017 --coco_slice_name val2017 --sam_checkpoint_base_path checkpoints --sam_model_type vit_h --point_sampling_cache_dir tmp/sam_coco_mask_center_cache --mask_debug_out_dir tmp/sam_eval_masks_out --batch_size 32 --num_workers 32 --use_compile max-autotune --use_half bfloat16 --device cuda --compress sparse
9+
# int8 dynamic quant + 2:4 sparsity (attn: int8, mlp lin1: int8+2:4 fuse mul, mlp lin2: 2:4 sparse)
10+
python eval_combo.py --coco_root_dir datasets/coco2017 --coco_slice_name val2017 --sam_checkpoint_base_path checkpoints --sam_model_type vit_h --point_sampling_cache_dir tmp/sam_coco_mask_center_cache --mask_debug_out_dir tmp/sam_eval_masks_out --batch_size 32 --num_workers 32 --use_compile max-autotune --use_half bfloat16 --device cuda --compress int8_dynamic_quant_sparse
11+

0 commit comments

Comments
 (0)