-
Notifications
You must be signed in to change notification settings - Fork 185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add segment-anything-fast perf/acc benchmarks to torchao #457
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/457
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit ad8b42f with merge base 5d22ad2 (): This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
||
| qkv | proj | lin1 | lin2 | time | memory | img/s | | ||
| ---- | ---- | ---- | ---- | ---- | ------ | ----- | | ||
| None | None | None | None | 1361.73 | 15.81 | 23.50 | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These numbers are a bit higher than the one reported in the new benchmark script, I'm pretty sure this is because we reuse the same example for benchmarking and not because of any changes to the code.
I grabbed the output of TORCH_LOGS=output_code for both the new and old benchmark script and diffed them, they look pretty much identical:
@@ -20,29 +20,32 @@ More concretely, we hope to provide tutorials and APIs for both sparse kernels ( | |||
|
|||
## Success Stories | |||
|
|||
#### segment-anything | |||
#### segment-anything-fast |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you update the main README as well with some of these results
benchmarks/benchmark_sam.py
Outdated
# run_once(qkv="quant", proj="quant", lin1="quant+sparse (cusparselt)", lin2="quant+sparse (cusparselt)"), | ||
# run_once(qkv="quant+sparse (cutlass)", proj="quant+sparse (cutlass)", lin1="quant+sparse (cutlass)", lin2="quant+sparse (cutlass)"), | ||
] | ||
ALL_RUNS = [run_once(qkv="quant", proj="quant", lin1="quant+sparse (cusparselt)", lin2="sparse (cusparselt)")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could be cleaner?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me just remove this file, it's unnecessary now that we have the SAF eval code. I just left it in the PR so I could pull torch logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some minor nits
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
This PR adds in segment-anything-fast evaluation to torchao, and also adds benchmarks for int8 quantization + 2:4 sparsity. With this we can run combined perf/accuracy benchmarks for segment-anything. This should give us a starting point for the relative perf vs relative acc graph for PTC. | Model Type | Technique | img/s | memory (MiB) | mIoU | relative speedup | relative accuracy | |------------|------------------------------------------------------------------------------------------------------|-------|--------------|--------|------------------|-------------------| | ViT-h | baseline (bfloat16, max-autotune) | 22.75 | 15172 | 0.5811 | | | | | int8 dynamic quant (attn + mlp) | 24.91 | 15154 | 0.5822 | **1.09x** | **100.19%** | | | 2:4 sparsity (mlp only) | 24.81 | 15632 | 0.5672 | **1.10x** | **97.61%** | | | 2:4 sparsity (attn + mlp) | 24.30 | 13429 | 0.5306 | **1.07x** | **91.31%** | | | int8 dynamic quant (attn)<br>int8 dynamic quant + 2:4 sparsity (mlp lin1)<br>2:4 sparsity (mlp lin2) | 26.46 | 14865 | 0.5668 | **1.16x** | **97.54%** | This just copies over the evaluation scripts. Eventually I think we should move over the modeling code too, but plan to do that in a subsequent PR.
* scrub & reformat code * use full paths * set tiktoken init to False, not None to align with new tokenizer chatting logic
This PR adds in segment-anything-fast evaluation to torchao, and also adds benchmarks for int8 quantization + 2:4 sparsity.
With this we can run combined perf/accuracy benchmarks for segment-anything. This should give us a starting point for the relative perf vs relative acc graph for PTC.
int8 dynamic quant + 2:4 sparsity (mlp lin1)
2:4 sparsity (mlp lin2)
This just copies over the evaluation scripts. Eventually I think we should move over the modeling code too, but plan to do that in a subsequent PR.