Use cutlass kernels by varun-sundar-rabindranath · Pull Request #202 · neuralmagic/nm-vllm

varun-sundar-rabindranath · 2024-04-22T21:13:11Z

Description:
Cutlass integration.

Use cutlass gemm with epilogue fusion for dequantization
Remove all existing dequant kernels and interface
Remove cublas i8gemm files

Test:
Run examples/offline_quantized_inference.py

(vllm-test) varun@floppy-fan:~/code/neuralmagic-vllm (vllm-quantization-cutlass) $ python3 ./examples/offline_quantized_inference.py 
...
Prompt: 'Hello, my name is', Generated text: ' John and I am a recovering workaholic.\nI used to work all the time'
Prompt: 'The president of the United States is', Generated text: ' the head of state and head of government of the United States of America. The president leads the executive'
Prompt: 'The capital of France is', Generated text: ' Paris.\nThe capital of France is Paris.'
Prompt: 'The future of AI is', Generated text: ' here, and it’s more accessible than ever.\nThe future of AI is here,'

Profiling results :
Prefill 512 tokens, Branch : This branch, dtype : "torch.float", model : Quantized model - results

Note that this branch is better than the previous best [w8a8 upstream PR with custom fused kernels]

mgoin

very nice - make sure to run format.sh though since some formatting bits seem to be off

vllm/model_executor/layers/quantization/smoothquant/cutlass_gemm.py

Varun Sundar Rabindranath added 5 commits April 22, 2024 14:17

cutlass fused dq

61b828f

move cutlass stuff to cutlass_gemm.py

cad6b63

Use only cutlass gemm and dq

fbbca78

Remove dequant and cublas i8gemm

7eaa73b

add cutlass to requirements

7c5c278

varun-sundar-rabindranath requested review from dsikka, robertgshaw2-redhat and tlrmchlsmth April 22, 2024 21:26

fix cutlass package naming

ce47438

robertgshaw2-redhat approved these changes Apr 22, 2024

View reviewed changes

mgoin approved these changes Apr 22, 2024

View reviewed changes

vllm/model_executor/layers/quantization/smoothquant/cutlass_gemm.py Show resolved Hide resolved

tlrmchlsmth approved these changes Apr 23, 2024

View reviewed changes

Remove log_str that triggers device to host transfer

3009dce

varun-sundar-rabindranath merged commit cc08dc4 into vllm-quantization Apr 24, 2024

varun-sundar-rabindranath deleted the vllm-quantization-cutlass branch April 24, 2024 19:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use cutlass kernels#202

Use cutlass kernels#202
varun-sundar-rabindranath merged 7 commits intovllm-quantizationfrom
vllm-quantization-cutlass

varun-sundar-rabindranath commented Apr 22, 2024 •

edited

Loading

Uh oh!

mgoin left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

varun-sundar-rabindranath commented Apr 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

varun-sundar-rabindranath commented Apr 22, 2024 •

edited

Loading