Skip to content
This repository was archived by the owner on Oct 11, 2024. It is now read-only.

Use cutlass kernels#202

Merged
varun-sundar-rabindranath merged 7 commits intovllm-quantizationfrom
vllm-quantization-cutlass
Apr 24, 2024
Merged

Use cutlass kernels#202
varun-sundar-rabindranath merged 7 commits intovllm-quantizationfrom
vllm-quantization-cutlass

Conversation

@varun-sundar-rabindranath
Copy link

@varun-sundar-rabindranath varun-sundar-rabindranath commented Apr 22, 2024

Description:
Cutlass integration.

  • Use cutlass gemm with epilogue fusion for dequantization
  • Remove all existing dequant kernels and interface
  • Remove cublas i8gemm files

Test:
Run examples/offline_quantized_inference.py

(vllm-test) varun@floppy-fan:~/code/neuralmagic-vllm (vllm-quantization-cutlass) $ python3 ./examples/offline_quantized_inference.py 
...
Prompt: 'Hello, my name is', Generated text: ' John and I am a recovering workaholic.\nI used to work all the time'
Prompt: 'The president of the United States is', Generated text: ' the head of state and head of government of the United States of America. The president leads the executive'
Prompt: 'The capital of France is', Generated text: ' Paris.\nThe capital of France is Paris.'
Prompt: 'The future of AI is', Generated text: ' here, and it’s more accessible than ever.\nThe future of AI is here,'

Profiling results :
Prefill 512 tokens, Branch : This branch, dtype : "torch.float", model : Quantized model - results

Note that this branch is better than the previous best [w8a8 upstream PR with custom fused kernels]

Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very nice - make sure to run format.sh though since some formatting bits seem to be off

@varun-sundar-rabindranath varun-sundar-rabindranath merged commit cc08dc4 into vllm-quantization Apr 24, 2024
@varun-sundar-rabindranath varun-sundar-rabindranath deleted the vllm-quantization-cutlass branch April 24, 2024 19:19
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants