Skip to content

Commit

Permalink
Quantization, fp acceleration, and testing (pytorch#572)
Browse files Browse the repository at this point in the history
* code beautification

* code beautification, move functions together

* make --device fast the default (pytorch#515)

* make --device fast the default

* Update iOS.md (pytorch#517)

* Update iOS.md

* Update iOS.md

* Pip to pip3 (pytorch#504)

* remove macos-12 test

* pip to pip3

* break aoti CI jobs separately (pytorch#500)

* init

* fixes

* more fixes

* fixes

* fix

* fix

* bug fix

* add objcopy update

* suppress int8

* undefined variable

---------

Co-authored-by: Michael Gschwind <[email protected]>

* Support llama3 in chat in run.cpp  (pytorch#486)

* refactor chat runner in preparation for llama3

* add sketch for llama3 prompt template and move to returning tokens

* fix tiktoken

* fixes to chat

* add default llama_ver

* Add tests for quantize json, add cuda device specification and precision to cuda.json (pytorch#519)

* remove code for no KV Cache path (pytorch#527)

* Update ADVANCED-USERS.md (pytorch#529)

Update Advanced Users description to reflect changes in the repo since the description was initially created.

* runner-aoti on cuda (pytorch#531)

* runner-aoti on cuda

* transfer results back to CPU

* transfer results back to CPU

* runner-aoti on cuda

* Update runner_build.md (pytorch#530)

Update description of runner and build process in runner_build.md

* clean up runner code a little (pytorch#532)

* clean up runner code a little

* update

* update

* pull out generate loop in chat

* updates

* edit docs

* typo

* move int8 linear class and function into qops.py (pytorch#534)

* add dtype tests for runner-aoti + runner-et (pytorch#539)

* add dtype tests for runner-aoti + runner-et

* typo

* Quantized embedding (pytorch#536)

* move int8 linear class and function into qops.py

* move Quantized Embedding to qops.py

* Move Linear int4 to qops (pytorch#537)

* move int8 linear class and function into qops.py

* move Quantized Embedding to qops.py

* move int4 linear to qops

* Revert "add dtype tests for runner-aoti + runner-et (pytorch#539)" (pytorch#548)

This reverts commit a7a24577a65be67ac9ae4dc05452f35d9c49e5d1.

* fix generate for llama3 (pytorch#538)

* fix generate for llama3

* switch more things to C

* remove C++ header

* add delegation visualization instructions (pytorch#551)

* Add dtype runner aoti (pytorch#552)

* add dtype tests for runner-aoti + runner-et

* typo

* add dtype test runner-aoti

* test sdpa with fp16 (pytorch#553)

* test sdpa with fp16

* kv cache fp32

* typo

* update (pytorch#560)

* Only support newest versions of lm-eval (pytorch#556)

Summary:
remove support for lm-eval 0.3 to reduce the options we have

Test Plan:
CI

Reviewers:

Subscribers:

Tasks:

Tags:

* split cpu eval CI by dtype (pytorch#554)

* split cpu eval CI by dtype

* fix

* differentiate names with checks

* keep one name the same as old

* fix

* Removing duplicate HF issue message from README (pytorch#559)

Co-authored-by: Michael Gschwind <[email protected]>

* doc updates (pytorch#567)

* Add VM-safe MPS check

---------

Co-authored-by: Anthony Shoumikhin <[email protected]>
Co-authored-by: metascroy <[email protected]>
Co-authored-by: Nikita Shulga <[email protected]>
Co-authored-by: lucylq <[email protected]>
Co-authored-by: Jerry Zhang <[email protected]>
Co-authored-by: Jack-Khuu <[email protected]>

* add unpacking support (pytorch#525)

* add unpacking support

* fix typos and linter

* perform parallel prefill when possible (pytorch#568)

* perform parallel prefill when possible

* typo

* disable hack

* remove print

* remove debug messages which prevent export

* fixes

* stream results in generate.py (pytorch#571)

* remove logging interfering with export

---------

Co-authored-by: Anthony Shoumikhin <[email protected]>
Co-authored-by: metascroy <[email protected]>
Co-authored-by: Nikita Shulga <[email protected]>
Co-authored-by: lucylq <[email protected]>
Co-authored-by: Jerry Zhang <[email protected]>
Co-authored-by: Jack-Khuu <[email protected]>
  • Loading branch information
7 people committed Jul 17, 2024
1 parent e677fe8 commit c980472
Show file tree
Hide file tree
Showing 7 changed files with 134 additions and 23 deletions.
89 changes: 89 additions & 0 deletions .github/workflows/more-tests.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
name: Run parallel prefill

on:
pull_request:
push:
branches:
- main
workflow_dispatch:

jobs:
test-cuda:
uses: pytorch/test-infra/.github/workflows/linux_job.yml@main
with:
runner: linux.g5.4xlarge.nvidia.gpu
gpu-arch-type: cuda
gpu-arch-version: "12.1"
script: |
echo "::group::Print machine info"
uname -a
echo "::endgroup::"
echo "::group::Install newer objcopy that supports --set-section-alignment"
yum install -y devtoolset-10-binutils
export PATH=/opt/rh/devtoolset-10/root/usr/bin/:$PATH
echo "::endgroup::"
echo "::group::Download checkpoints"
# Install requirements
pip3 install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu121
pip3 install -r requirements.txt
pip3 list
python3 -c 'import torch;print(f"torch: {torch.__version__, torch.version.git_version}")'
echo "::endgroup::"
echo "::group::Download checkpoints"
mkdir -p checkpoints/stories15M
pushd checkpoints/stories15M
wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories15M.pt
wget https://github.com/karpathy/llama2.c/raw/master/tokenizer.model
popd
echo "::endgroup::"
echo "::group::Run inference"
export MODEL_PATH=checkpoints/stories15M/stories15M.pt
export MODEL_NAME=stories15M
export MODEL_DIR=/tmp
for DTYPE in bfloat16 float16 float32; do
###################################################################
# group with different temperatures
python generate.py --checkpoint-path ${MODEL_PATH} --device cpu --dtype ${DTYPE} --temperature 0
python generate.py --checkpoint-path ${MODEL_PATH} --device cpu --dtype ${DTYPE} --temperature 0.9
python generate.py --checkpoint-path ${MODEL_PATH} --device cpu --dtype ${DTYPE} --temperature 1.0
python generate.py --checkpoint-path ${MODEL_PATH} --device cpu --dtype ${DTYPE} --top-k 100
python generate.py --checkpoint-path ${MODEL_PATH} --device cpu --dtype ${DTYPE} --top-k 200
python generate.py --checkpoint-path ${MODEL_PATH} --device cpu --dtype ${DTYPE} --top-k 500
###################################################################
# group with different temperatures and prefill, and compile
# and prefill compile
python generate.py --checkpoint-path ${MODEL_PATH} --device cpu --dtype ${DTYPE} --temperature 0 --compile --compile-prefill
python generate.py --checkpoint-path ${MODEL_PATH} --device cpu --dtype ${DTYPE} --temperature 0.9 --compile --compile-prefill
python generate.py --checkpoint-path ${MODEL_PATH} --device cpu --dtype ${DTYPE} --temperature 1.0 --compile --compile-prefill
python generate.py --checkpoint-path ${MODEL_PATH} --device cpu --dtype ${DTYPE} --top-k 100 --compile --compile-prefill
python generate.py --checkpoint-path ${MODEL_PATH} --device cpu --dtype ${DTYPE} --top-k 200 --compile --compile-prefill
python generate.py --checkpoint-path ${MODEL_PATH} --device cpu --dtype ${DTYPE} --top-k 500 --compile --compile-prefill
###################################################################
# group with different temperatures and sequential prefill
python generate.py --checkpoint-path ${MODEL_PATH} --device cpu --dtype ${DTYPE} --temperature 0 --sequential-prefill
python generate.py --checkpoint-path ${MODEL_PATH} --device cpu --dtype ${DTYPE} --temperature 0.9 --sequential-prefill
python generate.py --checkpoint-path ${MODEL_PATH} --device cpu --dtype ${DTYPE} --temperature 1.0 --sequential-prefill
python generate.py --checkpoint-path ${MODEL_PATH} --device cpu --dtype ${DTYPE} --top-k 100 --sequential-prefill
python generate.py --checkpoint-path ${MODEL_PATH} --device cpu --dtype ${DTYPE} --top-k 200 --sequential-prefill
python generate.py --checkpoint-path ${MODEL_PATH} --device cpu --dtype ${DTYPE} --top-k 500 --sequential-prefill
###################################################################
# group with different temperatures and prefill, and compile
python generate.py --checkpoint-path ${MODEL_PATH} --device cpu --dtype ${DTYPE} --temperature 0 --sequential-prefill --compile
python generate.py --checkpoint-path ${MODEL_PATH} --device cpu --dtype ${DTYPE} --temperature 0.9 --sequential-prefill --compile
python generate.py --checkpoint-path ${MODEL_PATH} --device cpu --dtype ${DTYPE} --temperature 1.0 --sequential-prefill --compile
python generate.py --checkpoint-path ${MODEL_PATH} --device cpu --dtype ${DTYPE} --top-k 100 --sequential-prefill --compile
python generate.py --checkpoint-path ${MODEL_PATH} --device cpu --dtype ${DTYPE} --top-k 200 --sequential-prefill --compile
python generate.py --checkpoint-path ${MODEL_PATH} --device cpu --dtype ${DTYPE} --top-k 500 --sequential-prefill --compile
done
echo "tests complete"
echo "******************************************"
echo "::endgroup::"
10 changes: 9 additions & 1 deletion build/builder.py
Original file line number Diff line number Diff line change
Expand Up @@ -117,6 +117,14 @@ def from_args(cls, args): # -> BuilderArgs:
if "chat" in path_basename or "instruct" in path_basename:
is_chat_model = True

if args.output_pte_path and args.dtype.startswith("fast"):
if args.dtype == "fast":
dtype = torch.float32
else:
dtype = torch.float16
else:
dtype = name_to_dtype(args.dtype)

return cls(
checkpoint_dir=checkpoint_dir,
checkpoint_path=checkpoint_path,
Expand All @@ -127,7 +135,7 @@ def from_args(cls, args): # -> BuilderArgs:
dso_path=args.dso_path,
pte_path=args.pte_path,
device=args.device,
precision=name_to_dtype(args.dtype),
precision=dtype,
setup_caches=(args.output_dso_path or args.output_pte_path),
use_tp=False,
is_chat_model=is_chat_model,
Expand Down
15 changes: 14 additions & 1 deletion build/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -130,7 +130,17 @@ def get_precision():

##########################################################################
### dtype name to torch.dtype mapping ###


def name_to_dtype(name):
if (name == "fast") or (name == "fast16"):
import platform

if platform.processor() == "arm":
return torch.float16
else:
return torch.bfloat16

if name in name_to_dtype_dict:
return name_to_dtype_dict[name]
else:
Expand All @@ -150,6 +160,8 @@ def allowable_dtype_names() -> List[str]:
"float32": torch.float,
"float16": torch.float16,
"bfloat16": torch.bfloat16,
"fast": None,
"fast16": None,
}


Expand Down Expand Up @@ -208,6 +220,7 @@ def state_dict_device(d, device="cpu") -> Dict:
#########################################################################
### move state dict to specified device ###


def is_mps_available() -> bool:
if not torch.backends.mps.is_available():
return False
Expand All @@ -219,7 +232,7 @@ def is_mps_available() -> bool:
except:
return False

# MPS, is that you?
# MPS, is that you?
return True


Expand Down
5 changes: 2 additions & 3 deletions cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -210,11 +210,10 @@ def _add_arguments_common(parser):
help="Use the specified ExecuTorch .pte model file",
)
parser.add_argument(
"-d",
"--dtype",
default="float32",
default="fast",
choices=allowable_dtype_names(),
help="Override the dtype of the model (default is the checkpoint dtype). Options: bf16, fp16, fp32",
help="Override the dtype of the model (default is the checkpoint dtype). Options: bf16, fp16, fp32, fast16, fast",
)
parser.add_argument(
"-v",
Expand Down
2 changes: 1 addition & 1 deletion generate.py
Original file line number Diff line number Diff line change
Expand Up @@ -172,7 +172,7 @@ def prefill(
sequential_prefill=True,
**sampling_kwargs,
) -> torch.Tensor:
logging.debug(f"x: {x}, input_pos: {input_pos}")
# logging.debug(f"x: {x}, input_pos: {input_pos}")
width = x.size(1)
assert input_pos.size(0) == width

Expand Down
18 changes: 18 additions & 0 deletions qops.py
Original file line number Diff line number Diff line change
Expand Up @@ -305,3 +305,21 @@ def forward(self, input: torch.Tensor) -> torch.Tensor:
@classmethod
def _check_k(cls, *, k, groupsize=1, inner_k_tiles=1):
return k % groupsize == 0 and k % (inner_k_tiles * 16) == 0

@classmethod
def _prepare_weight_and_scales_and_zeros(
cls, weight_bf16, groupsize, inner_k_tiles
):
from quantize import group_quantize_tensor

weight_int32, scales_and_zeros = group_quantize_tensor(
weight_bf16, n_bit=4, groupsize=groupsize
)
weight_int4pack = torch.ops.aten._convert_weight_to_int4pack(
weight_int32, inner_k_tiles
)
return weight_int4pack, scales_and_zeros

@classmethod
def _calc_padded_size(cls, *, k, groupsize=1, innner_k_tiles=1):
return find_multiple(k, 1024)
18 changes: 1 addition & 17 deletions quantize.py
Original file line number Diff line number Diff line change
Expand Up @@ -595,22 +595,6 @@ def quantized_model(self) -> nn.Module:
##### weight only int4 per channel groupwise quantized code ######


def _int4_prepare_int4_weight_and_scales_and_zeros(
weight_bf16, groupsize, inner_k_tiles
):
weight_int32, scales_and_zeros = group_quantize_tensor(
weight_bf16, n_bit=4, groupsize=groupsize
)
weight_int4pack = torch.ops.aten._convert_weight_to_int4pack(
weight_int32, inner_k_tiles
)
return weight_int4pack, scales_and_zeros


def _int4_calc_padded_size(k, groupsize=1, innner_k_tiles=1):
return find_multiple(k, 1024)


def replace_linear_int4(
module,
device,
Expand Down Expand Up @@ -705,7 +689,7 @@ def create_quantized_state_dict(self):
)
continue
weight_int4pack, scales_and_zeros = (
_int4_prepare_int4_weight_and_scales_and_zeros(
WeightOnlyInt4Linear._prepare_weight_and_scales_and_zeros(
weight.to(torch.float), self.groupsize, self.inner_k_tiles
)
)
Expand Down

0 comments on commit c980472

Please sign in to comment.