Mark scale as const and remove --fp8 flag usage#156
Conversation
| set_seed(args.seed) | ||
| get_repo_root(args.model_name_or_path, local_rank=args.local_rank, token=args.token) | ||
| use_deepspeed = args.world_size > 0 | ||
| if use_deepspeed or args.bf16 or args.fp8: |
There was a problem hiding this comment.
This changes behavior than what we had before, is it tested on 7b? (seems like the only way to be false)
There was a problem hiding this comment.
we always set --bf16 even in fp8 runs so it should work.
| import habana_frameworks.torch.core as htcore | ||
|
|
||
| if args.fp8: | ||
| if args.quant_config: |
There was a problem hiding this comment.
This is true also when performing a measurement, was it tested?
| if args.const_serialization_path: | ||
| setup_const_serialization(args.const_serialization_path) | ||
| if args.fp8: | ||
| if args.quant_config: |
| const_marking = os.getenv("ENABLE_CONST_MARKING", "True") | ||
| if const_marking == "True": | ||
| htcore.hpu_initialize(model) | ||
| htcore.hpu_initialize(model, mark_only_scales_as_const=True) |
There was a problem hiding this comment.
if we mark only scales as const we may lose some constant folding optimization (in theory).
for example we may lose transpose on constant weitghs, i think this happens in SDXL.
So anyway did you tested it for performance right ?
There was a problem hiding this comment.
@ulivne it was tested on llama 70B and 7B.
the transpose on weights now happen in HQT patched module init so it should be fine
| set_seed(args.seed) | ||
| get_repo_root(args.model_name_or_path, local_rank=args.local_rank, token=args.token) | ||
| use_deepspeed = args.world_size > 0 | ||
| if use_deepspeed or args.bf16 or args.fp8: |
There was a problem hiding this comment.
we always set --bf16 even in fp8 runs so it should work.
| const_marking = os.getenv("ENABLE_CONST_MARKING", "True") | ||
| if const_marking == "True": | ||
| htcore.hpu_initialize(model) | ||
| htcore.hpu_initialize(model, mark_only_scales_as_const=True) |
There was a problem hiding this comment.
@ulivne it was tested on llama 70B and 7B.
the transpose on weights now happen in HQT patched module init so it should be fine
| @@ -102,7 +102,7 @@ def setup_inference(args, model): | |||
| print("Initializing inference mode") | |||
| const_marking = os.getenv("ENABLE_CONST_MARKING", "True") | |||
There was a problem hiding this comment.
@Yantom1 you should remove it. we should not use this variable anymore. we should always call hpu_initialize in this function. as QA set this flag to false, did you test it correctly?
Change-Id: I6dba8691d842fc62d09da5202ea1e61a111f5f18
* Mark only scales as const * remove --fp8 flag usage from llama * removed usage of ENABLE_CONST_MARKING Change-Id: I6dba8691d842fc62d09da5202ea1e61a111f5f18 --------- Co-authored-by: Eran Geva <egeva@habana.ai>
* Mark only scales as const * remove --fp8 flag usage from llama * removed usage of ENABLE_CONST_MARKING Change-Id: I6dba8691d842fc62d09da5202ea1e61a111f5f18 --------- Co-authored-by: Eran Geva <egeva@habana.ai>
* Mark only scales as const * remove --fp8 flag usage from llama * removed usage of ENABLE_CONST_MARKING Change-Id: I6dba8691d842fc62d09da5202ea1e61a111f5f18 --------- Co-authored-by: Eran Geva <egeva@habana.ai>
* Mark only scales as const * remove --fp8 flag usage from llama * removed usage of ENABLE_CONST_MARKING Change-Id: I6dba8691d842fc62d09da5202ea1e61a111f5f18 --------- Co-authored-by: Eran Geva <egeva@habana.ai>
* Mark only scales as const * remove --fp8 flag usage from llama * removed usage of ENABLE_CONST_MARKING Change-Id: I6dba8691d842fc62d09da5202ea1e61a111f5f18 --------- Co-authored-by: Eran Geva <egeva@habana.ai>
* Add memory, graph stats * fix import formatting issues * sort imports * sort imports
* Add memory, graph stats * fix import formatting issues * sort imports * sort imports
What does this PR do?
Fixes # (issue)
Before submitting