Add rouge metric evalution for llama 70B with orca datasets#1068
Add rouge metric evalution for llama 70B with orca datasets#1068sushildubey171 wants to merge 2 commits into
Conversation
* Add rouge metric evalution for llama 70B with orca datasets use rouge metric to evaluate the corretness of the model, it uses openorca dataset
| def get_args(): | ||
| parser = argparse.ArgumentParser() | ||
| parser.add_argument("--checkpoint-path", default="/mnt/weka/data/pytorch/llama2/Llama-2-70b-chat-hf", | ||
| help="Path to Llama2-70b-hf-chat checkpoint") |
There was a problem hiding this comment.
please remove the default path as current. and put None
There was a problem hiding this comment.
This argument should'nt have a default. And it should probably have: required=True. Also we can call it --model_name_or_path (because it doesnt have to be a checkpoint on disk, it could be a hugging face model that might be downloaded)
like here: https://github.com/huggingface/optimum-habana/blob/main/examples/text-generation/run_generation.py#L48
There was a problem hiding this comment.
we have used this from llama mlperf submission, and this rouge eval works only with this checkpoint, it can't be any other checkpoint. Therefore it is put as default since we have tested with only this. also it helps user to run with the correct file.
| help="Path to Llama2-70b-hf-chat checkpoint") | ||
| parser.add_argument("--accuracy-file", default="output/accuracy.json", help="path to accuracy.json") | ||
| parser.add_argument("--dataset-file", default="/mnt/weka/data/mlperf_inference/llama2/processed-data.pkl", | ||
| help="path to processed openorca validation set") |
There was a problem hiding this comment.
remove the default as this
There was a problem hiding this comment.
Maybe should specify what the file format and data contents should be, maybe as help message or atleast as a comment somewhere
There was a problem hiding this comment.
default file name is to help running with correct dataset file since it is already preprocessed, no other dataset is used here. user may end up running with incorrect or may not find the correct dataset if removed.
| ) | ||
| parser.add_argument( | ||
| "--dataset", | ||
| default="/mnt/weka/data/mlperf_inference/llama2/processed-data.pkl", |
There was a problem hiding this comment.
remove default, and put None, but do a check later
There was a problem hiding this comment.
same as above
| def generate(input_tokens, size=None, reduce_recompile=False): | ||
| """Generates sequences from the input sentences and returns them.""" | ||
|
|
||
| t0 = time.perf_counter() |
There was a problem hiding this comment.
please check how the 1st token latency is done here, and do the same https://github.com/huggingface/optimum-habana/blob/main/examples/text-generation/run_generation.py#L381
There was a problem hiding this comment.
please note this for evaluation and not for latency test still existing behaviour is retained
There was a problem hiding this comment.
This change is better suited in a separate file, like eval_orca.py.
run_generation is meant for perf eval. Its apparatus for warmup and iterating multiple times over a same sentence is not needed for an accuracy eval usecase.
Also current PR has too much code duplications, and unused code paths (dynamic prompts branches)
| def get_args(): | ||
| parser = argparse.ArgumentParser() | ||
| parser.add_argument("--checkpoint-path", default="/mnt/weka/data/pytorch/llama2/Llama-2-70b-chat-hf", | ||
| help="Path to Llama2-70b-hf-chat checkpoint") |
There was a problem hiding this comment.
This argument should'nt have a default. And it should probably have: required=True. Also we can call it --model_name_or_path (because it doesnt have to be a checkpoint on disk, it could be a hugging face model that might be downloaded)
like here: https://github.com/huggingface/optimum-habana/blob/main/examples/text-generation/run_generation.py#L48
| help="Path to Llama2-70b-hf-chat checkpoint") | ||
| parser.add_argument("--accuracy-file", default="output/accuracy.json", help="path to accuracy.json") | ||
| parser.add_argument("--dataset-file", default="/mnt/weka/data/mlperf_inference/llama2/processed-data.pkl", | ||
| help="path to processed openorca validation set") |
There was a problem hiding this comment.
Maybe should specify what the file format and data contents should be, maybe as help message or atleast as a comment somewhere
| if args.dtype == "int32": | ||
| eval_dtype = np.int32 | ||
| elif args.dtype == "float": | ||
| eval_dtype = np.float32 |
There was a problem hiding this comment.
minor:
eval_dtype = {"int32": np.int32, "float": np.float32, "int64": np.int64}[args.dtype]
| outputs[i] = outputs[i][args.max_input_tokens:] | ||
| duration = time.perf_counter() - t0 | ||
| print(f"Total E2E time of this batch is {duration:.3f}s", flush=True) | ||
| return outputs |
There was a problem hiding this comment.
This is a LOT of code duplication, which can cause errors later due to failing to remember to propagate chaneges in both branches etc.
For example, you could reuse the older "generate" function, just by adding:
def generate(input_tokens=None, size=None, reduce_recompile=False):
.....
if input_tokens is None: # ADDING THIS
# Tokenization
if args.max_input_tokens > 0:
input_tokens = tokenizer.batch_encode_plus(
input_sentences,
return_tensors="pt",
padding="max_length",
max_length=args.max_input_tokens,
truncation=True,
)
else:
input_tokens = tokenizer.batch_encode_plus(input_sentences, return_tensors="pt", padding=True)
And you could perform the output runcation outside generate:
for i in range(len(outputs)):
outputs[i] = outputs[i][args.max_input_tokens:]
There was a problem hiding this comment.
please note we need to run the measurement also, we need other functionalities from this file therefore existing behaviours are retained, if any code duplication can be handled as part of later cleanup/code refactoring.
| dyn_prompt_lens = args.simulate_dyn_prompt | ||
| t0 = time.perf_counter() | ||
| # The first three iterations take longer because of graph compilation | ||
| if dyn_prompt_lens is None or len(set(dyn_prompt_lens)) == 1: |
There was a problem hiding this comment.
There is no difference between these code and the already existing one in the else branch (except for the call to generate? If not lets not duplicate. You can just call generate differently:
generate(input_sentences[0] if args.dataset_name == "openorca" else None, dyn_prompt_lens[0], args.reduce_recompile)
There was a problem hiding this comment.
there are difference related the data loading from different datasets and invoking the generate functions on openorca datasets in the for loop
| for i in range(args.n_iterations): | ||
| results = [] | ||
| b = 1 | ||
| for sentence in input_sentences: |
There was a problem hiding this comment.
Minor: More pythonic to write:
for b, sentence in enumerate(input_sentences)
then we can remove the b=1, b+=1 lines
| t0 = time.perf_counter() | ||
| # Benchmark over n_iterations iterations | ||
| N = len(input_sentences) | ||
| if dyn_prompt_lens is None: |
There was a problem hiding this comment.
What is the purpose of the dataset here? I suppose it is for accuracy eval. Then Why warmup? Once you have gone thru the dataset once and collected the sentences for accuracy, you dont need to go over the dataset n_iter times again, as far as I understand.
There was a problem hiding this comment.
Similarly teh whole apparatus of dynamic prompts warmup etc are also probably not used for orca eval? in which case we should delete all these extraneous if-elses that dynamic prompt gives rise to
There was a problem hiding this comment.
two different datasets one for generating measurement and one for running the quantization for evaluation. Please note here number of iteration is 1 for running this evaluation, we have already specified the command for running the evaluation, I have kept to retain the n_itr argument. dynamic path is yet no tested for rouge eval, need to check with validation team else it can be clean up as part of code refactoring.
We still need warm up to avoid the compilations, in accuracy eval we are not using the dummy sentences but dataset. on dynamic prompts, there was some work going on by validation team to support. |
| import numpy as np | ||
| import json | ||
|
|
||
| ###################### Habana internal code ################################## |
use rouge metric to evaluate the corretness of the model, it uses openorca dataset