Skip to content

Add rouge metric evalution for llama 70B with orca datasets#1068

Closed
sushildubey171 wants to merge 2 commits into
huggingface:mainfrom
sushildubey171:rouge_eval_oh
Closed

Add rouge metric evalution for llama 70B with orca datasets#1068
sushildubey171 wants to merge 2 commits into
huggingface:mainfrom
sushildubey171:rouge_eval_oh

Conversation

@sushildubey171
Copy link
Copy Markdown
Contributor

  • Add rouge metric evalution for llama 70B with orca datasets

use rouge metric to evaluate the corretness of the model, it uses openorca dataset

* Add rouge metric evalution for llama 70B with orca datasets

use rouge metric to evaluate the corretness of the model, it uses
openorca dataset
def get_args():
parser = argparse.ArgumentParser()
parser.add_argument("--checkpoint-path", default="/mnt/weka/data/pytorch/llama2/Llama-2-70b-chat-hf",
help="Path to Llama2-70b-hf-chat checkpoint")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove the default path as current. and put None

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This argument should'nt have a default. And it should probably have: required=True. Also we can call it --model_name_or_path (because it doesnt have to be a checkpoint on disk, it could be a hugging face model that might be downloaded)
like here: https://github.com/huggingface/optimum-habana/blob/main/examples/text-generation/run_generation.py#L48

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have used this from llama mlperf submission, and this rouge eval works only with this checkpoint, it can't be any other checkpoint. Therefore it is put as default since we have tested with only this. also it helps user to run with the correct file.

help="Path to Llama2-70b-hf-chat checkpoint")
parser.add_argument("--accuracy-file", default="output/accuracy.json", help="path to accuracy.json")
parser.add_argument("--dataset-file", default="/mnt/weka/data/mlperf_inference/llama2/processed-data.pkl",
help="path to processed openorca validation set")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove the default as this

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe should specify what the file format and data contents should be, maybe as help message or atleast as a comment somewhere

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default file name is to help running with correct dataset file since it is already preprocessed, no other dataset is used here. user may end up running with incorrect or may not find the correct dataset if removed.

)
parser.add_argument(
"--dataset",
default="/mnt/weka/data/mlperf_inference/llama2/processed-data.pkl",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove default, and put None, but do a check later

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

def generate(input_tokens, size=None, reduce_recompile=False):
"""Generates sequences from the input sentences and returns them."""

t0 = time.perf_counter()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please note this for evaluation and not for latency test still existing behaviour is retained

Copy link
Copy Markdown
Contributor

@ssarkar2 ssarkar2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is better suited in a separate file, like eval_orca.py.

run_generation is meant for perf eval. Its apparatus for warmup and iterating multiple times over a same sentence is not needed for an accuracy eval usecase.

Also current PR has too much code duplications, and unused code paths (dynamic prompts branches)

def get_args():
parser = argparse.ArgumentParser()
parser.add_argument("--checkpoint-path", default="/mnt/weka/data/pytorch/llama2/Llama-2-70b-chat-hf",
help="Path to Llama2-70b-hf-chat checkpoint")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This argument should'nt have a default. And it should probably have: required=True. Also we can call it --model_name_or_path (because it doesnt have to be a checkpoint on disk, it could be a hugging face model that might be downloaded)
like here: https://github.com/huggingface/optimum-habana/blob/main/examples/text-generation/run_generation.py#L48

help="Path to Llama2-70b-hf-chat checkpoint")
parser.add_argument("--accuracy-file", default="output/accuracy.json", help="path to accuracy.json")
parser.add_argument("--dataset-file", default="/mnt/weka/data/mlperf_inference/llama2/processed-data.pkl",
help="path to processed openorca validation set")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe should specify what the file format and data contents should be, maybe as help message or atleast as a comment somewhere

if args.dtype == "int32":
eval_dtype = np.int32
elif args.dtype == "float":
eval_dtype = np.float32
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor:
eval_dtype = {"int32": np.int32, "float": np.float32, "int64": np.int64}[args.dtype]

outputs[i] = outputs[i][args.max_input_tokens:]
duration = time.perf_counter() - t0
print(f"Total E2E time of this batch is {duration:.3f}s", flush=True)
return outputs
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a LOT of code duplication, which can cause errors later due to failing to remember to propagate chaneges in both branches etc.

For example, you could reuse the older "generate" function, just by adding:

def generate(input_tokens=None, size=None, reduce_recompile=False):
.....
    if input_tokens is None:  # ADDING THIS
				# Tokenization
				if args.max_input_tokens > 0:
					input_tokens = tokenizer.batch_encode_plus(
						input_sentences,
						return_tensors="pt",
						padding="max_length",
						max_length=args.max_input_tokens,
						truncation=True,
					)
				else:
					input_tokens = tokenizer.batch_encode_plus(input_sentences, return_tensors="pt", padding=True)

And you could perform the output runcation outside generate:

for i in range(len(outputs)):
                outputs[i] = outputs[i][args.max_input_tokens:]

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please note we need to run the measurement also, we need other functionalities from this file therefore existing behaviours are retained, if any code duplication can be handled as part of later cleanup/code refactoring.

dyn_prompt_lens = args.simulate_dyn_prompt
t0 = time.perf_counter()
# The first three iterations take longer because of graph compilation
if dyn_prompt_lens is None or len(set(dyn_prompt_lens)) == 1:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no difference between these code and the already existing one in the else branch (except for the call to generate? If not lets not duplicate. You can just call generate differently:

generate(input_sentences[0] if args.dataset_name == "openorca" else None, dyn_prompt_lens[0], args.reduce_recompile)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are difference related the data loading from different datasets and invoking the generate functions on openorca datasets in the for loop

for i in range(args.n_iterations):
results = []
b = 1
for sentence in input_sentences:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: More pythonic to write:
for b, sentence in enumerate(input_sentences)
then we can remove the b=1, b+=1 lines

t0 = time.perf_counter()
# Benchmark over n_iterations iterations
N = len(input_sentences)
if dyn_prompt_lens is None:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of the dataset here? I suppose it is for accuracy eval. Then Why warmup? Once you have gone thru the dataset once and collected the sentences for accuracy, you dont need to go over the dataset n_iter times again, as far as I understand.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly teh whole apparatus of dynamic prompts warmup etc are also probably not used for orca eval? in which case we should delete all these extraneous if-elses that dynamic prompt gives rise to

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two different datasets one for generating measurement and one for running the quantization for evaluation. Please note here number of iteration is 1 for running this evaluation, we have already specified the command for running the evaluation, I have kept to retain the n_itr argument. dynamic path is yet no tested for rouge eval, need to check with validation team else it can be clean up as part of code refactoring.

@sushildubey171
Copy link
Copy Markdown
Contributor Author

This change is better suited in a separate file, like eval_orca.py.

run_generation is meant for perf eval. Its apparatus for warmup and iterating multiple times over a same sentence is not needed for an accuracy eval usecase.

Also current PR has too much code duplications, and unused code paths (dynamic prompts branches)

We still need warm up to avoid the compilations, in accuracy eval we are not using the dummy sentences but dataset. on dynamic prompts, there was some work going on by validation team to support.
I agree with you on code refactoring, but it can be done as part of separate change as this changes are pending to be merged from past 3 months, also we already have this running in OHF/gerrit and being used by QA.

import numpy as np
import json

###################### Habana internal code ##################################
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you put proper head?

@sushildubey171 sushildubey171 deleted the rouge_eval_oh branch November 15, 2024 01:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants