Add rouge metric evalution for llama 70B with orca datasets by sushildubey171 · Pull Request #1068 · huggingface/optimum-habana

sushildubey171 · 2024-06-12T08:10:09Z

Add rouge metric evalution for llama 70B with orca datasets

use rouge metric to evaluate the corretness of the model, it uses openorca dataset

* Add rouge metric evalution for llama 70B with orca datasets use rouge metric to evaluate the corretness of the model, it uses openorca dataset

libinta · 2024-07-02T05:30:58Z

+def get_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--checkpoint-path", default="/mnt/weka/data/pytorch/llama2/Llama-2-70b-chat-hf",
+                        help="Path to Llama2-70b-hf-chat checkpoint")


please remove the default path as current. and put None

This argument should'nt have a default. And it should probably have: required=True. Also we can call it --model_name_or_path (because it doesnt have to be a checkpoint on disk, it could be a hugging face model that might be downloaded)
like here: https://github.com/huggingface/optimum-habana/blob/main/examples/text-generation/run_generation.py#L48

we have used this from llama mlperf submission, and this rouge eval works only with this checkpoint, it can't be any other checkpoint. Therefore it is put as default since we have tested with only this. also it helps user to run with the correct file.

libinta · 2024-07-02T05:31:28Z

+                        help="Path to Llama2-70b-hf-chat checkpoint")
+    parser.add_argument("--accuracy-file", default="output/accuracy.json", help="path to accuracy.json")
+    parser.add_argument("--dataset-file", default="/mnt/weka/data/mlperf_inference/llama2/processed-data.pkl",
+                        help="path to processed openorca validation set")


remove the default as this

Maybe should specify what the file format and data contents should be, maybe as help message or atleast as a comment somewhere

default file name is to help running with correct dataset file since it is already preprocessed, no other dataset is used here. user may end up running with incorrect or may not find the correct dataset if removed.

libinta · 2024-07-02T05:32:22Z

    )
+    parser.add_argument(
+        "--dataset",
+        default="/mnt/weka/data/mlperf_inference/llama2/processed-data.pkl",


remove default, and put None, but do a check later

same as above

libinta · 2024-07-02T05:34:35Z

+        def generate(input_tokens, size=None, reduce_recompile=False):
+            """Generates sequences from the input sentences and returns them."""
+
+            t0 = time.perf_counter()


please check how the 1st token latency is done here, and do the same https://github.com/huggingface/optimum-habana/blob/main/examples/text-generation/run_generation.py#L381

please note this for evaluation and not for latency test still existing behaviour is retained

ssarkar2

This change is better suited in a separate file, like eval_orca.py.

run_generation is meant for perf eval. Its apparatus for warmup and iterating multiple times over a same sentence is not needed for an accuracy eval usecase.

Also current PR has too much code duplications, and unused code paths (dynamic prompts branches)

ssarkar2 · 2024-07-02T17:27:51Z

+def get_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--checkpoint-path", default="/mnt/weka/data/pytorch/llama2/Llama-2-70b-chat-hf",
+                        help="Path to Llama2-70b-hf-chat checkpoint")


This argument should'nt have a default. And it should probably have: required=True. Also we can call it --model_name_or_path (because it doesnt have to be a checkpoint on disk, it could be a hugging face model that might be downloaded)
like here: https://github.com/huggingface/optimum-habana/blob/main/examples/text-generation/run_generation.py#L48

ssarkar2 · 2024-07-02T17:29:50Z

+                        help="Path to Llama2-70b-hf-chat checkpoint")
+    parser.add_argument("--accuracy-file", default="output/accuracy.json", help="path to accuracy.json")
+    parser.add_argument("--dataset-file", default="/mnt/weka/data/mlperf_inference/llama2/processed-data.pkl",
+                        help="path to processed openorca validation set")


Maybe should specify what the file format and data contents should be, maybe as help message or atleast as a comment somewhere

ssarkar2 · 2024-07-02T17:59:19Z

+    if args.dtype == "int32":
+        eval_dtype = np.int32
+    elif args.dtype == "float":
+        eval_dtype = np.float32


minor:
eval_dtype = {"int32": np.int32, "float": np.float32, "int64": np.int64}[args.dtype]

ssarkar2 · 2024-07-02T18:10:40Z

+                outputs[i] = outputs[i][args.max_input_tokens:]
+            duration = time.perf_counter() - t0
+            print(f"Total E2E time of this batch is {duration:.3f}s", flush=True)
+            return outputs


This is a LOT of code duplication, which can cause errors later due to failing to remember to propagate chaneges in both branches etc.

For example, you could reuse the older "generate" function, just by adding:

def generate(input_tokens=None, size=None, reduce_recompile=False): ..... if input_tokens is None: # ADDING THIS # Tokenization if args.max_input_tokens > 0: input_tokens = tokenizer.batch_encode_plus( input_sentences, return_tensors="pt", padding="max_length", max_length=args.max_input_tokens, truncation=True, ) else: input_tokens = tokenizer.batch_encode_plus(input_sentences, return_tensors="pt", padding=True)

And you could perform the output runcation outside generate:

for i in range(len(outputs)): outputs[i] = outputs[i][args.max_input_tokens:]

please note we need to run the measurement also, we need other functionalities from this file therefore existing behaviours are retained, if any code duplication can be handled as part of later cleanup/code refactoring.

ssarkar2 · 2024-07-02T18:11:24Z

+        dyn_prompt_lens = args.simulate_dyn_prompt
+        t0 = time.perf_counter()
+        # The first three iterations take longer because of graph compilation
+        if dyn_prompt_lens is None or len(set(dyn_prompt_lens)) == 1:


There is no difference between these code and the already existing one in the else branch (except for the call to generate? If not lets not duplicate. You can just call generate differently:

generate(input_sentences[0] if args.dataset_name == "openorca" else None, dyn_prompt_lens[0], args.reduce_recompile)

there are difference related the data loading from different datasets and invoking the generate functions on openorca datasets in the for loop

ssarkar2 · 2024-07-02T18:15:55Z

+            for i in range(args.n_iterations):
+                results = []
+                b = 1
+                for sentence in input_sentences:


Minor: More pythonic to write:
for b, sentence in enumerate(input_sentences)
then we can remove the b=1, b+=1 lines

ssarkar2 · 2024-07-02T18:20:41Z

+        t0 = time.perf_counter()
+        # Benchmark over n_iterations iterations
+        N = len(input_sentences)
+        if dyn_prompt_lens is None:


What is the purpose of the dataset here? I suppose it is for accuracy eval. Then Why warmup? Once you have gone thru the dataset once and collected the sentences for accuracy, you dont need to go over the dataset n_iter times again, as far as I understand.

Similarly teh whole apparatus of dynamic prompts warmup etc are also probably not used for orca eval? in which case we should delete all these extraneous if-elses that dynamic prompt gives rise to

two different datasets one for generating measurement and one for running the quantization for evaluation. Please note here number of iteration is 1 for running this evaluation, we have already specified the command for running the evaluation, I have kept to retain the n_itr argument. dynamic path is yet no tested for rouge eval, need to check with validation team else it can be clean up as part of code refactoring.

sushildubey171 · 2024-08-27T08:10:06Z

This change is better suited in a separate file, like eval_orca.py.

run_generation is meant for perf eval. Its apparatus for warmup and iterating multiple times over a same sentence is not needed for an accuracy eval usecase.

Also current PR has too much code duplications, and unused code paths (dynamic prompts branches)

We still need warm up to avoid the compilations, in accuracy eval we are not using the dummy sentences but dataset. on dynamic prompts, there was some work going on by validation team to support.
I agree with you on code refactoring, but it can be done as part of separate change as this changes are pending to be merged from past 3 months, also we already have this running in OHF/gerrit and being used by QA.

libinta · 2024-09-18T20:53:36Z

+import numpy as np
+import json
+
+###################### Habana internal code ##################################


can you put proper head?

Add rouge metric evalution for llama 70B with orca datasets (#169)

00ad981

* Add rouge metric evalution for llama 70B with orca datasets use rouge metric to evaluate the corretness of the model, it uses openorca dataset

sushildubey171 requested a review from regisss as a code owner June 12, 2024 08:10

sushildubey171 mentioned this pull request Jun 12, 2024

Add rouge metric evalution for llama 70B with orca datasets HabanaAI/optimum-habana-fork#169

Merged

libinta reviewed Jul 2, 2024

View reviewed changes

ssarkar2 suggested changes Jul 2, 2024

View reviewed changes

Merge branch 'main' into rouge_eval_oh

7e14569

sushildubey171 requested review from libinta and ssarkar2 August 27, 2024 08:45

libinta reviewed Sep 18, 2024

View reviewed changes

sushildubey171 closed this Nov 15, 2024

sushildubey171 deleted the rouge_eval_oh branch November 15, 2024 01:11

Conversation

sushildubey171 commented Jun 12, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ssarkar2 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sushildubey171 commented Aug 27, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ssarkar2 left a comment •

edited

Loading