[Docs] Enable Llama3 inference bkc. (#2725)

intel · Jun 18, 2024 · 785a9c9 · 785a9c9
1 parent bac815f
commit 785a9c9
Show file tree

Hide file tree

Showing 4 changed files with 222 additions and 0 deletions.
diff --git a/examples/llama3_inference/README.md b/examples/llama3_inference/README.md
@@ -0,0 +1,100 @@
+# Llama3 Inference Best Known Method for Intel-Extension-For-Tensorflow on Intel GPU
+
+## Introduction
+Llama 3 is a collection of pretrained and fine-tuned generative text models ranging in scale from 8 billion to 70 billion parameters. For more detail information, please refer to [llama-3/keras](https://www.kaggle.com/models/metaresearch/llama-3/keras).
+
+This example shows how to run Llama3 8b inference with Intel® Extension for TensorFlow* on Intel GPU.
+
+## Hardware Requirements
+
+Verified Hardware Platforms:
+ - Intel® Data Center GPU Max Series
+ - Intel® Data Center GPU Flex Series 170
+
+## Prerequisites
+### Dataset
+Follow [llama-3/keras/llama3_8b_en](https://www.kaggle.com/models/metaresearch/llama-3/keras/llama3_8b_en) to apply access permission and then download datasets.
+```
+mkdir -p llama3_8b_en
+tar -xzvf llama3-keras-llama3_8b_en-v3.tar.gz -C ./llama3_8b_en
+```
+
+### Prepare for GPU
+
+Refer to [Prepare](../common_guide_running.md#prepare)
+
+### Setup Running Environment
+* Setup for GPU
+```bash
+./pip_set_env.sh
+```
+Note: This Llama3 keras3 implementation requires TensorFlow >= 2.16.1 and Intel® Extension for TensorFlow* >= 2.16.0.0.
+
+### Enable Running Environment
+
+Enable oneAPI running environment (only for GPU) and virtual running environment.
+
+   * For GPU, refer to [Running](../common_guide_running.md#running)
+
+
+### Executes the Example with Python API
+#### Model Default Parameters
+| **Parameter** | **Default Value** |
+| :---: | :--- |
+| **model** | llama3_8b_en |
+| **dtype** | bfloat16 |
+| **data-dir** | ./ |
+| **input-tokens** | 32 |
+| **max-new-tokens** | 32 |
+| **num-beams** | 1 |
+| **num-iter** | 10 |
+| **num-warmup** | 3 |
+| **batch-size** | 1 |
+
+#### FP32 Inference
+```
+python run_generate.py \
+  --model llama3_8b_en \
+  --dtype float32      \
+  --data-dir ./        \
+  --input-tokens 32    \
+  --max-new-tokens 32
+```
+
+#### BF16 Inference
+```
+python run_generate.py \
+  --model llama3_8b_en \
+  --dtype bfloat16     \
+  --data-dir ./        \
+  --input-tokens 32    \
+  --max-new-tokens 32
+```
+
+## Example Output
+With successful execution, it will print out the following results:
+
+```
+Prompt: Once upon a time, there existed a little girl, who liked to have adventures. She wanted to go to places and meet new people, and have fun.
+Iteration: 0, Time: xxx sec
+Iteration: 1, Time: xxx sec
+Iteration: 2, Time: xxx sec
+Iteration: 3, Time: xxx sec
+Iteration: 4, Time: xxx sec
+Iteration: 5, Time: xxx sec
+Iteration: 6, Time: xxx sec
+Iteration: 7, Time: xxx sec
+Iteration: 8, Time: xxx sec
+Iteration: 9, Time: xxx sec
+
+ ---------- Summary: ----------
+Inference latency: xxx sec.
+Output: Once upon a time, there existed a little girl, who liked to have adventures. She wanted to go to places and meet new people, and have fun. She wanted to be a princess, and a pirate, and a fairy, and a mermaid, and a superhero, and a witch, and a queen.
+```
+
+## FAQ
+
+1. If you get the following error log, refer to [Enable Running Environment](#Enable-Running-Environment) to Enable oneAPI running environment.
+``` 
+tensorflow.python.framework.errors_impl.NotFoundError: libmkl_sycl.so.2: cannot open shared object file: No such file or directory
+```
diff --git a/examples/llama3_inference/pip_set_env.sh b/examples/llama3_inference/pip_set_env.sh
@@ -0,0 +1,28 @@
+#!/bin/bash
+
+#
+# Copyright (c) 2024 Intel Corporation
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+ENV_NAME=env_itex
+deactivate
+rm -rf $ENV_NAME
+python -m venv $ENV_NAME
+source $ENV_NAME/bin/activate
+pip install --upgrade pip
+pip install tensorflow==2.16.1
+pip install keras_nlp
+pip install numpy
+pip install --upgrade intel-extension-for-tensorflow-weekly[xpu] -f https://developer.intel.com/itex-whl-weekly
diff --git a/examples/llama3_inference/prompt.json b/examples/llama3_inference/prompt.json
diff --git a/examples/llama3_inference/run_generate.py b/examples/llama3_inference/run_generate.py
@@ -0,0 +1,93 @@
+import os
+import tensorflow as tf
+import argparse
+import json
+import time
+import keras
+import keras_nlp
+import numpy as np
+import kagglehub
+
+# Download latest version
+#path = kagglehub.model_download("keras/llama3/keras/llama3_8b_en")
+#print("Path to model files:", path)
+
+parser = argparse.ArgumentParser()
+parser.add_argument(
+  "--model",
+  type=str,
+  choices=["llama3_8b_en", "llama3_instruct_8b_en"],
+  default="llama3_8b_en",
+  help="the mdoel name",
+)
+parser.add_argument(
+  "--data-dir",
+  type=str,
+  default="./",
+  help="the dataset path",
+)
+parser.add_argument(
+  "--dtype",
+  type=str,
+  choices=["float32", "bfloat16"],
+  default="float32",
+  help="float32, bfloat16"
+)
+parser.add_argument(
+  "--prompt", default=None, type=str, help="input prompt for self-defined if needed"
+)
+parser.add_argument(
+  "--input-tokens",
+  default=None,
+  choices=["32", "64", "128", "256", "512", "1024", "2016", "2017", "2048", "4096", "8192"],
+  type=str,
+  help="input tokens length if needed from prompt.json",
+)
+parser.add_argument(
+  "--max-new-tokens", default=32, type=int, help="output max new tokens"
+)
+parser.add_argument("--num-beams", default=1, type=int, help="beam width")
+parser.add_argument("--num-iter", default=10, type=int, help="num iter")
+parser.add_argument("--num-warmup", default=3, type=int, help="num warmup")
+parser.add_argument("--batch-size", default=1, type=int, help="batch size")
+
+args = parser.parse_args()
+path = args.data_dir + args.model
+print("Dataset dir is: %s" % path)
+
+if args.dtype == "bfloat16":
+  keras.config.set_floatx("bfloat16")
+model = keras_nlp.models.Llama3CausalLM.from_preset(path, dtype=args.dtype)
+if args.num_beams > 1:
+  from keras_nlp.samplers import BeamSampler
+  model.compile(sampler=BeamSampler(num_beams=args.num_beams))
+else:
+  model.compile(sampler="greedy")
+
+if args.prompt is not None:
+  prompt = args.prompt
+elif args.input_tokens is not None:
+  current_path = os.path.dirname(__file__)
+  with open(str(current_path) + "/prompt.json") as f:
+    prompt_pool = json.load(f)
+  prompt = prompt_pool[args.input_tokens]
+print("Prompt: %s!" % prompt)
+
+total_time = 0.0
+num_iter = args.num_iter
+num_warmup = args.num_warmup
+prompt = [prompt] * args.batch_size
+for i in range(num_iter):
+  tic = time.time()
+  output = model.generate(
+    prompt, max_length=int(args.max_new_tokens)+int(args.input_tokens)
+  )
+  toc = time.time()
+  print("Iteration: %d, Time: %.6f sec" % (i, toc - tic), flush=True)
+  if i >= num_warmup:
+    total_time += toc - tic
+
+print("\n", "-" * 10, "Summary:", "-" * 10)
+latency = total_time / (num_iter - num_warmup)
+print("Inference latency: %.3f sec." % latency)
+print("Output: %s." % output)