Skip to content

Commit

Permalink
[Docs] Enable Llama3 inference bkc. (#2725)
Browse files Browse the repository at this point in the history
  • Loading branch information
cboss6 committed Jun 18, 2024
1 parent bac815f commit 785a9c9
Show file tree
Hide file tree
Showing 4 changed files with 222 additions and 0 deletions.
100 changes: 100 additions & 0 deletions examples/llama3_inference/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
# Llama3 Inference Best Known Method for Intel-Extension-For-Tensorflow on Intel GPU

## Introduction
Llama 3 is a collection of pretrained and fine-tuned generative text models ranging in scale from 8 billion to 70 billion parameters. For more detail information, please refer to [llama-3/keras](https://www.kaggle.com/models/metaresearch/llama-3/keras).

This example shows how to run Llama3 8b inference with Intel® Extension for TensorFlow* on Intel GPU.

## Hardware Requirements

Verified Hardware Platforms:
- Intel® Data Center GPU Max Series
- Intel® Data Center GPU Flex Series 170

## Prerequisites
### Dataset
Follow [llama-3/keras/llama3_8b_en](https://www.kaggle.com/models/metaresearch/llama-3/keras/llama3_8b_en) to apply access permission and then download datasets.
```
mkdir -p llama3_8b_en
tar -xzvf llama3-keras-llama3_8b_en-v3.tar.gz -C ./llama3_8b_en
```

### Prepare for GPU

Refer to [Prepare](../common_guide_running.md#prepare)

### Setup Running Environment
* Setup for GPU
```bash
./pip_set_env.sh
```
Note: This Llama3 keras3 implementation requires TensorFlow >= 2.16.1 and Intel® Extension for TensorFlow* >= 2.16.0.0.

### Enable Running Environment

Enable oneAPI running environment (only for GPU) and virtual running environment.

* For GPU, refer to [Running](../common_guide_running.md#running)


### Executes the Example with Python API
#### Model Default Parameters
| **Parameter** | **Default Value** |
| :---: | :--- |
| **model** | llama3_8b_en |
| **dtype** | bfloat16 |
| **data-dir** | ./ |
| **input-tokens** | 32 |
| **max-new-tokens** | 32 |
| **num-beams** | 1 |
| **num-iter** | 10 |
| **num-warmup** | 3 |
| **batch-size** | 1 |

#### FP32 Inference
```
python run_generate.py \
--model llama3_8b_en \
--dtype float32 \
--data-dir ./ \
--input-tokens 32 \
--max-new-tokens 32
```

#### BF16 Inference
```
python run_generate.py \
--model llama3_8b_en \
--dtype bfloat16 \
--data-dir ./ \
--input-tokens 32 \
--max-new-tokens 32
```

## Example Output
With successful execution, it will print out the following results:

```
Prompt: Once upon a time, there existed a little girl, who liked to have adventures. She wanted to go to places and meet new people, and have fun.
Iteration: 0, Time: xxx sec
Iteration: 1, Time: xxx sec
Iteration: 2, Time: xxx sec
Iteration: 3, Time: xxx sec
Iteration: 4, Time: xxx sec
Iteration: 5, Time: xxx sec
Iteration: 6, Time: xxx sec
Iteration: 7, Time: xxx sec
Iteration: 8, Time: xxx sec
Iteration: 9, Time: xxx sec
---------- Summary: ----------
Inference latency: xxx sec.
Output: Once upon a time, there existed a little girl, who liked to have adventures. She wanted to go to places and meet new people, and have fun. She wanted to be a princess, and a pirate, and a fairy, and a mermaid, and a superhero, and a witch, and a queen.
```

## FAQ

1. If you get the following error log, refer to [Enable Running Environment](#Enable-Running-Environment) to Enable oneAPI running environment.
```
tensorflow.python.framework.errors_impl.NotFoundError: libmkl_sycl.so.2: cannot open shared object file: No such file or directory
```
28 changes: 28 additions & 0 deletions examples/llama3_inference/pip_set_env.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
#!/bin/bash

#
# Copyright (c) 2024 Intel Corporation
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

ENV_NAME=env_itex
deactivate
rm -rf $ENV_NAME
python -m venv $ENV_NAME
source $ENV_NAME/bin/activate
pip install --upgrade pip
pip install tensorflow==2.16.1
pip install keras_nlp
pip install numpy
pip install --upgrade intel-extension-for-tensorflow-weekly[xpu] -f https://developer.intel.com/itex-whl-weekly
1 change: 1 addition & 0 deletions examples/llama3_inference/prompt.json

Large diffs are not rendered by default.

93 changes: 93 additions & 0 deletions examples/llama3_inference/run_generate.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
import os
import tensorflow as tf
import argparse
import json
import time
import keras
import keras_nlp
import numpy as np
import kagglehub

# Download latest version
#path = kagglehub.model_download("keras/llama3/keras/llama3_8b_en")
#print("Path to model files:", path)

parser = argparse.ArgumentParser()
parser.add_argument(
"--model",
type=str,
choices=["llama3_8b_en", "llama3_instruct_8b_en"],
default="llama3_8b_en",
help="the mdoel name",
)
parser.add_argument(
"--data-dir",
type=str,
default="./",
help="the dataset path",
)
parser.add_argument(
"--dtype",
type=str,
choices=["float32", "bfloat16"],
default="float32",
help="float32, bfloat16"
)
parser.add_argument(
"--prompt", default=None, type=str, help="input prompt for self-defined if needed"
)
parser.add_argument(
"--input-tokens",
default=None,
choices=["32", "64", "128", "256", "512", "1024", "2016", "2017", "2048", "4096", "8192"],
type=str,
help="input tokens length if needed from prompt.json",
)
parser.add_argument(
"--max-new-tokens", default=32, type=int, help="output max new tokens"
)
parser.add_argument("--num-beams", default=1, type=int, help="beam width")
parser.add_argument("--num-iter", default=10, type=int, help="num iter")
parser.add_argument("--num-warmup", default=3, type=int, help="num warmup")
parser.add_argument("--batch-size", default=1, type=int, help="batch size")

args = parser.parse_args()
path = args.data_dir + args.model
print("Dataset dir is: %s" % path)

if args.dtype == "bfloat16":
keras.config.set_floatx("bfloat16")
model = keras_nlp.models.Llama3CausalLM.from_preset(path, dtype=args.dtype)
if args.num_beams > 1:
from keras_nlp.samplers import BeamSampler
model.compile(sampler=BeamSampler(num_beams=args.num_beams))
else:
model.compile(sampler="greedy")

if args.prompt is not None:
prompt = args.prompt
elif args.input_tokens is not None:
current_path = os.path.dirname(__file__)
with open(str(current_path) + "/prompt.json") as f:
prompt_pool = json.load(f)
prompt = prompt_pool[args.input_tokens]
print("Prompt: %s!" % prompt)

total_time = 0.0
num_iter = args.num_iter
num_warmup = args.num_warmup
prompt = [prompt] * args.batch_size
for i in range(num_iter):
tic = time.time()
output = model.generate(
prompt, max_length=int(args.max_new_tokens)+int(args.input_tokens)
)
toc = time.time()
print("Iteration: %d, Time: %.6f sec" % (i, toc - tic), flush=True)
if i >= num_warmup:
total_time += toc - tic

print("\n", "-" * 10, "Summary:", "-" * 10)
latency = total_time / (num_iter - num_warmup)
print("Inference latency: %.3f sec." % latency)
print("Output: %s." % output)

0 comments on commit 785a9c9

Please sign in to comment.