GitHub - watchstep/Efficient-SLM-Inference: Inference for Phi 3 SLM

Phi-3-medium-4k-instruct Inference on NVIDIA Jetson AGX Orin

1. Introduction

The Computer Engineering Challenge is focused on optimizing the performance of LLMs, with an emphasis on reducing inference latency while maintaining high accuracy on given datasets. We utilized the Phi3-medium-4k-instruct model, a 14-billion parameter LLM, to tackle a question-answering task based on a structured dataset. To address the challenge of reducing inference latency, we implemented several key optimizations:

Model Conversion: We converted hf into the gguf format, enabling more efficient loading and execution during inference.
Speculative Decoding (Leviathan et al., 2023; Chen et al., 2023a) : This technique leverages a draft model that predicts a few tokens ahead (in our case, 10 tokens), allowing for parallel token generation and reducing the number of iterations required for generating the final output.
- Speculative Decoding extends the Draft-then-Verify paradigm by incorporating various sampling techniques. This uses pre-trained smaller models, eliminating the need for extra training and simplifying deployment.
Dataset Handling: We streamlined the process of loading the question-answering (QA) dataset, ensuring efficient data flow during inference.
Prompt Optimization: By refining and optimizing the prompts used during inference, we further improved the response accuracy and speed of the model.

2. Set up

2-1. Install Docker

hub.docker.com

sudo docker pull samsung
sudo docker run --runtime nvidia -it --name {컨테이너 이름} 
-v  samsung:latest /bin/bash

sudo docker start {도커 컨테이너 이름}
sudo docker attach {도커 컨테이너 이름}

Prerequisites

CUDACXX=/usr/local/cuda-12.2/bin/nvcc CMAKE_ARGS=
"-DGGML_CUDA=on" 
FORCE_CMAKE=1 pip install llama-cpp-python --force-reinstall 
--upgrade --no-cache-dir --verbose

python -c "import llama_cpp" 
# Check if 'llama-cpp-python' imports successfully

2-2. Download Model

❓ How to Convert from HF to GGUF? Please refer to

Download GGUF

GGUF format for the microsoft/Phi-3-medium-4k-instruct (not quantization)

Phi-3-medium-4k-instruct-fp32.gguf : 14B parameters, FP32 (32-bit floating point) precision (*We recommend using this file.)
Phi-3-medium-4k-instruct-fp16.gguf : 14B parameters, FP16 (16-bit floating point) precision

git lfs install
git-lfs clone https://huggingface.co/watchstep/Phi-3-medium-4k-
instruct-fp32-gguf

3. Options

❗️(required) You should update this path to point to the location where your model is stored.

If you want to use fp32 choose “"Phi-3-medium-4k-instruct-fp32-gguf"", else "Phi-3-medium-4k-instruct-fp16.gguf"

--model_dir: Type: str, Default: "Phi-3-medium-4k-instruct-fp32.gguf"

❗️(required) Modify this to point to the correct dataset that you mount to local folder.

--data_path: Type: str, Default: "new_test_dataset.jsonl"

❗️--model_dir (string, required=True)
- The directory where the model files are stored. This should be the folder path where the model files are located.
--model_name (string, default: "Phi-3-medium-4k-instruct-fp32.gguf")
- The name of the model file you want to use for inference. The default model is set to Phi-3-medium-4k-instruct-fp32.gguf. Make sure that the model file is present in the model_dir.
❗️--data_path (string, required=True)
- The path where the input dataset file is stored. Thedata file should contain the test data in JSON format for inference.
--seed (integer, default: 0)
- The seed value for reproducibility. Change this if you want to control randomness for different runs.
--temperature (float, default: 0.0)
- The temperature to use for sampling.
--verbose (bool, default: False )
- Whether to enable verbose logging during execution.
--n_gpu_layers (integer, default: -1)
- Number of layers to offload to GPU for processing If -1, all layers are offloaded.
--num_pred_tokens (integer, default: 10)
- The number of tokens the model will predict during inference for speculative decoding.

4. Run

cd /

python final.py --model_dir "/path/to/model" \
--data_path "/path/to/data" 

python final.py --model_dir 
"./Phi-3-medium-4k-instruct-fp32-gguf" --data_path
 "./data/test_dataset.jsonl"

✅ NOTE

Be patient😆. It takes about 2 hours to offload the model parameters for each layer onto the GPU in Jetson Orin AGX 32GB. (You can view the logging by setting verbose=True)

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
llama_cpp_exp		llama_cpp_exp
llm_lingua_exp		llm_lingua_exp
mlc_llm_exp		mlc_llm_exp
onnx_runtime_exp		onnx_runtime_exp
sampling		sampling
scripts		scripts
sepculative_decoding_exp		sepculative_decoding_exp
tensorrt_exp		tensorrt_exp
triton_exp		triton_exp
vllm_exp		vllm_exp
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
final.py		final.py
test_script.ipynb		test_script.ipynb
test_script.py		test_script.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Phi-3-medium-4k-instruct Inference on NVIDIA Jetson AGX Orin

1. Introduction

2. Set up

2-1. Install Docker

2-2. Download Model

3. Options

4. Run

About

Releases

Languages

watchstep/Efficient-SLM-Inference

Folders and files

Latest commit

History

Repository files navigation

Phi-3-medium-4k-instruct Inference on NVIDIA Jetson AGX Orin

1. Introduction

2. Set up

2-1. Install Docker

2-2. Download Model

3. Options

4. Run

About

Topics

Resources

Stars

Watchers

Forks

Releases

Languages