The Computer Engineering Challenge is focused on optimizing the performance of LLMs, with an emphasis on reducing inference latency while maintaining high accuracy on given datasets. We utilized the Phi3-medium-4k-instruct model, a 14-billion parameter LLM, to tackle a question-answering task based on a structured dataset. To address the challenge of reducing inference latency, we implemented several key optimizations:
-
Model Conversion: We converted hf into the gguf format, enabling more efficient loading and execution during inference.
-
Speculative Decoding (Leviathan et al., 2023; Chen et al., 2023a) : This technique leverages a draft model that predicts a few tokens ahead (in our case, 10 tokens), allowing for parallel token generation and reducing the number of iterations required for generating the final output.
- Speculative Decoding extends the Draft-then-Verify paradigm by incorporating various sampling techniques. This uses pre-trained smaller models, eliminating the need for extra training and simplifying deployment.
-
Dataset Handling: We streamlined the process of loading the question-answering (QA) dataset, ensuring efficient data flow during inference.
-
Prompt Optimization: By refining and optimizing the prompts used during inference, we further improved the response accuracy and speed of the model.
sudo docker pull samsung
sudo docker run --runtime nvidia -it --name {컨테이너 이름}
-v samsung:latest /bin/bash
sudo docker start {도커 컨테이너 이름}
sudo docker attach {도커 컨테이너 이름}
-
Prerequisites
CUDACXX=/usr/local/cuda-12.2/bin/nvcc CMAKE_ARGS= "-DGGML_CUDA=on" FORCE_CMAKE=1 pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir --verbose python -c "import llama_cpp" # Check if 'llama-cpp-python' imports successfully
❓ How to Convert from HF
to GGUF
? Please refer to
GGUF format for the microsoft/Phi-3-medium-4k-instruct (not quantization)
Phi-3-medium-4k-instruct-fp32.gguf
: 14B parameters, FP32 (32-bit floating point) precision (*We recommend using this file.)Phi-3-medium-4k-instruct-fp16.gguf
: 14B parameters, FP16 (16-bit floating point) precision
git lfs install
git-lfs clone https://huggingface.co/watchstep/Phi-3-medium-4k-
instruct-fp32-gguf
❗️(required) You should update this path to point to the location where your model is stored.
If you want to use fp32 choose “"Phi-3-medium-4k-instruct-fp32-gguf"
", else "Phi-3-medium-4k-instruct-fp16.gguf"
--model_dir
: Type: str
, Default: "Phi-3-medium-4k-instruct-fp32.gguf"
❗️(required) Modify this to point to the correct dataset that you mount to local folder.
--data_path
: Type: str
, Default: "new_test_dataset.jsonl"
- ❗️
--model_dir
(string, required=True)- The directory where the model files are stored. This should be the folder path where the model files are located.
--model_name
(string, default:"Phi-3-medium-4k-instruct-fp32.gguf"
)- The name of the model file you want to use for inference. The default model is set to
Phi-3-medium-4k-instruct-fp32.gguf
. Make sure that the model file is present in themodel_dir
.
- The name of the model file you want to use for inference. The default model is set to
- ❗️
--data_path
(string, required=True)- The path where the input dataset file is stored. Thedata file should contain the test data in JSON format for inference.
--seed
(integer, default:0
)- The seed value for reproducibility. Change this if you want to control randomness for different runs.
--temperature
(float, default:0.0
)- The temperature to use for sampling.
--verbose
(bool, default:False
)- Whether to enable verbose logging during execution.
--n_gpu_layers
(integer, default:-1
)- Number of layers to offload to GPU for processing If -1, all layers are offloaded.
--num_pred_tokens
(integer, default:10
)- The number of tokens the model will predict during inference for speculative decoding.
cd /
python final.py --model_dir "/path/to/model" \
--data_path "/path/to/data"
python final.py --model_dir
"./Phi-3-medium-4k-instruct-fp32-gguf" --data_path
"./data/test_dataset.jsonl"
✅ NOTE
Be patient😆. It takes about 2 hours to offload the model parameters for each layer onto the GPU in Jetson Orin AGX 32GB. (You can view the logging by setting verbose=True
)