Name	Name	Last commit message	Last commit date
parent directory ..
demo	demo
README.md	README.md
setup_llama.sh	setup_llama.sh

Llama3/3.1-70B Demo

One command run
How to Run
- Running the demo from TT-Metalium
- Serving the model from vLLM

One command run

chmod +x ./models/demos/t3000/llama3_70b/setup_llama.sh && ./models/demos/t3000/llama3_70b/setup_llama.sh <MODEL_TYPE> <TT_METAL_COMMIT_SHA_OR_TAG> <TT_VLLM_COMMIT_SHA_OR_TAG>

Where, TT_METAL_COMMIT_SHA_OR_TAG and TT_VLLM_COMMIT_SHA_OR_TAG are found in the root README under "Release" version, respectively.

Example:

./models/demos/t3000/llama3_70b/setup_llama.sh llama-3.1-70b-instruct v0.54.0-rc2 953161188c50f10da95a88ab305e23977ebd3750

Follow prompts as they come up in CLI to select appropriate weights for Llama 3.1 70B Instruct.

Prerequisites:

Submit request to access weights from Meta: Llama Downloads
Submit permissions on HuggingFace and have a HF personal access token: Llama 3.1 70B Instruct

Steps run:

Setup environment
Build tt-metal
Download Llama 3.1 70B Instruct weights
Install vLLM
Deploy vLLM server

How to Run

Note: This guide requires the installation / build of tt-metal. Please refer to the installation instructions for the release corresponding to README.

Download the Llama3/3.1-70B weights from Meta (https://llama.meta.com/):

Repack the weights:

# This concatenates the sharded checkpoints and makes it easier for us to load.
python models/demos/t3000/llama2_70b/scripts/repack_weights.py <path_to_checkpoint_dir> <repacked_output_dir> <chunk_size>

Note: Use 5 for chunk_size.

Once the weights are repacked, move the params.json file from the checkpoint_dir to the repacked_output_dir.

Running the demo from TT-Metalium

After setting up the repacked weights and tokenizer, you can run the demo using the commands below:

Prepare the weight cache directory:

# Make a directory for us to cache weights into. This speeds up subsequent runs.
mkdir <weight_cache_dir>

Set up environment variables:

export LLAMA3_CKPT_DIR=<repacked_output_dir>
export LLAMA3_TOKENIZER_PATH=<path_to_checkpoint_dir>/tokenizer.model  # Path needs to include the tokenizer.model file
export LLAMA3_CACHE_PATH=<weight_cache_dir>

export WH_ARCH_YAML=wormhole_b0_80_arch_eth_dispatch.yaml
export TIKTOKEN_CACHE_DIR=""

pip install -r models/demos/t3000/llama2_70b/reference/llama/requirements.txt

# Example:
# export LLAMA3_CKPT_DIR="/home/llama-data-repacked/llama-3-70b/"
# export LLAMA3_TOKENIZER_PATH="/home/llama-data-repacked/tokenizer.model"
# export LLAMA3_CACHE_PATH="/home/llama-data-cache/weights-cache"

Run the demo:

Note: Run the following command twice.

The first run will cache the weights. This will take some time.
The second run will use the cached weights, thereby running much faster.

# Run the demo using sampling decode
pytest -svv models/demos/t3000/llama3_70b/demo/demo.py::test_LlamaModel_demo[wormhole_b0-True-device_params0-short_context-check_disabled-sampling-tt-70b-T3000-80L-decode_only-trace_mode_on-text_completion-llama3]

Run the performance test:

The above demo does not achieve peak performance because we log outputs to the screen. The following perf test will print an accurate end-to-end throughput number. For best performance, ensure that tt-metal is built in release mode (default), and ensure the host's CPU frequency governors are set to performance -- instructions for setting the frequency governor vary by machine. This performance test runs with sequence length 128 and batch size 32.
```
pytest -svv models/demos/t3000/llama2_70b/tests/test_llama_perf_decode.py::test_Llama_perf_host[wormhole_b0-True-device_params0-gen128-llama3]
```

Details

Supported context lengths and batch sizes for the Llama3.1-70B demo are as follows:

Context Length	Max Batch Size
2k	32
8k	16
128k	1

Input File: Uses ./demo/data/multi_prompt.json.
Model Configuration: Utilizes a pretrained model.
Hardware Requirements: Runs on an 8-chip T3000 machine using tensor parallelism. The host machine must have at least 512 GB of memory.
Demo arguments:
- context: [short_context, long_context, 128k_context]: Select between short context (batch 32, sequence_length 2k) and long context (batch 16, sequence length 8k) and full context (batch 1, sequence length 128k)
- ground_truth: [check_disabled, check_enabled]: Enable or disable ground truth checking, used for testing
- sampling: [greedy, sampling]: Select between greedy decoding and top-k/top-p sampling
- implementation: [tt-70b-T3000]: Run the 70B model on the Tenstorrent backend
- num_layers: [1L, 2L, 10L, 80L]: Select 80L to run the full model
- decode_only: [decode_only, prefill_decode]: Use prefill_decode. Alternately, decode_only implements prefill via decode.
- chat: [text_completion, chat_completion]: Run in text_completion mode for the pretrained model or chat_completion for the finetuned model
- llama_version: [llama3, llama2]: Select the Llama3 model

Ensure you follow these guidelines to successfully run the Llama3-70B demo.

Serving the model from vLLM

Complete Step 1 and Step 2 of Running the Demo from TT-Metalium

Install vLLM

# Installing from within `tt-metal`
export VLLM_TARGET_DEVICE="tt"
git clone https://github.com/tenstorrent/vllm.git
cd vllm
git checkout TT_VLLM_COMMIT_SHA_OR_TAG
pip install -e .
cd ..

Note: TT_VLLM_COMMIT_SHA_OR_TAG is the vLLM Release version from README

Running the server

python vllm/examples/server_example_tt.py

Interact with server

In a separate terminal window, run:

curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "meta-llama/Meta-Llama-3.1-70B",
        "prompt": "Write a poem about RISC-V",
        "max_tokens": 128,
        "temperature": 1,
        "top_p": 0.9,
        "top_k": 10,
        "stream": false
    }'

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama3_70b

llama3_70b

README.md

Llama3/3.1-70B Demo

Table of Contents

One command run

How to Run

Running the demo from TT-Metalium

Details

Serving the model from vLLM

Files

llama3_70b

Directory actions

More options

Directory actions

More options

Latest commit

History

llama3_70b

Folders and files

parent directory

README.md

Llama3/3.1-70B Demo

Table of Contents

One command run

How to Run

Running the demo from TT-Metalium

Details

Serving the model from vLLM