Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 16 additions & 16 deletions docs/deploy/ios.rst
Original file line number Diff line number Diff line change
Expand Up @@ -341,10 +341,24 @@ All these knobs are specified in ``mlc-chat-config.json`` generated by ``gen_con
mlc_llm gen_config ./dist/models/phi-2/ \
--quantization q4f16_1 --conv-template phi-2 \
-o dist/phi-2-q4f16_1-MLC/
# 2. compile: compile model library with specification in mlc-chat-config.json
# 2. mkdir: create a directory to store the compiled model library
mkdir -p dist/libs
# 3. compile: compile model library with specification in mlc-chat-config.json
mlc_llm compile ./dist/phi-2-q4f16_1-MLC/mlc-chat-config.json \
--device iphone -o dist/libs/phi-2-q4f16_1-iphone.tar

Given the compiled library, it is possible to calculate an upper bound for the VRAM
usage during runtime. This useful to better understand if a model is able to fit particular
hardware.
That information will be displayed at the end of the console log when the ``compile`` is executed.
It might look something like this:

.. code:: shell

[2024-04-25 03:19:56] INFO model_metadata.py:96: Total memory usage: 1625.73 MB (Parameters: 1492.45 MB. KVCache: 0.00 MB. Temporary buffer: 133.28 MB)
[2024-04-25 03:19:56] INFO model_metadata.py:105: To reduce memory usage, tweak `prefill_chunk_size`, `context_window_size` and `sliding_window_size`
[2024-04-25 03:19:56] INFO compile.py:198: Generated: dist/libs/phi-2-q4f16_1-iphone.tar

.. note::
When compiling larger models like ``Llama-2-7B``, you may want to add a lower chunk size
while prefilling prompts ``--prefill_chunk_size 128`` or even lower ``context_window_size``\
Expand Down Expand Up @@ -388,21 +402,7 @@ This would result in something like `phi-2-q4f16_1-MLC
<https://huggingface.co/mlc-ai/phi-2-q4f16_1-MLC/tree/main>`_.


**Step 4. Calculate estimated VRAM usage**

Given the compiled library, it is possible to calculate an upper bound for the VRAM
usage during runtime. This useful to better understand if a model is able to fit particular
hardware. We can calculate this estimate using the following command:

.. code:: shell

~/mlc-llm > python -m mlc_llm.cli.model_metadata ./dist/libs/phi-2-q4f16_1-iphone.tar \
> --memory-only --mlc-chat-config ./dist/phi-2-q4f16_1-MLC/mlc-chat-config.json
INFO model_metadata.py:90: Total memory usage: 3042.96 MB (Parameters: 1492.45 MB. KVCache: 640.00 MB. Temporary buffer: 910.51 MB)
INFO model_metadata.py:99: To reduce memory usage, tweak `prefill_chunk_size`, `context_window_size` and `sliding_window_size`


**Step 5. Register as a ModelRecord**
**Step 4. Register as a ModelRecord**

Finally, we update the code snippet for
`app-config.json <https://github.com/mlc-ai/mlc-llm/blob/main/ios/MLCChat/app-config.json>`__
Expand Down