Skip to content

Commit 470a42a

Browse files
[Docs] Update deploy/ios#bring-your-own-model-library (#2235)
remove model metadata step (#1) * remove model metadata step and make minor fixes
1 parent 3139fd7 commit 470a42a

File tree

1 file changed

+16
-16
lines changed

1 file changed

+16
-16
lines changed

docs/deploy/ios.rst

Lines changed: 16 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -341,10 +341,24 @@ All these knobs are specified in ``mlc-chat-config.json`` generated by ``gen_con
341341
mlc_llm gen_config ./dist/models/phi-2/ \
342342
--quantization q4f16_1 --conv-template phi-2 \
343343
-o dist/phi-2-q4f16_1-MLC/
344-
# 2. compile: compile model library with specification in mlc-chat-config.json
344+
# 2. mkdir: create a directory to store the compiled model library
345+
mkdir -p dist/libs
346+
# 3. compile: compile model library with specification in mlc-chat-config.json
345347
mlc_llm compile ./dist/phi-2-q4f16_1-MLC/mlc-chat-config.json \
346348
--device iphone -o dist/libs/phi-2-q4f16_1-iphone.tar
347349
350+
Given the compiled library, it is possible to calculate an upper bound for the VRAM
351+
usage during runtime. This useful to better understand if a model is able to fit particular
352+
hardware.
353+
That information will be displayed at the end of the console log when the ``compile`` is executed.
354+
It might look something like this:
355+
356+
.. code:: shell
357+
358+
[2024-04-25 03:19:56] INFO model_metadata.py:96: Total memory usage: 1625.73 MB (Parameters: 1492.45 MB. KVCache: 0.00 MB. Temporary buffer: 133.28 MB)
359+
[2024-04-25 03:19:56] INFO model_metadata.py:105: To reduce memory usage, tweak `prefill_chunk_size`, `context_window_size` and `sliding_window_size`
360+
[2024-04-25 03:19:56] INFO compile.py:198: Generated: dist/libs/phi-2-q4f16_1-iphone.tar
361+
348362
.. note::
349363
When compiling larger models like ``Llama-2-7B``, you may want to add a lower chunk size
350364
while prefilling prompts ``--prefill_chunk_size 128`` or even lower ``context_window_size``\
@@ -388,21 +402,7 @@ This would result in something like `phi-2-q4f16_1-MLC
388402
<https://huggingface.co/mlc-ai/phi-2-q4f16_1-MLC/tree/main>`_.
389403

390404

391-
**Step 4. Calculate estimated VRAM usage**
392-
393-
Given the compiled library, it is possible to calculate an upper bound for the VRAM
394-
usage during runtime. This useful to better understand if a model is able to fit particular
395-
hardware. We can calculate this estimate using the following command:
396-
397-
.. code:: shell
398-
399-
~/mlc-llm > python -m mlc_llm.cli.model_metadata ./dist/libs/phi-2-q4f16_1-iphone.tar \
400-
> --memory-only --mlc-chat-config ./dist/phi-2-q4f16_1-MLC/mlc-chat-config.json
401-
INFO model_metadata.py:90: Total memory usage: 3042.96 MB (Parameters: 1492.45 MB. KVCache: 640.00 MB. Temporary buffer: 910.51 MB)
402-
INFO model_metadata.py:99: To reduce memory usage, tweak `prefill_chunk_size`, `context_window_size` and `sliding_window_size`
403-
404-
405-
**Step 5. Register as a ModelRecord**
405+
**Step 4. Register as a ModelRecord**
406406

407407
Finally, we update the code snippet for
408408
`app-config.json <https://github.com/mlc-ai/mlc-llm/blob/main/ios/MLCChat/app-config.json>`__

0 commit comments

Comments
 (0)