diff --git a/docs/deploy/ios.rst b/docs/deploy/ios.rst index c0217db9e9..75a5cdbdc7 100644 --- a/docs/deploy/ios.rst +++ b/docs/deploy/ios.rst @@ -341,10 +341,24 @@ All these knobs are specified in ``mlc-chat-config.json`` generated by ``gen_con mlc_llm gen_config ./dist/models/phi-2/ \ --quantization q4f16_1 --conv-template phi-2 \ -o dist/phi-2-q4f16_1-MLC/ - # 2. compile: compile model library with specification in mlc-chat-config.json + # 2. mkdir: create a directory to store the compiled model library + mkdir -p dist/libs + # 3. compile: compile model library with specification in mlc-chat-config.json mlc_llm compile ./dist/phi-2-q4f16_1-MLC/mlc-chat-config.json \ --device iphone -o dist/libs/phi-2-q4f16_1-iphone.tar +Given the compiled library, it is possible to calculate an upper bound for the VRAM +usage during runtime. This useful to better understand if a model is able to fit particular +hardware. +That information will be displayed at the end of the console log when the ``compile`` is executed. +It might look something like this: + +.. code:: shell + + [2024-04-25 03:19:56] INFO model_metadata.py:96: Total memory usage: 1625.73 MB (Parameters: 1492.45 MB. KVCache: 0.00 MB. Temporary buffer: 133.28 MB) + [2024-04-25 03:19:56] INFO model_metadata.py:105: To reduce memory usage, tweak `prefill_chunk_size`, `context_window_size` and `sliding_window_size` + [2024-04-25 03:19:56] INFO compile.py:198: Generated: dist/libs/phi-2-q4f16_1-iphone.tar + .. note:: When compiling larger models like ``Llama-2-7B``, you may want to add a lower chunk size while prefilling prompts ``--prefill_chunk_size 128`` or even lower ``context_window_size``\ @@ -388,21 +402,7 @@ This would result in something like `phi-2-q4f16_1-MLC `_. -**Step 4. Calculate estimated VRAM usage** - -Given the compiled library, it is possible to calculate an upper bound for the VRAM -usage during runtime. This useful to better understand if a model is able to fit particular -hardware. We can calculate this estimate using the following command: - -.. code:: shell - - ~/mlc-llm > python -m mlc_llm.cli.model_metadata ./dist/libs/phi-2-q4f16_1-iphone.tar \ - > --memory-only --mlc-chat-config ./dist/phi-2-q4f16_1-MLC/mlc-chat-config.json - INFO model_metadata.py:90: Total memory usage: 3042.96 MB (Parameters: 1492.45 MB. KVCache: 640.00 MB. Temporary buffer: 910.51 MB) - INFO model_metadata.py:99: To reduce memory usage, tweak `prefill_chunk_size`, `context_window_size` and `sliding_window_size` - - -**Step 5. Register as a ModelRecord** +**Step 4. Register as a ModelRecord** Finally, we update the code snippet for `app-config.json `__