[Docs] Update deploy/ios#bring-your-own-model-library (#2235)

nobuhiroYamakado · web-flow · commit 470a42a382da · 2024-04-26T12:55:00.000-04:00
remove model metadata step (#1) * remove model metadata step and make minor fixes
diff --git a/docs/deploy/ios.rst b/docs/deploy/ios.rst
@@ -341,10 +341,24 @@ All these knobs are specified in ``mlc-chat-config.json`` generated by ``gen_con
     mlc_llm gen_config ./dist/models/phi-2/ \
         --quantization q4f16_1 --conv-template phi-2 \
         -o dist/phi-2-q4f16_1-MLC/
-    # 2. compile: compile model library with specification in mlc-chat-config.json
+    # 2. mkdir: create a directory to store the compiled model library
+    mkdir -p dist/libs
+    # 3. compile: compile model library with specification in mlc-chat-config.json
     mlc_llm compile ./dist/phi-2-q4f16_1-MLC/mlc-chat-config.json \
         --device iphone -o dist/libs/phi-2-q4f16_1-iphone.tar
 
+Given the compiled library, it is possible to calculate an upper bound for the VRAM
+usage during runtime. This useful to better understand if a model is able to fit particular
+hardware.
+That information will be displayed at the end of the console log when the ``compile`` is executed.
+It might look something like this:
+
+.. code:: shell
+
+    [2024-04-25 03:19:56] INFO model_metadata.py:96: Total memory usage: 1625.73 MB (Parameters: 1492.45 MB. KVCache: 0.00 MB. Temporary buffer: 133.28 MB)
+    [2024-04-25 03:19:56] INFO model_metadata.py:105: To reduce memory usage, tweak `prefill_chunk_size`, `context_window_size` and `sliding_window_size`
+    [2024-04-25 03:19:56] INFO compile.py:198: Generated: dist/libs/phi-2-q4f16_1-iphone.tar
+
 .. note::
     When compiling larger models like ``Llama-2-7B``, you may want to add a lower chunk size
     while prefilling prompts ``--prefill_chunk_size 128`` or even lower ``context_window_size``\
@@ -388,21 +402,7 @@ This would result in something like `phi-2-q4f16_1-MLC
 <https://huggingface.co/mlc-ai/phi-2-q4f16_1-MLC/tree/main>`_.
 
 
-**Step 4. Calculate estimated VRAM usage**
-
-Given the compiled library, it is possible to calculate an upper bound for the VRAM
-usage during runtime. This useful to better understand if a model is able to fit particular
-hardware. We can calculate this estimate using the following command:
-
-.. code:: shell
-
-    ~/mlc-llm > python -m mlc_llm.cli.model_metadata ./dist/libs/phi-2-q4f16_1-iphone.tar \
-      > --memory-only --mlc-chat-config ./dist/phi-2-q4f16_1-MLC/mlc-chat-config.json
-      INFO model_metadata.py:90: Total memory usage: 3042.96 MB (Parameters: 1492.45 MB. KVCache: 640.00 MB. Temporary buffer: 910.51 MB)
-      INFO model_metadata.py:99: To reduce memory usage, tweak `prefill_chunk_size`, `context_window_size` and `sliding_window_size`
-
-
-**Step 5. Register as a ModelRecord**
+**Step 4. Register as a ModelRecord**
 
 Finally, we update the code snippet for
 `app-config.json <https://github.com/mlc-ai/mlc-llm/blob/main/ios/MLCChat/app-config.json>`__