@@ -341,10 +341,24 @@ All these knobs are specified in ``mlc-chat-config.json`` generated by ``gen_con
341341 mlc_llm gen_config ./dist/models/phi-2/ \
342342 --quantization q4f16_1 --conv-template phi-2 \
343343 -o dist/phi-2-q4f16_1-MLC/
344- # 2. compile: compile model library with specification in mlc-chat-config.json
344+ # 2. mkdir: create a directory to store the compiled model library
345+ mkdir -p dist/libs
346+ # 3. compile: compile model library with specification in mlc-chat-config.json
345347 mlc_llm compile ./dist/phi-2-q4f16_1-MLC/mlc-chat-config.json \
346348 --device iphone -o dist/libs/phi-2-q4f16_1-iphone.tar
347349
350+ Given the compiled library, it is possible to calculate an upper bound for the VRAM
351+ usage during runtime. This useful to better understand if a model is able to fit particular
352+ hardware.
353+ That information will be displayed at the end of the console log when the ``compile `` is executed.
354+ It might look something like this:
355+
356+ .. code :: shell
357+
358+ [2024-04-25 03:19:56] INFO model_metadata.py:96: Total memory usage: 1625.73 MB (Parameters: 1492.45 MB. KVCache: 0.00 MB. Temporary buffer: 133.28 MB)
359+ [2024-04-25 03:19:56] INFO model_metadata.py:105: To reduce memory usage, tweak ` prefill_chunk_size` , ` context_window_size` and ` sliding_window_size`
360+ [2024-04-25 03:19:56] INFO compile.py:198: Generated: dist/libs/phi-2-q4f16_1-iphone.tar
361+
348362 .. note ::
349363 When compiling larger models like ``Llama-2-7B ``, you may want to add a lower chunk size
350364 while prefilling prompts ``--prefill_chunk_size 128 `` or even lower ``context_window_size ``\
@@ -388,21 +402,7 @@ This would result in something like `phi-2-q4f16_1-MLC
388402<https://huggingface.co/mlc-ai/phi-2-q4f16_1-MLC/tree/main> `_.
389403
390404
391- **Step 4. Calculate estimated VRAM usage **
392-
393- Given the compiled library, it is possible to calculate an upper bound for the VRAM
394- usage during runtime. This useful to better understand if a model is able to fit particular
395- hardware. We can calculate this estimate using the following command:
396-
397- .. code :: shell
398-
399- ~ /mlc-llm > python -m mlc_llm.cli.model_metadata ./dist/libs/phi-2-q4f16_1-iphone.tar \
400- > --memory-only --mlc-chat-config ./dist/phi-2-q4f16_1-MLC/mlc-chat-config.json
401- INFO model_metadata.py:90: Total memory usage: 3042.96 MB (Parameters: 1492.45 MB. KVCache: 640.00 MB. Temporary buffer: 910.51 MB)
402- INFO model_metadata.py:99: To reduce memory usage, tweak ` prefill_chunk_size` , ` context_window_size` and ` sliding_window_size`
403-
404-
405- **Step 5. Register as a ModelRecord **
405+ **Step 4. Register as a ModelRecord **
406406
407407Finally, we update the code snippet for
408408`app-config.json <https://github.com/mlc-ai/mlc-llm/blob/main/ios/MLCChat/app-config.json >`__
0 commit comments