microsoft · sonichi · Jul 21, 2023 · Jul 10, 2023 · Jul 10, 2023 · Jul 10, 2023
diff --git a/website/blog/2023-07-14-Local-LLMs/index.mdx b/website/blog/2023-07-14-Local-LLMs/index.mdx
@@ -0,0 +1,147 @@
+---
+title: Use flaml.autogen for local LLMs
+authors: jialeliu
+tags: [LLM, FLAMLv2]
+---
+**TL;DR:**
+We demonstrate how to use flaml.autogen for local LLM application. As an example, we will initiate an endpoint using [FastChat](https://github.com/lm-sys/FastChat) and perform inference on [ChatGLMv2-6b](https://github.com/THUDM/ChatGLM2-6B).
+
+## Preparations
+
+### Clone FastChat
+
+FastChat provides OpenAI-compatible APIs for its supported models, so you can use FastChat as a local drop-in replacement for OpenAI APIs. However, its code needs minor modification in order to function properly.
+
+```bash
+git clone https://github.com/lm-sys/FastChat.git
+cd FastChat
+```
+
+### Download checkpoint
+
+ChatGLM-6B is an open bilingual language model based on General Language Model (GLM) framework, with 6.2 billion parameters. ChatGLM2-6B is its second-generation version.
+
+Before downloading from HuggingFace Hub, you need to have Git LFS [installed](https://docs.github.com/en/repositories/working-with-files/managing-large-files/installing-git-large-file-storage).
+
+```bash
+git clone https://huggingface.co/THUDM/chatglm2-6b
+```
+
+## Initiate server
+
+First, launch the controller
+
+```bash
+python -m fastchat.serve.controller
+```
+
+Then, launch the model worker(s)
+
+```bash
+python -m fastchat.serve.model_worker --model-path chatglm2-6b
+```
+
+Finally, launch the RESTful API server
+
+```bash
+python -m fastchat.serve.openai_api_server --host localhost --port 8000
+```
+
+Normally this will work. However, if you encounter error like [this](https://github.com/lm-sys/FastChat/issues/1641), commenting out all the lines containing `finish_reason` in `fastchat/protocol/api_protocal.py` and `fastchat/protocol/openai_api_protocol.py` will fix the problem. The modified code looks like:
+
+```python
+class CompletionResponseChoice(BaseModel):
+    index: int
+    text: str
+    logprobs: Optional[int] = None
+    # finish_reason: Optional[Literal["stop", "length"]]
+
+class CompletionResponseStreamChoice(BaseModel):
+    index: int
+    text: str
+    logprobs: Optional[float] = None
+    # finish_reason: Optional[Literal["stop", "length"]] = None
+```
+
+
+## Interact with model using `oai.Completion`
+
+Now the models can be directly accessed through openai-python library as well as `flaml.oai.Completion` and `flaml.oai.ChatCompletion`.
+
+
+```python
+from flaml import oai
+
+# create a text completion request
+response = oai.Completion.create(
+    config_list=[
+        {
+            "model": "chatglm2-6b",
+            "api_base": "http://localhost:8000/v1",
+            "api_type": "open_ai",
+            "api_key": "NULL", # just a placeholder
+        }
+    ],
+    prompt="Hi",
+)
+print(response)
+
+# create a chat completion request
+response = oai.ChatCompletion.create(
+    config_list=[
+        {
+            "model": "chatglm2-6b",
+            "api_base": "http://localhost:8000/v1",
+            "api_type": "open_ai",
+            "api_key": "NULL",
+        }
+    ],
+    messages=[{"role": "user", "content": "Hi"}]
+)
+print(response)
+```
+
+If you would like to switch to different models, download their checkpoints and specify model path when launching model worker(s).
+
+## interacting with multiple local LLMs
+
+If you would like to interact with multiple LLMs on your local machine, replace the `model_worker` step above with a multi model variant:
+
+```bash
+python -m fastchat.serve.multi_model_worker \
+    --model-path lmsys/vicuna-7b-v1.3 \
+    --model-names vicuna-7b-v1.3 \
+    --model-path chatglm2-6b \
+    --model-names chatglm2-6b
+```
+
+The inference code would be:
+
+```python
+from flaml import oai
+
+# create a chat completion request
+response = oai.ChatCompletion.create(
+    config_list=[
+        {
+            "model": "chatglm2-6b",
+            "api_base": "http://localhost:8000/v1",
+            "api_type": "open_ai",
+            "api_key": "NULL",
+        },
+        {
+            "model": "vicuna-7b-v1.3",
+            "api_base": "http://localhost:8000/v1",
+            "api_type": "open_ai",
+            "api_key": "NULL",
+        }
+    ],
+    messages=[{"role": "user", "content": "Hi"}]
+)
+print(response)
+```
+
+## For Further Reading
+
+* [Documentation](/docs/Use-Cases/Auto-Generation) about `flaml.autogen`
+* [Documentation](https://github.com/lm-sys/FastChat) about FastChat.
diff --git a/website/blog/authors.yml b/website/blog/authors.yml
@@ -9,3 +9,9 @@ qingyunwu:
   title: Assistant Professor at the Pennsylvania State University
   url: https://qingyun-wu.github.io/
   image_url: https://github.com/qingyun-wu.png
+
+jialeliu:
+  name: Jiale Liu
+  title: Undergraduate student at Xidian University
+  url: https://leoljl.github.io
+  image_url: https://github.com/LeoLjl/leoljl.github.io/blob/main/profile.jpg?raw=true
diff --git a/website/docs/Use-Cases/Auto-Generation.md b/website/docs/Use-Cases/Auto-Generation.md
@@ -249,7 +249,7 @@ The tuend config can be used to perform inference.
 `flaml.oai.Completion.create` is compatible with both `openai.Completion.create` and `openai.ChatCompletion.create`, and both OpenAI API and Azure OpenAI API. So models such as "text-davinci-003", "gpt-3.5-turbo" and "gpt-4" can share a common API.
 When chat models are used and `prompt` is given as the input to `flaml.oai.Completion.create`, the prompt will be automatically converted into `messages` to fit the chat completion API requirement. One advantage is that one can experiment with both chat and non-chat models for the same prompt in a unified API.
 
-For local LLMs, one can spin up an endpoint using a package like [simple_ai_server](https://github.com/lhenault/simpleAI), and then use the same API to send a request.
+For local LLMs, one can spin up an endpoint using a package like [simple_ai_server](https://github.com/lhenault/simpleAI) and [FastChat](https://github.com/lm-sys/FastChat), and then use the same API to send a request. See [here](../../blog/2023/07/14/Local-LLMs) for examples on how to make inference with local LLMs.
 
 When only working with the chat-based models, `flaml.oai.ChatCompletion` can be used. It also does automatic conversion from prompt to messages, if prompt is provided instead of messages.