diff --git a/.gitattributes b/.gitattributes new file mode 100644 index 000000000000..e69de29bb2d1 diff --git a/notebook/agentchat_groupchat_customized.ipynb b/notebook/agentchat_groupchat_customized.ipynb index 08f03e0f5911..253a0ec939b9 100644 --- a/notebook/agentchat_groupchat_customized.ipynb +++ b/notebook/agentchat_groupchat_customized.ipynb @@ -37,9 +37,6 @@ "```\n", "The last speaker and the groupchat object are passed to the function. Commonly used variables from groupchat are `groupchat.messages` an `groupchat.agents`, which is the message history and the agents in the group chat respectively. You can access other attributes of the groupchat, such as `groupchat.allowed_speaker_transitions_dict` for pre-defined allowed_speaker_transitions_dict. \n", "\n", - "\n", - "\n", - "\n", "````{=mdx}\n", ":::info Requirements\n", "Install `pyautogen`:\n", @@ -85,7 +82,7 @@ "source": [ "````{=mdx}\n", ":::tip\n", - "Learn more about configuring LLMs for agents [here](/docs/llm_configuration).\n", + "Learn more about configuring LLMs for agents [here](/docs/topics/llm_configuration).\n", ":::\n", "````\n", "\n", @@ -443,16 +440,16 @@ ")\n", "# type exit to terminate the chat" ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] } ], "metadata": { + "front_matter": { + "description": "Introduce Group Chat with Customized Speaker Selection Method", + "tags": [ + "orchestration", + "group chat" + ] + }, "kernelspec": { "display_name": "flaml", "language": "python", diff --git a/notebook/agentchat_groupchat_stateflow.ipynb b/notebook/agentchat_groupchat_stateflow.ipynb new file mode 100644 index 000000000000..461687b9070d --- /dev/null +++ b/notebook/agentchat_groupchat_stateflow.ipynb @@ -0,0 +1,546 @@ +{ + "cells": [ + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# StateFlow: Build Workflows through State-Oriented Actions\n", + "\n", + "AutoGen offers conversable agents powered by LLM, tool or human, which can be used to perform tasks collectively via automated chat. In this notebook, we introduce how to use groupchat to build workflows with AutoGen agents from a state-oriented perspective.\n", + "\n", + "\n", + "````{=mdx}\n", + ":::info Requirements\n", + "Install `pyautogen`:\n", + "```bash\n", + "pip install pyautogen\n", + "```\n", + "\n", + "For more information, please refer to the [installation guide](/docs/installation/).\n", + ":::\n", + "````" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Set your API Endpoint\n", + "\n", + "The [`config_list_from_json`](https://microsoft.github.io/autogen/docs/reference/oai/openai_utils#config_list_from_json) function loads a list of configurations from an environment variable or a json file." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "import autogen\n", + "\n", + "config_list = autogen.config_list_from_json(\n", + " \"OAI_CONFIG_LIST\",\n", + " filter_dict={\n", + " \"model\": [\"gpt-4\", \"gpt-4-1106-preview\"],\n", + " },\n", + ")" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "````{=mdx}\n", + ":::tip\n", + "Learn more about configuring LLMs for agents [here](/docs/topics/llm_configuration).\n", + ":::\n", + "````\n", + "\n", + "## A workflow for research\n", + "\n", + "
\n", + " \"SF_Example_1\"\n", + " \n", + "
\n", + "\n", + "We define the following agents:\n", + "- Initializer: Start the workflow by sending a task.\n", + "- Coder: Retrieve papers from the internet by writing code.\n", + "- Executor: Execute the code.\n", + "- Scientist: Read the papers and write a summary.\n", + "\n", + "\n", + "In the Figure, we define a simple workflow for research with 4 states: Init, Retrieve, Reserach and End. Within each state, we will call different agents to perform the tasks.\n", + "- Init: We use the initializer to start the workflow.\n", + "- Retrieve: We will first call the coder to write code and then call the executor to execute the code.\n", + "- Research: We will call the scientist to read the papers and write a summary.\n", + "- End: We will end the workflow.\n", + "\n", + "Through customizing the speaker selection method, we can easily realize the state-oriented workflow by defining the transitions between different agents." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/home/ykw5399/maintain/autogen/autogen/agentchat/user_proxy_agent.py:83: UserWarning: Using None to signal a default code_execution_config is deprecated. Use {} to use default or False to disable code execution.\n", + " super().__init__(\n" + ] + } + ], + "source": [ + "gpt4_config = {\n", + " \"cache_seed\": 42, # change the cache_seed for different trials\n", + " \"temperature\": 0,\n", + " \"config_list\": config_list,\n", + " \"timeout\": 120,\n", + "}\n", + "\n", + "initializer = autogen.UserProxyAgent(\n", + " name=\"Init\",\n", + ")\n", + "\n", + "coder = autogen.AssistantAgent(\n", + " name=\"Retrieve_Action_1\",\n", + " llm_config=gpt4_config,\n", + " system_message=\"\"\"You are the Coder. Given a topic, write code to retrieve related papers from the arXiv API, print their title, authors, abstract, and link.\n", + "You write python/shell code to solve tasks. Wrap the code in a code block that specifies the script type. The user can't modify your code. So do not suggest incomplete code which requires others to modify. Don't use a code block if it's not intended to be executed by the executor.\n", + "Don't include multiple code blocks in one response. Do not ask others to copy and paste the result. Check the execution result returned by the executor.\n", + "If the result indicates there is an error, fix the error and output the code again. Suggest the full code instead of partial code or code changes. If the error can't be fixed or if the task is not solved even after the code is executed successfully, analyze the problem, revisit your assumption, collect additional info you need, and think of a different approach to try.\n", + "\"\"\",\n", + ")\n", + "executor = autogen.UserProxyAgent(\n", + " name=\"Retrieve_Action_2\",\n", + " system_message=\"Executor. Execute the code written by the Coder and report the result.\",\n", + " human_input_mode=\"NEVER\",\n", + " code_execution_config={\n", + " \"last_n_messages\": 3,\n", + " \"work_dir\": \"paper\",\n", + " \"use_docker\": False,\n", + " }, # Please set use_docker=True if docker is available to run the generated code. Using docker is safer than running the generated code directly.\n", + ")\n", + "scientist = autogen.AssistantAgent(\n", + " name=\"Research_Action_1\",\n", + " llm_config=gpt4_config,\n", + " system_message=\"\"\"You are the Scientist. Please categorize papers after seeing their abstracts printed and create a markdown table with Domain, Title, Authors, Summary and Link\"\"\",\n", + ")\n", + "\n", + "\n", + "def state_transition(last_speaker, groupchat):\n", + " messages = groupchat.messages\n", + "\n", + " if last_speaker is initializer:\n", + " # init -> retrieve\n", + " return coder\n", + " elif last_speaker is coder:\n", + " # retrieve: action 1 -> action 2\n", + " return executor\n", + " elif last_speaker is executor:\n", + " if messages[-1][\"content\"] == \"exitcode: 1\":\n", + " # retrieve --(execution failed)--> retrieve\n", + " return coder\n", + " else:\n", + " # retrieve --(execution sucess)--> research\n", + " return scientist\n", + " elif last_speaker == \"Scientist\":\n", + " # research -> end\n", + " return None\n", + "\n", + "\n", + "groupchat = autogen.GroupChat(\n", + " agents=[initializer, coder, executor, scientist],\n", + " messages=[],\n", + " max_round=20,\n", + " speaker_selection_method=state_transition,\n", + ")\n", + "manager = autogen.GroupChatManager(groupchat=groupchat, llm_config=gpt4_config)" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\u001b[33mInit\u001b[0m (to chat_manager):\n", + "\n", + "Topic: LLM applications papers from last week. Requirement: 5 - 10 papers from different domains.\n", + "\n", + "--------------------------------------------------------------------------------\n", + "\u001b[33mRetrieve_Action_1\u001b[0m (to chat_manager):\n", + "\n", + "To retrieve related papers from the arXiv API, we can use Python with the `requests` library to send a query to the API and parse the response. Below is a Python script that searches for papers related to \"LLM applications\" (Large Language Models applications) from the last week, across different domains, and prints out the required information for 5 to 10 papers.\n", + "\n", + "```python\n", + "import requests\n", + "from datetime import datetime, timedelta\n", + "import feedparser\n", + "\n", + "# Define the base URL for the arXiv API\n", + "ARXIV_API_URL = 'http://export.arxiv.org/api/query?'\n", + "\n", + "# Define the search parameters\n", + "search_query = 'all:\"LLM applications\"'\n", + "start_date = (datetime.now() - timedelta(days=7)).strftime('%Y%m%d%H%M%S')\n", + "end_date = datetime.now().strftime('%Y%m%d%H%M%S')\n", + "start = 0\n", + "max_results = 10\n", + "sort_by = 'submittedDate'\n", + "sort_order = 'descending'\n", + "\n", + "# Construct the query\n", + "query = f'search_query={search_query}&sortBy={sort_by}&sortOrder={sort_order}&start={start}&max_results={max_results}'\n", + "\n", + "# Send the request to the arXiv API\n", + "response = requests.get(ARXIV_API_URL + query)\n", + "\n", + "# Parse the response using feedparser\n", + "feed = feedparser.parse(response.content)\n", + "\n", + "# Print the title, authors, abstract, and link of each paper\n", + "for entry in feed.entries:\n", + " print(\"Title:\", entry.title)\n", + " print(\"Authors:\", ', '.join(author.name for author in entry.authors))\n", + " print(\"Abstract:\", entry.summary)\n", + " print(\"Link:\", entry.link)\n", + " print(\"\\n\")\n", + "\n", + "# Check if we have at least 5 papers, if not, adjust the search or notify\n", + "if len(feed.entries) < 5:\n", + " print(\"Less than 5 papers found. Consider adjusting the search parameters or timeframe.\")\n", + "```\n", + "\n", + "This script will print the title, authors, abstract, and link for each paper related to \"LLM applications\" from the last week, up to a maximum of 10 papers. If fewer than 5 papers are found, it will notify the user to consider adjusting the search parameters or timeframe.\n", + "\n", + "--------------------------------------------------------------------------------\n", + "\u001b[31m\n", + ">>>>>>>> EXECUTING CODE BLOCK 0 (inferred language is python)...\u001b[0m\n", + "\u001b[33mRetrieve_Action_2\u001b[0m (to chat_manager):\n", + "\n", + "exitcode: 0 (execution succeeded)\n", + "Code output: \n", + "Title: PRSA: Prompt Reverse Stealing Attacks against Large Language Models\n", + "Authors: Yong Yang, Xuhong Zhang, Yi Jiang, Xi Chen, Haoyu Wang, Shouling Ji, Zonghui Wang\n", + "Abstract: Prompt, recognized as crucial intellectual property, enables large language\n", + "models (LLMs) to perform specific tasks without the need of fine-tuning,\n", + "underscoring their escalating importance. With the rise of prompt-based\n", + "services, such as prompt marketplaces and LLM applications, providers often\n", + "display prompts' capabilities through input-output examples to attract users.\n", + "However, this paradigm raises a pivotal security concern: does the exposure of\n", + "input-output pairs pose the risk of potential prompt leakage, infringing on the\n", + "intellectual property rights of the developers? To our knowledge, this problem\n", + "still has not been comprehensively explored yet. To remedy this gap, in this\n", + "paper, we perform the first in depth exploration and propose a novel attack\n", + "framework for reverse-stealing prompts against commercial LLMs, namely PRSA.\n", + "The main idea of PRSA is that by analyzing the critical features of the\n", + "input-output pairs, we mimic and gradually infer (steal) the target prompts. In\n", + "detail, PRSA mainly consists of two key phases: prompt mutation and prompt\n", + "pruning. In the mutation phase, we propose a prompt attention algorithm based\n", + "on differential feedback to capture these critical features for effectively\n", + "inferring the target prompts. In the prompt pruning phase, we identify and mask\n", + "the words dependent on specific inputs, enabling the prompts to accommodate\n", + "diverse inputs for generalization. Through extensive evaluation, we verify that\n", + "PRSA poses a severe threat in real world scenarios. We have reported these\n", + "findings to prompt service providers and actively collaborate with them to take\n", + "protective measures for prompt copyright.\n", + "Link: http://arxiv.org/abs/2402.19200v1\n", + "\n", + "\n", + "Title: Political Compass or Spinning Arrow? Towards More Meaningful Evaluations\n", + " for Values and Opinions in Large Language Models\n", + "Authors: Paul Röttger, Valentin Hofmann, Valentina Pyatkin, Musashi Hinck, Hannah Rose Kirk, Hinrich Schütze, Dirk Hovy\n", + "Abstract: Much recent work seeks to evaluate values and opinions in large language\n", + "models (LLMs) using multiple-choice surveys and questionnaires. Most of this\n", + "work is motivated by concerns around real-world LLM applications. For example,\n", + "politically-biased LLMs may subtly influence society when they are used by\n", + "millions of people. Such real-world concerns, however, stand in stark contrast\n", + "to the artificiality of current evaluations: real users do not typically ask\n", + "LLMs survey questions. Motivated by this discrepancy, we challenge the\n", + "prevailing constrained evaluation paradigm for values and opinions in LLMs and\n", + "explore more realistic unconstrained evaluations. As a case study, we focus on\n", + "the popular Political Compass Test (PCT). In a systematic review, we find that\n", + "most prior work using the PCT forces models to comply with the PCT's\n", + "multiple-choice format. We show that models give substantively different\n", + "answers when not forced; that answers change depending on how models are\n", + "forced; and that answers lack paraphrase robustness. Then, we demonstrate that\n", + "models give different answers yet again in a more realistic open-ended answer\n", + "setting. We distill these findings into recommendations and open challenges in\n", + "evaluating values and opinions in LLMs.\n", + "Link: http://arxiv.org/abs/2402.16786v1\n", + "\n", + "\n", + "Title: Large Language Models as Urban Residents: An LLM Agent Framework for\n", + " Personal Mobility Generation\n", + "Authors: Jiawei Wang, Renhe Jiang, Chuang Yang, Zengqing Wu, Makoto Onizuka, Ryosuke Shibasaki, Chuan Xiao\n", + "Abstract: This paper introduces a novel approach using Large Language Models (LLMs)\n", + "integrated into an agent framework for flexible and efficient personal mobility\n", + "generation. LLMs overcome the limitations of previous models by efficiently\n", + "processing semantic data and offering versatility in modeling various tasks.\n", + "Our approach addresses the critical need to align LLMs with real-world urban\n", + "mobility data, focusing on three research questions: aligning LLMs with rich\n", + "activity data, developing reliable activity generation strategies, and\n", + "exploring LLM applications in urban mobility. The key technical contribution is\n", + "a novel LLM agent framework that accounts for individual activity patterns and\n", + "motivations, including a self-consistency approach to align LLMs with\n", + "real-world activity data and a retrieval-augmented strategy for interpretable\n", + "activity generation. In experimental studies, comprehensive validation is\n", + "performed using real-world data. This research marks the pioneering work of\n", + "designing an LLM agent framework for activity generation based on real-world\n", + "human activity data, offering a promising tool for urban mobility analysis.\n", + "Link: http://arxiv.org/abs/2402.14744v1\n", + "\n", + "\n", + "Title: An Evaluation of Large Language Models in Bioinformatics Research\n", + "Authors: Hengchuang Yin, Zhonghui Gu, Fanhao Wang, Yiparemu Abuduhaibaier, Yanqiao Zhu, Xinming Tu, Xian-Sheng Hua, Xiao Luo, Yizhou Sun\n", + "Abstract: Large language models (LLMs) such as ChatGPT have gained considerable\n", + "interest across diverse research communities. Their notable ability for text\n", + "completion and generation has inaugurated a novel paradigm for\n", + "language-interfaced problem solving. However, the potential and efficacy of\n", + "these models in bioinformatics remain incompletely explored. In this work, we\n", + "study the performance LLMs on a wide spectrum of crucial bioinformatics tasks.\n", + "These tasks include the identification of potential coding regions, extraction\n", + "of named entities for genes and proteins, detection of antimicrobial and\n", + "anti-cancer peptides, molecular optimization, and resolution of educational\n", + "bioinformatics problems. Our findings indicate that, given appropriate prompts,\n", + "LLMs like GPT variants can successfully handle most of these tasks. In\n", + "addition, we provide a thorough analysis of their limitations in the context of\n", + "complicated bioinformatics tasks. In conclusion, we believe that this work can\n", + "provide new perspectives and motivate future research in the field of LLMs\n", + "applications, AI for Science and bioinformatics.\n", + "Link: http://arxiv.org/abs/2402.13714v1\n", + "\n", + "\n", + "Title: Privacy-Preserving Instructions for Aligning Large Language Models\n", + "Authors: Da Yu, Peter Kairouz, Sewoong Oh, Zheng Xu\n", + "Abstract: Service providers of large language model (LLM) applications collect user\n", + "instructions in the wild and use them in further aligning LLMs with users'\n", + "intentions. These instructions, which potentially contain sensitive\n", + "information, are annotated by human workers in the process. This poses a new\n", + "privacy risk not addressed by the typical private optimization. To this end, we\n", + "propose using synthetic instructions to replace real instructions in data\n", + "annotation and model fine-tuning. Formal differential privacy is guaranteed by\n", + "generating those synthetic instructions using privately fine-tuned generators.\n", + "Crucial in achieving the desired utility is our novel filtering algorithm that\n", + "matches the distribution of the synthetic instructions to that of the real\n", + "ones. In both supervised fine-tuning and reinforcement learning from human\n", + "feedback, our extensive experiments demonstrate the high utility of the final\n", + "set of synthetic instructions by showing comparable results to real\n", + "instructions. In supervised fine-tuning, models trained with private synthetic\n", + "instructions outperform leading open-source models such as Vicuna.\n", + "Link: http://arxiv.org/abs/2402.13659v1\n", + "\n", + "\n", + "Title: Ain't Misbehavin' -- Using LLMs to Generate Expressive Robot Behavior in\n", + " Conversations with the Tabletop Robot Haru\n", + "Authors: Zining Wang, Paul Reisert, Eric Nichols, Randy Gomez\n", + "Abstract: Social robots aim to establish long-term bonds with humans through engaging\n", + "conversation. However, traditional conversational approaches, reliant on\n", + "scripted interactions, often fall short in maintaining engaging conversations.\n", + "This paper addresses this limitation by integrating large language models\n", + "(LLMs) into social robots to achieve more dynamic and expressive conversations.\n", + "We introduce a fully-automated conversation system that leverages LLMs to\n", + "generate robot responses with expressive behaviors, congruent with the robot's\n", + "personality. We incorporate robot behavior with two modalities: 1) a\n", + "text-to-speech (TTS) engine capable of various delivery styles, and 2) a\n", + "library of physical actions for the robot. We develop a custom,\n", + "state-of-the-art emotion recognition model to dynamically select the robot's\n", + "tone of voice and utilize emojis from LLM output as cues for generating robot\n", + "actions. A demo of our system is available here. To illuminate design and\n", + "implementation issues, we conduct a pilot study where volunteers chat with a\n", + "social robot using our proposed system, and we analyze their feedback,\n", + "conducting a rigorous error analysis of chat transcripts. Feedback was\n", + "overwhelmingly positive, with participants commenting on the robot's empathy,\n", + "helpfulness, naturalness, and entertainment. Most negative feedback was due to\n", + "automatic speech recognition (ASR) errors which had limited impact on\n", + "conversations. However, we observed a small class of errors, such as the LLM\n", + "repeating itself or hallucinating fictitious information and human responses,\n", + "that have the potential to derail conversations, raising important issues for\n", + "LLM application.\n", + "Link: http://arxiv.org/abs/2402.11571v1\n", + "\n", + "\n", + "Title: Fine-tuning Large Language Model (LLM) Artificial Intelligence Chatbots\n", + " in Ophthalmology and LLM-based evaluation using GPT-4\n", + "Authors: Ting Fang Tan, Kabilan Elangovan, Liyuan Jin, Yao Jie, Li Yong, Joshua Lim, Stanley Poh, Wei Yan Ng, Daniel Lim, Yuhe Ke, Nan Liu, Daniel Shu Wei Ting\n", + "Abstract: Purpose: To assess the alignment of GPT-4-based evaluation to human clinician\n", + "experts, for the evaluation of responses to ophthalmology-related patient\n", + "queries generated by fine-tuned LLM chatbots. Methods: 400 ophthalmology\n", + "questions and paired answers were created by ophthalmologists to represent\n", + "commonly asked patient questions, divided into fine-tuning (368; 92%), and\n", + "testing (40; 8%). We find-tuned 5 different LLMs, including LLAMA2-7b,\n", + "LLAMA2-7b-Chat, LLAMA2-13b, and LLAMA2-13b-Chat. For the testing dataset,\n", + "additional 8 glaucoma QnA pairs were included. 200 responses to the testing\n", + "dataset were generated by 5 fine-tuned LLMs for evaluation. A customized\n", + "clinical evaluation rubric was used to guide GPT-4 evaluation, grounded on\n", + "clinical accuracy, relevance, patient safety, and ease of understanding. GPT-4\n", + "evaluation was then compared against ranking by 5 clinicians for clinical\n", + "alignment. Results: Among all fine-tuned LLMs, GPT-3.5 scored the highest\n", + "(87.1%), followed by LLAMA2-13b (80.9%), LLAMA2-13b-chat (75.5%),\n", + "LLAMA2-7b-Chat (70%) and LLAMA2-7b (68.8%) based on the GPT-4 evaluation. GPT-4\n", + "evaluation demonstrated significant agreement with human clinician rankings,\n", + "with Spearman and Kendall Tau correlation coefficients of 0.90 and 0.80\n", + "respectively; while correlation based on Cohen Kappa was more modest at 0.50.\n", + "Notably, qualitative analysis and the glaucoma sub-analysis revealed clinical\n", + "inaccuracies in the LLM-generated responses, which were appropriately\n", + "identified by the GPT-4 evaluation. Conclusion: The notable clinical alignment\n", + "of GPT-4 evaluation highlighted its potential to streamline the clinical\n", + "evaluation of LLM chatbot responses to healthcare-related queries. By\n", + "complementing the existing clinician-dependent manual grading, this efficient\n", + "and automated evaluation could assist the validation of future developments in\n", + "LLM applications for healthcare.\n", + "Link: http://arxiv.org/abs/2402.10083v1\n", + "\n", + "\n", + "Title: Unmemorization in Large Language Models via Self-Distillation and\n", + " Deliberate Imagination\n", + "Authors: Yijiang River Dong, Hongzhou Lin, Mikhail Belkin, Ramon Huerta, Ivan Vulić\n", + "Abstract: While displaying impressive generation capabilities across many tasks, Large\n", + "Language Models (LLMs) still struggle with crucial issues of privacy violation\n", + "and unwanted exposure of sensitive data. This raises an essential question: how\n", + "should we prevent such undesired behavior of LLMs while maintaining their\n", + "strong generation and natural language understanding (NLU) capabilities? In\n", + "this work, we introduce a novel approach termed deliberate imagination in the\n", + "context of LLM unlearning. Instead of trying to forget memorized data, we\n", + "employ a self-distillation framework, guiding LLMs to deliberately imagine\n", + "alternative scenarios. As demonstrated in a wide range of experiments, the\n", + "proposed method not only effectively unlearns targeted text but also preserves\n", + "the LLMs' capabilities in open-ended generation tasks as well as in NLU tasks.\n", + "Our results demonstrate the usefulness of this approach across different models\n", + "and sizes, and also with parameter-efficient fine-tuning, offering a novel\n", + "pathway to addressing the challenges with private and sensitive data in LLM\n", + "applications.\n", + "Link: http://arxiv.org/abs/2402.10052v1\n", + "\n", + "\n", + "Title: Anchor-based Large Language Models\n", + "Authors: Jianhui Pang, Fanghua Ye, Derek F. Wong, Longyue Wang\n", + "Abstract: Large language models (LLMs) predominantly employ decoder-only transformer\n", + "architectures, necessitating the retention of keys/values information for\n", + "historical tokens to provide contextual information and avoid redundant\n", + "computation. However, the substantial size and parameter volume of these LLMs\n", + "require massive GPU memory. This memory demand increases with the length of the\n", + "input text, leading to an urgent need for more efficient methods of information\n", + "storage and processing. This study introduces Anchor-based LLMs (AnLLMs), which\n", + "utilize an innovative anchor-based self-attention network (AnSAN) and also an\n", + "anchor-based inference strategy. This approach enables LLMs to compress\n", + "sequence information into an anchor token, reducing the keys/values cache and\n", + "enhancing inference efficiency. Experiments on question-answering benchmarks\n", + "reveal that AnLLMs maintain similar accuracy levels while achieving up to 99%\n", + "keys/values cache reduction and up to 3.5 times faster inference. Despite a\n", + "minor compromise in accuracy, the substantial enhancements of AnLLMs employing\n", + "the AnSAN technique in resource utilization and computational efficiency\n", + "underscore their potential for practical LLM applications.\n", + "Link: http://arxiv.org/abs/2402.07616v2\n", + "\n", + "\n", + "Title: T-RAG: Lessons from the LLM Trenches\n", + "Authors: Masoomali Fatehkia, Ji Kim Lucas, Sanjay Chawla\n", + "Abstract: Large Language Models (LLM) have shown remarkable language capabilities\n", + "fueling attempts to integrate them into applications across a wide range of\n", + "domains. An important application area is question answering over private\n", + "enterprise documents where the main considerations are data security, which\n", + "necessitates applications that can be deployed on-prem, limited computational\n", + "resources and the need for a robust application that correctly responds to\n", + "queries. Retrieval-Augmented Generation (RAG) has emerged as the most prominent\n", + "framework for building LLM-based applications. While building a RAG is\n", + "relatively straightforward, making it robust and a reliable application\n", + "requires extensive customization and relatively deep knowledge of the\n", + "application domain. We share our experiences building and deploying an LLM\n", + "application for question answering over private organizational documents. Our\n", + "application combines the use of RAG with a finetuned open-source LLM.\n", + "Additionally, our system, which we call Tree-RAG (T-RAG), uses a tree structure\n", + "to represent entity hierarchies within the organization. This is used to\n", + "generate a textual description to augment the context when responding to user\n", + "queries pertaining to entities within the organization's hierarchy. Our\n", + "evaluations show that this combination performs better than a simple RAG or\n", + "finetuning implementation. Finally, we share some lessons learned based on our\n", + "experiences building an LLM application for real-world use.\n", + "Link: http://arxiv.org/abs/2402.07483v1\n", + "\n", + "\n", + "\n", + "\n", + "--------------------------------------------------------------------------------\n", + "\u001b[33mResearch_Action_1\u001b[0m (to chat_manager):\n", + "\n", + "Based on the retrieved abstracts, here is a markdown table categorizing the papers by domain, along with their titles, authors, summaries, and links:\n", + "\n", + "| Domain | Title | Authors | Summary | Link |\n", + "|--------|-------|---------|---------|------|\n", + "| Security | PRSA: Prompt Reverse Stealing Attacks against Large Language Models | Yong Yang, Xuhong Zhang, Yi Jiang, Xi Chen, Haoyu Wang, Shouling Ji, Zonghui Wang | The paper explores the security risks associated with exposing input-output pairs of prompts used in LLMs and proposes a novel attack framework, PRSA, to reverse-steal prompts, posing a threat to intellectual property rights. | [Link](http://arxiv.org/abs/2402.19200v1) |\n", + "| Ethics & Evaluation | Political Compass or Spinning Arrow? Towards More Meaningful Evaluations for Values and Opinions in Large Language Models | Paul Röttger, Valentin Hofmann, Valentina Pyatkin, Musashi Hinck, Hannah Rose Kirk, Hinrich Schütze, Dirk Hovy | This work challenges the constrained evaluation paradigm for values and opinions in LLMs and explores more realistic unconstrained evaluations, focusing on the Political Compass Test (PCT). | [Link](http://arxiv.org/abs/2402.16786v1) |\n", + "| Urban Mobility | Large Language Models as Urban Residents: An LLM Agent Framework for Personal Mobility Generation | Jiawei Wang, Renhe Jiang, Chuang Yang, Zengqing Wu, Makoto Onizuka, Ryosuke Shibasaki, Chuan Xiao | Introduces an LLM agent framework for personal mobility generation, aligning LLMs with real-world urban mobility data, and offering a tool for urban mobility analysis. | [Link](http://arxiv.org/abs/2402.14744v1) |\n", + "| Bioinformatics | An Evaluation of Large Language Models in Bioinformatics Research | Hengchuang Yin, Zhonghui Gu, Fanhao Wang, Yiparemu Abuduhaibaier, Yanqiao Zhu, Xinming Tu, Xian-Sheng Hua, Xiao Luo, Yizhou Sun | Evaluates the performance of LLMs on bioinformatics tasks, highlighting their potential and limitations, and motivating future research in LLM applications in bioinformatics. | [Link](http://arxiv.org/abs/2402.13714v1) |\n", + "| Privacy | Privacy-Preserving Instructions for Aligning Large Language Models | Da Yu, Peter Kairouz, Sewoong Oh, Zheng Xu | Proposes using synthetic instructions generated by privately fine-tuned generators to replace real instructions in data annotation and model fine-tuning, ensuring privacy while maintaining utility. | [Link](http://arxiv.org/abs/2402.13659v1) |\n", + "| Social Robotics | Ain't Misbehavin' -- Using LLMs to Generate Expressive Robot Behavior in Conversations with the Tabletop Robot Haru | Zining Wang, Paul Reisert, Eric Nichols, Randy Gomez | Integrates LLMs into social robots to generate dynamic and expressive conversations, using a text-to-speech engine and a library of physical actions for the robot. | [Link](http://arxiv.org/abs/2402.11571v1) |\n", + "| Ophthalmology | Fine-tuning Large Language Model (LLM) Artificial Intelligence Chatbots in Ophthalmology and LLM-based evaluation using GPT-4 | Ting Fang Tan, Kabilan Elangovan, Liyuan Jin, Yao Jie, Li Yong, Joshua Lim, Stanley Poh, Wei Yan Ng, Daniel Lim, Yuhe Ke, Nan Liu, Daniel Shu Wei Ting | Assesses the alignment of GPT-4-based evaluation to human clinician experts for evaluating responses to ophthalmology-related patient queries generated by fine-tuned LLM chatbots. | [Link](http://arxiv.org/abs/2402.10083v1) |\n", + "| Privacy & Data Security | Unmemorization in Large Language Models via Self-Distillation and Deliberate Imagination | Yijiang River Dong, Hongzhou Lin, Mikhail Belkin, Ramon Huerta, Ivan Vulić | Introduces a novel approach for LLM unlearning by guiding LLMs to imagine alternative scenarios, effectively unlearning targeted text while preserving generation and NLU capabilities. | [Link](http://arxiv.org/abs/2402.10052v1) |\n", + "| Computational Efficiency | Anchor-based Large Language Models | Jianhui Pang, Fanghua Ye, Derek F. Wong, Longyue Wang | Proposes Anchor-based LLMs (AnLLMs) with an innovative anchor-based self-attention network (AnSAN) to reduce memory demand and enhance inference efficiency. | [Link](http://arxiv.org/abs/2402.07616v2) |\n", + "| Enterprise Applications | T-RAG: Lessons from the LLM Trenches | Masoomali Fatehkia, Ji Kim Lucas, Sanjay Chawla | Shares experiences building and deploying an LLM application for question answering over private organizational documents, combining RAG with a finetuned LLM and a tree structure for entity hierarchies. | [Link](http://arxiv.org/abs/2402.07483v1) |\n", + "\n", + "These papers cover a range of domains including security, ethics, urban mobility, bioinformatics, privacy, social robotics, ophthalmology, data security, computational efficiency, and enterprise applications, showcasing the diverse applications of large language models.\n", + "\n", + "--------------------------------------------------------------------------------\n" + ] + }, + { + "data": { + "text/plain": [ + "ChatResult(chat_id=None, chat_history=[{'content': 'Topic: LLM applications papers from last week. Requirement: 5 - 10 papers from different domains.', 'role': 'assistant'}, {'content': 'To retrieve related papers from the arXiv API, we can use Python with the `requests` library to send a query to the API and parse the response. Below is a Python script that searches for papers related to \"LLM applications\" (Large Language Models applications) from the last week, across different domains, and prints out the required information for 5 to 10 papers.\\n\\n```python\\nimport requests\\nfrom datetime import datetime, timedelta\\nimport feedparser\\n\\n# Define the base URL for the arXiv API\\nARXIV_API_URL = \\'http://export.arxiv.org/api/query?\\'\\n\\n# Define the search parameters\\nsearch_query = \\'all:\"LLM applications\"\\'\\nstart_date = (datetime.now() - timedelta(days=7)).strftime(\\'%Y%m%d%H%M%S\\')\\nend_date = datetime.now().strftime(\\'%Y%m%d%H%M%S\\')\\nstart = 0\\nmax_results = 10\\nsort_by = \\'submittedDate\\'\\nsort_order = \\'descending\\'\\n\\n# Construct the query\\nquery = f\\'search_query={search_query}&sortBy={sort_by}&sortOrder={sort_order}&start={start}&max_results={max_results}\\'\\n\\n# Send the request to the arXiv API\\nresponse = requests.get(ARXIV_API_URL + query)\\n\\n# Parse the response using feedparser\\nfeed = feedparser.parse(response.content)\\n\\n# Print the title, authors, abstract, and link of each paper\\nfor entry in feed.entries:\\n print(\"Title:\", entry.title)\\n print(\"Authors:\", \\', \\'.join(author.name for author in entry.authors))\\n print(\"Abstract:\", entry.summary)\\n print(\"Link:\", entry.link)\\n print(\"\\\\n\")\\n\\n# Check if we have at least 5 papers, if not, adjust the search or notify\\nif len(feed.entries) < 5:\\n print(\"Less than 5 papers found. Consider adjusting the search parameters or timeframe.\")\\n```\\n\\nThis script will print the title, authors, abstract, and link for each paper related to \"LLM applications\" from the last week, up to a maximum of 10 papers. If fewer than 5 papers are found, it will notify the user to consider adjusting the search parameters or timeframe.', 'name': 'Retrieve_Action_1', 'role': 'user'}, {'content': \"exitcode: 0 (execution succeeded)\\nCode output: \\nTitle: PRSA: Prompt Reverse Stealing Attacks against Large Language Models\\nAuthors: Yong Yang, Xuhong Zhang, Yi Jiang, Xi Chen, Haoyu Wang, Shouling Ji, Zonghui Wang\\nAbstract: Prompt, recognized as crucial intellectual property, enables large language\\nmodels (LLMs) to perform specific tasks without the need of fine-tuning,\\nunderscoring their escalating importance. With the rise of prompt-based\\nservices, such as prompt marketplaces and LLM applications, providers often\\ndisplay prompts' capabilities through input-output examples to attract users.\\nHowever, this paradigm raises a pivotal security concern: does the exposure of\\ninput-output pairs pose the risk of potential prompt leakage, infringing on the\\nintellectual property rights of the developers? To our knowledge, this problem\\nstill has not been comprehensively explored yet. To remedy this gap, in this\\npaper, we perform the first in depth exploration and propose a novel attack\\nframework for reverse-stealing prompts against commercial LLMs, namely PRSA.\\nThe main idea of PRSA is that by analyzing the critical features of the\\ninput-output pairs, we mimic and gradually infer (steal) the target prompts. In\\ndetail, PRSA mainly consists of two key phases: prompt mutation and prompt\\npruning. In the mutation phase, we propose a prompt attention algorithm based\\non differential feedback to capture these critical features for effectively\\ninferring the target prompts. In the prompt pruning phase, we identify and mask\\nthe words dependent on specific inputs, enabling the prompts to accommodate\\ndiverse inputs for generalization. Through extensive evaluation, we verify that\\nPRSA poses a severe threat in real world scenarios. We have reported these\\nfindings to prompt service providers and actively collaborate with them to take\\nprotective measures for prompt copyright.\\nLink: http://arxiv.org/abs/2402.19200v1\\n\\n\\nTitle: Political Compass or Spinning Arrow? Towards More Meaningful Evaluations\\n for Values and Opinions in Large Language Models\\nAuthors: Paul Röttger, Valentin Hofmann, Valentina Pyatkin, Musashi Hinck, Hannah Rose Kirk, Hinrich Schütze, Dirk Hovy\\nAbstract: Much recent work seeks to evaluate values and opinions in large language\\nmodels (LLMs) using multiple-choice surveys and questionnaires. Most of this\\nwork is motivated by concerns around real-world LLM applications. For example,\\npolitically-biased LLMs may subtly influence society when they are used by\\nmillions of people. Such real-world concerns, however, stand in stark contrast\\nto the artificiality of current evaluations: real users do not typically ask\\nLLMs survey questions. Motivated by this discrepancy, we challenge the\\nprevailing constrained evaluation paradigm for values and opinions in LLMs and\\nexplore more realistic unconstrained evaluations. As a case study, we focus on\\nthe popular Political Compass Test (PCT). In a systematic review, we find that\\nmost prior work using the PCT forces models to comply with the PCT's\\nmultiple-choice format. We show that models give substantively different\\nanswers when not forced; that answers change depending on how models are\\nforced; and that answers lack paraphrase robustness. Then, we demonstrate that\\nmodels give different answers yet again in a more realistic open-ended answer\\nsetting. We distill these findings into recommendations and open challenges in\\nevaluating values and opinions in LLMs.\\nLink: http://arxiv.org/abs/2402.16786v1\\n\\n\\nTitle: Large Language Models as Urban Residents: An LLM Agent Framework for\\n Personal Mobility Generation\\nAuthors: Jiawei Wang, Renhe Jiang, Chuang Yang, Zengqing Wu, Makoto Onizuka, Ryosuke Shibasaki, Chuan Xiao\\nAbstract: This paper introduces a novel approach using Large Language Models (LLMs)\\nintegrated into an agent framework for flexible and efficient personal mobility\\ngeneration. LLMs overcome the limitations of previous models by efficiently\\nprocessing semantic data and offering versatility in modeling various tasks.\\nOur approach addresses the critical need to align LLMs with real-world urban\\nmobility data, focusing on three research questions: aligning LLMs with rich\\nactivity data, developing reliable activity generation strategies, and\\nexploring LLM applications in urban mobility. The key technical contribution is\\na novel LLM agent framework that accounts for individual activity patterns and\\nmotivations, including a self-consistency approach to align LLMs with\\nreal-world activity data and a retrieval-augmented strategy for interpretable\\nactivity generation. In experimental studies, comprehensive validation is\\nperformed using real-world data. This research marks the pioneering work of\\ndesigning an LLM agent framework for activity generation based on real-world\\nhuman activity data, offering a promising tool for urban mobility analysis.\\nLink: http://arxiv.org/abs/2402.14744v1\\n\\n\\nTitle: An Evaluation of Large Language Models in Bioinformatics Research\\nAuthors: Hengchuang Yin, Zhonghui Gu, Fanhao Wang, Yiparemu Abuduhaibaier, Yanqiao Zhu, Xinming Tu, Xian-Sheng Hua, Xiao Luo, Yizhou Sun\\nAbstract: Large language models (LLMs) such as ChatGPT have gained considerable\\ninterest across diverse research communities. Their notable ability for text\\ncompletion and generation has inaugurated a novel paradigm for\\nlanguage-interfaced problem solving. However, the potential and efficacy of\\nthese models in bioinformatics remain incompletely explored. In this work, we\\nstudy the performance LLMs on a wide spectrum of crucial bioinformatics tasks.\\nThese tasks include the identification of potential coding regions, extraction\\nof named entities for genes and proteins, detection of antimicrobial and\\nanti-cancer peptides, molecular optimization, and resolution of educational\\nbioinformatics problems. Our findings indicate that, given appropriate prompts,\\nLLMs like GPT variants can successfully handle most of these tasks. In\\naddition, we provide a thorough analysis of their limitations in the context of\\ncomplicated bioinformatics tasks. In conclusion, we believe that this work can\\nprovide new perspectives and motivate future research in the field of LLMs\\napplications, AI for Science and bioinformatics.\\nLink: http://arxiv.org/abs/2402.13714v1\\n\\n\\nTitle: Privacy-Preserving Instructions for Aligning Large Language Models\\nAuthors: Da Yu, Peter Kairouz, Sewoong Oh, Zheng Xu\\nAbstract: Service providers of large language model (LLM) applications collect user\\ninstructions in the wild and use them in further aligning LLMs with users'\\nintentions. These instructions, which potentially contain sensitive\\ninformation, are annotated by human workers in the process. This poses a new\\nprivacy risk not addressed by the typical private optimization. To this end, we\\npropose using synthetic instructions to replace real instructions in data\\nannotation and model fine-tuning. Formal differential privacy is guaranteed by\\ngenerating those synthetic instructions using privately fine-tuned generators.\\nCrucial in achieving the desired utility is our novel filtering algorithm that\\nmatches the distribution of the synthetic instructions to that of the real\\nones. In both supervised fine-tuning and reinforcement learning from human\\nfeedback, our extensive experiments demonstrate the high utility of the final\\nset of synthetic instructions by showing comparable results to real\\ninstructions. In supervised fine-tuning, models trained with private synthetic\\ninstructions outperform leading open-source models such as Vicuna.\\nLink: http://arxiv.org/abs/2402.13659v1\\n\\n\\nTitle: Ain't Misbehavin' -- Using LLMs to Generate Expressive Robot Behavior in\\n Conversations with the Tabletop Robot Haru\\nAuthors: Zining Wang, Paul Reisert, Eric Nichols, Randy Gomez\\nAbstract: Social robots aim to establish long-term bonds with humans through engaging\\nconversation. However, traditional conversational approaches, reliant on\\nscripted interactions, often fall short in maintaining engaging conversations.\\nThis paper addresses this limitation by integrating large language models\\n(LLMs) into social robots to achieve more dynamic and expressive conversations.\\nWe introduce a fully-automated conversation system that leverages LLMs to\\ngenerate robot responses with expressive behaviors, congruent with the robot's\\npersonality. We incorporate robot behavior with two modalities: 1) a\\ntext-to-speech (TTS) engine capable of various delivery styles, and 2) a\\nlibrary of physical actions for the robot. We develop a custom,\\nstate-of-the-art emotion recognition model to dynamically select the robot's\\ntone of voice and utilize emojis from LLM output as cues for generating robot\\nactions. A demo of our system is available here. To illuminate design and\\nimplementation issues, we conduct a pilot study where volunteers chat with a\\nsocial robot using our proposed system, and we analyze their feedback,\\nconducting a rigorous error analysis of chat transcripts. Feedback was\\noverwhelmingly positive, with participants commenting on the robot's empathy,\\nhelpfulness, naturalness, and entertainment. Most negative feedback was due to\\nautomatic speech recognition (ASR) errors which had limited impact on\\nconversations. However, we observed a small class of errors, such as the LLM\\nrepeating itself or hallucinating fictitious information and human responses,\\nthat have the potential to derail conversations, raising important issues for\\nLLM application.\\nLink: http://arxiv.org/abs/2402.11571v1\\n\\n\\nTitle: Fine-tuning Large Language Model (LLM) Artificial Intelligence Chatbots\\n in Ophthalmology and LLM-based evaluation using GPT-4\\nAuthors: Ting Fang Tan, Kabilan Elangovan, Liyuan Jin, Yao Jie, Li Yong, Joshua Lim, Stanley Poh, Wei Yan Ng, Daniel Lim, Yuhe Ke, Nan Liu, Daniel Shu Wei Ting\\nAbstract: Purpose: To assess the alignment of GPT-4-based evaluation to human clinician\\nexperts, for the evaluation of responses to ophthalmology-related patient\\nqueries generated by fine-tuned LLM chatbots. Methods: 400 ophthalmology\\nquestions and paired answers were created by ophthalmologists to represent\\ncommonly asked patient questions, divided into fine-tuning (368; 92%), and\\ntesting (40; 8%). We find-tuned 5 different LLMs, including LLAMA2-7b,\\nLLAMA2-7b-Chat, LLAMA2-13b, and LLAMA2-13b-Chat. For the testing dataset,\\nadditional 8 glaucoma QnA pairs were included. 200 responses to the testing\\ndataset were generated by 5 fine-tuned LLMs for evaluation. A customized\\nclinical evaluation rubric was used to guide GPT-4 evaluation, grounded on\\nclinical accuracy, relevance, patient safety, and ease of understanding. GPT-4\\nevaluation was then compared against ranking by 5 clinicians for clinical\\nalignment. Results: Among all fine-tuned LLMs, GPT-3.5 scored the highest\\n(87.1%), followed by LLAMA2-13b (80.9%), LLAMA2-13b-chat (75.5%),\\nLLAMA2-7b-Chat (70%) and LLAMA2-7b (68.8%) based on the GPT-4 evaluation. GPT-4\\nevaluation demonstrated significant agreement with human clinician rankings,\\nwith Spearman and Kendall Tau correlation coefficients of 0.90 and 0.80\\nrespectively; while correlation based on Cohen Kappa was more modest at 0.50.\\nNotably, qualitative analysis and the glaucoma sub-analysis revealed clinical\\ninaccuracies in the LLM-generated responses, which were appropriately\\nidentified by the GPT-4 evaluation. Conclusion: The notable clinical alignment\\nof GPT-4 evaluation highlighted its potential to streamline the clinical\\nevaluation of LLM chatbot responses to healthcare-related queries. By\\ncomplementing the existing clinician-dependent manual grading, this efficient\\nand automated evaluation could assist the validation of future developments in\\nLLM applications for healthcare.\\nLink: http://arxiv.org/abs/2402.10083v1\\n\\n\\nTitle: Unmemorization in Large Language Models via Self-Distillation and\\n Deliberate Imagination\\nAuthors: Yijiang River Dong, Hongzhou Lin, Mikhail Belkin, Ramon Huerta, Ivan Vulić\\nAbstract: While displaying impressive generation capabilities across many tasks, Large\\nLanguage Models (LLMs) still struggle with crucial issues of privacy violation\\nand unwanted exposure of sensitive data. This raises an essential question: how\\nshould we prevent such undesired behavior of LLMs while maintaining their\\nstrong generation and natural language understanding (NLU) capabilities? In\\nthis work, we introduce a novel approach termed deliberate imagination in the\\ncontext of LLM unlearning. Instead of trying to forget memorized data, we\\nemploy a self-distillation framework, guiding LLMs to deliberately imagine\\nalternative scenarios. As demonstrated in a wide range of experiments, the\\nproposed method not only effectively unlearns targeted text but also preserves\\nthe LLMs' capabilities in open-ended generation tasks as well as in NLU tasks.\\nOur results demonstrate the usefulness of this approach across different models\\nand sizes, and also with parameter-efficient fine-tuning, offering a novel\\npathway to addressing the challenges with private and sensitive data in LLM\\napplications.\\nLink: http://arxiv.org/abs/2402.10052v1\\n\\n\\nTitle: Anchor-based Large Language Models\\nAuthors: Jianhui Pang, Fanghua Ye, Derek F. Wong, Longyue Wang\\nAbstract: Large language models (LLMs) predominantly employ decoder-only transformer\\narchitectures, necessitating the retention of keys/values information for\\nhistorical tokens to provide contextual information and avoid redundant\\ncomputation. However, the substantial size and parameter volume of these LLMs\\nrequire massive GPU memory. This memory demand increases with the length of the\\ninput text, leading to an urgent need for more efficient methods of information\\nstorage and processing. This study introduces Anchor-based LLMs (AnLLMs), which\\nutilize an innovative anchor-based self-attention network (AnSAN) and also an\\nanchor-based inference strategy. This approach enables LLMs to compress\\nsequence information into an anchor token, reducing the keys/values cache and\\nenhancing inference efficiency. Experiments on question-answering benchmarks\\nreveal that AnLLMs maintain similar accuracy levels while achieving up to 99%\\nkeys/values cache reduction and up to 3.5 times faster inference. Despite a\\nminor compromise in accuracy, the substantial enhancements of AnLLMs employing\\nthe AnSAN technique in resource utilization and computational efficiency\\nunderscore their potential for practical LLM applications.\\nLink: http://arxiv.org/abs/2402.07616v2\\n\\n\\nTitle: T-RAG: Lessons from the LLM Trenches\\nAuthors: Masoomali Fatehkia, Ji Kim Lucas, Sanjay Chawla\\nAbstract: Large Language Models (LLM) have shown remarkable language capabilities\\nfueling attempts to integrate them into applications across a wide range of\\ndomains. An important application area is question answering over private\\nenterprise documents where the main considerations are data security, which\\nnecessitates applications that can be deployed on-prem, limited computational\\nresources and the need for a robust application that correctly responds to\\nqueries. Retrieval-Augmented Generation (RAG) has emerged as the most prominent\\nframework for building LLM-based applications. While building a RAG is\\nrelatively straightforward, making it robust and a reliable application\\nrequires extensive customization and relatively deep knowledge of the\\napplication domain. We share our experiences building and deploying an LLM\\napplication for question answering over private organizational documents. Our\\napplication combines the use of RAG with a finetuned open-source LLM.\\nAdditionally, our system, which we call Tree-RAG (T-RAG), uses a tree structure\\nto represent entity hierarchies within the organization. This is used to\\ngenerate a textual description to augment the context when responding to user\\nqueries pertaining to entities within the organization's hierarchy. Our\\nevaluations show that this combination performs better than a simple RAG or\\nfinetuning implementation. Finally, we share some lessons learned based on our\\nexperiences building an LLM application for real-world use.\\nLink: http://arxiv.org/abs/2402.07483v1\\n\\n\\n\", 'name': 'Retrieve_Action_2', 'role': 'user'}, {'content': \"Based on the retrieved abstracts, here is a markdown table categorizing the papers by domain, along with their titles, authors, summaries, and links:\\n\\n| Domain | Title | Authors | Summary | Link |\\n|--------|-------|---------|---------|------|\\n| Security | PRSA: Prompt Reverse Stealing Attacks against Large Language Models | Yong Yang, Xuhong Zhang, Yi Jiang, Xi Chen, Haoyu Wang, Shouling Ji, Zonghui Wang | The paper explores the security risks associated with exposing input-output pairs of prompts used in LLMs and proposes a novel attack framework, PRSA, to reverse-steal prompts, posing a threat to intellectual property rights. | [Link](http://arxiv.org/abs/2402.19200v1) |\\n| Ethics & Evaluation | Political Compass or Spinning Arrow? Towards More Meaningful Evaluations for Values and Opinions in Large Language Models | Paul Röttger, Valentin Hofmann, Valentina Pyatkin, Musashi Hinck, Hannah Rose Kirk, Hinrich Schütze, Dirk Hovy | This work challenges the constrained evaluation paradigm for values and opinions in LLMs and explores more realistic unconstrained evaluations, focusing on the Political Compass Test (PCT). | [Link](http://arxiv.org/abs/2402.16786v1) |\\n| Urban Mobility | Large Language Models as Urban Residents: An LLM Agent Framework for Personal Mobility Generation | Jiawei Wang, Renhe Jiang, Chuang Yang, Zengqing Wu, Makoto Onizuka, Ryosuke Shibasaki, Chuan Xiao | Introduces an LLM agent framework for personal mobility generation, aligning LLMs with real-world urban mobility data, and offering a tool for urban mobility analysis. | [Link](http://arxiv.org/abs/2402.14744v1) |\\n| Bioinformatics | An Evaluation of Large Language Models in Bioinformatics Research | Hengchuang Yin, Zhonghui Gu, Fanhao Wang, Yiparemu Abuduhaibaier, Yanqiao Zhu, Xinming Tu, Xian-Sheng Hua, Xiao Luo, Yizhou Sun | Evaluates the performance of LLMs on bioinformatics tasks, highlighting their potential and limitations, and motivating future research in LLM applications in bioinformatics. | [Link](http://arxiv.org/abs/2402.13714v1) |\\n| Privacy | Privacy-Preserving Instructions for Aligning Large Language Models | Da Yu, Peter Kairouz, Sewoong Oh, Zheng Xu | Proposes using synthetic instructions generated by privately fine-tuned generators to replace real instructions in data annotation and model fine-tuning, ensuring privacy while maintaining utility. | [Link](http://arxiv.org/abs/2402.13659v1) |\\n| Social Robotics | Ain't Misbehavin' -- Using LLMs to Generate Expressive Robot Behavior in Conversations with the Tabletop Robot Haru | Zining Wang, Paul Reisert, Eric Nichols, Randy Gomez | Integrates LLMs into social robots to generate dynamic and expressive conversations, using a text-to-speech engine and a library of physical actions for the robot. | [Link](http://arxiv.org/abs/2402.11571v1) |\\n| Ophthalmology | Fine-tuning Large Language Model (LLM) Artificial Intelligence Chatbots in Ophthalmology and LLM-based evaluation using GPT-4 | Ting Fang Tan, Kabilan Elangovan, Liyuan Jin, Yao Jie, Li Yong, Joshua Lim, Stanley Poh, Wei Yan Ng, Daniel Lim, Yuhe Ke, Nan Liu, Daniel Shu Wei Ting | Assesses the alignment of GPT-4-based evaluation to human clinician experts for evaluating responses to ophthalmology-related patient queries generated by fine-tuned LLM chatbots. | [Link](http://arxiv.org/abs/2402.10083v1) |\\n| Privacy & Data Security | Unmemorization in Large Language Models via Self-Distillation and Deliberate Imagination | Yijiang River Dong, Hongzhou Lin, Mikhail Belkin, Ramon Huerta, Ivan Vulić | Introduces a novel approach for LLM unlearning by guiding LLMs to imagine alternative scenarios, effectively unlearning targeted text while preserving generation and NLU capabilities. | [Link](http://arxiv.org/abs/2402.10052v1) |\\n| Computational Efficiency | Anchor-based Large Language Models | Jianhui Pang, Fanghua Ye, Derek F. Wong, Longyue Wang | Proposes Anchor-based LLMs (AnLLMs) with an innovative anchor-based self-attention network (AnSAN) to reduce memory demand and enhance inference efficiency. | [Link](http://arxiv.org/abs/2402.07616v2) |\\n| Enterprise Applications | T-RAG: Lessons from the LLM Trenches | Masoomali Fatehkia, Ji Kim Lucas, Sanjay Chawla | Shares experiences building and deploying an LLM application for question answering over private organizational documents, combining RAG with a finetuned LLM and a tree structure for entity hierarchies. | [Link](http://arxiv.org/abs/2402.07483v1) |\\n\\nThese papers cover a range of domains including security, ethics, urban mobility, bioinformatics, privacy, social robotics, ophthalmology, data security, computational efficiency, and enterprise applications, showcasing the diverse applications of large language models.\", 'name': 'Research_Action_1', 'role': 'user'}], summary=\"Based on the retrieved abstracts, here is a markdown table categorizing the papers by domain, along with their titles, authors, summaries, and links:\\n\\n| Domain | Title | Authors | Summary | Link |\\n|--------|-------|---------|---------|------|\\n| Security | PRSA: Prompt Reverse Stealing Attacks against Large Language Models | Yong Yang, Xuhong Zhang, Yi Jiang, Xi Chen, Haoyu Wang, Shouling Ji, Zonghui Wang | The paper explores the security risks associated with exposing input-output pairs of prompts used in LLMs and proposes a novel attack framework, PRSA, to reverse-steal prompts, posing a threat to intellectual property rights. | [Link](http://arxiv.org/abs/2402.19200v1) |\\n| Ethics & Evaluation | Political Compass or Spinning Arrow? Towards More Meaningful Evaluations for Values and Opinions in Large Language Models | Paul Röttger, Valentin Hofmann, Valentina Pyatkin, Musashi Hinck, Hannah Rose Kirk, Hinrich Schütze, Dirk Hovy | This work challenges the constrained evaluation paradigm for values and opinions in LLMs and explores more realistic unconstrained evaluations, focusing on the Political Compass Test (PCT). | [Link](http://arxiv.org/abs/2402.16786v1) |\\n| Urban Mobility | Large Language Models as Urban Residents: An LLM Agent Framework for Personal Mobility Generation | Jiawei Wang, Renhe Jiang, Chuang Yang, Zengqing Wu, Makoto Onizuka, Ryosuke Shibasaki, Chuan Xiao | Introduces an LLM agent framework for personal mobility generation, aligning LLMs with real-world urban mobility data, and offering a tool for urban mobility analysis. | [Link](http://arxiv.org/abs/2402.14744v1) |\\n| Bioinformatics | An Evaluation of Large Language Models in Bioinformatics Research | Hengchuang Yin, Zhonghui Gu, Fanhao Wang, Yiparemu Abuduhaibaier, Yanqiao Zhu, Xinming Tu, Xian-Sheng Hua, Xiao Luo, Yizhou Sun | Evaluates the performance of LLMs on bioinformatics tasks, highlighting their potential and limitations, and motivating future research in LLM applications in bioinformatics. | [Link](http://arxiv.org/abs/2402.13714v1) |\\n| Privacy | Privacy-Preserving Instructions for Aligning Large Language Models | Da Yu, Peter Kairouz, Sewoong Oh, Zheng Xu | Proposes using synthetic instructions generated by privately fine-tuned generators to replace real instructions in data annotation and model fine-tuning, ensuring privacy while maintaining utility. | [Link](http://arxiv.org/abs/2402.13659v1) |\\n| Social Robotics | Ain't Misbehavin' -- Using LLMs to Generate Expressive Robot Behavior in Conversations with the Tabletop Robot Haru | Zining Wang, Paul Reisert, Eric Nichols, Randy Gomez | Integrates LLMs into social robots to generate dynamic and expressive conversations, using a text-to-speech engine and a library of physical actions for the robot. | [Link](http://arxiv.org/abs/2402.11571v1) |\\n| Ophthalmology | Fine-tuning Large Language Model (LLM) Artificial Intelligence Chatbots in Ophthalmology and LLM-based evaluation using GPT-4 | Ting Fang Tan, Kabilan Elangovan, Liyuan Jin, Yao Jie, Li Yong, Joshua Lim, Stanley Poh, Wei Yan Ng, Daniel Lim, Yuhe Ke, Nan Liu, Daniel Shu Wei Ting | Assesses the alignment of GPT-4-based evaluation to human clinician experts for evaluating responses to ophthalmology-related patient queries generated by fine-tuned LLM chatbots. | [Link](http://arxiv.org/abs/2402.10083v1) |\\n| Privacy & Data Security | Unmemorization in Large Language Models via Self-Distillation and Deliberate Imagination | Yijiang River Dong, Hongzhou Lin, Mikhail Belkin, Ramon Huerta, Ivan Vulić | Introduces a novel approach for LLM unlearning by guiding LLMs to imagine alternative scenarios, effectively unlearning targeted text while preserving generation and NLU capabilities. | [Link](http://arxiv.org/abs/2402.10052v1) |\\n| Computational Efficiency | Anchor-based Large Language Models | Jianhui Pang, Fanghua Ye, Derek F. Wong, Longyue Wang | Proposes Anchor-based LLMs (AnLLMs) with an innovative anchor-based self-attention network (AnSAN) to reduce memory demand and enhance inference efficiency. | [Link](http://arxiv.org/abs/2402.07616v2) |\\n| Enterprise Applications | T-RAG: Lessons from the LLM Trenches | Masoomali Fatehkia, Ji Kim Lucas, Sanjay Chawla | Shares experiences building and deploying an LLM application for question answering over private organizational documents, combining RAG with a finetuned LLM and a tree structure for entity hierarchies. | [Link](http://arxiv.org/abs/2402.07483v1) |\\n\\nThese papers cover a range of domains including security, ethics, urban mobility, bioinformatics, privacy, social robotics, ophthalmology, data security, computational efficiency, and enterprise applications, showcasing the diverse applications of large language models.\", cost=({'total_cost': 0}, {'total_cost': 0}), human_input=[])" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "initializer.initiate_chat(\n", + " manager, message=\"Topic: LLM applications papers from last week. Requirement: 5 - 10 papers from different domains.\"\n", + ")" + ] + } + ], + "metadata": { + "front_matter": { + "description": "StateFlow: Build Workflows through State-Oriented Actions", + "tags": [ + "orchestration", + "group chat", + "research" + ] + }, + "kernelspec": { + "display_name": "flaml", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.18" + }, + "orig_nbformat": 4 + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/test/test_notebook.py b/test/test_notebook.py index 5399abf4a770..cc17c354067c 100755 --- a/test/test_notebook.py +++ b/test/test_notebook.py @@ -1,5 +1,5 @@ -#!/usr/bin/env python3 -m pytest - +#!/usr/bin/env python3 -m pytest + import sys import os import pytest @@ -119,6 +119,14 @@ def test_agentchat_cost_token_tracking(save=False): run_notebook("agentchat_cost_token_tracking.ipynb", save=save) +@pytest.mark.skipif( + skip or not sys.version.startswith("3.11"), + reason="do not run if openai is not installed or py!=3.11", +) +def test_agentchat_groupchat_stateflow(save=False): + run_notebook("agentchat_groupchat_stateflow.ipynb", save=save) + + if __name__ == "__main__": # test_agentchat_auto_feedback_from_code(save=True) # test_oai_chatgpt_gpt4(save=True) diff --git a/website/blog/2024-02-29-StateFlow/img/bash_result.png b/website/blog/2024-02-29-StateFlow/img/bash_result.png new file mode 100644 index 000000000000..a31f34bada64 Binary files /dev/null and b/website/blog/2024-02-29-StateFlow/img/bash_result.png differ diff --git a/website/blog/2024-02-29-StateFlow/img/intercode.png b/website/blog/2024-02-29-StateFlow/img/intercode.png new file mode 100644 index 000000000000..7616d3e6f829 Binary files /dev/null and b/website/blog/2024-02-29-StateFlow/img/intercode.png differ diff --git a/website/blog/2024-02-29-StateFlow/img/sf_example_1.png b/website/blog/2024-02-29-StateFlow/img/sf_example_1.png new file mode 100644 index 000000000000..1dea2b2f45bc Binary files /dev/null and b/website/blog/2024-02-29-StateFlow/img/sf_example_1.png differ diff --git a/website/blog/2024-02-29-StateFlow/index.mdx b/website/blog/2024-02-29-StateFlow/index.mdx new file mode 100644 index 000000000000..1989f9c24696 --- /dev/null +++ b/website/blog/2024-02-29-StateFlow/index.mdx @@ -0,0 +1,142 @@ +--- +title: StateFlow - Build LLM Workflows with Customized State-Oriented Transition Function in GroupChat +authors: yiranwu +tags: [LLM, research] +--- + +**TL;DR:**: Introduce Stateflow, a task-solving paradigm that conceptualizes complex task-solving processes backed by LLMs as state machines. +Introduce how to use GroupChat to realize such an idea with a customized speaker selection function. + +The paper is coming soon! + +## Introduction +LLMs have increasingly been employed to solve complex, multi-step tasks, e.g., tasks that require a sequence of complex reasoning +and interacting with external environments and tools. To facilitate the development of such applications, we introduce **StateFlow**, a new paradigm +that conceptualizes complex task-solving processes backed by LLMs as state machines. With proper construction of states and definition of +state transitions, we can ground the progress of task-solving, ensuring clear tracking and management of LLMs' responses throughout the task-solving process. + +## StateFlow +Finite State machines (FSMs) are used as control systems to monitor practical applications, such as traffic light control. +A defined state machine is a model of behavior that decides what to do based on current status. A state represents one situation that the state machine might be in, and all states cover all possible situations of the FSM. +Drawing from this concept, we want to use state machines to model the task-solving process of LLMs. When using an LLM to solve a task, each step of the task-solving process can be mapped to a state. +For example, the process starts at the *Init* state when the task is given. When reaching a state, a sequence of output functions is called to add content to the context history, including sending a specific instruction, using a tool, or calling the LLM itself. +Based on the current state and context history, the **StateFlow** model determines the next state to transit to. The task-solving process progresses by transitioning through different states and performing corresponding actions and ends until a final state is reached. +Essentially, we are sending different instructions to an LLM to ask it to perform different actions based on its current status. + +In **StateFlow**, we construct a state machine to control a single LLM, sending different instructions to it at different states. +We also provide an agent view of our framework, **SF_Agent**, that can use different LLM agents to perform actions at different states. +In this case, we don't need to add instructions to the context history and call the LLM. Instead, we construct individual agents with pre-set instructions and (potentially) different LLMs. +AutoGen is the perfect platform to implement **SF_Agent**. + +## Experiments +We evaluate **StateFlow** on the SQL task and Bash task from the InterCode benchmark, with both GPT-3.5-Turbo and GPT-4. +We construct different **StateFlow** models for each task (See the figure below). But note that most of the states are shared between the two tasks. +Within each state, we define a sequence of actions to be performed. The most common action sequence is P->M->E, meaning we first send a prompt, then call the LLM to generate a response, and finally execute commands from the response. +![Intercode Example](./img/intercode.png) + +We record different metrics for a comprehensive comparison. The 'SR' (success rate) and 'rewards' measure the performance, +'Turns represents a number of interactions with the environment, and 'Error Rate' represents the percentage of errors of the commands executed. +We also record the cost and token counts to show the LLM usage. + +We compare with the following baselines: +(1) ReAct: a few-shot prompting method that prompts the model to generate thoughts and actions. +(2) Plan & Solve: A two-step prompting strategy to first ask the model to propose a plan and then execute it. + +We show the results of the Bash task in the figure below: + +![Bash Result](./img/bash_result.png) + +Our evaluation demonstrates the advantages of **StateFlow** and **SF_Agent** over existing methods in terms of both effectiveness and efficiency. +For example, in the Bash task with GPT-4, **SF_Agent** improves the success rate by 8% and has 4x less cost compared to ReAct prompting. + + +## Implement StateFlow With GroupChat +We illustrate how to build **StateFlow** with GroupChat. Previous blog [FSM Group Chat](/blog/2024/02/11/FSM-GroupChat/) +introduces a new feature of GroupChat that allows us to input a transition graph to constrain agent transitions. +It requires us to use natural language to describe the transition conditions of the FSM in the agent's `description` parameter, and then use an LLM to take in the description and make decisions of the next agent. +In this blog, we take advantages of a customized speaker selection function passed to the `speaker_selection_method` of the `GroupChat` object. +This function allows us to custom the transition logic between agents, and can be used together with the transition graph introduced in FSM Group Chat. The current StateFlow implementation also allows the user to override the transition graph. +These transitions can be based on the current speaker and static checking of the context history (for example, checking if 'Error' is in the last message). + +We present an example of how to build a state-oriented workflow using GroupChat. +We define a custom speaker selection function to be passed into the `speaker_selection_method` parameter of the GroupChat. +Here, the task is to retrieve research papers related to a given topic, and create a markdown table for these papers. + +![StateFlow Example](./img/sf_example_1.png) + + +We define the following agents: +- Initializer: Start the workflow by sending a task. +- Coder: Retrieve papers from the internet by writing code. +- Executor: Execute the code. +- Scientist: Read the papers and write a summary. + + +```python +# Define the agents, the code is for illustration purposes and is not executable. +initializer = autogen.UserProxyAgent( + name="Init" +) +coder = autogen.AssistantAgent( + name="Coder", + system_message="""You are the Coder. Write Python Code to retrieve papers from arxiv.""" +) +executor = autogen.UserProxyAgent( + name="Executor", + system_message="Executor. Execute the code written by the Coder and report the result.", +) +scientist = autogen.AssistantAgent( + name="Scientist", + system_message="""You are the Scientist. Please categorize papers after seeing their abstracts printed and create a markdown table with Domain, Title, Authors, Summary and Link. Return 'TERMINATE' in the end.""", +) +``` + +In the Figure, we define a simple workflow for research with 4 states: Init, Retrieve, Reserach and End. Within each state, we will call different agents to perform the tasks. +- Init: We use the initializer to start the workflow. +- Retrieve: We will first call the coder to write code and then call the executor to execute the code. +- Research: We will call the scientist to read the papers and write a summary. +- End: We will end the workflow. + + +Then we define a customized function to control the transition between states: +```python +def state_transition(last_speaker, groupchat): + messages = groupchat.messages + + if last_speaker is initializer: + # init -> retrieve + return coder + elif last_speaker is coder: + # retrieve: action 1 -> action 2 + return executor + elif last_speaker is executor: + if messages[-1]["content"] == "exitcode: 1": + # retrieve --(execution failed)--> retrieve + return coder + else: + # retrieve --(execution success)--> research + return scientist + elif last_speaker == "Scientist": + # research -> end + return None + + +groupchat = autogen.GroupChat( + agents=[initializer, coder, executor, scientist], + messages=[], + max_round=20, + speaker_selection_method=state_transition, +) +``` + +We recommend implementing the transition logic for each speaker in the customized function. In analogy to a state machine, a state transition function determines the next state based on the current state and input. +Instead of returning an `Agent` class representing the next speaker, we can also return a string from `['auto', 'manual', 'random', 'round_robin']` to select a default method to use. +For example, we can always default to the built-in `auto` method to employ a LLM-based group chat manager to select the next speaker. +When returning `None`, the group chat will terminate. Note that some of the transitions, such as "initializer" -> "coder" can be defined with the transition graph. + + +## For Further Reading +* [StateFlow notebook](/docs/notebooks/agentchat_groupchat_stateflow) +* [GroupChat with Customized Speaker Selection notebook](/docs/notebooks/agentchat_groupchat_customized) +* [FSM Group Chat](/blog/2024/02/11/FSM-GroupChat/) +* [Documentation about `autogen`](/docs/Getting-Started)