Skip to content

Conversation

lsorber
Copy link
Member

@lsorber lsorber commented Dec 25, 2024

Changes:

  1. ✨ Enable streaming tool use for llama-cpp-python models (see below for details). The result is a more simple rag and async_rag implementation that opens the door to Agentic RAG and user-defined tools.
  2. ✨ Update the query parameter description to require that it is a single-faceted question (i.e., a non-compound question) to encourage parallel function calling for compound questions.
  3. ✅ Add fairly thorough tests for both rag and the improved chatml-function-calling chat handler.

Changes to llama-cpp-python's chatml-function-calling chat handler:

  1. General:
    a. ✨ If no system message is supplied, add an empty system message to hold the tool metadata.
    b. ✨ Add function descriptions to the system message so that tool use is better informed (fixes chatml-function-callling not adding tool description to the prompt. abetlen/llama-cpp-python#1869).
    c. ✨ Replace print statements relating to JSON grammars with RuntimeWarning warnings.
    d. ✅ Add tests with fairly broad coverage of the different scenarios.
  2. Case "Tool choice by user":
    a. ✨ Add support for more than one function call by making this a special case of "Automatic tool choice" with a single tool (subsumes Support parallel function calls with tool_choice abetlen/llama-cpp-python#1503).
  3. Case "Automatic tool choice -> respond with a message":
    a. ✨ Use user-defined stop and max_tokens.
    b. 🐛 Replace incorrect use of follow-up grammar with user-defined grammar.
  4. Case "Automatic tool choice -> one or more function calls":
    a. ✨ Add support for streaming the function calls (fixes Feature request: add support for streaming tool use abetlen/llama-cpp-python#1883).
    b. ✨ Make tool calling more robust by giving the LLM an explicit way to terminate the tool calls by wrapping them in a <function_calls></function_calls> block.
    c. 🐛 Add missing ":" stop token to determine whether to continue with another tool call, which prevented parallel function calling (fixes chatml-function-calling chat format fails to generate multi calls to the same tool abetlen/llama-cpp-python#1756).
    d. ✨ Set temperature=0 to determine whether to continue with another tool call, similar to the initial decision on whether to call a tool.

@lsorber lsorber requested a review from undo76 December 25, 2024 22:34
@lsorber lsorber self-assigned this Dec 25, 2024
@lsorber
Copy link
Member Author

lsorber commented Dec 26, 2024

Upstream PR to bring these improvements to llama-cpp-python: abetlen/llama-cpp-python#1884

@lsorber lsorber merged commit c57aac1 into main Jan 5, 2025
2 checks passed
@lsorber lsorber deleted the ls-streaming-tools branch January 5, 2025 15:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant