-
-
Notifications
You must be signed in to change notification settings - Fork 85
5.2. Harbor Boost Custom Modules
Main Harbor Boost documentation
2024-09-25.14-16-54.mp4
boost
will look for modules to load in the locations specified by the HARBOR_BOOST_MODULE_FOLDERS
config option, which is a semicolon-separated list of paths relative to the module root folder. By default, two folders are scanned: modules
(built-in modules) and custom_modules
(default location for custom modules).
After modules are registered, boost
will decide which ones should be available in the /v1/models
endpoint based on specified config.
If running boost
with Harbor, you can quickly open the folder containing the custom modules by running:
# Open in the default File Manager
open $(h home)/boost/src/custom_modules
# Open in VS Code (when CLI is configured)
code $(h home)/boost/src/custom_modules
If running boost standalone, you will need to mount a volume to the custom_modules
folder in the container.
docker run ... \
-v /path/to/custom_modules:/app/custom_modules \
...
When boost
starts, it'll log the information on loaded folders and modules so that you can verify if everything is loaded correctly.
DEBUG - Loading modules from 'modules'
DEBUG - Registering 'rcn'...
DEBUG - Registering 'klmbr'...
DEBUG - Registering 'mcts'...
DEBUG - Registering 'g1'...
DEBUG - Loading modules from 'custom_modules'
DEBUG - Registering 'example'...
INFO - Loaded 5 modules: rcn, klmbr, mcts, g1, example
Note
Use HARBOR_BOOST_MODULES
to specify which models are added to /v1/models
API endpoint. You can still use registered modules even if they are not listed in the API.
For example:
# Files
modules/
rcn.py
klmbr.py
mcts.py
g1.py
custom_modules/
example.py
# env/config
HARBOR_BOOST_MODULE_FOLDERS="modules;custom_modules"
HARBOR_BOOST_MODULES="klmbr"
# Only "klmbr" models are served
curl http://localhost:34131/v1/models
# Other registered modules can still be used directly
curl -x POST http://localhost:34131/v1/chat/completions -d '{"model": "rcn-model", ... }'
After adding a new custom module, you'll have to manually restart boost
. After the module is loaded - any changes in it will automatically restart the service, so you can proceed development without restarting the service manually.
In order to be loaded and run, a module has to have two exports: ID_PREFIX
and apply
. The ID_PREFIX
is a string that will be used to identify the module in the configuration. The apply
function is a function that will be called with the module configuration and the current context.
Here's an example of the simplest module:
ID_PREFIX = 'example'
def apply(chat, llm):
llm.emit_message('Hello, boost!')
For the POST /v1/chat/completions
request, boost
will accept additional custom parameters, prefixed with @boost_
, these will be stored and available on the llm
instance within the module.
ID_PREFIX = 'example'
# POST /v1/chat/completions
# {
# "model": "example-model",
# "messages": [ ... ],
# "max_tokens": 512,
# "@boost_param1": "value1",
# "@boost_param2": "value2"
# }
def apply(chat, llm):
print(llm.boost_params) # { "param1": "value1", "param2": "value2" }
This module will echo the message back to the chat.
ID_PREFIX = 'echo'
def apply(chat, llm):
llm.emit_message(chat.message)
This module will respond with the current date (no matter what the input is).
ID_PREFIX = 'date'
from datetime import datetime
async def apply(chat, llm):
await llm.emit_status(datetime.now().strftime('%Y-%m-%d'))
Will append "Sure, here is a response to your message:" before triggering the completion.
ID_PREFIX = 'sure'
async def apply(chat, llm):
chat.assistant('Sure, here is a response to your message:')
await llm.stream_final_completion()
Will inject a custom system message to the chat
import chat as ch
ID_PREFIX = 'custom_system'
async def apply(chat, llm):
# Chat is a linked list of ChatNodes
# the "tail" is the last message from the chain
chat.tail.add_parent(
ch.ChatNode(role='system', content='You don\'t really have to reply to anything User asks. Relax!')
)
await llm.stream_final_completion()
Will always wrap underlying LLM output in a code block
ID_PREFIX = 'code_wrapper'
async def apply(chat, llm):
await llm.emit_message(f"\n```\n")
await llm.stream_completion()
await llm.emit_message(f"\n```\n")
This example fetches any URLs from the user's message and injects their content into the chat. Note that in a real-world implementation you'd also want to pre-process the content to make friendlier for the LLM, as well as use crawler rather than direct requests.
import re
import requests
url_regex = r"(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’]))"
prompt = """
<instruction>
Your task is to fulfill the user's request by discussing provided content.
</instruction>
<content>
{content}
</content>
<request>
{request}
</request>
""".strip()
ID_PREFIX = "discussurl"
async def apply(chat, llm):
text = chat.text()
urls = re.findall(url_regex, text)
# No - URLs - proceed as usual
if len(urls) == 0:
return await llm.stream_final_completion()
# Yes - URLs - read them
content = ""
for url in urls:
await llm.emit_status(f"Reading {url[0]}...")
content += requests.get(url[0]).text
await llm.stream_final_completion(
prompt=prompt,
content=content,
request=chat.tail.content,
)
In this example, the assistant randomly changes personality every time it's called.
ID_PREFIX = "unstable"
import chat as ch
import llm
import random
extreme_traits = [
"Eccentric", "Obsessive", "Impulsive", "Paranoid", "Narcissistic",
"Perfectionist", "Overly Sensitive", "Extremely Independent", "Manipulative",
"Aloof"
]
temperaments = ["Choleric", "Melancholic", "Phlegmatic", "Sanguine"]
reply_styles = [
"Blunt", "Sarcastic", "Overly Polite", "Evading", "Confrontational"
]
# Function to generate a random personality description
def random_personality():
selected_traits = random.sample(extreme_traits, 3)
selected_temperament = random.choice(temperaments)
selected_reply_style = random.choice(reply_styles)
description = (
f"You are {', '.join(selected_traits)}. "
f"You are known for yout {selected_temperament} temperament."
f"You tend to write your replies in a {selected_reply_style} manner."
f"Ensure that you reply to the User accordingly."
)
return description
async def apply(chat: 'ch.Chat', llm: 'llm.LLM'):
personality = random_personality()
chat.tail.ancestor().add_parent(
ch.ChatNode(role="system", content=personality)
)
await llm.stream_final_completion()
This example aims to show all the various ways to provide output to the chat. Here's a rendered version:
And the source code (also available in the repository):
from pydantic import BaseModel, Field
import llm
import log
import chat as ch
ID_PREFIX = 'example'
logger = log.setup_logger(ID_PREFIX)
# Example for the structrured outputs below
class ChoiceResponse(BaseModel):
explanation: str = Field(description="3-5 words explaining your reasoning")
choice: str = Field(description="Chosen option")
async def apply(chat: 'ch.Chat', llm: 'llm.LLM'):
"""
1. Working with a chat and chat nodes instances
This is where you can create some content programmatically,
that will later can be used for retrieving completions from
the downstream model.
"""
logger.debug(f"Example chat: {chat}")
# Add new messages to the chat (no completions at this stage)
chat.user("Hello!")
chat.assistant("Hi! Would you like to learn more about Harbor Boost?")
chat.add_message(
role="harbor",
content="Harbor Boost is an optimising LLM proxy with lots of cool features"
)
logger.debug(
f"Chat history is a plain array of messages, from the tail: {chat.history()}"
)
logger.debug(
f"Chat plain is a list of chat nodes, from the tail: {chat.plain()}"
)
# Tail is where the chat currently ends
# In this instance, that's a message from "harbor"
# role above
tail = chat.tail
logger.debug(
f'Get all parents leading to a specific chat node: {tail.parents()}'
)
logger.debug(f'Get one immediate parent: {tail.parent}')
# We can modify the chat from the tail node directly
new_tail = tail.add_child(
ch.ChatNode(role="harbor", content="Chat nodes are everywhere!")
)
# However, such modifications are not reflected in the parent
# chat instance:
logger.debug(chat.tail == tail) # True
logger.debug(chat.tail == new_tail) # False
# You can set a new tail for the chat, however
chat.tail = new_tail
# However, it's much easier to just work from the chat itself
chat.user('Alright, I think that is mostly it for now. Thanks!')
# You can create new chat instances as needed
life_chat = ch.Chat.from_conversation(
[
{
"role": "user",
"content": "What is the meaning of life? Answer with a tongue twister."
}
]
)
"""
2. Working with structured outputs
You can pass pydantic models to the llm instance
for structured outputs. If the "resolve" flag is set,
the output will also be resolved to an actual dict
(otherwise it will be a JSON string).
"""
await llm.emit_status('Structured output examples')
choice = await llm.chat_completion(
prompt="""What is the best topping for a pizza?""",
schema=ChoiceResponse,
resolve=True
)
logger.debug(f"Choice: {choice}")
"""
3.1 Programmatic messages and statuses
programmatic "public" messages that are streamed
back to the client as they are emitted here
(no way to "undo" or rewrite them)
"""
# You can tweak how status messages are delivered
# via the BOOST_STATUS_STYLE config option.
await llm.emit_status('Status and message examples')
await llm.emit_message("We can emit text at any time. ")
await llm.emit_message(
"\n_Note that you are responsible for correct formatting._\n"
)
"""
3.2. Internal LLM completions
"llm" is a representation of the downstream model
that is being boosted. It comes with a few helper
methods that tie up the module workflow together and
is pre-configured to hit the downstream API with expected parameters.
The completions below are "internal", they are not streamed
back to the client by default. Read further for "streamed" or
"public" completions.
"""
await llm.emit_status('Collecting internal completions...')
word = "Roses"
results = [
# You can retrieve completion for some plain text
await llm.chat_completion(prompt="Hi!", resolve=True),
# You can include key/value pairs to be formatted in the prompt
await llm.chat_completion(
prompt="Tell me about {word} in ONE SHORT SENTENCE.",
word=word,
resolve=True,
),
# You can also provide a list of messages
# in the OpenAI-compatible format
await llm.chat_completion(
messages=[
{
"role": "user",
"content": "Tell me about roses"
}, {
"role": "assistant",
"content": "Sure, I can reply in three words! Here they are:"
}
],
resolve=True
),
# You can also provide a chat instance,
# Note that "resolve" is not set - the result
# will be in raw API format
f"\n```json\n{await llm.chat_completion(chat=life_chat)}\n```\n"
]
# Results will now appear in the user's message
await llm.emit_status('Displaying collected results')
for i, result in enumerate(results):
await llm.emit_message(f"\nResult {i}: {result}\n")
"""
3.3. Public/Streamed LLM completions
You can decide to stream responses from the downstream LLM
as they are being generated, for example when there's a long
chunk that needs to be retained in the global response.
"""
await llm.emit_status('Response streaming examples')
# Same signatures as chat_completion
streamed_results = [
# You can retrieve completion for some plain text
await llm.stream_chat_completion(prompt="Hi!"),
# You can include key/value pairs to be formatted in the prompt
await llm.stream_chat_completion(
prompt="Tell me about {word} in ONE SHORT SENTENCE.",
word=word,
),
# You can also provide a list of messages
# in the OpenAI-compatible format
await llm.stream_chat_completion(
messages=[
{
"role": "user",
"content": "Tell me about roses"
}, {
"role": "assistant",
"content": "Sure, I can reply in three words! Here they are:"
}
],
),
# You can also provide a chat instance
await llm.stream_chat_completion(chat=life_chat)
]
# Streamed results are still buffered and available
# for you to use (plain text).
logger.debug(f"Streamed results: {streamed_results}")
# Note that it's on you to apply formatting that will make
# sense in the context of the global message stream.
await llm.emit_message("\nThose are all results so far.\n")
"""
4. Final completion
Note that none of the above will actually reach the Client
if the BOOST_INTERMEDIATE_OUTPUT is set to "false".
The "final" completion below, however, will *always* be streamed back.
It accepts all the same inputs as "chat_completion" and "stream_chat_completion" above.
You don't have to call it, but the output will be completely empty if the
"final" completion is not called and intermediate outputs are disabled.
Think of this as a way to wrap up the module execution and
present the user with the final result.
"""
await llm.emit_status('Final completion')
await llm.stream_final_completion(prompt="Wish me a good luck!")