-
Notifications
You must be signed in to change notification settings - Fork 7.8k
Description
What happened?
Describe the bug
OpenAIChatCompletionClient.count_tokens (from autogen_ext.models.openai) ignores several JSON Schema fields of Tools (e.g., anyOf, default, title), printing Not supported field
... warnings and producing consistent gaps between the pre-send estimate and usage.prompt_tokens.
- In a multi-tool agent, we rely on pre-send token estimates to avoid context-window overruns
# The entire code will be described below in To Reproduce.
token_estimate = self.model_client.count_tokens(messages=model_messages, tools=[get_current_time_tool])
create_result = await self.model_client.create(
messages=model_messages,
tools=[get_current_time_tool],
cancellation_token=ctx.cancellation_token,
)
token_usage = create_result.usage.prompt_tokens
if token_usage != token_estimate:
print(f"Token usage mismatch: estimated {token_estimate}, actual {token_usage}")
- Result
Not supported field anyOf
Not supported field default
Not supported field title
Token usage mismatch: estimated 87, actual 99
Token usage mismatch: estimated 151, actual 83
To Reproduce
Environment
pip list | grep autogen
autogen-core 0.7.4
autogen-ext 0.7.4
- Model: gpt-4o
- Scenario: Agent makes a first call with tool-calls enabled, executes the tool, then makes a second call including the tool result.
Steps to Reproduce (Minimal)
- package import
import asyncio
import json
from datetime import datetime
from typing import Dict, Optional, List
import pytz
from autogen_core import (
MessageContext,
message_handler,
RoutedAgent, SingleThreadedAgentRuntime, AgentId
)
from autogen_core.models import SystemMessage, UserMessage, FunctionExecutionResult, FunctionExecutionResultMessage, \
AssistantMessage
from autogen_core.tools import FunctionTool
from autogen_ext.models.openai import OpenAIChatCompletionClient
from dotenv import load_dotenv
from typing_extensions import Annotated
load_dotenv()
- Define a simple function, convert to
autogen_core.tools.FunctionTool
def get_current_time(
timezone: Annotated[Optional[str], "Asia/Seoul"] = None
) -> str:
"""Get the current date and time in the specified timezone."""
if timezone is None:
timezone = "Asia/Seoul"
tz = pytz.timezone(timezone)
current_time = datetime.now(tz)
return current_time.strftime("%Y-%m-%d %H:%M:%S")
get_current_time_tool = FunctionTool(
get_current_time, description=get_current_time.__doc__
)
- Define DatetimeAgent, Register FunctionTool
- Agent using gpt-4o, with tool calls
- The first call accepts the tool call, and if a tool_call request comes in, sends a second request with the tool_result
- LLM counts the number of tokens before and after each call
- For each model call (before sending), compute count_tokens, then compare with usage.prompt_tokens returned by the API
class DatetimeAgent(RoutedAgent):
def __init__(self) -> None:
super().__init__("DatetimeAgent")
self.model_client = OpenAIChatCompletionClient(model="gpt-4o")
@message_handler
async def handle_user_message(
self, message: UserMessage, ctx: MessageContext
) -> dict:
model_messages = [
SystemMessage(
content="You are a helpful assistant.\nYou don't rely on your own knowledge, but utilize the tools provided to deliver accurate information."),
message
]
# First response generation (with tools)
token_estimate = self.model_client.count_tokens(messages=model_messages, tools=[get_current_time_tool])
create_result = await self.model_client.create(
messages=model_messages,
tools=[get_current_time_tool],
cancellation_token=ctx.cancellation_token,
)
token_usage = create_result.usage.prompt_tokens
if token_usage != token_estimate:
print(f"Token usage mismatch: estimated {token_estimate}, actual {token_usage}")
# Direct response
if isinstance(create_result.content, str):
return {"result": create_result.content}
else:
# Tool execution
tool_results: List[FunctionExecutionResult] = []
for tool_call in create_result.content:
if tool_call.name == "get_current_time":
args: Dict = json.loads(tool_call.arguments)
timezone = args.get("timezone", "Asia/Seoul")
current_time = get_current_time(timezone)
tool_results.append(
FunctionExecutionResult(call_id=tool_call.id, content=current_time, is_error=False,
name=tool_call.name))
# Append model response and tool results to messages
model_messages.append(AssistantMessage(content=create_result.content, source="assistant"))
model_messages.append(FunctionExecutionResultMessage(content=tool_results))
# Final response generation (no tools)
token_estimate = self.model_client.count_tokens(messages=model_messages)
create_result = await self.model_client.create(
messages=model_messages,
cancellation_token=ctx.cancellation_token,
)
token_usage = create_result.usage.prompt_tokens
if token_usage != token_estimate:
print(f"Token usage mismatch: estimated {token_estimate}, actual {token_usage}")
return {"result": create_result.content}
- Run agent
async def main():
runtime = SingleThreadedAgentRuntime()
await DatetimeAgent.register(
runtime=runtime,
type="datetime_agent",
factory=lambda: DatetimeAgent()
)
runtime.start()
user_message = UserMessage(content="What is the current time in Seoul?", source="user")
response = await runtime.send_message(user_message, AgentId("datetime_agent", "default"))
print("Response:", response)
if __name__ == '__main__':
asyncio.run(main())
- Observed warnings & mismatches:
- Both the first (with tools) and the second call (including tool results) show persistent divergence between estimate and actual usage.
- Repeated Not supported field ... warnings for Tool schema fields such as anyOf, default, title.
Not supported field anyOf
Not supported field default
Not supported field title
Token usage mismatch: estimated 87, actual 99
Token usage mismatch: estimated 151, actual 83
Expected behavior
-
Token counting should include commonly used JSON Schema fields in Tool definitions without warnings and should closely approximate actual usage.prompt_tokens.
-
Root-Cause Hypotheses
-
FunctionExecutionResultMessage path
- The second call (with tool results appended) tends to drift more, suggesting the tokenization path for FunctionExecutionResultMessage might be incomplete or simplified.
- I need to do a little more checking, and if I see a more detailed cause, I'll report it in this issue.
-
Omitted Tool-schema fields
-
count_tokens_openai
currently only accounts for type, description, and enum, ignoring others and logging warnings.- unsupported fields are skipped and warned, leading to undercount.
-
In the example code above, the
get_current_time
function is converted to json as shown below inside thecount_tokens_openai
function, and then the calculation is performed, where fields likeanyOf
,default
, andtitle
appear, which are missing from the token calculation and cause a warning -
get_current_time
tool description
[ {
"type" : "function",
"function" : {
"name" : "get_current_time",
"description" : "Get the current date and time in the specified timezone.",
"parameters" : {
"type" : "object",
"properties" : {
"timezone" : {
"anyOf" : [ {
"type" : "string"
}, {
"type" : "null"
} ],
"default" : null,
"description" : "Asia/Seoul",
"title" : "Timezone"
}
},
"required" : [ ],
"additionalProperties" : false
},
"strict" : false
}
} ]
- The
anyOf
,default
, andtitle
fields are missing from the token calculation and raise a warning
autogen/python/packages/autogen-ext/src/autogen_ext/models/openai/_openai_client.py
Lines 373 to 402 in 82df9dd
# Tool tokens. oai_tools = convert_tools(tools) for tool in oai_tools: function = tool["function"] tool_tokens = len(encoding.encode(function["name"])) if "description" in function: tool_tokens += len(encoding.encode(function["description"])) tool_tokens -= 2 if "parameters" in function: parameters = function["parameters"] if "properties" in parameters: assert isinstance(parameters["properties"], dict) for propertiesKey in parameters["properties"]: # pyright: ignore assert isinstance(propertiesKey, str) tool_tokens += len(encoding.encode(propertiesKey)) v = parameters["properties"][propertiesKey] # pyright: ignore for field in v: # pyright: ignore if field == "type": tool_tokens += 2 tool_tokens += len(encoding.encode(v["type"])) # pyright: ignore elif field == "description": tool_tokens += 2 tool_tokens += len(encoding.encode(v["description"])) # pyright: ignore elif field == "enum": tool_tokens -= 3 for o in v["enum"]: # pyright: ignore tool_tokens += 3 tool_tokens += len(encoding.encode(o)) # pyright: ignore else: trace_logger.warning(f"Not supported field {field}")
Additional context
- I'd like to open an issue instead of a PR, in case there is something that is intended and not a bug.
- My English is not very good, so there might be some awkwardness in using the translator. Thank you for your understanding
Which packages was the bug in?
Python Extensions (autogen-ext)
AutoGen library version.
Python 0.7.4
Other library version.
No response
Model used
gpt-4o
Model provider
OpenAI
Other model provider
No response
Python version
3.12
.NET version
None
Operating system
MacOS