-
Notifications
You must be signed in to change notification settings - Fork 398
DRAFT support Gemini ThoughtSignature and correct FinishReason for streaming updates #15173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…eaming updates fix: correct FinishReason for streaming updates VertexAI returns FinishReason.Unspecified on partial updates. This was incorrectly mapped to ChatFinishReason.Stop and should be null. fix: add ModelArmor to support FinishReason, mapping it to ChatFinishReason.ContentFilter feat: support Gemini Thought Signatures use M.E.AI's TextReasoningContent for thought summaries (Part.Thought == true) store ThoughtSignature for individal Parts in the AdditionalProperties dictionary of the corresponding AIContent.
38cc036 to
1aaf075
Compare
|
Just saw this, thanks. I didn't follow why TextReasoningContent.ProtectedData is insufficient. Is the issue fundamentally what you wrote: "In Gemini, reasoning isn't a separate Part, but rather a property on another valid Part (such as Text or FunctionCall)"? Did you consider just translating any part that has a thought signature instead into multiple AIContent instances, one that's a TextReasoningContent with the thought signature as ProtectedData and one for the other content (e.g. the text, the function call, etc.)? A TextReasoningContent without Text is only good for roundtripping back to the original model, so a consumer is just going to ignore it, and the IChatClient implementation can reassembly things when sending back to the model so that a TextReasoningContent just contributes its ProtectedData as the ThoughtSignature for the subsequent part. |
Those are the only pre-defined values (static properties on ChatFinishReason), but ChatFinishReason is a string-wrapping struct, so technically you could store whatever you want. |
No I didn't. I also mistakenly assumed that the thought signature could appear on any part in the non-streaming case. It is indeed on the first content part so this seems doable and cleaner. {
"candidates": [
{
"content": {
"role": "model",
"parts": [
{
"text": "Okay, here's what I'm thinking:\n\nFirst, this user wants to plan a day trip, and they've got some weather preferences. Great. I see that I can use `get_current_weather` – that seems like a relevant tool.\n\nNow, I need...(snip)...",
"thought": true
},
{
"text": "I can certainly look up the weather for you in those cities. That way you can decide which city has the best weather for you to walk around in.",
"thoughtSignature": "CrMNAePx/16T5mM0WQgdAtYLJ...(snip)..."
},
{ "functionCall": {"name": "get_current_weather", "args": { "location": "Salzburg"} } },
{ "functionCall": {"name": "get_current_weather", "args": { "location": "Paris" } } },
{ "functionCall": {"name": "get_current_weather", "args": { "location": "Prague" } } }
]
},
"finishReason": "STOP",
"avgLogprobs": -0.75160148962220152
}
]
...
}I'll update the PR with that approach. Thanks! |
… signatures WIP: broken on streaming results
|
I've updated with the suggested approach and it's cleaner and works well for the non-streaming use case. The integration test for the streaming case is failing. The individual AIContents are all there but I haven't debugged it yet extensively and will pick it up again in a few hours. |
|
It shouldn't be losing the ProtectedData, but if it is, or if the logic of ToChatResponse needs to be tweaked, we can do that (ideally today if possible, to catch our next build) I will take a look when I'm at my desk. The intent was that a TextReasoningContemt A could be combined with a following TextReasoningContent B if the former didn't have ProtectedData, even if the latter did, and then the combined text plus any ProtectedData from B would be on the combined instance. It's possible that's not happening. |
Streaming + IncludeThoughts Streaming + not IncludeThoughts Non-Streaming + IncludeThoughts Non-Streaming + not IncludeThoughts
|
To clarify the implementation, given the following response... [{
"text": "Okay, here's what I'm thinking:\n\nFirst, this user wants to plan a day trip, and they've got some weather preferences. Great. I see that I can use `get_current_weather` – that seems like a relevant tool.\n\nNow, I need...(snip)...",
"thought": true
},
{
"text": "I can certainly look up the weather for you in those cities. That way you can decide which city has the best weather for you to walk around in.",
"thoughtSignature": "CrMNAePx/16T5mM0WQgdAtYLJ...(snip)..."
}]... I'm parsing it into: new TextReasoningContent(parts[0].text),
new TextReasoningContent(null) { ProtectedData = parts[1].thoughtSignature },
new TextContent(part[1].text)When we don't specify IncludeThoughts=true in the request, everything is fine since we only have a single TextReasoningContent. |
|
EDIT: I misread what you wrote. I see, the first text part is noted as a thought but then the second text part has a thought signature. |
|
Here's a simple example without involving the PredictionServiceChatClient: [Fact]
public void ToChatResponse_SimpleText_WithThoughtsAndThoughtSignature()
{
// Response #1 [{ "text" : "This is my thought.", "thought": true }]
var textReasoningContent1 = new TextReasoningContent("This is my thought.");
// Response #2 [{ "text", "This is my response.", "thoughtSignature" : "someThoughtSignatureBase64" }]
var textSignatureContent = new TextReasoningContent(null) { ProtectedData = "someThoughtSignatureBase64" };
var textContent1 = new TextContent("This is my response.");
List<ChatResponseUpdate> responses = [
new ChatResponseUpdate(ChatRole.Assistant, [textReasoningContent1]), //1
new ChatResponseUpdate(ChatRole.Assistant, [textSignatureContent, textContent1]), //2
];
var chatResponse = responses.ToChatResponse();
var contents = chatResponse.Messages[0].Contents;
Assert.Equal("This is my response.", chatResponse.Text);
Assert.Equal("This is my thought.", contents.OfType<TextReasoningContent>().First(trc => trc.ProtectedData is null).Text);
Assert.Equal(1, contents.OfType<TextContent>().Count());
Assert.Equal(1, contents.OfType<TextReasoningContent>().Count(trc => trc.ProtectedData is null));
Assert.Equal(1, contents.OfType<TextReasoningContent>().Count(trc => trc.ProtectedData is not null)); // <-- Fails.
}Is it reasonable for ToChatResponse not to merge TextReasoningContent when ProtectedData is set? We only expect to have exactly 1 TextReasoningContent with ProtectedData set per turn, but there may be 0 or multiple TextReasoningContent("thought")'s. |
There's a bug in ToChatResponse I'm fixing right now where the ProtectedData is getting dropped.
The intent of the current scheme was that you could have a sequence of TextReasoningContent and they'd be mergeable up to and including the first one that has ProtectedData set, e.g. would become: The bug is that right now it's becoming: Once that's fixed, does that work for your needs, or any combining with something that has ProtectedData set is problematic? |
|
In theory it's workable. Per model turn, we expect 0 or 1 ThoughtSignatures and 0 or many Thoughts (max 1 if non-streaming). It doesn't quite map to our API because a Thought is just a flag on a text part, and a ThoughtSignature is an attribute on Part. But I think we could just attach it to the appropriate first Part Content.Parts in the subequent request to the model (which is what I have already). From the user perspective, I'm not so sure though. This means we have two uses for TextReasoningContent. One is throw-away and informational and the other is critical and needs to be persisted. And the user needs to inspect the contents of the object to determine what they need to do with it. Even if these are correctly merged, we still have an object that has one throw-away property (Text - which can be quite large) and one critical one (ProtectedData). Could this warrant a new type for informational Thoughts or a type explicitly for ThoughtSignature/ProtectedData? If we had something like Edit: I must add, when I say it's workable I mean getting the code to work and the tests to pass. I would have to defer to @amanda-tarafa to answer if it works for our needs. |
Great
I don't think so? We arrived at this based on looking at how varying services represent this, e.g. other services actually expect them to be merged, such that for example streaming reasoning sends individual text deltas then a final signature and expects both the coalesced text and the signature to be roundtripped later. (Also, at this point adding something the duplicates what's here would, I expect, lead to confusion.)
While it may be for Google, Text is not throw away for all services, though.
Users generally don't need to look at ProtectedData; it's there just to roundtrip. |
What else is there? That's a genuine question. Thought signatures / encrypted content / protected data / etc. are not user facing; they exist purely to roundtrip data back to the service/model. If everything holds together technically, such that the data can be roundtripped, what other concerns do we have? |
|
Roger. I'll need to update the tests to match the expected merged content, but other than that it should work once the fix is there. It already works fine without IncludeThoughts for both non-streaming and streaming, and with IncludeThoughts for non-streaming. I'll add a few more tests around the thought handling. |
|
Ok, please let me know if you run into any issues. I'm certainly open to revising the coalescing strategy if it's not workable as is (after the bug fix dotnet/extensions#6936). |
|
From https://ai.google.dev/gemini-api/docs/thinking#signatures
And from https://ai.google.dev/gemini-api/docs/function-calling?example=meeting#thinking
Recaping, and let me know if there's something wrong here:
Proposal:
I like the alternatives that rely on What do you think? I can work on this myself but it'll likely be Wednesday. |
One point to clarify. The requests payload for both the streaming and non-streaming endpoints are identical. The streaming endpoint streams the partial responses as they are generated. IEnumerable.ToChatResponse() is used to flattens these (as opposed to, say, compacting multiple turns into fewer to manage context length). ADK does this too. We don't need to preserve the unflattened form when we round trip to the model. IE the flattened version we recreate is what the response would have been if they hit the non-streaming endpoints. In practice, this just means flattening any multiple thoughts or multiple text parts into 1 each, and leaving everything else as is. |
|
Thanks for all the details, @drch- ! What I'm hearing (and please correct me if I've misunderstood) is that you can make it work with what's there today, but doing so requires you to effectively bake in knowledge of the heuristics Based on what you've said, I think all of these are potential solutions, in my order of preference (not promising we'd do any of these yet, just trying to understand the field and then we'll need to vet against other providers):
? |
Yes, I think this makes sense and does not contradict any of the documentation.
Not really, we shouldn't be merging the signature with the previous thoughts. Because you are coalescing, we would have to drop all mergeable thoughts on our side before returning, and I really believe we shouldn't be in the business of dropping bits. It seems we need 1 or 3, 2 won't work because we cannot guarantee that the signature will end up in its original part. I like 3 better because 1 (and 2 if made to work) are mixing data with behaviour and those relations may not hold in the future. What about a version of 3 where the property is named |
But that's avoidable by inserting an AIContent between them that will prevent them from being coalesced (literally |
Yes, I hadn't either read (this is not exactly 2 I think) or thought about this, but that would work, I'd still insert one before and one after, etc. But yes, using empty AIContent as boundary as a (very) temporary solution would work. |
|
@stephentoub The prefered solution would be a combination of 2 and 3 with |
(1) gives you the same capability as (3)... anything you don't want coalesced is just given its own part id. But it's also in my mind much cleaner than (3) because it's describing what the thing is rather than behaviors that should be applied to it by unrelated components that may or may not be used later on. It's why (3) is my least favorite. And various providers actually have that concept, whether it be "part" or "index" or such terminology. |
|
Also, for (2), does gemini/vertex ever today actually send a thought signature with non-text, or is that a hypothetical future thing? |
|
Yes, it does. For example it's common that a response contains only function call(s). The thought signature appears on this part. |
As @drch- said, yes, and it's even documented in https://ai.google.dev/gemini-api/docs/thinking#signatures: "Signatures are returned from the model within other parts in the response, for example function calling or text parts.". If we don't have 2) we have to use empty TRCs to put these signatures before or after the actual content.
Yes, agreed in tersm of cabality.
But M.E.AI would be actually infering a desired beahaviour just from the description of the thing. That's precisely what I don't like. I prefer the description of the thing and the expectations of behaviour to be separate. RawResponse is a similar description of the thing, and yet you are not infering behavoiur from it (I know it's not serializable and it would be breaking to use it now anyway). AdditionalProperties is also a description of the thing and you are not infering any behaviour from it, even if loosing info on the way because you only attach the first AdditionalProperties to the merged result. PartId is fundamentally no different than those, even if it's "an ID". |
Yes, it would be inferring that it's ok within a very narrow heuristic to coalesce all text (or separately text reasoning) that's part of the same part. As @drch- pointed out, ADK does that, too, merging viable updates within the same part. Adding a "coalescable" onto AIContent is putting external concerns onto it. This is not specific to the content, but rather a directive to an external algorithm about how that algorithm should behave. It's declaring a "how" for one possible consumer of the data, rather than declaring a "what", which is directly tied to the data. It happens that we ship an implementation of such an algorithm in the same assembly, but that's just packaging; anyone could write such a thing. And that doesn't scale. What if someone wants to write a reducer that removes content in order to reduce the size of chat history... should AIContent also be imbued with a strongly-typed property for priority on removal? What if someone wants to write a translater that tranforms a history into a different language... should AIContent be given a strongly-typed property that dictates whether it's intended to be translated or whether that would result in issues? And so on. I would much rather enable the AIContent (or, probably ChatResponseUpdate) to express the part that it's a part of, reflecting the actual data provided by the provider, which is about the what rather than the how for some unrelated algorithm. And then that algorithm's heuristics can take that into account. |
This is in the 9.10.1 release on nuget. |
|
First, one note, we will make this work almost certainly, with whaterver surface M.E.AI ends up having. We can make it work now with AIContent as boundary, and any of the options being discussed here are already better than that. I'll continue the discussion to try to get to a better surface, but not because this is blocking. My thoughts below but my proposal is this: Add
Not exactly, what ADK does is wrap, and return, each streamed response in an LlmResponse that preserves the original content, including the original parts with signatures, etc. And then, when possible, returns extra synthetic LlmResponses that contain the merged text or thoughts. The important bit being, merging viable content does not loose information and the original parts can all be recovered, which is what we need for Gemini, at least for the signed parts.
I don't agree with that what MEAI is shipping just happens to be a specific implementation. If someone writes a custom reducer and I use it, I do so at my own risk and it's my responsability to understand the roundtripping implications of using that custom reducer for the set of models I'm working with. But I think it's a fair expectation to put on M.E.AI's own I don't agree either with this being about instructions to some random algorithm. What we are trying to do is allow each provider to specify how the data needs to be preserved for rounttripping, which is a fundamental aspect, wheter we call that "describing the data in terms of preservation" or "giving instructions for preservation of the data", is, if I may so, somewhat irrelevant. And of course, algorithms may decide not to respect those instructions, or they may not need to, but that's on them.
Adding PartId won't scale either. That's just a proxy for passing on the preservation instructions by convention, but they continue to be incomplete instructions (as compared to the domain of possible transformations, etc.). M.E.AI was using ProtectedData in the same way, and that didn't scale just now, that's why we are talking about this. In three months, ProtectedData and PartId won't be enoug and there'll need to be a third hint somewhere. Last night I was thinking about pushing the merging decision down to the providers through callbacks or similar, but that won't scale either (there are plenty more transformations than just merging) and seems messy in terms of API surface. So, what about AIContent having a |
With this in, we can work on a workaround for preserving the signature. Given the parts: Given the following Vertex parts: We yield the following AIContents: So that they are coalesced as follows: @stephentoub , @drch- do you think this is right, assuming MEAI's current coalesce strategy? |
|
Yes that coalescing is accurate. I tested that specific scenario with the latest M.E.AI and the output was exactly as described. I'd be happy if you would make the changes, @amanda-tarafa . I won't likely have a chance today. |
Great. I think we're going to need to agree to disagree on the specifics ;-) |
I was out for part of this week, but I'll get to this as soon as I have a chance. |
Thanks :) Are we blocking an initial release of the package for this? Or could this just be incremental bug fixes? |
Not, just for me to get some time to patch the current implementation and add a couple of headers to track usage. It won't be later than Wednesday. ( @jskeet already fixed project generation so we are good and that also). |
|
Cool, sounds good. Please let me know if I can help at all. (Also, it shouldn't affect anything, but there's a 9.10.2 for M.E.AI.Abstractions if you want to bump to that.) |
|
@amanda-tarafa any news about that release? :) |
|
No news yet, but I'm on it, please bear me with. |
|
Closing, replacement PR coming shortly. |
@amanda-tarafa, any updates here on Google.Cloud.VertexAI.Extensions getting published? Can I help in some way? cc: @rogerbarreto |
|
Any news on the followup PR for this? |
|
Please see #13815 (comment) and in general follow #13815 for updates. Thanks. |
towards #13815 as discussed with @amanda-tarafa
updated after implementing feedback from stephentoub
Summary: small fix to correctly hande null FinishReason in streamed updates, and a larger one to bridge the gap between our Part.ThoughtSignature and Microsoft.Extensions.AI's TextReasoningContent.
First the small one:
FinishReason
When we send streaming updates, VertexAI sets FinishReason to Unspecified on the interim results and the overall reason (typically Stop unless otherwise filtered) on the final update. I've updated the mapping of our FinishReason.Unspecified to a
nullM.E.AI.ChatFinishReason, which is how the OpenAI client handles it.Question: How should we handle new FinishReasons in the catch-all? Or alternatively, do we expect future FinishReasons to be ContentFilter or Stop? M.E.AI has only Stop, Length, ToolCalls, ContentFilter. I've left it as is (Stop) but I'm leaning towards ContentFilter.
Thought Summaries and Thought Signatures
Gemini 2.5 models can think. Callers can specify a
thinkingBudget. 2.5 Flash variants allow a minimum of 0 (disabled). 2.5 Pro has a minimum of 128 (ie, thinking is always enabled).Callers can specify
"includeThoughts" : truein a request. They are purely informational summaries of the raw thoughts, and can provide the developer with insights into reasoning or an end-user a responsive experience during the thinking phase. They are returned as textParts with"thought":trueThoughtSignatures are particularly effective for function calling. If the thinking budget is > 0, AND function calling is enabled, Gemini will include a ThoughtSignature on the first non-thinking Part in the response. A ThoughtSignature is an encrypted representation of the model's internal thought process. (Note, the thought signature appearing on the first Part is an observation, not specified in the API).
Example response where includeThoughts is true, and thinkingBudget > 0
{ "candidates": [ { "content": { "role": "model", "parts": [ { // thought "text": "Okay, here's what I'm thinking:\n\nFirst, this user wants to plan a day trip, and they've got some weather preferences. Great. I see that I can use `get_current_weather` – that seems like a relevant tool.\n\nNow, I need...(snip)...", "thought": true }, { // text part "text": "I can certainly look up the weather for you in those cities. That way you can decide which city has the best weather for you to walk around in.", "thoughtSignature": "CrMNAePx/16T5mM0WQgdAtYLJ...(snip)..." }, { // functionCall part "functionCall": {"name": "get_current_weather", "args": { "location": "Prague" } } } ] } } ] // ... }Gemini is stateless, so in a multi-turn conversation, the caller has to include all the parts of the conversation so far. ThoughtSignatures should also be included to provide the model's raw thoughts back to itself, which are otherwise lost. In 1P libraries, this is straightforward - append the response's
candidates[].contentto the next request.M.E. AI uses a single type,
TextReasoningContent, for both. It has a Text property and ProtectedData property. The above snippet is converted to the following AIContent subclasses:Users of the library will also simply copy these contents into the next request, so we have to rebuild the new request from these objects.
We ignore any thought summary Text (ie, the first TextReasoningContent above).If there are any TextReasoningContent with non-null ProtectedData, we set this as the ThoughtSignature the first Part in the list. There will only ever be 0 or 1 ProtectedData in a given turn.Streaming and non-streaming
Responses can be streamed to provide the caller with content while the model generates it. In particular for text content (both thoughts and model text output). These partial responses of
ChatResponseUpdateare coallesced once the final response has been received. In our case above, our two TextReasoningContent in a streaming response will be merged into a single one with both Text and ProtectedData set. Note that ThoughtSignatures are not meant to be merged and M.E.AI will also keep multiple TextReasoningContents instances during the merge if they have ProtectedData set. (fix will go out shortly)