You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I was trying out ingesting data beyond the two sample PDFs and quickly ran into rate limit issues:
Unhandled exception. System.ClientModel.ClientResultException: HTTP 429 (: RateLimitReached)
Rate limit of 20 per 60s exceeded for UserByModelByMinute. Please wait 0 seconds before retrying.
at OpenAI.ClientPipelineExtensions.ProcessMessageAsync(ClientPipeline pipeline, PipelineMessage message, RequestOptions options)
This rate limit is unfortunate. More info about it:
This leads to several problems with the existing ingestion mechanism:
It runs out of requests very quickly. Currently it calls the generator once per source document, which means you couldn't ingest more than 150 docs/day. In any case, you'd hit per-minute rate limits much sooner than that (e.g., eShopSupport tries to ingest 200 docs, and would hit per-minute limits every 15 docs)
We don't handle rate limit errors. They just cause ingestion to fail.
We don't manage the number of input tokens. We just send all the chunks in an entire source document, even if the total is over 64k tokens, which would also fail.
What we have is OK for "hello world" cases (2 small PDFs) but doesn't address the question of scaling beyond that. It's good that the template is labelled "preview" because this is quite a basic issue we need to resolve.
Possible strategies:
We could do in-proc embedding generation (e.g., via Onnx runtime) instead of calling an external embedding generator service
We could reconsider the whole approach around ingestion to make it less frameworky and more clearly just one special case for a "getting started" template, making it more obvious that developers are responsible for figuring out their own approach to ingestion based on their needs
... or we could go the other way and actually try to make something like a minimal ingestion framework that handles all the edge cases
The text was updated successfully, but these errors were encountered:
Not sure this helps the team, but it may be helpful to users who are trying this out and hitting limits and simply want to continue using the Preview without this blocking them.
I have switched the embedding client to locally hosted while keeping the chat client as a GitHub model. WARNING: Changing to a local model means regenerating all of your embeddings. You may need to delete the cache ingestioncache.db
The end result is something like this:
// Embedding client
builder.AddOllamaApiClient("local-embedding").AddEmbeddingGenerator();
// Use your existing chat client for inference
var credential = new ApiKeyCredential(builder.Configuration["Chat:GitHubModels:Token"] ?? throw new InvalidOperationException("Missing configuration: GitHubModels:Token. See the README for details."));
var openAIOptions = new OpenAIClientOptions()
{
Endpoint = new Uri("https://models.inference.ai.azure.com")
};
var innerClient = new OpenAIClient(credential, openAIOptions);
var client = innerClient.AsChatClient("gpt-4o-mini");
builder.Services.AddChatClient(client).UseFunctionInvocation().UseLogging();
I was trying out ingesting data beyond the two sample PDFs and quickly ran into rate limit issues:
This rate limit is unfortunate. More info about it:
This leads to several problems with the existing ingestion mechanism:
What we have is OK for "hello world" cases (2 small PDFs) but doesn't address the question of scaling beyond that. It's good that the template is labelled "preview" because this is quite a basic issue we need to resolve.
Possible strategies:
The text was updated successfully, but these errors were encountered: