-
Notifications
You must be signed in to change notification settings - Fork 29
[FEATURE] Support multiple LLM connectos with proper rate limit #285
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Fen Qin <[email protected]>
Signed-off-by: Fen Qin <[email protected]>
Signed-off-by: Fen Qin <[email protected]>
martin-gaievski
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good addition to the system, thanks for this contribution.
Few high level comments on top of what I posted in the code:
- can we create a single class for LLM params, e.g LLMConfiguration, where we keep things like rateLimit, maxRetries, retryBackoff, maxTokens etc. Currently we do have all those params spread all over the code in different places. Separate class should help to keep logical structure and decouple logic from parameters.
- we need to collect metric on retries, I think good way to do it is to use stats API that is part of the SRW repo. Check this PR where it has been introduced #63. We can collect number of retry attempts and delay
- In future we need to think of a separate pool for retries. With current implementation system uses
CompletableFuture.delayedExecutorwhich is basically common ForkJoinPool. Under high load with many retries this could exhaust the thread pool. This may be too much for this PR, more like a follow up change
| ignoreFailure | ||
| ); | ||
| List<String> unprocessedDocIds = docIds; | ||
| // List<String> unprocessedDocIds = deduplicateFromCache( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not needed in the final version
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. sure. this should be removed
|
|
||
| private boolean isRetryableError(Exception e) { | ||
| String message = e.getMessage(); | ||
| if (message == null) return true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why it's retryable if without message? also please fix style, use curly braces
| import org.apache.logging.log4j.LogManager; | ||
| import org.apache.logging.log4j.Logger; | ||
|
|
||
| public class BedrockRateLimiter { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not everyone will use Bedrock as a platform. We need RateLimiter interface and BedrockRateLimiter will be one possible implementation of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the high-level suggestions + this thread.
Yes, we should put a more general class MLConfiguration that can probably hold all common functions:
- rateLimit, maxRetries, retryBackoff, maxTokens etc
- prompt formatter
- response processor
to make it more clear if anyone what to add a new model with customized input/output interface and setting.
| ActionListener<String> listener | ||
| ) { | ||
| // Apply rate limiting per chunk to handle multiple chunks per query | ||
| BedrockRateLimiter.applyRateLimit(connectorType, customRateLimit); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to create exact implementation of RateLimiter interface using some sort of a Factory, or factory method, probably makes sense to have it as class variable. Then call only interface methods.
| this.rateLimit = rateLimitObj != null ? rateLimitObj : 0L; | ||
| } | ||
|
|
||
| public String getQuerySetId() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for bunch of getters you can use lombok getter
| } | ||
| } | ||
|
|
||
| lastRequestTimes.put(key, System.currentTimeMillis()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
map can grow indefinitely, I cannot see where we delete from it
| long sleepTime = delayMs - timeSinceLastRequest; | ||
| try { | ||
| log.debug("Rate limiting {}: sleeping for {} ms", modelType, sleepTime); | ||
| Thread.sleep(sleepTime); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not inefficient to block threads with sleep. You can do multiple options, for example:
- Use ScheduledExecutorService, each time you need to sleep the thread, you do
scheduler.schedule(() -> { future.complete(null); }, waitTime, TimeUnit.MILLISECONDS)
or even simpler approach - CompletableFuture with delay
CompletableFuture.delayedExecutor(delayMs, TimeUnit.MILLISECONDS) .execute(() -> {});
|
Thanks @martin-gaievski for the review. This PR is about to do code refactoring on existing LLM processor, to make it supports models in a more general way:
and the interface should be extendable to onboard any other models |
Thanks for cc me. It does look like to be a lot of code conflicts with PR #264 We should think about how to merge it and the efforts to merge and resolve conflicts without having functionality break. |
From the code comparison of Overlap between two PRs, it is not trivial work to merge and ensure not functionality breaks. |
Description
Goal:
To support more LLMs to Search Relevance Workbench:
Changes:
Introduce two new fields to Llm Judgment:
modelType: enum type to identify LLM typerateLimit: rate limit for LLM host to receive the requestAdd prompt formatter to handle differences across connectors:
Tests
Issues Resolved
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.