[Inference API] Add Amazon Bedrock support to Inference API#110248
[Inference API] Add Amazon Bedrock support to Inference API#110248markjhoy merged 36 commits intoelastic:mainfrom markjhoy:markjhoy/add_amazon_bedrock_inference_api
Conversation
|
Pinging @elastic/ml-core (Team:ML) |
|
Pinging @elastic/ent-search-eng (Team:Enterprise Search) |
maxhniebergall
left a comment
There was a problem hiding this comment.
Overall, great PR! Very well organized and easy to understand. I do have a concern about the embedding task settings.
| @@ -0,0 +1 @@ | |||
|
|
|||
There was a problem hiding this comment.
is this intentionally left empty?
There was a problem hiding this comment.
I think so -- these license files were brought over from the respository-s3 module pieces which also use the AWS SDK (and is blank there too)
| private final Integer dimensions; | ||
| private final Boolean dimensionsSetByUser; | ||
| private final Integer maxInputTokens; | ||
| private final SimilarityMeasure similarity; |
There was a problem hiding this comment.
Shouldn't these be task settings, since they only apply to the embedding task?
There was a problem hiding this comment.
Intuitively, I would think so, but all our services (OpenAI, Cohere, etc.) all have these in the service settings - I assume as these are rarely (if ever) changed when performing inference... @jonathan-buttner or @davidkyle - thoughts here? I'll keep them in the service settings for now, but just a note for the future if we ever want to change these to be task settings...
There was a problem hiding this comment.
Cool, we will have to think about refactoring this in the future.
There was a problem hiding this comment.
Shouldn't these be task settings, since they only apply to the embedding task?
Hmm I'm not sure what you mean @maxhniebergall . I believe these settings are only used for embedding tasks.
Do you mean that they should be task settings because they should be modifiable after an inference endpoint is created?
I don't think we'd want a user to be able to change these after setting up an inference endpoint (I suppose max input tokens maybe?). If a user generated text embeddings using a certain number of dimensions and expected similarity I would think it'd break the index if we stored different dimension sizes or potentially a different similarity during an inference request.
There was a problem hiding this comment.
Is that what the difference between task settings and service settings is? Maybe we should rename them to MutableSettings and ImmutableSettings? I had thought that service settings were for every task in a service, and task settings were for when the settings were specific to a task.
There was a problem hiding this comment.
Is that what the difference between task settings and service settings is?
Yep!
Maybe we should rename them to MutableSettings and ImmutableSettings?
That's true, it isn't intuitive that service settings aren't mutable and task settings are. I think the convention we went with was to reflect the field name from body of the request. And we add the task type in the class name to indicate what it's for.
I guess we could have the reverse problem if we named them mutable and immutable as you wouldn't know immediately what they refer too without looking at the class contents. I suppose we could include it in the name too.
There was a problem hiding this comment.
Right, these are also serialized as part of the request, so we can't change the naming without a breaking change.
| import java.util.List; | ||
| import java.util.Map; | ||
|
|
||
| public final class AmazonBedrockProviderCapabilities { |
There was a problem hiding this comment.
Awesome! Love how all of this info is in one spot
| var actionCreator = new AmazonBedrockActionCreator(amazonBedrockSender, this.getServiceComponents(), timeout); | ||
| if (model instanceof AmazonBedrockModel baseAmazonBedrockModel) { | ||
| var maxBatchSize = getEmbeddingsMaxBatchSize(baseAmazonBedrockModel.provider()); | ||
| var batchedRequests = new EmbeddingRequestChunker(input, maxBatchSize, EmbeddingRequestChunker.EmbeddingType.FLOAT) |
There was a problem hiding this comment.
Do all of the bedrock models use Float embeddings?
There was a problem hiding this comment.
Yep - both Amazon Titan and Cohere use floats in their output vectors
| import java.util.Map; | ||
|
|
||
| public final class AmazonBedrockProviderCapabilities { | ||
| private static final int DEFAULT_MAX_CHUNK_SIZE = 2048; |
There was a problem hiding this comment.
Why is this value here instead of in AmazonBedrockConstants?
| import static org.elasticsearch.xpack.inference.services.amazonbedrock.AmazonBedrockConstants.TOP_K_FIELD; | ||
| import static org.elasticsearch.xpack.inference.services.amazonbedrock.AmazonBedrockConstants.TOP_P_FIELD; | ||
|
|
||
| public record AmazonBedrockChatCompletionRequestTaskSettings( |
There was a problem hiding this comment.
Why do we need this class and AmazonBedrockChatCompletionTaskSettings?
There was a problem hiding this comment.
The Request task settings are ones that can be overridden during an inference POST request to Elasticsearch... the AmazonBedrockChatCompletionTaskSettings are the default task settings a user can set up when they create the model (PUT)
…_bedrock_inference_api
…_bedrock_inference_api
jonathan-buttner
left a comment
There was a problem hiding this comment.
Great work Mark! Thanks for all the refactoring. I left a comment about the dimensions set by user. The others are just nits.
| static final String DIMENSIONS_SET_BY_USER = "dimensions_set_by_user"; | ||
|
|
||
| private final Integer dimensions; | ||
| private final Boolean dimensionsSetByUser; |
There was a problem hiding this comment.
I might have missed it but do we or do we plan on sending this to an external provider? I think I only saw that we use it in the updateModelWithEmbeddingDetails() method. If we don't send it to an external provider then the user set dimensions might mess the semantic text field 🤔 . What I mean is, if a users sets it to 5 but the actual stored embeddings is 1000 I'm not sure what happens.
I think my vote would be to remove this if we don't plan on using it in the near future. If we do plan on using it soon, we might want to prevent a user from setting it for now 🤷♂️
There was a problem hiding this comment.
I think this is an oversight on my part... for Amazon Titan embeddings - you can pass the dimensions, but Cohere does not have this... I'll fix this up, with some validations on this.
There was a problem hiding this comment.
Ah ok cool 👍 yeah don't forget to pass it along in the titan request, or at least I missed where we are doing that.
There was a problem hiding this comment.
Sooooo - here's something silly... Amazon Bedrock G1 models don't allow the dimensions parameter, but the version 2 does..:
Titan Embeddings G1 - Text doesn't support the use of inference parameters. The following sections detail the request and response formats and provides a code example.
I think for now, I'll omit the dimensions all together because there's no definitive way we can tell which model the end user is using easily (we could check on the model ID as the string to see if it matches the v2 version, but, if they are using a custom ARN, it'd require a lot more hoops to jump through that I don't think is worth it at this point in time...)
There was a problem hiding this comment.
If we do plan on using it soon, we might want to prevent a user from setting it for now
I'll go this route for now as it's probably the safest.
There was a problem hiding this comment.
I think for now, I'll omit the dimensions all together because there's no definitive way we can tell which model the end user is using easily
👍 sounds good
|
|
||
| ValidationException validationException = new ValidationException(); | ||
|
|
||
| var temperature = extractOptionalDoubleInRange( |
There was a problem hiding this comment.
nit: I know I previously (in a different PR) asked that we adding functionality to validate the range. I think I've changed my position on that 😅 . To better support provider changes in these fields it probably makes sense to not validate them on our end and let the upstream provide return an error if they're not in a specific range/positive/negative etc.
I know we have this in a few places so you don't need to change it here but maybe going forward we rely more on the upstream provider to do this for us and we try to ensure the returned error message is clear enough to the user to know what needs to be changed.
There was a problem hiding this comment.
While this is still in the works, I'll see if it's easy enough to change this...
There was a problem hiding this comment.
I think for now, I'll keep this as-is... providing an invalid value does have an error message available in the exception from the SDK, but it's quite buried ... :
(about 30 more lines above this here...)
...
(RequestExecutorService.java:192)\n\tat org.elasticsearch.inference@8.16.0-SNAPSHOT/org.elasticsearch.xpack.inference.external.http.sender.AmazonBedrockRequestExecutorService.start(AmazonBedrockRequestExecutorService.java:19)\n\tat org.elasticsearch.inference@8.16.0-SNAPSHOT/org.elasticsearch.xpack.inference.external.amazonbedrock.AmazonBedrockRequestSender.lambda$start$0(AmazonBedrockRequestSender.java:89)\n\tat org.elasticsearch.server@8.16.0-SNAPSHOT/org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:917)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)\n\tat java.base/java.lang.Thread.run(Thread.java:1570)\nCaused by: java.util.concurrent.ExecutionException: com.amazonaws.services.bedrockruntime.model.ValidationException: 1 validation error detected: Value '500.0' at 'inferenceConfig.temperature' failed to satisfy constraint: Member must have value less than or equal to 1 (Service: AmazonBedrockRuntime; Status Code: 400; Error Code: ValidationException; Request ID: fe7b8dbd-31f9-4821-a663-d00319d36e89; Proxy: null)
| private final Integer dimensions; | ||
| private final Boolean dimensionsSetByUser; | ||
| private final Integer maxInputTokens; | ||
| private final SimilarityMeasure similarity; |
There was a problem hiding this comment.
Shouldn't these be task settings, since they only apply to the embedding task?
Hmm I'm not sure what you mean @maxhniebergall . I believe these settings are only used for embedding tasks.
Do you mean that they should be task settings because they should be modifiable after an inference endpoint is created?
I don't think we'd want a user to be able to change these after setting up an inference endpoint (I suppose max input tokens maybe?). If a user generated text embeddings using a certain number of dimensions and expected similarity I would think it'd break the index if we stored different dimension sizes or potentially a different similarity during an inference request.
💚 All backports created successfully
Questions ?Please refer to the Backport tool documentation |
💚 Backport successful
|
💔 Backport failedThe backport operation could not be completed due to the following error: You can use sqren/backport to manually backport by running |
…#110545) * Initial commit; setup Gradle; start service * initial commit * minor cleanups, builds green; needs tests * bug fixes; tested working embeddings & completion * use custom json builder for embeddings request * Ensure auto-close; fix forbidden API * start of adding unit tests; abstraction layers * adding additional tests; cleanups * add requests unit tests * all tests created * fix cohere embeddings response * fix cohere embeddings response * fix lint * better test coverage for secrets; inference client * update thread-safe syncs; make dims/tokens + int * add tests for dims and max tokens positive integer * use requireNonNull;override settings type;cleanups * use r/w lock for client cache * remove client reference counting * update locking in cache; client errors; noop doc * remove extra block in internalGetOrCreateClient * remove duplicate dependencies; cleanup * add fxn to get default embeddings similarity * use async calls to Amazon Bedrock; cleanups * use Clock in cache; simplify locking; cleanups * cleanups around executor; remove some instanceof * cleanups; use EmbeddingRequestChunker * move max chunk size to constants * oof - swapped transport vers w/ master node req * use XContent instead of Jackson JsonFactory * remove gradle versions; do not allow dimensions
* Initial commit; setup Gradle; start service * initial commit * minor cleanups, builds green; needs tests * bug fixes; tested working embeddings & completion * use custom json builder for embeddings request * Ensure auto-close; fix forbidden API * start of adding unit tests; abstraction layers * adding additional tests; cleanups * add requests unit tests * all tests created * fix cohere embeddings response * fix cohere embeddings response * fix lint * better test coverage for secrets; inference client * update thread-safe syncs; make dims/tokens + int * add tests for dims and max tokens positive integer * use requireNonNull;override settings type;cleanups * use r/w lock for client cache * remove client reference counting * update locking in cache; client errors; noop doc * remove extra block in internalGetOrCreateClient * remove duplicate dependencies; cleanup * add fxn to get default embeddings similarity * use async calls to Amazon Bedrock; cleanups * use Clock in cache; simplify locking; cleanups * cleanups around executor; remove some instanceof * cleanups; use EmbeddingRequestChunker * move max chunk size to constants * oof - swapped transport vers w/ master node req * use XContent instead of Jackson JsonFactory * remove gradle versions; do not allow dimensions
* Initial commit; setup Gradle; start service * initial commit * minor cleanups, builds green; needs tests * bug fixes; tested working embeddings & completion * use custom json builder for embeddings request * Ensure auto-close; fix forbidden API * start of adding unit tests; abstraction layers * adding additional tests; cleanups * add requests unit tests * all tests created * fix cohere embeddings response * fix cohere embeddings response * fix lint * better test coverage for secrets; inference client * update thread-safe syncs; make dims/tokens + int * add tests for dims and max tokens positive integer * use requireNonNull;override settings type;cleanups * use r/w lock for client cache * remove client reference counting * update locking in cache; client errors; noop doc * remove extra block in internalGetOrCreateClient * remove duplicate dependencies; cleanup * add fxn to get default embeddings similarity * use async calls to Amazon Bedrock; cleanups * use Clock in cache; simplify locking; cleanups * cleanups around executor; remove some instanceof * cleanups; use EmbeddingRequestChunker * move max chunk size to constants * oof - swapped transport vers w/ master node req * use XContent instead of Jackson JsonFactory * remove gradle versions; do not allow dimensions
Adds Amazon Bedrock text embeddings and chat completion support for the inference API.
Prerequisites to Model Creation
Inference Model Creation:
Service Settings
Where
provideris one of:amazontitancohereamazontitananthropicai21labscoheremetamistralThe
model_idshould match a model you have access to when using a base model, or the ARN of the model if you are using a custom model based on a base model.Also, be sure that the model you are using is supported in your region that you specify.
text_embeddingtasks for Amazon Bedrock have the following additional, optional settings:dimensions: The output dimensions to use for the inferencemax_input_tokens: the maximum number of input tokenssimilarity: the similarity measure to usecompletiontasks for Amazon Bedrock do not have any additional service settings.Task Settings
text_embeddingtasks for Amazon Bedrock do not have any settings, and none should be sent.chat_completiontasks have the following task settings available (all optional):max_new_tokens: the max new tokens to producetemperature: the temperature (0.0 to 1.0) to usetop_p: the top P metric to use (0.0 to 1.0)top_k: and alternative totop_pfrom 0.0 to 1.0 (only available for Anthropic, Cohere, and Mistral)Inference Model Inference
Manual Testing Completed:
Tested Embeddings Providers:
Tested Chat Completion Providers: