-
Notifications
You must be signed in to change notification settings - Fork 29
[SRW] LLM Judge Dynamic Template Backend #264
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
831f422 to
0c0500e
Compare
src/main/java/org/opensearch/searchrelevance/judgments/LlmJudgmentsProcessor.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/searchrelevance/judgments/LlmJudgmentsProcessor.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/searchrelevance/utils/RatingOutputProcessor.java
Show resolved
Hide resolved
src/main/java/org/opensearch/searchrelevance/common/RatingOutputProcessor.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/searchrelevance/rest/RestPutJudgmentAction.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/searchrelevance/judgments/LlmJudgmentsProcessor.java
Show resolved
Hide resolved
|
Thanks for this PR. I believe that more flexibility when generating LLM-assisted judgments hugely improves the chances of this feature being useful. Personally, I find a scale from 0-3 (or generally a 4-point scale) more useful than a 5-point scale or even more granular scales for explicit judgments and it's what I have seen most of the times in practice. It's typically increases consistency and it forces you to make a choice (it's either more on the relevant side or more on the irrelevant side, not in between). Most of the times I see metrics (not judgments) in the range from 0 to 1 is when the similarity of a document to a reference answer is calculated or for other metrics in use cases that go beyond retrieval (for example, faithfulness or response relevance). I would regard these as too granular for an LLM to be applied consistently. That being said, I would recommend to support three scales:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall,
- would like to know the backward compatibility since this PR includes index and rest api interface changes.
- can you update mapping https://github.com/opensearch-project/search-relevance/tree/main/src/main/resources/mappings if index schema changed
- can you check the LLM api level output schema constraints ? i've put more details in the comments
src/main/java/org/opensearch/searchrelevance/utils/RatingOutputProcessor.java
Show resolved
Hide resolved
src/main/java/org/opensearch/searchrelevance/common/RatingOutputProcessor.java
Outdated
Show resolved
Hide resolved
src/test/java/org/opensearch/searchrelevance/common/RatingOutputProcessorTests.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/searchrelevance/rest/RestPutJudgmentAction.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/searchrelevance/rest/RestPutQuerySetAction.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/searchrelevance/rest/RestPutQuerySetAction.java
Outdated
Show resolved
Hide resolved
...n/java/org/opensearch/searchrelevance/transport/experiment/PutExperimentTransportAction.java
Outdated
Show resolved
Hide resolved
As @wrigleyDan mentioned, I don't think need both of 0-1 scale and 1-5 scale when they both are 5 points scale. |
Shouldn't we just increase the version of the judgement for every update and evict cache when version does not match instead of asking user to decide if they want to evict cache or not? |
src/main/java/org/opensearch/searchrelevance/common/MLConstants.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/searchrelevance/common/MLConstants.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/searchrelevance/common/RatingOutputProcessor.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/searchrelevance/common/RatingOutputProcessor.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/searchrelevance/common/RatingOutputProcessor.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/searchrelevance/common/RatingOutputProcessor.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/searchrelevance/judgments/LlmJudgmentsProcessor.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/searchrelevance/common/MLConstants.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/searchrelevance/judgments/LlmJudgmentsProcessor.java
Outdated
Show resolved
Hide resolved
52e6a2a to
eb70bb9
Compare
e399c4c to
c71ce48
Compare
src/main/java/org/opensearch/searchrelevance/model/Judgment.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/searchrelevance/model/QueryWithReference.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/searchrelevance/rest/RestPutJudgmentAction.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/searchrelevance/rest/RestPutQuerySetAction.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/searchrelevance/transport/judgment/PutLlmJudgmentRequest.java
Show resolved
Hide resolved
src/main/java/org/opensearch/searchrelevance/transport/queryset/PutQuerySetTransportAction.java
Show resolved
Hide resolved
src/main/java/org/opensearch/searchrelevance/utils/TextValidationUtil.java
Show resolved
Hide resolved
src/main/java/org/opensearch/searchrelevance/executors/ExperimentTaskContext.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/searchrelevance/utils/RatingOutputProcessor.java
Show resolved
Hide resolved
src/main/java/org/opensearch/searchrelevance/ml/MLInputOutputTransformer.java
Show resolved
Hide resolved
src/main/java/org/opensearch/searchrelevance/judgments/LlmJudgmentsProcessor.java
Outdated
Show resolved
Hide resolved
epugh
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worked through some of the files.. The qa tests did run for me. I think right now I worry about the amount of plumbing in the qa/bwc stuff. (Should the dir be named just bwc instead of qa?). Also, and maybe too late for this, part are there any well maintained Java projects that handle LLM integration that we should be leveraging? LangChain4j etc? We are definitly integrating a very low level direct manner! Though maybe that is our style?
src/main/java/org/opensearch/searchrelevance/common/MLConstants.java
Outdated
Show resolved
Hide resolved
Signed-off-by: Chloe Gao <[email protected]>
Signed-off-by: Chloe Gao <[email protected]>
Signed-off-by: Chloe Gao <[email protected]>
Signed-off-by: Chloe Gao <[email protected]>
Signed-off-by: Chloe Gao <[email protected]>
Signed-off-by: Chloe Gao <[email protected]>
Signed-off-by: Chloe Gao <[email protected]>
Signed-off-by: Chloe Gao <[email protected]>
Signed-off-by: Chloe Gao <[email protected]>
Signed-off-by: Chloe Gao <[email protected]>
Signed-off-by: Chloe Gao <[email protected]>
Signed-off-by: Chloe Gao <[email protected]>
Signed-off-by: Chloe Gao <[email protected]>
Signed-off-by: Chloe Gao <[email protected]>
Signed-off-by: Chloe Gao <[email protected]>
Signed-off-by: Chloe Gao <[email protected]>
Signed-off-by: Chloe Gao <[email protected]>
1262da7 to
a908cb8
Compare
Signed-off-by: Chloe Gao <[email protected]>
Signed-off-by: Chloe Gao <[email protected]>
Signed-off-by: Chloe Gao <[email protected]>
fen-qin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks great. please add index mapping
src/main/java/org/opensearch/searchrelevance/common/RatingOutputProcessor.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/searchrelevance/model/QueryWithReference.java
Outdated
Show resolved
Hide resolved
Signed-off-by: Chloe Gao <[email protected]>
Signed-off-by: Chloe Gao <[email protected]>
martin-gaievski
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall code looks code with few comments, special thanks for adding BWC tests capabilities.
src/main/java/org/opensearch/searchrelevance/judgments/LlmJudgmentsProcessor.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/searchrelevance/utils/RatingOutputProcessor.java
Show resolved
Hide resolved
src/main/java/org/opensearch/searchrelevance/judgments/LlmJudgmentsProcessor.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/searchrelevance/utils/ParserUtils.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/searchrelevance/utils/RatingOutputProcessor.java
Show resolved
Hide resolved
src/main/java/org/opensearch/searchrelevance/ml/MLInputOutputTransformer.java
Show resolved
Hide resolved
src/main/java/org/opensearch/searchrelevance/transport/queryset/PutQuerySetTransportAction.java
Outdated
Show resolved
Hide resolved
Signed-off-by: Chloe Gao <[email protected]>
src/main/java/org/opensearch/searchrelevance/rest/RestPutJudgmentAction.java
Outdated
Show resolved
Hide resolved
Signed-off-by: Chloe Gao <[email protected]>
Signed-off-by: Chloe Gao <[email protected]>
martin-gaievski
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good job, thank you for addressing my comments.
One ask from my side - please open a new issue for improving logic for doing truncation.
Yeah issue created. #314 |
fen-qin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Would you like to link the following up issues to a centralized issue ?
|
This will be fantantic to have. |
|
Investigation for Index Mapping Update: We have two new fields added here, Difference between keyword and text will come into play when there is upper characters. keyword type preserves upper chars and text convert all chars into lower case. In Conclusion is that no issues for now. But will be an issue if we want to use modelId as a condition in judgement cache. |
|
The different field type from what we defined in judgment_cache.json is an issue imo. Otherwise, why not we define the type as text in that field instead of keyword? |
Description
LLM Judge Dynamic Template Backend
Overall Description of changes
Implements LLM Judge Dynamic Template Backend feature, enabling customizable prompt templates and multiple rating types for LLM-based search relevance judgments.
Key Changes
Customizable Prompt Templates
PROMPT_SEARCH_RELEVANCE_SCORE_END) in MLConstants.java:42-74
New Rating Type System
Enhanced Caching System
API Enhancements
Refactoring & Code Quality
End to End Testing Procedure
Step 1: Enable Workbench
Request:
Response:
{ "acknowledged": true, "persistent": { "plugins": { "search_relevance": { "workbench_enabled": "true" } } }, "transient": {} }Status: ✅ Success
Step 2: Create Test Products Index
Request:
Response:
{ "acknowledged": true, "shards_acknowledged": true, "index": "test_products" }Status: ✅ Success
Step 3: Load Sample Product Data
Request:
Response:
{ "took": 25, "errors": false, "items": [ {"index": {"_index": "test_products", "_id": "1", "result": "created", "status": 201}}, {"index": {"_index": "test_products", "_id": "2", "result": "created", "status": 201}}, {"index": {"_index": "test_products", "_id": "3", "result": "created", "status": 201}}, {"index": {"_index": "test_products", "_id": "4", "result": "created", "status": 201}}, {"index": {"_index": "test_products", "_id": "5", "result": "created", "status": 201}} ] }Status: ✅ Success - 5 documents indexed
Step 4: Create Query Set with Custom Fields
This query set includes custom fields (
category,targetAudience,referenceAnswer) that can be used in prompt template placeholders.Request:
Response:
{ "query_set_id": "2550758e-c346-4c9b-b6fd-52ff33a40ae0", "query_set_result": "CREATED" }Status: ✅ Success
Query Set ID:
2550758e-c346-4c9b-b6fd-52ff33a40ae0Step 5: Create Multi-Field Search Configuration
Request:
Response:
{ "search_configuration_id": "a1ce8022-1a9c-48ec-ab36-c9850680d9c2", "search_configuration_result": "CREATED" }Status: ✅ Success
Search Configuration ID:
a1ce8022-1a9c-48ec-ab36-c9850680d9c2Test 1: GPT-4 with SCORE0_1 Rating Type
Create Judgment with Custom Prompt Template
Request:
Response:
{ "judgment_id": "5d91e3d8-0aed-4ab0-a4f5-38637fc41134" }Verify Results (after 15 seconds)
Request:
Response Summary:
{ "status": "COMPLETED", "metadata": { "llmJudgmentRatingType": "SCORE0_1", "promptTemplate": "Given the query: {{queryText}}\nCategory: {{category}}\nTarget audience: {{targetAudience}}\nReference: {{referenceAnswer}}\n\nRate the relevance of this document on a scale of 0.0 to 1.0...", "overwriteCache": false }, "judgmentRatings": [ { "query": "laptop for developers#\ntargetAudience: professionals\nreferenceAnswer: A portable computer suitable for software development\ncategory: electronics", "ratings": [ {"rating": "0.9", "docId": "1"} ] }, { "query": "coffee machine#\ntargetAudience: home users\nreferenceAnswer: An appliance for brewing coffee at home\ncategory: kitchen", "ratings": [ {"rating": "1.0", "docId": "1"} ] } ] }Test 2: GPT-4 with RELEVANT_IRRELEVANT Rating Type
Create Judgment with Binary Rating
Request:
Response:
{ "judgment_id": "95077971-d0ca-4191-affa-b2c704ede066" }Verify Results
Request:
Response Summary:
{ "status": "COMPLETED", "metadata": { "llmJudgmentRatingType": "RELEVANT_IRRELEVANT", "promptTemplate": "Search query: {{queryText}}\nCategory: {{category}}\nFor: {{targetAudience}}\nExpected: {{referenceAnswer}}\n\nDetermine if this document is RELEVANT or IRRELEVANT to the query." }, "judgmentRatings": [ { "query": "laptop for developers#\ntargetAudience: professionals\nreferenceAnswer: A portable computer suitable for software development\ncategory: electronics", "ratings": [ {"rating": "1.0", "docId": "1"} ] }, { "query": "coffee machine#\ntargetAudience: home users\nreferenceAnswer: An appliance for brewing coffee at home\ncategory: kitchen", "ratings": [ {"rating": "1.0", "docId": "1"} ] } ] }Issues Resolved
List any issues this PR will resolve, e.g. Closes [...].
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.