[SRW] LLM Judge Dynamic Template Backend#264
[SRW] LLM Judge Dynamic Template Backend#264fen-qin merged 36 commits intoopensearch-project:mainfrom
Conversation
831f422 to
0c0500e
Compare
|
Thanks for this PR. I believe that more flexibility when generating LLM-assisted judgments hugely improves the chances of this feature being useful. Personally, I find a scale from 0-3 (or generally a 4-point scale) more useful than a 5-point scale or even more granular scales for explicit judgments and it's what I have seen most of the times in practice. It's typically increases consistency and it forces you to make a choice (it's either more on the relevant side or more on the irrelevant side, not in between). Most of the times I see metrics (not judgments) in the range from 0 to 1 is when the similarity of a document to a reference answer is calculated or for other metrics in use cases that go beyond retrieval (for example, faithfulness or response relevance). I would regard these as too granular for an LLM to be applied consistently. That being said, I would recommend to support three scales:
|
There was a problem hiding this comment.
overall,
- would like to know the backward compatibility since this PR includes index and rest api interface changes.
- can you update mapping https://github.com/opensearch-project/search-relevance/tree/main/src/main/resources/mappings if index schema changed
- can you check the LLM api level output schema constraints ? i've put more details in the comments
As @wrigleyDan mentioned, I don't think need both of 0-1 scale and 1-5 scale when they both are 5 points scale. |
Shouldn't we just increase the version of the judgement for every update and evict cache when version does not match instead of asking user to decide if they want to evict cache or not? |
52e6a2a to
eb70bb9
Compare
e399c4c to
c71ce48
Compare
epugh
left a comment
There was a problem hiding this comment.
Worked through some of the files.. The qa tests did run for me. I think right now I worry about the amount of plumbing in the qa/bwc stuff. (Should the dir be named just bwc instead of qa?). Also, and maybe too late for this, part are there any well maintained Java projects that handle LLM integration that we should be leveraging? LangChain4j etc? We are definitly integrating a very low level direct manner! Though maybe that is our style?
1262da7 to
a908cb8
Compare
fen-qin
left a comment
There was a problem hiding this comment.
Overall looks great. please add index mapping
martin-gaievski
left a comment
There was a problem hiding this comment.
Overall code looks code with few comments, special thanks for adding BWC tests capabilities.
martin-gaievski
left a comment
There was a problem hiding this comment.
Good job, thank you for addressing my comments.
One ask from my side - please open a new issue for improving logic for doing truncation.
Yeah issue created. #314 |
fen-qin
left a comment
There was a problem hiding this comment.
LGTM. Would you like to link the following up issues to a centralized issue ?
Signed-off-by: Chloe Gao <chloewq@amazon.com>
Signed-off-by: Chloe Gao <chloewq@amazon.com>
Signed-off-by: Chloe Gao <chloewq@amazon.com>
Signed-off-by: Chloe Gao <chloewq@amazon.com>
Signed-off-by: Chloe Gao <chloewq@amazon.com>
Signed-off-by: Chloe Gao <chloewq@amazon.com>
Signed-off-by: Chloe Gao <chloewq@amazon.com>
Signed-off-by: Chloe Gao <chloewq@amazon.com>
Signed-off-by: Chloe Gao <chloewq@amazon.com>
Signed-off-by: Chloe Gao <chloewq@amazon.com>
9467aa8
1c36fae to
9467aa8
Compare
Signed-off-by: Chloe Gao <chloewq@amazon.com>
Signed-off-by: Chloe Gao <chloewq@amazon.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #264 +/- ##
===========================
===========================
☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
fen-qin
left a comment
There was a problem hiding this comment.
LGTM, rebasing on index mapping PR
yes. thanks, Martin. we should track and document the changes on LLM as a judgment into the META and also need to update the blog on new advanced settings for LLM judgment |
Description
LLM Judge Dynamic Template Backend
Overall Description of changes
Implements LLM Judge Dynamic Template Backend feature, enabling customizable prompt templates and multiple rating types for LLM-based search relevance judgments.
Key Changes
Customizable Prompt Templates
PROMPT_SEARCH_RELEVANCE_SCORE_END) in MLConstants.java:42-74
New Rating Type System
Enhanced Caching System
API Enhancements
Refactoring & Code Quality
End to End Testing Procedure
Step 1: Enable Workbench
Request:
Response:
{ "acknowledged": true, "persistent": { "plugins": { "search_relevance": { "workbench_enabled": "true" } } }, "transient": {} }Status: ✅ Success
Step 2: Create Test Products Index
Request:
Response:
{ "acknowledged": true, "shards_acknowledged": true, "index": "test_products" }Status: ✅ Success
Step 3: Load Sample Product Data
Request:
Response:
{ "took": 25, "errors": false, "items": [ {"index": {"_index": "test_products", "_id": "1", "result": "created", "status": 201}}, {"index": {"_index": "test_products", "_id": "2", "result": "created", "status": 201}}, {"index": {"_index": "test_products", "_id": "3", "result": "created", "status": 201}}, {"index": {"_index": "test_products", "_id": "4", "result": "created", "status": 201}}, {"index": {"_index": "test_products", "_id": "5", "result": "created", "status": 201}} ] }Status: ✅ Success - 5 documents indexed
Step 4: Create Query Set with Custom Fields
This query set includes custom fields (
category,targetAudience,referenceAnswer) that can be used in prompt template placeholders.Request:
Response:
{ "query_set_id": "2550758e-c346-4c9b-b6fd-52ff33a40ae0", "query_set_result": "CREATED" }Status: ✅ Success
Query Set ID:
2550758e-c346-4c9b-b6fd-52ff33a40ae0Step 5: Create Multi-Field Search Configuration
Request:
Response:
{ "search_configuration_id": "a1ce8022-1a9c-48ec-ab36-c9850680d9c2", "search_configuration_result": "CREATED" }Status: ✅ Success
Search Configuration ID:
a1ce8022-1a9c-48ec-ab36-c9850680d9c2Test 1: GPT-4 with SCORE0_1 Rating Type
Create Judgment with Custom Prompt Template
Request:
Response:
{ "judgment_id": "5d91e3d8-0aed-4ab0-a4f5-38637fc41134" }Verify Results (after 15 seconds)
Request:
Response Summary:
{ "status": "COMPLETED", "metadata": { "llmJudgmentRatingType": "SCORE0_1", "promptTemplate": "Given the query: {{queryText}}\nCategory: {{category}}\nTarget audience: {{targetAudience}}\nReference: {{referenceAnswer}}\n\nRate the relevance of this document on a scale of 0.0 to 1.0...", "overwriteCache": false }, "judgmentRatings": [ { "query": "laptop for developers#\ntargetAudience: professionals\nreferenceAnswer: A portable computer suitable for software development\ncategory: electronics", "ratings": [ {"rating": "0.9", "docId": "1"} ] }, { "query": "coffee machine#\ntargetAudience: home users\nreferenceAnswer: An appliance for brewing coffee at home\ncategory: kitchen", "ratings": [ {"rating": "1.0", "docId": "1"} ] } ] }Test 2: GPT-4 with RELEVANT_IRRELEVANT Rating Type
Create Judgment with Binary Rating
Request:
Response:
{ "judgment_id": "95077971-d0ca-4191-affa-b2c704ede066" }Verify Results
Request:
Response Summary:
{ "status": "COMPLETED", "metadata": { "llmJudgmentRatingType": "RELEVANT_IRRELEVANT", "promptTemplate": "Search query: {{queryText}}\nCategory: {{category}}\nFor: {{targetAudience}}\nExpected: {{referenceAnswer}}\n\nDetermine if this document is RELEVANT or IRRELEVANT to the query." }, "judgmentRatings": [ { "query": "laptop for developers#\ntargetAudience: professionals\nreferenceAnswer: A portable computer suitable for software development\ncategory: electronics", "ratings": [ {"rating": "1.0", "docId": "1"} ] }, { "query": "coffee machine#\ntargetAudience: home users\nreferenceAnswer: An appliance for brewing coffee at home\ncategory: kitchen", "ratings": [ {"rating": "1.0", "docId": "1"} ] } ] }Issues Resolved
List any issues this PR will resolve, e.g. Closes [...].
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.