Skip to content

[SRW] LLM Judge Dynamic Template Backend#264

Merged
fen-qin merged 36 commits intoopensearch-project:mainfrom
chloewqg:llm_judge_template
Dec 16, 2025
Merged

[SRW] LLM Judge Dynamic Template Backend#264
fen-qin merged 36 commits intoopensearch-project:mainfrom
chloewqg:llm_judge_template

Conversation

@chloewqg
Copy link
Copy Markdown
Contributor

@chloewqg chloewqg commented Oct 13, 2025

Description

LLM Judge Dynamic Template Backend

Overall Description of changes

Implements LLM Judge Dynamic Template Backend feature, enabling customizable prompt templates and multiple rating types for LLM-based search relevance judgments.

Key Changes

Customizable Prompt Templates

  • Split monolithic prompt into modular components (PROMPT_SEARCH_RELEVANCE_SCORE_1_5_START, PROMPT_SEARCH_RELEVANCE_SCORE_0_1_START, PROMPT_SEARCH_RELEVANCE_SCORE_BINARY,
    PROMPT_SEARCH_RELEVANCE_SCORE_END) in MLConstants.java:42-74
  • Users can now provide custom prompt templates via API
  • Default templates support three rating types: 0-1 scale, 1-5 scale, and binary (RELEVANT/IRRELEVANT)

New Rating Type System

  • Added LLMJudgmentRatingType enum with three types: SCORE0_1, SCORE1_5, RELEVANT_IRRELEVANT
  • Created RatingOutputProcessor class for rating sanitization and validation with type-specific handling
  • Automatic rating clamping and normalization based on configured type

Enhanced Caching System

  • Added promptTemplateCode field to JudgmentCache model to differentiate cached results by template
  • Updated JudgmentCacheDao.getJudgmentCache() to include prompt template in cache lookup
  • Introduced overwriteCache parameter to force cache refresh when needed

API Enhancements

  • Updated PutLlmJudgmentRequest to accept promptTemplate, ratingType, and overwriteCache parameters
  • Extended REST APIs (RestPutJudgmentAction, RestPutQuerySetAction) to support new fields
  • Backward compatible - existing APIs work without new parameters

Refactoring & Code Quality

  • Renamed queryTextWithReference → queryTextWithCustomInput throughout codebase for clarity
  • Deprecated old sanitizeLLMResponse() methods in favor of RatingOutputProcessor
  • Added utility method generatePromptTemplateCode()

End to End Testing Procedure

Step 1: Enable Workbench

Request:

curl -X PUT "http://localhost:9200/_cluster/settings" -H 'Content-Type: application/json' -d'{
  "persistent": {
    "plugins.search_relevance.workbench_enabled": true
  }
}'

Response:

{
  "acknowledged": true,
  "persistent": {
    "plugins": {
      "search_relevance": {
        "workbench_enabled": "true"
      }
    }
  },
  "transient": {}
}

Status: ✅ Success


Step 2: Create Test Products Index

Request:

curl -X PUT "http://localhost:9200/test_products" -H 'Content-Type: application/json' -d'{
  "mappings": {
    "properties": {
      "name": {"type": "text"},
      "description": {"type": "text"},
      "category": {"type": "keyword"},
      "price": {"type": "float"}
    }
  }
}'

Response:

{
  "acknowledged": true,
  "shards_acknowledged": true,
  "index": "test_products"
}

Status: ✅ Success


Step 3: Load Sample Product Data

Request:

curl -X POST "http://localhost:9200/test_products/_bulk" -H 'Content-Type: application/json' -d'
{"index":{"_id":"1"}}
{"name":"Dell Laptop","description":"High performance laptop for professionals","category":"electronics","price":1200.00}
{"index":{"_id":"2"}}
{"name":"Office Chair","description":"Ergonomic office chair with lumbar support","category":"furniture","price":299.99}
{"index":{"_id":"3"}}
{"name":"Espresso Machine","description":"Premium coffee maker for home baristas","category":"kitchen","price":499.99}
{"index":{"_id":"4"}}
{"name":"Running Shoes","description":"Comfortable athletic shoes for runners","category":"sports","price":129.99}
{"index":{"_id":"5"}}
{"name":"MacBook Pro","description":"Apple laptop with M3 chip for developers","category":"electronics","price":2499.00}
'

Response:

{
  "took": 25,
  "errors": false,
  "items": [
    {"index": {"_index": "test_products", "_id": "1", "result": "created", "status": 201}},
    {"index": {"_index": "test_products", "_id": "2", "result": "created", "status": 201}},
    {"index": {"_index": "test_products", "_id": "3", "result": "created", "status": 201}},
    {"index": {"_index": "test_products", "_id": "4", "result": "created", "status": 201}},
    {"index": {"_index": "test_products", "_id": "5", "result": "created", "status": 201}}
  ]
}

Status: ✅ Success - 5 documents indexed


Step 4: Create Query Set with Custom Fields

This query set includes custom fields (category, targetAudience, referenceAnswer) that can be used in prompt template placeholders.

Request:

curl -X PUT "http://localhost:9200/_plugins/_search_relevance/query_sets" -H 'Content-Type: application/json' -d'{
  "name": "E2E Test Query Set",
  "description": "Query set for testing LLM judgment with custom fields",
  "querySetQueries": [
    {
      "queryText": "laptop for developers",
      "category": "electronics",
      "targetAudience": "professionals",
      "referenceAnswer": "A portable computer suitable for software development"
    },
    {
      "queryText": "coffee machine",
      "category": "kitchen",
      "targetAudience": "home users",
      "referenceAnswer": "An appliance for brewing coffee at home"
    }
  ]
}'

Response:

{
  "query_set_id": "2550758e-c346-4c9b-b6fd-52ff33a40ae0",
  "query_set_result": "CREATED"
}

Status: ✅ Success
Query Set ID: 2550758e-c346-4c9b-b6fd-52ff33a40ae0


Step 5: Create Multi-Field Search Configuration

Request:

curl -X PUT "http://localhost:9200/_plugins/_search_relevance/search_configurations" -H 'Content-Type: application/json' -d'{
  "name": "Products Multi-Field Search",
  "description": "Search both name and description fields",
  "index": "test_products",
  "query": "{\"query\": {\"multi_match\": {\"query\": \"%SearchText%\", \"fields\": [\"name\", \"description\"]}}}"
}'

Response:

{
  "search_configuration_id": "a1ce8022-1a9c-48ec-ab36-c9850680d9c2",
  "search_configuration_result": "CREATED"
}

Status: ✅ Success
Search Configuration ID: a1ce8022-1a9c-48ec-ab36-c9850680d9c2


Test 1: GPT-4 with SCORE0_1 Rating Type

Create Judgment with Custom Prompt Template

Request:

curl -X PUT "http://localhost:9200/_plugins/_search_relevance/judgments" -H 'Content-Type: application/json' -d'{
  "name": "Test 1: GPT-4 SCORE0_1 Custom Template",
  "type": "LLM_JUDGMENT",
  "querySetId": "2550758e-c346-4c9b-b6fd-52ff33a40ae0",
  "searchConfigurationList": ["a1ce8022-1a9c-48ec-ab36-c9850680d9c2"],
  "modelId": "ycmnTZoBJMvqPc66Lqqh",
  "size": 5,
  "tokenLimit": 4000,
  "contextFields": ["name", "description"],
  "ignoreFailure": false,
  "llmJudgmentRatingType": "SCORE0_1",
  "promptTemplate": "Given the query: {{queryText}}\nCategory: {{category}}\nTarget audience: {{targetAudience}}\nReference: {{referenceAnswer}}\n\nRate the relevance of this document on a scale of 0.0 to 1.0, where 0.0 is completely irrelevant and 1.0 is perfectly relevant.",
  "overwriteCache": false
}'

Response:

{
  "judgment_id": "5d91e3d8-0aed-4ab0-a4f5-38637fc41134"
}

Verify Results (after 15 seconds)

Request:

curl -s "http://localhost:9200/_plugins/_search_relevance/judgments/5d91e3d8-0aed-4ab0-a4f5-38637fc41134" | python3 -m json.tool

Response Summary:

{
  "status": "COMPLETED",
  "metadata": {
    "llmJudgmentRatingType": "SCORE0_1",
    "promptTemplate": "Given the query: {{queryText}}\nCategory: {{category}}\nTarget audience: {{targetAudience}}\nReference: {{referenceAnswer}}\n\nRate the relevance of this document on a scale of 0.0 to 1.0...",
    "overwriteCache": false
  },
  "judgmentRatings": [
    {
      "query": "laptop for developers#\ntargetAudience: professionals\nreferenceAnswer: A portable computer suitable for software development\ncategory: electronics",
      "ratings": [
        {"rating": "0.9", "docId": "1"}
      ]
    },
    {
      "query": "coffee machine#\ntargetAudience: home users\nreferenceAnswer: An appliance for brewing coffee at home\ncategory: kitchen",
      "ratings": [
        {"rating": "1.0", "docId": "1"}
      ]
    }
  ]
}

Test 2: GPT-4 with RELEVANT_IRRELEVANT Rating Type

Create Judgment with Binary Rating

Request:

curl -X PUT "http://localhost:9200/_plugins/_search_relevance/judgments" -H 'Content-Type: application/json' -d'{
  "name": "Test 2: GPT-4 RELEVANT_IRRELEVANT",
  "type": "LLM_JUDGMENT",
  "querySetId": "2550758e-c346-4c9b-b6fd-52ff33a40ae0",
  "searchConfigurationList": ["a1ce8022-1a9c-48ec-ab36-c9850680d9c2"],
  "modelId": "ycmnTZoBJMvqPc66Lqqh",
  "size": 5,
  "tokenLimit": 4000,
  "contextFields": ["name", "description", "category"],
  "ignoreFailure": false,
  "llmJudgmentRatingType": "RELEVANT_IRRELEVANT",
  "promptTemplate": "Search query: {{queryText}}\nCategory: {{category}}\nFor: {{targetAudience}}\nExpected: {{referenceAnswer}}\n\nDetermine if this document is RELEVANT or IRRELEVANT to the query.",
  "overwriteCache": false
}'

Response:

{
  "judgment_id": "95077971-d0ca-4191-affa-b2c704ede066"
}

Verify Results

Request:

curl -s "http://localhost:9200/_plugins/_search_relevance/judgments/95077971-d0ca-4191-affa-b2c704ede066" | python3 -m json.tool

Response Summary:

{
  "status": "COMPLETED",
  "metadata": {
    "llmJudgmentRatingType": "RELEVANT_IRRELEVANT",
    "promptTemplate": "Search query: {{queryText}}\nCategory: {{category}}\nFor: {{targetAudience}}\nExpected: {{referenceAnswer}}\n\nDetermine if this document is RELEVANT or IRRELEVANT to the query."
  },
  "judgmentRatings": [
    {
      "query": "laptop for developers#\ntargetAudience: professionals\nreferenceAnswer: A portable computer suitable for software development\ncategory: electronics",
      "ratings": [
        {"rating": "1.0", "docId": "1"}
      ]
    },
    {
      "query": "coffee machine#\ntargetAudience: home users\nreferenceAnswer: An appliance for brewing coffee at home\ncategory: kitchen",
      "ratings": [
        {"rating": "1.0", "docId": "1"}
      ]
    }
  ]
}

Issues Resolved

List any issues this PR will resolve, e.g. Closes [...].

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@chloewqg chloewqg force-pushed the llm_judge_template branch 3 times, most recently from 831f422 to 0c0500e Compare October 14, 2025 03:22
Comment thread src/main/java/org/opensearch/searchrelevance/judgments/LlmJudgmentsProcessor.java Outdated
Comment thread src/main/java/org/opensearch/searchrelevance/judgments/LlmJudgmentsProcessor.java Outdated
Comment thread src/main/java/org/opensearch/searchrelevance/common/RatingOutputProcessor.java Outdated
Comment thread src/main/java/org/opensearch/searchrelevance/rest/RestPutJudgmentAction.java Outdated
@wrigleyDan
Copy link
Copy Markdown
Collaborator

Thanks for this PR. I believe that more flexibility when generating LLM-assisted judgments hugely improves the chances of this feature being useful.

Personally, I find a scale from 0-3 (or generally a 4-point scale) more useful than a 5-point scale or even more granular scales for explicit judgments and it's what I have seen most of the times in practice. It's typically increases consistency and it forces you to make a choice (it's either more on the relevant side or more on the irrelevant side, not in between).
So while I appreciate being able to add custom prompts I am wondering if the three rating types support what is used in the industry.

Most of the times I see metrics (not judgments) in the range from 0 to 1 is when the similarity of a document to a reference answer is calculated or for other metrics in use cases that go beyond retrieval (for example, faithfulness or response relevance). I would regard these as too granular for an LLM to be applied consistently.

That being said, I would recommend to support three scales:

  • binary judgments: relevant/irrelevant like suggested is fine. Users should be able to use 0 and 1 instead of the words relevant/irrelevant.
  • 4-point scale: 0-3 as the default is what I would consider most widely used. However there are also scales that use four classes, like Amazon's ESCI dataset.
  • 5-point scale: I think there are use cases where you'd want more than a 4-point scale, so offering that does makes sense.

Copy link
Copy Markdown
Collaborator

@fen-qin fen-qin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall,

Comment thread src/main/java/org/opensearch/searchrelevance/common/RatingOutputProcessor.java Outdated
Comment thread src/main/java/org/opensearch/searchrelevance/rest/RestPutJudgmentAction.java Outdated
Comment thread src/main/java/org/opensearch/searchrelevance/rest/RestPutQuerySetAction.java Outdated
Comment thread src/main/java/org/opensearch/searchrelevance/rest/RestPutQuerySetAction.java Outdated
@heemin32
Copy link
Copy Markdown
Collaborator

heemin32 commented Oct 20, 2025

Default templates support three rating types: 0-1 scale, 1-5 scale, and binary (RELEVANT/IRRELEVANT)

As @wrigleyDan mentioned, I don't think need both of 0-1 scale and 1-5 scale when they both are 5 points scale.

@heemin32
Copy link
Copy Markdown
Collaborator

Introduced overwriteCache parameter to force cache refresh when needed

Shouldn't we just increase the version of the judgement for every update and evict cache when version does not match instead of asking user to decide if they want to evict cache or not?

Comment thread src/main/java/org/opensearch/searchrelevance/common/MLConstants.java Outdated
Comment thread src/main/java/org/opensearch/searchrelevance/common/MLConstants.java Outdated
Comment thread src/main/java/org/opensearch/searchrelevance/common/RatingOutputProcessor.java Outdated
Comment thread src/main/java/org/opensearch/searchrelevance/common/RatingOutputProcessor.java Outdated
Comment thread src/main/java/org/opensearch/searchrelevance/common/RatingOutputProcessor.java Outdated
Comment thread src/main/java/org/opensearch/searchrelevance/common/RatingOutputProcessor.java Outdated
Comment thread src/main/java/org/opensearch/searchrelevance/judgments/LlmJudgmentsProcessor.java Outdated
Comment thread src/main/java/org/opensearch/searchrelevance/common/MLConstants.java Outdated
Comment thread src/main/java/org/opensearch/searchrelevance/judgments/LlmJudgmentsProcessor.java Outdated
Comment thread src/main/java/org/opensearch/searchrelevance/model/Judgment.java Outdated
Comment thread src/main/java/org/opensearch/searchrelevance/model/QueryWithReference.java Outdated
Comment thread src/main/java/org/opensearch/searchrelevance/rest/RestPutJudgmentAction.java Outdated
Comment thread src/main/java/org/opensearch/searchrelevance/rest/RestPutQuerySetAction.java Outdated
Comment thread src/main/java/org/opensearch/searchrelevance/executors/ExperimentTaskContext.java Outdated
Comment thread src/main/java/org/opensearch/searchrelevance/judgments/LlmJudgmentsProcessor.java Outdated
Copy link
Copy Markdown
Collaborator

@epugh epugh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worked through some of the files.. The qa tests did run for me. I think right now I worry about the amount of plumbing in the qa/bwc stuff. (Should the dir be named just bwc instead of qa?). Also, and maybe too late for this, part are there any well maintained Java projects that handle LLM integration that we should be leveraging? LangChain4j etc? We are definitly integrating a very low level direct manner! Though maybe that is our style?

Comment thread formatter/formatting.gradle Outdated
Comment thread gradle.properties
Comment thread qa/README.md Outdated
Comment thread qa/build.gradle
Comment thread src/main/java/org/opensearch/searchrelevance/common/MLConstants.java Outdated
Copy link
Copy Markdown
Collaborator

@fen-qin fen-qin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks great. please add index mapping

Comment thread src/main/java/org/opensearch/searchrelevance/common/RatingOutputProcessor.java Outdated
Comment thread src/main/java/org/opensearch/searchrelevance/model/QueryWithReference.java Outdated
Copy link
Copy Markdown
Member

@martin-gaievski martin-gaievski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall code looks code with few comments, special thanks for adding BWC tests capabilities.

Comment thread src/main/java/org/opensearch/searchrelevance/judgments/LlmJudgmentsProcessor.java Outdated
Comment thread .github/workflows/backwards_compatibility_tests_workflow.yml
Comment thread src/main/java/org/opensearch/searchrelevance/judgments/LlmJudgmentsProcessor.java Outdated
Comment thread src/main/java/org/opensearch/searchrelevance/utils/ParserUtils.java Outdated
Comment thread src/main/java/org/opensearch/searchrelevance/utils/ParserUtils.java
Comment thread src/main/java/org/opensearch/searchrelevance/ml/MLAccessor.java
Comment thread src/main/java/org/opensearch/searchrelevance/rest/RestPutJudgmentAction.java Outdated
Comment thread src/main/java/org/opensearch/searchrelevance/utils/ParserUtils.java
Copy link
Copy Markdown
Member

@martin-gaievski martin-gaievski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job, thank you for addressing my comments.

One ask from my side - please open a new issue for improving logic for doing truncation.

@chloewqg
Copy link
Copy Markdown
Contributor Author

Good job, thank you for addressing my comments.

One ask from my side - please open a new issue for improving logic for doing truncation.

Yeah issue created. #314

fen-qin
fen-qin previously approved these changes Nov 14, 2025
Copy link
Copy Markdown
Collaborator

@fen-qin fen-qin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Would you like to link the following up issues to a centralized issue ?

Signed-off-by: Chloe Gao <chloewq@amazon.com>
Signed-off-by: Chloe Gao <chloewq@amazon.com>
Signed-off-by: Chloe Gao <chloewq@amazon.com>
Signed-off-by: Chloe Gao <chloewq@amazon.com>
Signed-off-by: Chloe Gao <chloewq@amazon.com>
Signed-off-by: Chloe Gao <chloewq@amazon.com>
Signed-off-by: Chloe Gao <chloewq@amazon.com>
Signed-off-by: Chloe Gao <chloewq@amazon.com>
Signed-off-by: Chloe Gao <chloewq@amazon.com>
Signed-off-by: Chloe Gao <chloewq@amazon.com>
Signed-off-by: Chloe Gao <chloewq@amazon.com>
Signed-off-by: Chloe Gao <chloewq@amazon.com>
Signed-off-by: Chloe Gao <chloewq@amazon.com>
Signed-off-by: Chloe Gao <chloewq@amazon.com>
@chloewqg chloewqg dismissed stale reviews from fen-qin and martin-gaievski via 9467aa8 December 15, 2025 00:35
Signed-off-by: Chloe Gao <chloewq@amazon.com>
Signed-off-by: Chloe Gao <chloewq@amazon.com>
@chloewqg chloewqg closed this Dec 15, 2025
@chloewqg chloewqg reopened this Dec 15, 2025
@codecov
Copy link
Copy Markdown

codecov Bot commented Dec 15, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 0.00%. Comparing base (b45d90d) to head (ad770f1).
⚠️ Report is 3 commits behind head on main.

Additional details and impacted files
@@     Coverage Diff     @@
##   main   #264   +/-   ##
===========================
===========================

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Collaborator

@fen-qin fen-qin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, rebasing on index mapping PR

@martin-gaievski
Copy link
Copy Markdown
Member

LGTM. Would you like to link the following up issues to a centralized issue ?

We do have meta issue for LLM as a judge work, maybe we can use it as such centralized issue #126. @fen-qin what do you think?

@fen-qin
Copy link
Copy Markdown
Collaborator

fen-qin commented Dec 16, 2025

LGTM. Would you like to link the following up issues to a centralized issue ?

We do have meta issue for LLM as a judge work, maybe we can use it as such centralized issue #126. @fen-qin what do you think?

yes. thanks, Martin. we should track and document the changes on LLM as a judgment into the META and also need to update the blog on new advanced settings for LLM judgment

@fen-qin fen-qin merged commit 11ab1a2 into opensearch-project:main Dec 16, 2025
43 of 45 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants