Skip to content

Conversation

@chloewqg
Copy link
Contributor

@chloewqg chloewqg commented Oct 13, 2025

Description

LLM Judge Dynamic Template Backend

Overall Description of changes

Implements LLM Judge Dynamic Template Backend feature, enabling customizable prompt templates and multiple rating types for LLM-based search relevance judgments.

Key Changes

Customizable Prompt Templates

  • Split monolithic prompt into modular components (PROMPT_SEARCH_RELEVANCE_SCORE_1_5_START, PROMPT_SEARCH_RELEVANCE_SCORE_0_1_START, PROMPT_SEARCH_RELEVANCE_SCORE_BINARY,
    PROMPT_SEARCH_RELEVANCE_SCORE_END) in MLConstants.java:42-74
  • Users can now provide custom prompt templates via API
  • Default templates support three rating types: 0-1 scale, 1-5 scale, and binary (RELEVANT/IRRELEVANT)

New Rating Type System

  • Added LLMJudgmentRatingType enum with three types: SCORE0_1, SCORE1_5, RELEVANT_IRRELEVANT
  • Created RatingOutputProcessor class for rating sanitization and validation with type-specific handling
  • Automatic rating clamping and normalization based on configured type

Enhanced Caching System

  • Added promptTemplateCode field to JudgmentCache model to differentiate cached results by template
  • Updated JudgmentCacheDao.getJudgmentCache() to include prompt template in cache lookup
  • Introduced overwriteCache parameter to force cache refresh when needed

API Enhancements

  • Updated PutLlmJudgmentRequest to accept promptTemplate, ratingType, and overwriteCache parameters
  • Extended REST APIs (RestPutJudgmentAction, RestPutQuerySetAction) to support new fields
  • Backward compatible - existing APIs work without new parameters

Refactoring & Code Quality

  • Renamed queryTextWithReference → queryTextWithCustomInput throughout codebase for clarity
  • Deprecated old sanitizeLLMResponse() methods in favor of RatingOutputProcessor
  • Added utility method generatePromptTemplateCode()

End to End Testing Procedure

Step 1: Enable Workbench

Request:

curl -X PUT "http://localhost:9200/_cluster/settings" -H 'Content-Type: application/json' -d'{
  "persistent": {
    "plugins.search_relevance.workbench_enabled": true
  }
}'

Response:

{
  "acknowledged": true,
  "persistent": {
    "plugins": {
      "search_relevance": {
        "workbench_enabled": "true"
      }
    }
  },
  "transient": {}
}

Status: ✅ Success


Step 2: Create Test Products Index

Request:

curl -X PUT "http://localhost:9200/test_products" -H 'Content-Type: application/json' -d'{
  "mappings": {
    "properties": {
      "name": {"type": "text"},
      "description": {"type": "text"},
      "category": {"type": "keyword"},
      "price": {"type": "float"}
    }
  }
}'

Response:

{
  "acknowledged": true,
  "shards_acknowledged": true,
  "index": "test_products"
}

Status: ✅ Success


Step 3: Load Sample Product Data

Request:

curl -X POST "http://localhost:9200/test_products/_bulk" -H 'Content-Type: application/json' -d'
{"index":{"_id":"1"}}
{"name":"Dell Laptop","description":"High performance laptop for professionals","category":"electronics","price":1200.00}
{"index":{"_id":"2"}}
{"name":"Office Chair","description":"Ergonomic office chair with lumbar support","category":"furniture","price":299.99}
{"index":{"_id":"3"}}
{"name":"Espresso Machine","description":"Premium coffee maker for home baristas","category":"kitchen","price":499.99}
{"index":{"_id":"4"}}
{"name":"Running Shoes","description":"Comfortable athletic shoes for runners","category":"sports","price":129.99}
{"index":{"_id":"5"}}
{"name":"MacBook Pro","description":"Apple laptop with M3 chip for developers","category":"electronics","price":2499.00}
'

Response:

{
  "took": 25,
  "errors": false,
  "items": [
    {"index": {"_index": "test_products", "_id": "1", "result": "created", "status": 201}},
    {"index": {"_index": "test_products", "_id": "2", "result": "created", "status": 201}},
    {"index": {"_index": "test_products", "_id": "3", "result": "created", "status": 201}},
    {"index": {"_index": "test_products", "_id": "4", "result": "created", "status": 201}},
    {"index": {"_index": "test_products", "_id": "5", "result": "created", "status": 201}}
  ]
}

Status: ✅ Success - 5 documents indexed


Step 4: Create Query Set with Custom Fields

This query set includes custom fields (category, targetAudience, referenceAnswer) that can be used in prompt template placeholders.

Request:

curl -X PUT "http://localhost:9200/_plugins/_search_relevance/query_sets" -H 'Content-Type: application/json' -d'{
  "name": "E2E Test Query Set",
  "description": "Query set for testing LLM judgment with custom fields",
  "querySetQueries": [
    {
      "queryText": "laptop for developers",
      "category": "electronics",
      "targetAudience": "professionals",
      "referenceAnswer": "A portable computer suitable for software development"
    },
    {
      "queryText": "coffee machine",
      "category": "kitchen",
      "targetAudience": "home users",
      "referenceAnswer": "An appliance for brewing coffee at home"
    }
  ]
}'

Response:

{
  "query_set_id": "2550758e-c346-4c9b-b6fd-52ff33a40ae0",
  "query_set_result": "CREATED"
}

Status: ✅ Success
Query Set ID: 2550758e-c346-4c9b-b6fd-52ff33a40ae0


Step 5: Create Multi-Field Search Configuration

Request:

curl -X PUT "http://localhost:9200/_plugins/_search_relevance/search_configurations" -H 'Content-Type: application/json' -d'{
  "name": "Products Multi-Field Search",
  "description": "Search both name and description fields",
  "index": "test_products",
  "query": "{\"query\": {\"multi_match\": {\"query\": \"%SearchText%\", \"fields\": [\"name\", \"description\"]}}}"
}'

Response:

{
  "search_configuration_id": "a1ce8022-1a9c-48ec-ab36-c9850680d9c2",
  "search_configuration_result": "CREATED"
}

Status: ✅ Success
Search Configuration ID: a1ce8022-1a9c-48ec-ab36-c9850680d9c2


Test 1: GPT-4 with SCORE0_1 Rating Type

Create Judgment with Custom Prompt Template

Request:

curl -X PUT "http://localhost:9200/_plugins/_search_relevance/judgments" -H 'Content-Type: application/json' -d'{
  "name": "Test 1: GPT-4 SCORE0_1 Custom Template",
  "type": "LLM_JUDGMENT",
  "querySetId": "2550758e-c346-4c9b-b6fd-52ff33a40ae0",
  "searchConfigurationList": ["a1ce8022-1a9c-48ec-ab36-c9850680d9c2"],
  "modelId": "ycmnTZoBJMvqPc66Lqqh",
  "size": 5,
  "tokenLimit": 4000,
  "contextFields": ["name", "description"],
  "ignoreFailure": false,
  "llmJudgmentRatingType": "SCORE0_1",
  "promptTemplate": "Given the query: {{queryText}}\nCategory: {{category}}\nTarget audience: {{targetAudience}}\nReference: {{referenceAnswer}}\n\nRate the relevance of this document on a scale of 0.0 to 1.0, where 0.0 is completely irrelevant and 1.0 is perfectly relevant.",
  "overwriteCache": false
}'

Response:

{
  "judgment_id": "5d91e3d8-0aed-4ab0-a4f5-38637fc41134"
}

Verify Results (after 15 seconds)

Request:

curl -s "http://localhost:9200/_plugins/_search_relevance/judgments/5d91e3d8-0aed-4ab0-a4f5-38637fc41134" | python3 -m json.tool

Response Summary:

{
  "status": "COMPLETED",
  "metadata": {
    "llmJudgmentRatingType": "SCORE0_1",
    "promptTemplate": "Given the query: {{queryText}}\nCategory: {{category}}\nTarget audience: {{targetAudience}}\nReference: {{referenceAnswer}}\n\nRate the relevance of this document on a scale of 0.0 to 1.0...",
    "overwriteCache": false
  },
  "judgmentRatings": [
    {
      "query": "laptop for developers#\ntargetAudience: professionals\nreferenceAnswer: A portable computer suitable for software development\ncategory: electronics",
      "ratings": [
        {"rating": "0.9", "docId": "1"}
      ]
    },
    {
      "query": "coffee machine#\ntargetAudience: home users\nreferenceAnswer: An appliance for brewing coffee at home\ncategory: kitchen",
      "ratings": [
        {"rating": "1.0", "docId": "1"}
      ]
    }
  ]
}

Test 2: GPT-4 with RELEVANT_IRRELEVANT Rating Type

Create Judgment with Binary Rating

Request:

curl -X PUT "http://localhost:9200/_plugins/_search_relevance/judgments" -H 'Content-Type: application/json' -d'{
  "name": "Test 2: GPT-4 RELEVANT_IRRELEVANT",
  "type": "LLM_JUDGMENT",
  "querySetId": "2550758e-c346-4c9b-b6fd-52ff33a40ae0",
  "searchConfigurationList": ["a1ce8022-1a9c-48ec-ab36-c9850680d9c2"],
  "modelId": "ycmnTZoBJMvqPc66Lqqh",
  "size": 5,
  "tokenLimit": 4000,
  "contextFields": ["name", "description", "category"],
  "ignoreFailure": false,
  "llmJudgmentRatingType": "RELEVANT_IRRELEVANT",
  "promptTemplate": "Search query: {{queryText}}\nCategory: {{category}}\nFor: {{targetAudience}}\nExpected: {{referenceAnswer}}\n\nDetermine if this document is RELEVANT or IRRELEVANT to the query.",
  "overwriteCache": false
}'

Response:

{
  "judgment_id": "95077971-d0ca-4191-affa-b2c704ede066"
}

Verify Results

Request:

curl -s "http://localhost:9200/_plugins/_search_relevance/judgments/95077971-d0ca-4191-affa-b2c704ede066" | python3 -m json.tool

Response Summary:

{
  "status": "COMPLETED",
  "metadata": {
    "llmJudgmentRatingType": "RELEVANT_IRRELEVANT",
    "promptTemplate": "Search query: {{queryText}}\nCategory: {{category}}\nFor: {{targetAudience}}\nExpected: {{referenceAnswer}}\n\nDetermine if this document is RELEVANT or IRRELEVANT to the query."
  },
  "judgmentRatings": [
    {
      "query": "laptop for developers#\ntargetAudience: professionals\nreferenceAnswer: A portable computer suitable for software development\ncategory: electronics",
      "ratings": [
        {"rating": "1.0", "docId": "1"}
      ]
    },
    {
      "query": "coffee machine#\ntargetAudience: home users\nreferenceAnswer: An appliance for brewing coffee at home\ncategory: kitchen",
      "ratings": [
        {"rating": "1.0", "docId": "1"}
      ]
    }
  ]
}

Issues Resolved

List any issues this PR will resolve, e.g. Closes [...].

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@chloewqg chloewqg force-pushed the llm_judge_template branch 3 times, most recently from 831f422 to 0c0500e Compare October 14, 2025 03:22
@wrigleyDan
Copy link
Collaborator

Thanks for this PR. I believe that more flexibility when generating LLM-assisted judgments hugely improves the chances of this feature being useful.

Personally, I find a scale from 0-3 (or generally a 4-point scale) more useful than a 5-point scale or even more granular scales for explicit judgments and it's what I have seen most of the times in practice. It's typically increases consistency and it forces you to make a choice (it's either more on the relevant side or more on the irrelevant side, not in between).
So while I appreciate being able to add custom prompts I am wondering if the three rating types support what is used in the industry.

Most of the times I see metrics (not judgments) in the range from 0 to 1 is when the similarity of a document to a reference answer is calculated or for other metrics in use cases that go beyond retrieval (for example, faithfulness or response relevance). I would regard these as too granular for an LLM to be applied consistently.

That being said, I would recommend to support three scales:

  • binary judgments: relevant/irrelevant like suggested is fine. Users should be able to use 0 and 1 instead of the words relevant/irrelevant.
  • 4-point scale: 0-3 as the default is what I would consider most widely used. However there are also scales that use four classes, like Amazon's ESCI dataset.
  • 5-point scale: I think there are use cases where you'd want more than a 4-point scale, so offering that does makes sense.

Copy link
Collaborator

@fen-qin fen-qin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall,

@heemin32
Copy link
Collaborator

heemin32 commented Oct 20, 2025

Default templates support three rating types: 0-1 scale, 1-5 scale, and binary (RELEVANT/IRRELEVANT)

As @wrigleyDan mentioned, I don't think need both of 0-1 scale and 1-5 scale when they both are 5 points scale.

@heemin32
Copy link
Collaborator

Introduced overwriteCache parameter to force cache refresh when needed

Shouldn't we just increase the version of the judgement for every update and evict cache when version does not match instead of asking user to decide if they want to evict cache or not?

Copy link
Collaborator

@epugh epugh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worked through some of the files.. The qa tests did run for me. I think right now I worry about the amount of plumbing in the qa/bwc stuff. (Should the dir be named just bwc instead of qa?). Also, and maybe too late for this, part are there any well maintained Java projects that handle LLM integration that we should be leveraging? LangChain4j etc? We are definitly integrating a very low level direct manner! Though maybe that is our style?

Signed-off-by: Chloe Gao <[email protected]>
Signed-off-by: Chloe Gao <[email protected]>
Signed-off-by: Chloe Gao <[email protected]>
Signed-off-by: Chloe Gao <[email protected]>
Signed-off-by: Chloe Gao <[email protected]>
Signed-off-by: Chloe Gao <[email protected]>
Signed-off-by: Chloe Gao <[email protected]>
Signed-off-by: Chloe Gao <[email protected]>
Signed-off-by: Chloe Gao <[email protected]>
Copy link
Collaborator

@fen-qin fen-qin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks great. please add index mapping

Signed-off-by: Chloe Gao <[email protected]>
Signed-off-by: Chloe Gao <[email protected]>
Copy link
Member

@martin-gaievski martin-gaievski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall code looks code with few comments, special thanks for adding BWC tests capabilities.

Signed-off-by: Chloe Gao <[email protected]>
Signed-off-by: Chloe Gao <[email protected]>
Signed-off-by: Chloe Gao <[email protected]>
Copy link
Member

@martin-gaievski martin-gaievski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job, thank you for addressing my comments.

One ask from my side - please open a new issue for improving logic for doing truncation.

@chloewqg
Copy link
Contributor Author

Good job, thank you for addressing my comments.

One ask from my side - please open a new issue for improving logic for doing truncation.

Yeah issue created. #314

Copy link
Collaborator

@fen-qin fen-qin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Would you like to link the following up issues to a centralized issue ?

@epugh
Copy link
Collaborator

epugh commented Nov 15, 2025

This will be fantantic to have.

@chloewqg
Copy link
Contributor Author

Investigation for Index Mapping Update:

We have two new fields added here, modelId and encodedPromptTemplate. Performed Investigation locally by reverting judgement cache json to old version and performed judgement calling. Here's the judgement cache index mapping:

curl -s "http://localhost:9200/.plugins-search-relevance-judgment-cache/_mapping?pretty" | python3 -m json.tool 2>/dev/null
{
    ".plugins-search-relevance-judgment-cache": {
        "mappings": {
            "properties": {
                "contextFieldsStr": {
                    "type": "keyword"
                },
                "documentId": {
                    "type": "keyword"
                },
                "encodedPromptTemplate": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                        }
                    }
                },
                "id": {
                    "type": "keyword"
                },
                "modelId": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                        }
                    }
                },
                "querySet": {
                    "type": "keyword"
                },
                "queryText": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                        }
                    }
                },
                "rating": {
                    "type": "keyword"
                },
                "timestamp": {
                    "type": "date",
                    "format": "strict_date_time"
                }
            }
        }
    }
}

Difference between keyword and text will come into play when there is upper characters. keyword type preserves upper chars and text convert all chars into lower case.

In encodedPromptTemplate since we use SHA256 to encode, there is no upper case char. So no risk.
In modelId , there could be upper case which will creates issue if we are searching with modelId in judgement cache. However, right now we don't fetch cache with modelId (See JudgmentCacheDao in this PR)

Conclusion is that no issues for now. But will be an issue if we want to use modelId as a condition in judgement cache.

@heemin32
Copy link
Collaborator

The different field type from what we defined in judgment_cache.json is an issue imo. Otherwise, why not we define the type as text in that field instead of keyword?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants