Optimized text embedding processor for single document update #1191

will-hwang · 2025-02-19T06:30:54Z

Description

This PR implements optimized text embedding processor for single document update scenario.

Proposed State [Single Document Update]

Steps:

Process Map is generated based on IngestDocument and Field Map defined.
If ignore_existing feature is set to true, filter the process map.
1. Existing document is fetched via OpenSearch client’s Get Action to compare the existing inference text against IngestDocument
  1. if document does not exist, or any exception is thrown, fallback to calling model inference
2. Locate the embedding fields in the existing document
  1. Recursively traverse the process map to find embedding fields. Keeping track of the traversal path is required for look up in existing document.
  2. Once embedding fields are found, attempt to copy embeddings from existing document to ingest document.
3. If eligible, copy over the vector embeddings from existing document to ingest document
  1. It is eligible for copy if inference text in ingest document and existing document is the same, and embeddings for the inference text exist in the existing document.
  2. Note, if in case of values in list, the fields in the same index are compared to determine text equality
4. If eligible fields have been copied, remove the entry in process map
Inference List is generated based on entries in Filtered Process Map.
ML Commons InferenceSentence API is invoked with filtered inference list.
Embeddings for filtered inference list are generated in ML Commons.
Embeddings for filtered inference list are mapped to target fields via entries defined in process map.
Embeddings for filtered inference list are populated to target fields in IngestDocument

User Scenarios

Scenario 1: Document Update with no change to all fields

Update Document
ID: 1
"category": [
    {
        "name": {
            "en": "this is 1st name"
        }
    },
    {
        "name": {
            "en": "this is 2nd name"
        }
    }
]

Existing Document
ID: 1
"category": [
    {
        "name": {
            "category_name_vector": [
                -0.1,
                -0.1,
                -0.1,
                ...
            ],
            "en": "this is 1st name"
        }
        
    },
    {
        "name": {
            "category_name_vector": [
                -0.2,
               -0.2,
               -0.2,
                ...
            ],
            "en": "this is 2nd name"
        }
    }
]

Filtered Process Map 
{
  "category":` `
        {
            "name": {
                "category_name_vector": [~~"this is 1st name", "this is 2nd name"~~]
            }
        }
}

Filtered Inference List
[]

Ingested Document
"category": [
    {
        "name": {
            "category_name_vector": [ // copied from existing document
                -0.1,
                -0.1,
                -0.1,
                ...
            ],
            "en": "this is 1st name"
        }
        
    },
    {
        "name": {
            "category_name_vector": [ // copied from existing document
                -0.2,
                -0.2,
                -0.2,
                ...
            ],
            "en": "this is 2nd name"
        }
    }
]

Explanation: Both inference texts “this is 1st name”, “this is 2nd name” mapped with embedding have remained unchanged, so the corresponding embedding values have been copied over, instead of making inference calls

Scenario 2: Document Update with partial change to fields

Update Document
ID: 1
"category": [
   {
        "name": {
            "en": **"this is 2nd name"**
        }
   },
    {
        "name": {
            "en": "this is 2nd name"
        }
    }
]

Existing Document
ID: 1
"category": [
    {
        "name": {
            "category_name_vector": [
                -0.1,
                -0.1,
                -0.1,
                ...
            ],
            "en": "this is 1st name"
        }
        
    },
    {
        "name": {
            "category_name_vector": [
                -0.2,
                -0.2,
                -0.2,
                ...
            ],
            "en": "this is 2nd name"
        }
    }
]

Filtered Process Map 
{
  "category":` `
        {
            "name": {
                "category_name_vector": ["this is 2nd name", "this is 2nd name" ]
            }
        }
}

Filtered Inference List
["this is 2nd name"]

Ingested Document
ID: 1
"category": [
    {
        "name": {
            "category_name_vector" [ // generated via model inference
                -0.2,
                -0.2,
                -0.2,
                ...
            ],
            "en":"this is 2nd name"**
        }
    },
    {
        "name": {
            "category_name_vector": [ // copied from existing document
                -0.2,
                -0.2,
                -0.2,
                ...
            ],
            "en": "this is 2nd name"
        }
    }
]

Explanation: The inference text at index 0 changed from “this is 1st name” to “this is 2nd name”, while the text at index 1 (“this is 2nd name”) remained unchanged. An inference call is made only for the modified text at index 0. Even if the existing document already contains embeddings for “this is 2nd name”, the inference call is still triggered because text changes are evaluated based on their respective indices, not on content duplication.

Scenario 3: Document Update with change to only irrelevant fields

Update Document
ID: 1
"category": [
    {
        "name": {
            "en": "this is 1st name"
        }
    },
    {
        "name": {
            "en": "this is 2nd name"
        }
    }
]
"irrelevant_field": "hello world"

Existing Document
ID: 1
"category": [
    {
        "name": {
            "category_name_vector": [
                -0.1,
                -0.1,
                -0.1,
                ...
            ],
            "en": "this is 1st name"
        }
        
    },
    {
        "name": {
            "category_name_vector": [
                -0.2,
                -0.2,
                -0.2,
                ...
            ],
            "en": "this is 2nd name"
        }
    }
]

Filtered Process Map 
{
  "category":` `
        {
            "name": {
                "category_name_vector": ["this is 1st name", "this is 2nd name"]
            }
        }
}

Filtered Inference List
[]

Ingested Document
"category": [
    {
        "name": {
            "category_name_vector": [ // copied from existing document
                -0.1,
                -0.1,
                -0.1,
                ...
            ],
            "en": "this is 1st name"
        }
        
    },
    {
        "name": {
            "category_name_vector": [ // copied from existing document
                -0.2,
                -0.2,
                -0.2,
                ...
            ],
            "en": "this is 2nd name"
        }
    }
]
**"irrelevant_field": "hello world"**

Explanation: “irrelevant_field” is not mapped with vector embedding, so inference call is not made regardless of the feature. Meanwhile, both inference texts “this is 1st name”, “this is 2nd name” mapped with embedding have remained unchanged, so the corresponding embedding values have been copied over instead of making inference calls.

Scenario 4: Document Update with changes in list values

Update Document
ID: 1
"category": ["category3", "category2"]

Existing Document
ID: 1
"category": ["category1", "category2"]
"category_knn": [
    {
        "knn": [
             -0.1,
             -0.1,
             -0.1,
        ]
    },
    {
        "knn": [
             -0.2,
             -0.2,
             -0.2,
        ]
    }
]

Filtered Process Map 
{
  "category_knn":` ["category3", "category2"]`
}

Filtered Inference List
["category3", "category2"]

Ingested Document
ID: 1
"category": ["category3", "category2"]
"category_knn": [
    {
        **"knn": [ // generated via inference call 
             -0.3,
             -0.3,
             -0.3,
        ]**
    },
    {
        "knn": [ // generated via inference call
             -0.2,
             -0.2,
             -0.2,
        ]
    }
]

Explanation: The optimized text embedding processor compares lists as a whole. If all texts in the list match, the existing embeddings are copied over. If any of the texts doesn't match, inference call is made for all values in the list regardless of matching texts

Testing:
Benchmark Test in progress. Will discuss the results before merge to main

Related Issues

Resolves #[Issue number to be closed when this PR is merged]

HLD: #1138

Check List

New functionality includes testing.
New functionality has been documented.
API changes companion pull request created.
Commits are signed per the DCO using --signoff.
Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

codecov · 2025-02-19T06:41:59Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 81.89%. Comparing base (628cb64) to head (8bc759e).
Report is 2 commits behind head on optimized-processor.

Additional details and impacted files

@@                    Coverage Diff                    @@
##             optimized-processor    #1191      +/-   ##
=========================================================
+ Coverage                  81.74%   81.89%   +0.14%     
+ Complexity                  2511     1301    -1210     
=========================================================
  Files                        190       97      -93     
  Lines                       8564     4429    -4135     
  Branches                    1436      749     -687     
=========================================================
- Hits                        7001     3627    -3374     
+ Misses                      1006      512     -494     
+ Partials                     557      290     -267

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

junqiu-lei · 2025-02-19T23:18:04Z

...ain/java/org/opensearch/neuralsearch/processor/optimization/OptimizedInferenceProcessor.java

+ */
+
+@Log4j2
+public abstract class OptimizedInferenceProcessor extends InferenceProcessor {


I feel the term "optimization" is too generic to use

thanks for pointing this out. Will wait to see if others have better opinions on the naming

I think I agree with @junqiu-lei here. There is a already a InferenceProcessor and we are trying to Optimize what i do already. So better if we name it around what optimization it does upon the existing one.
But if we are not able to come up with something more appropriate then I am good with this one. No strong concerns.

will name as ConditionalTextEmbeddingProcessor to convey that it selectively generates embeddings when conditions are met. Let me know if there are better suggestions

will name it SelectiveTextEmbeddingProcessor for now

junqiu-lei · 2025-02-19T23:24:20Z

src/main/java/org/opensearch/neuralsearch/processor/TextEmbeddingProcessor.java


    public static final String TYPE = "text_embedding";
    public static final String LIST_TYPE_NESTED_MAP_KEY = "knn";
+    public static final String IGNORE_EXISTING = "ignore_existing";


From user's point, they might not clear what we're ignoring. maybe rename the flag to something like forceRegeneration or alwaysGenerateEmbedding?

ignore prefix was what we preferred based on discussions I had with ML-commons team because it is consistent with other flags that are in use. But open to more suggestions

I like "ignore_existing_embedding" since it's more descriptive. If the default to maintain backwards compatibility is to have optimization disabled, "ignore" makes it clearer that it's a feature they have to turn on to get the benefit (default false). like I wonder if users would be confused if a setting labelled "force" was default true?

Just saw the above comment about the name of the processor too, maybe ideally the processor name and the option name are similar since the processor exists to support the option?

I think we need to make the naming more clear. ignore_existing is a little bit confusing that we actually want to do the optimization when it is true. People can view it as ignore_existing_embedding which means we want to always do the inference work.

I think even we can document how the parameter works it should be self explain and give people a gut feel what does it control. It should convey the info that we will do some optimization when the doc has existing embedding. It's actually kind weird we have to start with ignore. And if starting with ignore can make it confusing we should propose to not do that. I feel even we simply call it as "optimize_embedding_generation" is more descriptive.

what do we think about skipExisting?

That is a good point. How about skip_existing?

junqiu-lei · 2025-02-19T23:29:00Z

src/testFixtures/java/org/opensearch/neuralsearch/BaseNeuralSearchIT.java

Do we have any integration test covered for the new processors?

will include IT tests in this PR

junqiu-lei · 2025-02-19T23:37:06Z

We might need BWC tests to verify mixed-version cluster scenarios where:

Some nodes are running the old version (without optimized processor)
Some nodes are running the new version

q-andy

If I'm understanding correctly, when a user adds a text embedding processor to the pipeline with the config option as true, then we create a different processor object than if the config was false.

Are there any scenarios where this is confusing for users that the processor object we create isn't the exact same one that they define? Or is this all abstracted away?

q-andy · 2025-02-20T18:34:40Z

...java/org/opensearch/neuralsearch/processor/optimization/OptimizedTextEmbeddingProcessor.java

+            int min = Math.min(((List) sourceList.get()).size(), ((List) existingList.get()).size());
+            for (int j = 0; j < min; j++) {
+                if (((List) sourceList.get()).get(j).equals(((List) existingList.get()).get(j))) {
+                    updatedEmbeddings.add(((List) embeddingList.get()).get(j));


Would having a helper method to cast lists would make this more readable? Like private List asList(Object object). Since raw casting requires double parentheses making it hard to parse inline, compare ((List) object) vs asList(object).

will make it more readable. thanks!

You don't need an extra method here. You just need to declare another variable to avoid repetitive casting.

List sList = (List)sourceList.get(); List eList = (List)existsingList.get(); if (sList.get(j).equals(eList.get(j)) { ... }

will-hwang · 2025-02-20T21:39:15Z

If I'm understanding correctly, when a user adds a text embedding processor to the pipeline with the config option as true, then we create a different processor object than if the config was false.

Are there any scenarios where this is confusing for users that the processor object we create isn't the exact same one that they define? Or is this all abstracted away?

the implementation detail should be abstracted away from user, and all user needs to know is the behavior of the flag.

bzhangam · 2025-02-21T18:51:20Z

...java/org/opensearch/neuralsearch/processor/optimization/OptimizedTextEmbeddingProcessor.java

+        Map<String, Object> existingSourceAndMetadataMap,
+        int index
+    ) {
+        String textKey = ProcessorUtils.findKeyFromFromValue(ProcessorDocumentUtils.unflattenJson(fieldMap), currentPath, level);


I notice we do ProcessorDocumentUtils.unflattenJson(fieldMap) multiple times. Can we only do it once?

fixed. unflattened in constructor

vibrantvarun

Completed the 1st round of review. Will wait for comments to be addressed before doing 2nd round of review.

Thanks

CHANGELOG.md

vibrantvarun · 2025-02-21T18:29:05Z

src/main/java/org/opensearch/neuralsearch/processor/InferenceProcessor.java

-            return v1;
-        } else {
-            return v2;
+        if (v1 instanceof List<?> && v2 instanceof List<?>) {


Can you Add a comment on what does this peice of code do?

This is a leftover from previous code that lacks comments explaining its functionality. It would be great if you could add a comment to clarify it while making your modifications, helping to address this technical debt.

added comments

vibrantvarun · 2025-02-21T18:29:15Z

src/main/java/org/opensearch/neuralsearch/processor/InferenceProcessor.java

+            return v1List;
        }
+
+        if (v1 instanceof Map<?, ?> && v2 instanceof Map<?, ?>) {


Can you Add a comment on what does this peice of code do?

yes, will add comments for this method

reverted back to previous after updated implementation

src/main/java/org/opensearch/neuralsearch/processor/InferenceProcessor.java

vibrantvarun · 2025-02-21T18:30:33Z

src/main/java/org/opensearch/neuralsearch/processor/InferenceProcessor.java

+        IngestDocument ingestDocument,
+        Map<String, Object> processorMap,
+        List<?> results,
+        boolean update


Suggested change

boolean update

boolean update

Can you come up with a little usecase specific variable name?

Or adding a javadoc explaining each parameter for this method might be good as well.

added javadoc for this

vibrantvarun · 2025-02-21T18:41:27Z

src/main/java/org/opensearch/neuralsearch/processor/InferenceProcessor.java

+        }
+    }
+
+    protected void setVectorFieldsToDocument(IngestDocument ingestDocument, Map<String, Object> processorMap, List<?> results) {


When is this method called? I see there are 2 methods with the same name. If we need this method then please change the name of this method to usecase specific.

Or better is to make the earlier method except this flag and put if else

Agree. We can remove this method.

refactored to remove this method

vibrantvarun · 2025-02-21T18:47:12Z

src/main/java/org/opensearch/neuralsearch/processor/InferenceProcessor.java

        // build nlp output for object in sourceValue which is map type
        Iterator<Map<String, Object>> iterator = sourceAndMetadataMapValueInList.iterator();
-        IntStream.range(0, sourceAndMetadataMapValueInList.size()).forEach(index -> {
+        IndexWrapper listIndexWrapper = new IndexWrapper(0);


We don't needs this listIndexWrapper object. The argument indexWrapper is same object what you are trying to create here. Please traceback the indexWrapper object and you will find that it is called from buildNLPResult method in this class https://github.com/opensearch-project/neural-search/pull/1191/files#diff-d2b79f65d193c79dd65558833fcf583eb2c29301325e1eb4eb83a963c737f2f8R468. Which has the same declaration. They what is the need of making this change?

indexWrapper is used to iterate the embedding list for update. listIndexWrapper is used to replace the existingIntStream.range(0, sourceAndMetadataMapValueInList.size()). Currently, every update is just another ingest so we overwrite the values and its embeddings every time. With the new way, some values and its embeddings will have been copied over, so we need a way to skip overwriting those values

If I am understanding correctly then we need only listIndexMapper not the indexMapper. As the listIndexMapper have the details of what needs to be updated right?

this change was done to increment the nestedIndex based on the existence of embedding in document, instead of sequentially incrementing like IntStream.range(0, sourceAndMetadataMapValueInList.size()). This change was done to handle recently found edge cases such as #1024. Since this change is only needed only for specific cases and seems like a distraction to overall objective of the PR, I will raise a separate PR for this change.

The IndexWrapper is necessary. We want to increase counter only when we consumed value from inputNestedMapEntryValue so we need a way to keep the counter value. I see the problem is in the class name which is confusing because index in OpenSearch has its own meaning.

updated implementation logic to remove the need for this change.

src/main/java/org/opensearch/neuralsearch/processor/InferenceProcessor.java

vibrantvarun · 2025-02-21T19:00:45Z

src/main/java/org/opensearch/neuralsearch/processor/InferenceProcessor.java

        IndexWrapper indexWrapper,
        Map<String, Object> sourceAndMetadataMap,
-        int nestedElementIndex
+        IndexWrapper listIndexWrapper


indexWrapper and listIndexWrapper are same. Please reevaluate the need of this object here.

vibrantvarun · 2025-02-21T19:02:28Z

src/main/java/org/opensearch/neuralsearch/processor/InferenceProcessor.java


    public static final String MODEL_ID_FIELD = "model_id";
    public static final String FIELD_MAP_FIELD = "field_map";
    private static final BiFunction<Object, Object, Object> REMAPPING_FUNCTION = (v1, v2) -> {


Can we make this REMAPPING _Function a method? Loading this method in memory by default is not ideal.

Not sure if making it as method is necessary. In terms of loading into memory, there is no difference between this and method.

bzhangam · 2025-02-21T19:48:42Z

src/main/java/org/opensearch/neuralsearch/processor/factory/TextEmbeddingProcessorFactory.java

-        return new TextEmbeddingProcessor(tag, description, batchSize, modelId, filedMap, clientAccessor, environment, clusterService);
+        Map<String, Object> fieldMap = readMap(TYPE, tag, config, FIELD_MAP_FIELD);
+        boolean ignoreExisting = readBooleanProperty(TYPE, tag, config, IGNORE_EXISTING, DEFAULT_IGNORE_EXISTING);
+        if (ignoreExisting == true) {


When we say ignore_existing does that means we want to always do inference work? Which is our existing behavior so we should use the existing processor rather than the optimized one?

open to suggestions, we can make the feature name more descriptive

bzhangam · 2025-02-21T19:53:22Z

...java/org/opensearch/neuralsearch/processor/optimization/OptimizedTextEmbeddingProcessor.java

+    ) {
+        String index = ingestDocument.getSourceAndMetadata().get(INDEX_FIELD).toString();
+        String id = ingestDocument.getSourceAndMetadata().get(ID_FIELD).toString();
+        openSearchClient.execute(GetAction.INSTANCE, new GetRequest(index, id), ActionListener.wrap(response -> {


In this processor we always do this get call which kind downgrade the performance of the operations we know for sure there is no existing embedding can be reused. We can control this through the ignore_existing flag but just thinking would it better in this optimized processor we also consider if the we are processing a create doc API? I think we can tell this info based on the which API is invoked and attach it to the IngestDocument.

that would require model change in ingestDocument to attach API performed on the ingestDocument. This change should be a separate feature in my opinion, and should be implemented separately if needed

There is no difference between create doc and update doc API.

q-andy · 2025-02-21T19:55:04Z

src/main/java/org/opensearch/neuralsearch/processor/factory/TextEmbeddingProcessorFactory.java

-        return new TextEmbeddingProcessor(tag, description, batchSize, modelId, filedMap, clientAccessor, environment, clusterService);
+        Map<String, Object> fieldMap = readMap(TYPE, tag, config, FIELD_MAP_FIELD);
+        boolean ignoreExisting = readBooleanProperty(TYPE, tag, config, IGNORE_EXISTING, DEFAULT_IGNORE_EXISTING);
+        if (ignoreExisting == true) {


nit: Since ignoreExisting is already a boolean you don't need to check equality to true here, you can use the boolean as the condition

following the developer guide: https://github.com/opensearch-project/OpenSearch/blob/main/DEVELOPER_GUIDE.md#java-language-formatting-guidelines

explicit true/false check is preferred

It is only applicable for false conditions. If(ignoreExisting) by all means mean a true condition.

Therefore, it is recommended to use if(boolvariable) rather than if(boolvariable==true). The end goal is to improve readability. For false conditions it makes sense to write that explicitly.

okay, will change

This is standard format for OpenSearch core where we explicitly compare it to either true or false. Even if we don't force it in neural, would be nice to follow it.

heemin32 · 2025-02-24T00:31:16Z

src/main/java/org/opensearch/neuralsearch/processor/InferenceProcessor.java

-        } else {
-            return v2;
+        if (v1 instanceof List<?> && v2 instanceof List<?>) {
+            List<Object> v1List = new ArrayList<>((List<?>) v1);


Can we avoid creating new array here?

heemin32 · 2025-02-24T00:37:14Z

src/main/java/org/opensearch/neuralsearch/processor/InferenceProcessor.java

+            List<?> v2List = (List<?>) v2;
+
+            Iterator<?> iterator = v2List.iterator();
+            for (int i = 0; i < v1List.size() && iterator.hasNext(); i++) {


In previous code, we added all item from v2 to v1. Now, we are setting null value in v1 using v2 value. Could you tell how this new logic can cover the previous use case as well?

The assumption here is that, the number of null value in v1List is same as the size of v2. Can we make it obvious from the code? For example we can do like

for (int i = 0; i , v1List.size(); i++) { if (v1List.get(i) == null) { assert iterator.hasNext() == true; v1List.set(i, iterator.next()); } } assert iterator.hasNext() == false;

refactored with comments

heemin32 · 2025-02-24T00:37:52Z

src/main/java/org/opensearch/neuralsearch/processor/InferenceProcessor.java

        }
+
+        if (v1 instanceof Map<?, ?> && v2 instanceof Map<?, ?>) {
+            Map<String, Object> v1Map = new LinkedHashMap<>((Map<String, Object>) v1);


Can we avoid creating new map? This might be costly.

We checked if v1 is instance of Map<?, ?> but not we are casting it to Map<String, Object>. This is not safe.

heemin32 · 2025-02-24T00:39:04Z

src/main/java/org/opensearch/neuralsearch/processor/InferenceProcessor.java

+            }
+            return v1Map;
+        }
+        return v2 != null ? v2 : v1;


Do we ever get to this case? What is the logic behind this returning v2 and ignoring v1?

heemin32 · 2025-02-24T00:44:32Z

src/main/java/org/opensearch/neuralsearch/processor/InferenceProcessor.java

+            Map<?, ?> v2Map = (Map<?, ?>) v2;
+
+            for (Map.Entry<?, ?> entry : v2Map.entrySet()) {
+                if (entry.getKey() instanceof String && !v1Map.containsKey(entry.getKey())) {


Is there a case where v2Map has different types of key? Like one key is string type and other int type?

src/main/java/org/opensearch/neuralsearch/processor/util/ProcessorUtils.java

heemin32 · 2025-02-24T02:00:21Z

src/main/java/org/opensearch/neuralsearch/processor/util/ProcessorUtils.java

+        Map<String, Object> currentMap = sourceAsMap;
+        String targetValue = keys[keys.length - 1];
+        for (String key : keys) {
+            if (key.equals(targetValue)) {


what if path is "level1.level2.level1" and level is 3?
The code will break here and lastFoundKey will be null.

* map: * { * "level1": { * "level2" : { * "first_text": "level1" * } * } * }

fixed, and added a test case

heemin32 · 2025-02-24T02:01:30Z

src/main/java/org/opensearch/neuralsearch/processor/util/ProcessorUtils.java

+            if (currentMap.containsKey(key)) {
+                Object value = currentMap.get(key);
+                if (value instanceof Map) {
+                    currentMap = (Map<String, Object>) value;


This is unsafe cast

src/main/java/org/opensearch/neuralsearch/processor/util/ProcessorUtils.java

vibrantvarun · 2025-02-27T17:37:29Z

src/main/java/org/opensearch/neuralsearch/processor/TextEmbeddingProcessor.java

    public static final String TYPE = "text_embedding";
    public static final String LIST_TYPE_NESTED_MAP_KEY = "knn";
+    public static final String SKIP_EXISTING = "skip_existing";
+    public static final boolean DEFAULT_SKIP_EXISTING = Boolean.FALSE;


Suggested change

public static final boolean DEFAULT_SKIP_EXISTING = Boolean.FALSE;

public static final boolean DEFAULT_SKIP_EXISTING = false;

You are using primitive datatype so false should be fine.

src/main/java/org/opensearch/neuralsearch/processor/TextEmbeddingProcessor.java

vibrantvarun · 2025-02-27T17:40:22Z

src/main/java/org/opensearch/neuralsearch/processor/TextEmbeddingProcessor.java

+                    }
+                }
+            }, e -> { handler.accept(null, e); }));
+        } else { // skip existing flag is turned off. Call model inference without filtering


Add comment in new line.

src/main/java/org/opensearch/neuralsearch/processor/optimization/InferenceFilter.java

...in/java/org/opensearch/neuralsearch/processor/optimization/TextEmbeddingInferenceFilter.java

vibrantvarun · 2025-02-27T17:45:32Z

src/main/java/org/opensearch/neuralsearch/processor/util/ProcessorUtils.java

        return currentValue;
    }

+    public static void setValueToSource(Map<String, Object> sourceAsMap, String targetKey, Object targetValue) {


Why do you need this method? There is another method with the same name below. Can you combine both of them?

i wanted to keep it consistent with getValueFromSource method, which can support two types of nested maps, one with just nested maps (link), and one with nested maps with lists (link)

src/main/java/org/opensearch/neuralsearch/processor/util/ProcessorUtils.java

vibrantvarun · 2025-02-27T17:47:31Z

src/main/java/org/opensearch/neuralsearch/processor/util/ProcessorUtils.java

        }
    }

+    public static Optional<Object> getValueFromSource(final Map<String, Object> sourceAsMap, final String targetField) {


There is a method with same name and return type below. Can you combine this and the below one? Why do we need this separate method?

this is an existing method that assumes nested structure will not contain lists.

src/main/java/org/opensearch/neuralsearch/processor/InferenceProcessor.java

src/main/java/org/opensearch/neuralsearch/processor/TextEmbeddingProcessor.java

heemin32 · 2025-02-27T19:37:30Z

src/main/java/org/opensearch/neuralsearch/processor/TextEmbeddingProcessor.java

+                            // have been copied
+            String index = ingestDocument.getSourceAndMetadata().get(INDEX_FIELD).toString();
+            String id = ingestDocument.getSourceAndMetadata().get(ID_FIELD).toString();
+            openSearchClient.execute(GetAction.INSTANCE, new GetRequest(index, id), ActionListener.wrap(response -> {


I have a concern with this async operation. This might impact to other search request by consuming all search thread.
There are three options I can think of,

we might need to have a circuit breaker to limit the total number of concurrent ingestion request.

ask cx to pass embedding if they want to skip inference call so that we only can access the source document without existing document or cx can omit the text field so that inference can be skipped for update operation.

Make it as synchronous.

discussed offline, we'll first perform benchmark testing to see the impact on the current change

heemin32 · 2025-02-27T19:45:40Z

src/main/java/org/opensearch/neuralsearch/processor/optimization/InferenceFilter.java

+        Map<String, Object> filteredProcessMap = new HashMap<>();
+        Map<String, Object> castedProcessMap = ProcessorUtils.castToMap(processMap);
+        for (Map.Entry<?, ?> entry : castedProcessMap.entrySet()) {
+            if ((entry.getKey() instanceof String) == false) {


You don't need this check. castedProcessMap is already Map<String, Object> and the key is String.

heemin32 · 2025-02-27T19:45:56Z

src/main/java/org/opensearch/neuralsearch/processor/optimization/InferenceFilter.java

+        }
+        Map<String, Object> filteredProcessMap = new HashMap<>();
+        Map<String, Object> castedProcessMap = ProcessorUtils.castToMap(processMap);
+        for (Map.Entry<?, ?> entry : castedProcessMap.entrySet()) {


Suggested change

for (Map.Entry<?, ?> entry : castedProcessMap.entrySet()) {

for (Map.Entry<String, Object> entry : castedProcessMap.entrySet()) {

heemin32 · 2025-02-27T19:49:45Z

src/main/java/org/opensearch/neuralsearch/processor/optimization/InferenceFilter.java

+            } else if (value instanceof List) {
+                List<Object> processedList = filterListValue(
+                    currentPath,
+                    (List<Object>) value,


This is not safe as well. If you know that value will be always List, let's use

@SuppressWarnings("unchecked") public static List<Object> castToObjectList(List<?> obj) { return (List<Object>) obj; }

heemin32 · 2025-02-27T19:50:59Z

src/main/java/org/opensearch/neuralsearch/processor/optimization/InferenceFilter.java

+        // return empty list if processList and existingList are equal and embeddings are copied, return empty list otherwise
+        return filterInferenceValuesInList(
+            processList,
+            (List<Object>) existingList.get(),


Same for this. Not all List is List. For example, we cannot cast List to List

heemin32 · 2025-02-28T00:36:50Z

src/main/java/org/opensearch/neuralsearch/processor/TextEmbeddingProcessor.java

+                    }
+                }
+            }, e -> { handler.accept(null, e); }));
+        } else {


Nit. Better to handle simple case at top and reduce the nested depth?

if (skipExisting == false) { makeInferenceCall(ingestDocument, processMap, inferenceList, handler); return; } // rest of codes..

heemin32 · 2025-02-28T00:40:37Z

src/main/java/org/opensearch/neuralsearch/processor/util/ProcessorUtils.java

+    // This method should be used only when you are certain the object is a `Map<String, Object>`.
+    // It is recommended to use this method as a last resort.
+    @SuppressWarnings("unchecked")
+    public static Map<String, Object> castToMap(Object obj) {


Suggested change

public static Map<String, Object> castToMap(Object obj) {

public static Map<String, Object> unsafeCastToObjectMap(Object obj) {

heemin32 · 2025-02-28T00:41:39Z

src/main/java/org/opensearch/neuralsearch/processor/util/ProcessorUtils.java

+    // This method should be used only when you are certain the object is a `List<Object>`.
+    // It is recommended to use this method as a last resort.
+    @SuppressWarnings("unchecked")
+    public static List<Object> castToObjectList(Object obj) {


Suggested change

public static List<Object> castToObjectList(Object obj) {

public static List<Object> unsafeCastToObjectList(Object obj) {

heemin32 · 2025-02-28T01:00:15Z

src/main/java/org/opensearch/neuralsearch/processor/util/ProcessorUtils.java

     *
     * @param sourceAsMap The Source map (a map of maps) to iterate through
     * @param targetField The path to take to get the desired mapping
+     * @param index       the index to use when a list is encountered during traversal; if list processing is not needed,


What is the case when list processing is not needed when the value is list? In such case, returning empty optional is expected behavior?

user should always pass in index, if they know sourceAsMap contains list. In any other case, we should return empty optional to indicate value not found.

Then the design of this method is not correct imo. If the caller already know that it contains list, they just get the value directly or there should be a separate method for that.

src/main/java/org/opensearch/neuralsearch/processor/util/ProcessorUtils.java

heemin32 · 2025-02-28T01:05:46Z

src/main/java/org/opensearch/neuralsearch/processor/util/ProcessorUtils.java

                }
-                Map<String, Object> currentMap = (Map<String, Object>) value;
-                return Optional.ofNullable(currentMap.get(key));
+                return Optional.empty();


Nit. For me better to put this under else block to make sure every block should return itself without missing any condition.

else { Optional.empty(); }

heemin32 · 2025-02-28T01:06:55Z

src/main/java/org/opensearch/neuralsearch/processor/util/ProcessorUtils.java

+                return Optional.empty();
            });

            if (currentValue.isEmpty()) {


Maybe break; could be better?

heemin32 · 2025-02-28T01:08:05Z

src/main/java/org/opensearch/neuralsearch/processor/util/ProcessorUtils.java

+
+        for (int i = 0; i < keys.length - 1; i++) {
+            Object next = current.computeIfAbsent(keys[i], k -> new HashMap<>());
+            if (next instanceof ArrayList<?> list) {


Same, is there a case where next is array list but index is -1?

if user assumes map doesn't contain lists, and doesn't specify the index, it will not set the value to map

heemin32 · 2025-02-28T17:00:57Z

src/main/java/org/opensearch/neuralsearch/processor/util/ProcessorUtils.java

     *
     * @param sourceAsMap The Source map (a map of maps) to iterate through
     * @param targetField The path to take to get the desired mapping
+     * @param index       the index to use when a list is encountered during traversal; if list processing is not needed,


Then the design of this method is not correct imo. If the caller already know that it contains list, they just get the value directly or there should be a separate method for that.

heemin32 · 2025-02-28T17:34:54Z

src/main/java/org/opensearch/neuralsearch/processor/optimization/InferenceFilter.java

+     * Stores the reverse mapping of field names to support efficient lookups for embedding keys.
+     * This is generated by flattening and flipping the provided field map.
+     */
+    protected Map<String, String> reversedFieldMap;


Could you comment what is the example key and value here?

heemin32 · 2025-02-28T17:46:43Z

src/main/java/org/opensearch/neuralsearch/processor/optimization/InferenceFilter.java

+                    existingSourceAndMetadataMap
+                );
+                filteredProcessMap.put(key, processedList);
+            } else {


Please check if value is String and throw exception otherwise.

heemin32 · 2025-02-28T17:48:22Z

src/main/java/org/opensearch/neuralsearch/processor/optimization/InferenceFilter.java

+        if (embeddingList.isPresent() == false || embeddingList.get() instanceof List == false) {
+            return processList;
+        }
+        // return empty list if processList and existingList are equal and embeddings are copied, return empty list otherwise


This is not correct

heemin32 · 2025-02-28T17:52:57Z

src/main/java/org/opensearch/neuralsearch/util/ProcessorDocumentUtils.java

+            String newKey = prefix.isEmpty() ? key : prefix + "." + key;
+
+            if (value instanceof Map) {
+                flattenAndFlip(newKey, (Map<String, Object>) value, flippedMap);


heemin32 · 2025-02-28T18:01:49Z

src/main/java/org/opensearch/neuralsearch/util/ProcessorDocumentUtils.java

+                String transformedValue = parentPath.isEmpty() ? value.toString() : parentPath + "." + value.toString();
+                if (flippedMap.containsKey(transformedValue)) {
+                    int index = 1;
+                    while (flippedMap.containsKey(transformedValue + "_" + index)) {


This is not efficient..

heemin32 · 2025-02-28T19:18:59Z

src/main/java/org/opensearch/neuralsearch/processor/optimization/InferenceFilter.java

+    private Map<String, Object> filter(
+        Map<String, Object> existingSourceAndMetadataMap,
+        Map<String, Object> sourceAndMetadataMap,
+        Object processMap,


Suggested change

Object processMap,

Map<String, Object> processMap,

heemin32 · 2025-02-28T19:19:13Z

src/main/java/org/opensearch/neuralsearch/processor/optimization/InferenceFilter.java

+        Object processMap,
+        String traversedPath
+    ) {
+        if (processMap instanceof Map == false) {


Suggested change

if (processMap instanceof Map == false) {

heemin32 · 2025-02-28T19:20:37Z

src/main/java/org/opensearch/neuralsearch/processor/optimization/InferenceFilter.java

+            Object value = entry.getValue();
+            String currentPath = traversedPath.isEmpty() ? key : traversedPath + "." + key;
+            if (value instanceof Map<?, ?>) {
+                Map<String, Object> filteredInnerMap = filter(existingSourceAndMetadataMap, sourceAndMetadataMap, value, currentPath);


Suggested change

Map<String, Object> filteredInnerMap = filter(existingSourceAndMetadataMap, sourceAndMetadataMap, value, currentPath);

Map<String, Object> filteredInnerMap = filter(existingSourceAndMetadataMap, sourceAndMetadataMap, unsafeCastToObjectMap(value), currentPath);

heemin32 · 2025-02-28T20:04:13Z

src/main/java/org/opensearch/neuralsearch/processor/util/ProcessorUtils.java

     * @return A possible result within an optional
     */
-    public static Optional<Object> getValueFromSource(final Map<String, Object> sourceAsMap, final String targetField) {
+    public static Optional<Object> getValueFromSource(final Map<String, Object> sourceAsMap, final String targetField, int index) {


Suggested change

public static Optional<Object> getValueFromSource(final Map<String, Object> sourceAsMap, final String targetField, int index) {

public static List<Optional<Object>> getValueFromSource(final Map<String, Object> sourceAsMap, final String targetField, int index) {

left it as <Optional<Object>> that can return single value, list of values and list of list values. If there's concern, please let me know

heemin32 · 2025-03-04T21:05:05Z

src/main/java/org/opensearch/neuralsearch/util/ProcessorDocumentUtils.java

+            String key = entry.getKey();
+            Object value = entry.getValue();
+            String currentKey = path.isEmpty() ? key : path + "." + key;
+            if (Objects.isNull(currentKey)) return;


Shouldn't we check key but not currentKey?

Suggested change

if (Objects.isNull(currentKey)) return;

if (Objects.isNull(key)) return;

this check is not needed in valid cases, added to avoid unit test failure

Could you fix the unit text?

this is the failing unit test case, which checks for failure case: https://github.com/opensearch-project/neural-search/blob/main/src/test/java/org/opensearch/neuralsearch/processor/TextEmbeddingProcessorTests.java#L122-L135

I can modify the implementation to bypass this unit test case

heemin32 · 2025-03-04T21:09:01Z

src/main/java/org/opensearch/neuralsearch/util/ProcessorDocumentUtils.java

+            if (value instanceof Map) {
+                flattenAndFlip(currentKey, ProcessorUtils.unsafeCastToObjectMap(value), flippedMap);
+            } else if (value instanceof String) {
+                String traversedPath = currentKey.contains(".") ? currentKey.substring(0, currentKey.lastIndexOf('.')) : "";


Suggested change

String traversedPath = currentKey.contains(".") ? currentKey.substring(0, currentKey.lastIndexOf('.')) : "";

String traversedPath = path;

path may not contain fully traversed path when we get . separated path.
we need to keep track of traversed path with currentKey value.

As an example, for given map:
{level1.level2: knn_field}

we need to insert into flipped map the full path to value: level1.level2.knn_field, instead of just knn_field in this case

Signed-off-by: Will Hwang <[email protected]>

heemin32

LGTM. Thanks.

vibrantvarun · 2025-03-05T22:37:56Z

@heemin32 I had comments added earlier in this PR. Not sure, why we closed this PR early? I will add my comments once this PR raised for main.

Thanks

heemin32 · 2025-03-05T22:53:01Z

@vibrantvarun Closed it as I thought all the comments are resolved. Could you point out comments which are not resolved or add one if there is something new? @will-hwang can address them in a new PR. Adding comment when this is merged to main will be too late.

opensearch-project#1191) Signed-off-by: Will Hwang <[email protected]>

…1238) * implement single document update scenario for text embedding processor (#1191) Signed-off-by: Will Hwang <[email protected]> * implement batch document update scenario for text embedding processor (#1217) Signed-off-by: Will Hwang <[email protected]> --------- Signed-off-by: Will Hwang <[email protected]>

…pensearch-project#1238) * implement single document update scenario for text embedding processor (opensearch-project#1191) Signed-off-by: Will Hwang <[email protected]> * implement batch document update scenario for text embedding processor (opensearch-project#1217) Signed-off-by: Will Hwang <[email protected]> --------- Signed-off-by: Will Hwang <[email protected]>

…pensearch-project#1238) * implement single document update scenario for text embedding processor (opensearch-project#1191) Signed-off-by: Will Hwang <[email protected]> * implement batch document update scenario for text embedding processor (opensearch-project#1217) Signed-off-by: Will Hwang <[email protected]> --------- Signed-off-by: Will Hwang <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]>

#1342) * Implement Optimized embedding generation in text embedding processor (#1238) * implement single document update scenario for text embedding processor (#1191) Signed-off-by: Will Hwang <[email protected]> * implement batch document update scenario for text embedding processor (#1217) Signed-off-by: Will Hwang <[email protected]> --------- Signed-off-by: Will Hwang <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Going from alpha1 to beta1 for 3.0 release (#1245) Signed-off-by: yeonghyeonKo <[email protected]> * Implement Optimized embedding generation in sparse encoding processor (#1246) Signed-off-by: Will Hwang <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Implement Optimized embedding generation in text and image embedding processor (#1249) Signed-off-by: will-hwang <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Inner hits support with hybrid query (#1253) * Inner Hits in Hybrid query Signed-off-by: Varun Jain <[email protected]> * Inner hits support with hybrid query Signed-off-by: Varun Jain <[email protected]> * Add changelog Signed-off-by: Varun Jain <[email protected]> * fix integ tests Signed-off-by: Varun Jain <[email protected]> * Modify comment Signed-off-by: Varun Jain <[email protected]> * Explain test case Signed-off-by: Varun Jain <[email protected]> * Optimize inner hits count calculation method Signed-off-by: Varun Jain <[email protected]> --------- Signed-off-by: Varun Jain <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Support custom tags in semantic highlighter (#1254) Signed-off-by: yeonghyeonKo <[email protected]> * Add neural stats API (#1256) * Add neural stats API Signed-off-by: Andy Qin <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Added release notes for 3.0 beta1 (#1252) * Added release notes for 3.0 beta1 Signed-off-by: Martin Gaievski <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Update semantic highlighter test model (#1259) Signed-off-by: Junqiu Lei <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Fix the edge case when the value of a fieldMap key in ingestDocument is empty string (#1257) Signed-off-by: Chloe Gao <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Hybrid query should call rewrite before creating weight (#1268) * Hybrid query should call rewrite before creating weight Signed-off-by: Harsha Vamsi Kalluri <[email protected]> * Awaits fix Signed-off-by: Harsha Vamsi Kalluri <[email protected]> * Rewrite with searcher Signed-off-by: Harsha Vamsi Kalluri <[email protected]> * Feature flag issue Signed-off-by: Harsha Vamsi Kalluri <[email protected]> --------- Signed-off-by: Harsha Vamsi Kalluri <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Support phasing off SecurityManager usage in favor of Java Agent (#1265) Signed-off-by: Gulshan <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Fix multi node transport issue on NeuralKNNQueryBuilder originalQueryText (#1272) Signed-off-by: Junqiu Lei <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Add semantic field mapper. (#1225) Signed-off-by: Bo Zhang <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Increment version to 3.0.0-SNAPSHOT (#1286) Signed-off-by: opensearch-ci-bot <[email protected]> Co-authored-by: opensearch-ci-bot <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Remove beta1 qualifier (#1292) Signed-off-by: Peter Zhu <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Fix for merging scoreDocs when totalHits are greater than 1 and fieldDocs are 0 (#1295) (#1296) (cherry picked from commit 6f3aabb) Co-authored-by: Varun Jain <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * add release notes for 3.0 (#1287) Signed-off-by: will-hwang <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Allow maven to publish to all versions (#1300) (#1301) Signed-off-by: Peter Zhu <[email protected]> (cherry picked from commit c5625db) Co-authored-by: Peter Zhu <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * [FEAT] introduce new FixedStringLengthChunker Signed-off-by: yeonghyeonKo <[email protected]> * [TEST] initial test cases for FixedStringLengthChunker Signed-off-by: yeonghyeonKo <[email protected]> * [FIX] gradlew spotlessApply Signed-off-by: yeonghyeonKo <[email protected]> * [REFACTOR] remove unnecessary comments Signed-off-by: yeonghyeonKo <[email protected]> * [Performance Improvement] Add custom bulk scorer for hybrid query (2-3x faster) (#1289) Signed-off-by: Martin Gaievski <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Add TextChunkingProcessor stats (#1308) * Add TextChunkingProcessor stats Signed-off-by: Andy Qin <[email protected]> # Conflicts: # CHANGELOG.md * Update unit and integ tests Signed-off-by: Andy Qin <[email protected]> --------- Signed-off-by: Andy Qin <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Update Lucene dependencies (#1336) * Update Lucene dependencies Signed-off-by: Ryan Bogan <[email protected]> * Add changelog entry Signed-off-by: Ryan Bogan <[email protected]> * Update model request body for bwc and integ tests Signed-off-by: Ryan Bogan <[email protected]> --------- Signed-off-by: Ryan Bogan <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * [REFACTOR] modify algorithm name and related parts Signed-off-by: yeonghyeonKo <[email protected]> * [REFACTOR] update test codes along with the change in CharacterLengthChunker Signed-off-by: yeonghyeonKo <[email protected]> * [REFACTOR] remove defensive check to prevent adding redundant code lines Signed-off-by: yeonghyeonKo <[email protected]> * Update CharacterLengthChunker to FixedCharLengthChunker Signed-off-by: Marcel Yeonghyeon Ko <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Update ChunkerFactory Signed-off-by: Marcel Yeonghyeon Ko <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Update CharacterLengthChunkerTests to FixedCharLengthChunkerTests Signed-off-by: Marcel Yeonghyeon Ko <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * [FIX] handle a corner case where the content is shorter than charLimit Signed-off-by: yeonghyeonKo <[email protected]> * [TEST] Add integration test codes for fixed_char_length chunking algorithm Signed-off-by: yeonghyeonKo <[email protected]> * [TEST] integration test code for cascaded pipeline Signed-off-by: yeonghyeonKo <[email protected]> * Support analyzer-based neural sparse query (#1088) * merge main; add analyzer impl Signed-off-by: zhichao-aws <[email protected]> * two phase adaption Signed-off-by: zhichao-aws <[email protected]> * two phase adaption Signed-off-by: zhichao-aws <[email protected]> * remove analysis Signed-off-by: zhichao-aws <[email protected]> * lint Signed-off-by: zhichao-aws <[email protected]> * update Signed-off-by: zhichao-aws <[email protected]> * address comments Signed-off-by: zhichao-aws <[email protected]> * tests Signed-off-by: zhichao-aws <[email protected]> * modify plugin security policy Signed-off-by: zhichao-aws <[email protected]> * change log Signed-off-by: zhichao-aws <[email protected]> * address comments Signed-off-by: zhichao-aws <[email protected]> * modify to package-private Signed-off-by: zhichao-aws <[email protected]> --------- Signed-off-by: zhichao-aws <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Fixed score value as null for single shard for sorting (#1277) * Fixed score value as null for single shard for sorting Signed-off-by: Owais <[email protected]> * Addressed comment Signed-off-by: Owais <[email protected]> * Addressed more comments Signed-off-by: Owais <[email protected]> * Added UT Signed-off-by: Owais <[email protected]> --------- Signed-off-by: Owais <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Add IT for neural sparse query + bert-uncased mbert-uncased analyzer (#1279) * add it Signed-off-by: zhichao-aws <[email protected]> * change log Signed-off-by: zhichao-aws <[email protected]> --------- Signed-off-by: zhichao-aws <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Add WithFieldName implementation to QueryBuilders (#1285) Signed-off-by: Owais <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * [AUTO] Increment version to 3.1.0-SNAPSHOT (#1288) * Increment version to 3.1.0-SNAPSHOT Signed-off-by: opensearch-ci-bot <[email protected]> * Update build.gradle Signed-off-by: Peter Zhu <[email protected]> --------- Signed-off-by: opensearch-ci-bot <[email protected]> Signed-off-by: Peter Zhu <[email protected]> Co-authored-by: opensearch-ci-bot <[email protected]> Co-authored-by: Peter Zhu <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * add release notes for 3.0 (#1298) Signed-off-by: will-hwang <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Return bad request for invalid stat parameters in stats API (#1291) Signed-off-by: Andy Qin <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Add semantic mapping transformer. (#1276) Signed-off-by: Bo Zhang <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Add semantic ingest processor. (#1309) Signed-off-by: Bo Zhang <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * [Performance Improvement] Add custom bulk scorer for hybrid query (2-3x faster) (#1289) Signed-off-by: Martin Gaievski <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Implement the query logic for the semantic field. (#1315) Signed-off-by: Bo Zhang <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Support custom weights params in RRF (#1322) * Support Weights params in RRF Signed-off-by: Varun Jain <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * add validation for invalid nested hybrid query (#1305) * add validation for nested hybrid query Signed-off-by: will-hwang <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Add stats tracking for semantic highlighting (#1327) * Add stats tracking for semantic highlighting Signed-off-by: Junqiu Lei <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Update Lucene dependencies (#1336) * Update Lucene dependencies Signed-off-by: Ryan Bogan <[email protected]> * Add changelog entry Signed-off-by: Ryan Bogan <[email protected]> * Update model request body for bwc and integ tests Signed-off-by: Ryan Bogan <[email protected]> --------- Signed-off-by: Ryan Bogan <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Enhance semantic field to allow to enable/disable chunking. (#1337) * Implement the query logic for the semantic field. Signed-off-by: Bo Zhang <[email protected]> * Enhance semantic field to allow to enable/disable chunking. Signed-off-by: Bo Zhang <[email protected]> --------- Signed-off-by: Bo Zhang <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * [REFACTOR] modify algorithm name and related parts Signed-off-by: yeonghyeonKo <[email protected]> * Update CHANGELOG.md Signed-off-by: Marcel Yeonghyeon Ko <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * [FEAT] Add fixed_char_length chunking algorithm to STAT manager Signed-off-by: yeonghyeonKo <[email protected]> * [TEST] Add integration test codes for fixed_char_length chunking algorithm Signed-off-by: yeonghyeonKo <[email protected]> * [TEST] integration test code for cascaded pipeline Signed-off-by: yeonghyeonKo <[email protected]> * Going from alpha1 to beta1 for 3.0 release (#1245) Signed-off-by: yeonghyeonKo <[email protected]> * Fix multi node transport issue on NeuralKNNQueryBuilder originalQueryText (#1272) Signed-off-by: Junqiu Lei <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Add semantic field mapper. (#1225) Signed-off-by: Bo Zhang <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Add semantic mapping transformer. (#1276) Signed-off-by: Bo Zhang <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> * Fix multi node transport issue on NeuralKNNQueryBuilder originalQueryText (#1272) Signed-off-by: Junqiu Lei <[email protected]> * Add semantic field mapper. (#1225) Signed-off-by: Bo Zhang <[email protected]> * Add semantic mapping transformer. (#1276) Signed-off-by: Bo Zhang <[email protected]> * [FIX] minor typo Signed-off-by: yeonghyeonKo <[email protected]> * [REFACTOR] adopt FixedTokenLengthChunker's loop strategy for robust final chunking Signed-off-by: yeonghyeonKo <[email protected]> * [TEST] sum the number of processors and their executions correctly in TextChunkingProcessorIT Signed-off-by: yeonghyeonKo <[email protected]> * [REFACTOR] gradlew spotlessApply Signed-off-by: yeonghyeonKo <[email protected]> --------- Signed-off-by: Will Hwang <[email protected]> Signed-off-by: yeonghyeonKo <[email protected]> Signed-off-by: will-hwang <[email protected]> Signed-off-by: Varun Jain <[email protected]> Signed-off-by: Andy Qin <[email protected]> Signed-off-by: Martin Gaievski <[email protected]> Signed-off-by: Junqiu Lei <[email protected]> Signed-off-by: Chloe Gao <[email protected]> Signed-off-by: Harsha Vamsi Kalluri <[email protected]> Signed-off-by: Gulshan <[email protected]> Signed-off-by: Bo Zhang <[email protected]> Signed-off-by: opensearch-ci-bot <[email protected]> Signed-off-by: Peter Zhu <[email protected]> Signed-off-by: Ryan Bogan <[email protected]> Signed-off-by: Marcel Yeonghyeon Ko <[email protected]> Signed-off-by: zhichao-aws <[email protected]> Signed-off-by: Owais <[email protected]> Co-authored-by: Will Hwang <[email protected]> Co-authored-by: Martin Gaievski <[email protected]> Co-authored-by: Varun Jain <[email protected]> Co-authored-by: Junqiu Lei <[email protected]> Co-authored-by: Andy <[email protected]> Co-authored-by: Chloe Gao <[email protected]> Co-authored-by: Harsha Vamsi Kalluri <[email protected]> Co-authored-by: Gulshan <[email protected]> Co-authored-by: Bo Zhang <[email protected]> Co-authored-by: opensearch-trigger-bot[bot] <98922864+opensearch-trigger-bot[bot]@users.noreply.github.com> Co-authored-by: opensearch-ci-bot <[email protected]> Co-authored-by: Peter Zhu <[email protected]> Co-authored-by: Ryan Bogan <[email protected]> Co-authored-by: zhichao-aws <[email protected]> Co-authored-by: Owais Kazi <[email protected]>

will-hwang force-pushed the optimized_text_embedding_processor_single_doc_update branch from 9264760 to d68dca2 Compare February 19, 2025 07:18

will-hwang marked this pull request as ready for review February 19, 2025 07:21

will-hwang requested review from VijayanB, heemin32, jmazanec15, junqiu-lei, martin-gaievski, minalsha, model-collapse, naveentatikonda, navneet1v, sean-zheng-amazon, vamshin, vibrantvarun, yuye-aws, zane-neo and zhichao-aws as code owners February 19, 2025 07:21

junqiu-lei reviewed Feb 19, 2025

View reviewed changes

q-andy reviewed Feb 20, 2025

View reviewed changes

bzhangam reviewed Feb 21, 2025

View reviewed changes

vibrantvarun reviewed Feb 21, 2025

View reviewed changes

bzhangam reviewed Feb 21, 2025

View reviewed changes

q-andy reviewed Feb 21, 2025

View reviewed changes

heemin32 reviewed Feb 24, 2025

View reviewed changes

vibrantvarun reviewed Feb 27, 2025

View reviewed changes

heemin32 reviewed Feb 27, 2025

View reviewed changes

will-hwang force-pushed the optimized_text_embedding_processor_single_doc_update branch from 90e77d1 to 9bf5cce Compare February 28, 2025 00:16

will-hwang changed the base branch from main to optimized-processor February 28, 2025 00:17

will-hwang force-pushed the optimized_text_embedding_processor_single_doc_update branch 3 times, most recently from b1816f1 to 5e7f19e Compare February 28, 2025 00:43

heemin32 reviewed Feb 28, 2025

View reviewed changes

will-hwang force-pushed the optimized_text_embedding_processor_single_doc_update branch 2 times, most recently from fc83219 to 0f99e96 Compare February 28, 2025 05:59

q-andy mentioned this pull request Feb 28, 2025

Add ITs to check embeddings are updated when using TextEmbeddingProcessors in document reindex/update scenarios #1205

Open

heemin32 reviewed Feb 28, 2025

View reviewed changes

will-hwang force-pushed the optimized_text_embedding_processor_single_doc_update branch 2 times, most recently from 1daf07d to 8902065 Compare March 4, 2025 21:11

heemin32 reviewed Mar 4, 2025

View reviewed changes

implement single document update scenario for text embedding processor

8bc759e

Signed-off-by: Will Hwang <[email protected]>

will-hwang force-pushed the optimized_text_embedding_processor_single_doc_update branch from 8902065 to 8bc759e Compare March 4, 2025 23:19

heemin32 approved these changes Mar 5, 2025

View reviewed changes

junqiu-lei approved these changes Mar 5, 2025

View reviewed changes

heemin32 merged commit 1a6e58e into opensearch-project:optimized-processor Mar 5, 2025
49 of 51 checks passed

will-hwang added a commit to will-hwang/neural-search that referenced this pull request Mar 20, 2025

implement single document update scenario for text embedding processor (

77043da

opensearch-project#1191) Signed-off-by: Will Hwang <[email protected]>

will-hwang mentioned this pull request Mar 20, 2025

Implement Optimized embedding generation in text embedding processor #1238

Merged

5 tasks

will-hwang added a commit to will-hwang/neural-search that referenced this pull request Mar 20, 2025

implement single document update scenario for text embedding processor (

d4b46c8

opensearch-project#1191) Signed-off-by: Will Hwang <[email protected]>

	public static final boolean DEFAULT_SKIP_EXISTING = Boolean.FALSE;
	public static final boolean DEFAULT_SKIP_EXISTING = false;

	for (Map.Entry<?, ?> entry : castedProcessMap.entrySet()) {
	for (Map.Entry<String, Object> entry : castedProcessMap.entrySet()) {

	public static Map<String, Object> castToMap(Object obj) {
	public static Map<String, Object> unsafeCastToObjectMap(Object obj) {

	public static List<Object> castToObjectList(Object obj) {
	public static List<Object> unsafeCastToObjectList(Object obj) {

	Map<String, Object> filteredInnerMap = filter(existingSourceAndMetadataMap, sourceAndMetadataMap, value, currentPath);
	Map<String, Object> filteredInnerMap = filter(existingSourceAndMetadataMap, sourceAndMetadataMap, unsafeCastToObjectMap(value), currentPath);

	public static Optional<Object> getValueFromSource(final Map<String, Object> sourceAsMap, final String targetField, int index) {
	public static List<Optional<Object>> getValueFromSource(final Map<String, Object> sourceAsMap, final String targetField, int index) {

	if (Objects.isNull(currentKey)) return;
	if (Objects.isNull(key)) return;

	String traversedPath = currentKey.contains(".") ? currentKey.substring(0, currentKey.lastIndexOf('.')) : "";
	String traversedPath = path;

Optimized text embedding processor for single document update #1191

Optimized text embedding processor for single document update #1191

Uh oh!

Conversation

will-hwang commented Feb 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Proposed State [Single Document Update]

User Scenarios

Related Issues

Check List

Uh oh!

codecov bot commented Feb 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

junqiu-lei commented Feb 19, 2025

Uh oh!

q-andy left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

will-hwang commented Feb 20, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vibrantvarun left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

will-hwang commented Feb 19, 2025 •

edited

Loading

codecov bot commented Feb 19, 2025 •

edited

Loading