community[minor]: Opensearch hybridsearch implementation #25375

karthikbharadhwajKB · 2024-08-14T07:32:48Z

community: add hybrid search in opensearch

Langchain OpenSearch Hybrid Search Implementation

Implementation of Hybrid Search:

I have taken LangChain's OpenSearch integration to the next level by adding hybrid search capabilities. Building on the existing OpenSearchVectorSearch class, I have implemented Hybrid Search functionality (which combines the best of both keyword and semantic search). This new functionality allows users to harness the power of OpenSearch's advanced hybrid search features without leaving the familiar LangChain ecosystem. By blending traditional text matching with vector-based similarity, the enhanced class delivers more accurate and contextually relevant results. It's designed to seamlessly fit into existing LangChain workflows, making it easy for developers to upgrade their search capabilities.

In implementing the hybrid search for OpenSearch within the LangChain framework, I also incorporated filtering capabilities. It's important to note that according to the OpenSearch hybrid search documentation, only post-filtering is supported for hybrid queries. This means that the filtering is applied after the hybrid search results are obtained, rather than during the initial search process.

Note: For the implementation of hybrid search, I strictly followed the official OpenSearch Hybrid search documentation and I took inspiration from https://github.com/AndreasThinks/langchain/tree/feature/opensearch_hybrid_search
Thanks Mate!

Experiments

I conducted few experiments to verify that the hybrid search implementation is accurate and capable of reproducing the results of both plain keyword search and vector search.

Experiment - 1
Hybrid Search
Keyword_weight: 1, vector_weight: 0

I conducted an experiment to verify the accuracy of my hybrid search implementation by comparing it to a plain keyword search. For this test, I set the keyword_weight to 1 and the vector_weight to 0 in the hybrid search, effectively giving full weightage to the keyword component. The results from this hybrid search configuration matched those of a plain keyword search, confirming that my implementation can accurately reproduce keyword-only search results when needed. It's important to note that while the results were the same, the scores differed between the two methods. This difference is expected because the plain keyword search in OpenSearch uses the BM25 algorithm for scoring, whereas the hybrid search still performs both keyword and vector searches before normalizing the scores, even when the vector component is given zero weight. This experiment validates that my hybrid search solution correctly handles the keyword search component and properly applies the weighting system, demonstrating its accuracy and flexibility in emulating different search scenarios.

Experiment - 2
Hybrid Search
keyword_weight = 0.0, vector_weight = 1.0

For experiment-2, I took the inverse approach to further validate my hybrid search implementation. I set the keyword_weight to 0 and the vector_weight to 1, effectively giving full weightage to the vector search component (KNN search). I then compared these results with a pure vector search. The outcome was consistent with my expectations: the results from the hybrid search with these settings exactly matched those from a standalone vector search. This confirms that my implementation accurately reproduces vector search results when configured to do so. As with the first experiment, I observed that while the results were identical, the scores differed between the two methods. This difference in scoring is expected and can be attributed to the normalization process in hybrid search, which still considers both components even when one is given zero weight. This experiment further validates the accuracy and flexibility of my hybrid search solution, demonstrating its ability to effectively emulate pure vector search when needed while maintaining the underlying hybrid search structure.

Experiment - 3
Hybrid Search - balanced

keyword_weight = 0.5, vector_weight = 0.5

For experiment-3, I adopted a balanced approach to further evaluate the effectiveness of my hybrid search implementation. In this test, I set both the keyword_weight and vector_weight to 0.5, giving equal importance to keyword-based and vector-based search components. This configuration aims to leverage the strengths of both search methods simultaneously. By setting both weights to 0.5, I intended to create a scenario where the hybrid search would consider lexical matches and semantic similarity equally. This balanced approach is often ideal for many real-world applications, as it can capture both exact keyword matches and contextually relevant results that might not contain the exact search terms.

Kindly verify the notebook for the experiments conducted!

Notebook: https://github.com/karthikbharadhwajKB/Langchain_OpenSearch_Hybrid_search/blob/main/Opensearch_Hybridsearch.ipynb

Instructions to follow for Performing Hybrid Search:

Step-1: Instantiating OpenSearchVectorSearch Class:

opensearch_vectorstore = OpenSearchVectorSearch(
    index_name=os.getenv("INDEX_NAME"),
    embedding_function=embedding_model,
    opensearch_url=os.getenv("OPENSEARCH_URL"),
    http_auth=(os.getenv("OPENSEARCH_USERNAME"),os.getenv("OPENSEARCH_PASSWORD")),
    use_ssl=False,
    verify_certs=False,
    ssl_assert_hostname=False,
    ssl_show_warn=False
)

Parameters:

index_name: The name of the OpenSearch index to use.
embedding_function: The function or model used to generate embeddings for the documents. It's assumed that embedding_model is defined elsewhere in the code.
opensearch_url: The URL of the OpenSearch instance.
http_auth: A tuple containing the username and password for authentication.
use_ssl: Set to False, indicating that the connection to OpenSearch is not using SSL/TLS encryption.
verify_certs: Set to False, which means the SSL certificates are not being verified. This is often used in development environments but is not recommended for production.
ssl_assert_hostname: Set to False, disabling hostname verification in SSL certificates.
ssl_show_warn: Set to False, suppressing SSL-related warnings.

Step-2: Configure Search Pipeline:

To initiate hybrid search functionality, you need to configures a search pipeline first.

Implementation Details:

This method configures a search pipeline in OpenSearch that:

Normalizes the scores from both keyword and vector searches using the min-max technique.
Applies the specified weights to the normalized scores.
Calculates the final score using an arithmetic mean of the weighted, normalized scores.

Parameters:

pipeline_name (str): A unique identifier for the search pipeline. It's recommended to use a descriptive name that indicates the weights used for keyword and vector searches.
keyword_weight (float): The weight assigned to the keyword search component. This should be a float value between 0 and 1. In this example, 0.3 gives 30% importance to traditional text matching.
vector_weight (float): The weight assigned to the vector search component. This should be a float value between 0 and 1. In this example, 0.7 gives 70% importance to semantic similarity.

opensearch_vectorstore.configure_search_pipelines(
    pipeline_name="search_pipeline_keyword_0.3_vector_0.7",
    keyword_weight=0.3,
    vector_weight=0.7,
)

Step-3: Performing Hybrid Search:

After creating the search pipeline, you can perform a hybrid search using the similarity_search() method (or) any methods that are supported by langchain. This method combines both keyword-based and semantic similarity searches on your OpenSearch index, leveraging the strengths of both traditional information retrieval and vector embedding techniques.

parameters:

query: The search query string.
k: The number of top results to return (in this case, 3).
search_type: Set to hybrid_search to use both keyword and vector search capabilities.
search_pipeline: The name of the previously created search pipeline.

query = "what are the country named in our database?"

top_k = 3

pipeline_name = "search_pipeline_keyword_0.3_vector_0.7"

matched_docs = opensearch_vectorstore.similarity_search_with_score(
                query=query,
                k=top_k,
                search_type="hybrid_search",
                search_pipeline = pipeline_name
            )

matched_docs

twitter handle: @iamkarthik98

vercel · 2024-08-14T07:32:52Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment

Name	Status	Preview	Comments	Updated (UTC)
langchain	⬜️ Ignored (Inspect)	Visit Preview		Dec 13, 2024 8:36pm

libs/community/langchain_community/vectorstores/opensearch_vector_search.py

eyurtsev · 2024-08-14T15:20:15Z

Hi @karthikbharadhwajKB, thank you for the contribution!

We need the following changes to get this code merged:

Add integration tests that show the code is working as expected
Use the opensearch client instead of issuing separate network requests via the requests library

…ub.com/karthikbharadhwajKB/langchain into opensearch_hybridsearch_implementation

karthikbharadhwajKB · 2024-08-15T10:34:18Z

Hey @eyurtsev,

[✅ ] Use the OpenSearch client instead of issuing separate network requests via the requests library - I have removed using requests library and Now update it with the OpenSearch client API for configuring_search_pipeline & performing Hybrid Search.
[⚒️] Add integration tests that show the code is working as expected - I'm working on it...!

Here you can see working Hybrid Search feature and some experiments (where I tried to reproduce the exact keyword search results and approximate_search with tweaking keyword_weight & vector_weight).
Notebook: https://github.com/karthikbharadhwajKB/Langchain_OpenSearch_Hybrid_search/blob/main/Opensearch_Hybridsearch.ipynb
I also tested Hybrid_Search, Hybrid_Search with Post filter & default Approximate_Search functionalities.
Notebook: https://github.com/karthikbharadhwajKB/Langchain_OpenSearch_Hybrid_search/blob/main/Tests.ipynb

…ub.com/karthikbharadhwajKB/langchain into opensearch_hybridsearch_implementation

karthikbharadhwajKB · 2024-08-16T18:10:15Z

Hey buddy @eyurtsev

[✅] Add integration tests that show the code is working as expected - I have added integration tests for configuring_search_pipeline, hybrid_search, hybrid_search_with_score & hybrid_search_with_post_filter.

Please have a look. if everything seems fine then merge it! 🚀

…ctions: Updated docstring

… performing Hybrid Search) and raising helpful msg to user

…ub.com/karthikbharadhwajKB/langchain into opensearch_hybridsearch_implementation

karthikbharadhwajKB · 2024-09-25T13:41:09Z

Hey @eyurtsev I have made the changes that you have requested. please check and let me know.

libs/community/langchain_community/vectorstores/opensearch_vector_search.py

karthikbharadhwajKB · 2024-10-17T20:56:33Z

Hey @eyurtsev I have made the changes that you requested to do.

Please merge if everything looks fine 🚀

Karthik-Kolluri added 5 commits August 14, 2024 08:47

added hybrid search functionality

af559dc

added post filtering feature to hybrid search

3928d5d

added create_search_pipeline method

95fb69d

minor change

04da99d

added exception handling to handle requests

883d3b2

dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. langchain Related to the langchain package Ɑ: vector store Related to vector store module 🤖:improvement Medium size change to existing code to handle new use-cases labels Aug 14, 2024

Merge branch 'master' into opensearch_hybridsearch_implementation

33ccb84

ccurme added community Related to langchain-community and removed langchain Related to the langchain package labels Aug 14, 2024

eyurtsev self-assigned this Aug 14, 2024

eyurtsev requested changes Aug 14, 2024

View reviewed changes

karthikbharadhwajKB and others added 6 commits August 15, 2024 10:05

modified requests to clientv

1c540ab

modified search_pipeline method and requests to client

a1bbc37

minor changes

60cc4cd

formatted whole file

a17fca4

Merge branch 'opensearch_hybridsearch_implementation' of https://gith…

6994047

…ub.com/karthikbharadhwajKB/langchain into opensearch_hybridsearch_implementation

Merge branch 'master' into opensearch_hybridsearch_implementation

2a09cd4

karthikbharadhwajKB added 6 commits August 16, 2024 12:11

added search_pipeline_exist method

35d4df0

added get_pipeline_info method

0648ed6

added integration test for configuring search pipeline

468b769

added integration test for get_search_pipeline_info functionality

f57b3ed

added integration test for hybrid search, hybrid search with post filter

21ba2b0

Merge branch 'opensearch_hybridsearch_implementation' of https://gith…

ebf5e2c

…ub.com/karthikbharadhwajKB/langchain into opensearch_hybridsearch_implementation

karthikbharadhwajKB added 5 commits September 22, 2024 12:31

removed unnecessary try/expect blocks

5caf8e7

refactor _default_hybrid_search & _hybrid_search_with_post_filter fun…

9ed9818

…ctions: Updated docstring

refactored _hybrid_search_with_post_filter

b4f8e23

added check for query_text and search_pipeline (must be not empty for…

9ae26b5

… performing Hybrid Search) and raising helpful msg to user

Merge branch 'opensearch_hybridsearch_implementation' of https://gith…

605501b

…ub.com/karthikbharadhwajKB/langchain into opensearch_hybridsearch_implementation

eyurtsev changed the title ~~Opensearch hybridsearch implementation~~ community[minor]: Opensearch hybridsearch implementation Oct 8, 2024

eyurtsev reviewed Oct 8, 2024

View reviewed changes

libs/community/langchain_community/vectorstores/opensearch_vector_search.py Show resolved Hide resolved

eyurtsev reviewed Oct 8, 2024

View reviewed changes

libs/community/langchain_community/vectorstores/opensearch_vector_search.py Show resolved Hide resolved

karthikbharadhwajKB and others added 3 commits October 17, 2024 22:41

added sanity check for pipeline_name

ca71456

removed unnecessary f-string

077508e

Merge branch 'master' into opensearch_hybridsearch_implementation

e49f810

eyurtsev approved these changes Oct 17, 2024

View reviewed changes

dosubot bot added the lgtm PR looks good. Use to confirm that a PR is ready for merging. label Oct 17, 2024

eyurtsev enabled auto-merge (squash) October 17, 2024 22:06

eyurtsev disabled auto-merge October 17, 2024 22:06

karthikbharadhwajKB added 3 commits October 19, 2024 09:43

Merge branch 'master' into opensearch_hybridsearch_implementation

c4b840f

Merge branch 'master' into opensearch_hybridsearch_implementation

e6adfdd

Merge branch 'master' into opensearch_hybridsearch_implementation

16e6e3e

karthikbharadhwajKB requested a review from eyurtsev October 23, 2024 14:48

karthikbharadhwajKB and others added 7 commits October 28, 2024 16:25

Merge branch 'master' into opensearch_hybridsearch_implementation

de56140

Merge branch 'master' into opensearch_hybridsearch_implementation

f85dd7e

Merge branch 'master' into opensearch_hybridsearch_implementation

810ddf1

lint

829d5e2

Merge branch 'master' into opensearch_hybridsearch_implementation

9473df5

Merge branch 'master' into opensearch_hybridsearch_implementation

bbe1579

Merge branch 'master' into opensearch_hybridsearch_implementation

2c39ee7

eyurtsev approved these changes Dec 13, 2024

View reviewed changes

eyurtsev merged commit 498f024 into langchain-ai:master Dec 13, 2024
19 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

community[minor]: Opensearch hybridsearch implementation #25375

community[minor]: Opensearch hybridsearch implementation #25375

karthikbharadhwajKB commented Aug 14, 2024 •

edited

Loading

vercel bot commented Aug 14, 2024 •

edited

Loading

eyurtsev commented Aug 14, 2024

karthikbharadhwajKB commented Aug 15, 2024 •

edited

Loading

karthikbharadhwajKB commented Aug 16, 2024

karthikbharadhwajKB commented Sep 25, 2024

karthikbharadhwajKB commented Oct 17, 2024

community[minor]: Opensearch hybridsearch implementation #25375

community[minor]: Opensearch hybridsearch implementation #25375

Conversation

karthikbharadhwajKB commented Aug 14, 2024 • edited Loading

Langchain OpenSearch Hybrid Search Implementation

Implementation of Hybrid Search:

Experiments

Instructions to follow for Performing Hybrid Search:

vercel bot commented Aug 14, 2024 • edited Loading

eyurtsev commented Aug 14, 2024

karthikbharadhwajKB commented Aug 15, 2024 • edited Loading

karthikbharadhwajKB commented Aug 16, 2024

karthikbharadhwajKB commented Sep 25, 2024

karthikbharadhwajKB commented Oct 17, 2024

karthikbharadhwajKB commented Aug 14, 2024 •

edited

Loading

vercel bot commented Aug 14, 2024 •

edited

Loading

karthikbharadhwajKB commented Aug 15, 2024 •

edited

Loading