Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Self-query with generic query constructor #3607

Merged
merged 26 commits into from
Apr 27, 2023

Conversation

dev2049
Copy link
Contributor

@dev2049 dev2049 commented Apr 26, 2023

Alternate implementation of #3452 that relies on a generic query constructor chain and language and then has vector store-specific translation layer. Still refactoring and updating examples but general structure is there and seems to work s well as #3452 on exampels

@dev2049 dev2049 changed the title WIP: self-query with generic query constructor Self-query with generic query constructor Apr 27, 2023
@dev2049 dev2049 marked this pull request as ready for review April 27, 2023 01:20
return Comparison(comparator=comp, attribute=attr.strip("\"'"), value=val)


def parse_filter(_filter: str) -> Union[Operation, Comparison]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about refactor using lark or pyparsing or something similar? They should make it to extend the code and handle a large class of errors from the get-go (e.g., checking for rule ambiguity etc)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just for sake of time, ok if we leave this as todo for refactor in sep pr?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah def

LTE = "lte"


class Comparison(BaseModel):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two suggestions:

  1. Introduce a common type hierarchy for "operators" / "directives" whatever we choose to call them. So we don't have to do List[Union[Comparison, Operation]], but instead can do Sequence[Directive].
  2. Could we remove everything from the schema file except for the schema, so it's easy to see the type hierarchy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does ir.py currently contain what you were imagining or did you want to separate out even more?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably makes sense to move out the visitor and keep only the ast typedefs

Copy link
Contributor Author

@dev2049 dev2049 Apr 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how would you deal with circular import (since ast's define method that takes visitor and visitor defines methods that take in ast's)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could always type something as Any

langchain/chains/query_constructor/base.py Outdated Show resolved Hide resolved
arbitrary_types_allowed = True


def format_attribute_info(info: List[AttributeInfo]) -> str:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion to favor Sequence over List on inputs -- Sequences accept tuples/lists and the collection is immutable

Suggested change
def format_attribute_info(info: List[AttributeInfo]) -> str:
def format_attribute_info(info: Sequence[AttributeInfo]) -> str:

langchain/chains/query_constructor/base.py Outdated Show resolved Hide resolved
def parse(self, text: str) -> StructuredQuery:
try:
expected_keys = ["query", "filter"]
parsed = parse_json_markdown(text, expected_keys)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the output is modified to be something like

<command>
request(..., ...)
</command>

Then after extracting the content of command, the string can be fed into ast_parse` without having to deal with parsing JSON and dealing with query and filter separately

langchain/chains/query_constructor/prompt.py Show resolved Hide resolved
langchain/retrievers/self_query/pinecone.py Outdated Show resolved Hide resolved
langchain/retrievers/self_query/pinecone.py Outdated Show resolved Hide resolved
langchain/retrievers/self_query/pinecone.py Show resolved Hide resolved
@eyurtsev
Copy link
Collaborator

HOT HOT HOT 🔥 🔥 🔥

@hwchase17 hwchase17 merged commit 3b60964 into master Apr 27, 2023
@hwchase17 hwchase17 deleted the dev2049/generic_query_generator branch April 27, 2023 15:36
@eyurtsev
Copy link
Collaborator

🙃 🙃 🙃 🙃

vowelparrot pushed a commit that referenced this pull request Apr 28, 2023
Alternate implementation of #3452 that relies on a generic query
constructor chain and language and then has vector store-specific
translation layer. Still refactoring and updating examples but general
structure is there and seems to work s well as #3452 on exampels

---------

Co-authored-by: Harrison Chase <[email protected]>
samching pushed a commit to samching/langchain that referenced this pull request May 1, 2023
Alternate implementation of langchain-ai#3452 that relies on a generic query
constructor chain and language and then has vector store-specific
translation layer. Still refactoring and updating examples but general
structure is there and seems to work s well as langchain-ai#3452 on exampels

---------

Co-authored-by: Harrison Chase <[email protected]>
expected_keys = ["query", "filter"]
parsed = parse_json_markdown(text, expected_keys)
if len(parsed["query"]) == 0:
parsed["query"] = " "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you fallback to " " when the returned string is empty?

@cancan101
Copy link
Contributor

I am also noticing that in the case NO_FILTER is returned, the query output by the LLM is often a single phrase. When this phrase is then searched on the vector store, the matches are worse than if I had taken the original query and searched it directly on the vector store.

In other words, while the self query is supposed to be a superset of the standard querying (structured + unstructured), the unstructured querying seems inferior to using the vector store directly.

docsearch = ....
retriever = docsearch.as_retriever()
sq_retriever = SelfQueryRetriever.from_llm(
  llm, docsearch, document_content_description, metadata_field_info, verbose=True
)

# this uses the entire phrase for similarity match
retriever.get_relevant_documents("What is the capital city of Vermont?")

# whereas assuming there is no structured operator here, the phrase sent to the vector store might just be "Vermont"
sq_retriever.get_relevant_documents("What is the capital city of Vermont?")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants