Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(wren-ai-service): retrieval improvement #599

Merged
merged 34 commits into from
Aug 30, 2024

Conversation

cyyeh
Copy link
Member

@cyyeh cyyeh commented Aug 14, 2024

indexing pipeline:

  1. 3 collections: db_schema, table_descriptions, view_questions
  2. to solve llm token window limit issue for indexing, we have a new env called COLUMN_INDEXING_BATCH_SIZE which users can decide how many columns to index in one document at one time

retrieval pipeline:

  1. select top 10(TABLE_RETRIEVAL_SIZE) tables based on table name and table descriptions (table_descriptions collection)
  2. select top 1000(TABLE_COLUMN_RETRIEVAL_SIZE) tables and columns based on previous results (db_schma)
  3. use llm to choose which tables and columns are needed to answer the question

we also expose two env vars for table and column selection: TABLE_RETRIEVAL_SIZE and TABLE_COLUMN_RETRIEVAL_SIZE

@cyyeh cyyeh added module/ai-service ai-service related ci/ai-service ai-service related labels Aug 14, 2024
@cyyeh cyyeh requested a review from paopa August 14, 2024 01:05
@cyyeh cyyeh force-pushed the chore/ai-service/retrieval-improvement branch 8 times, most recently from 08297f9 to 1a96ec5 Compare August 20, 2024 10:27
@cyyeh cyyeh force-pushed the chore/ai-service/retrieval-improvement branch 2 times, most recently from 99417d4 to 71e5765 Compare August 26, 2024 06:32
@cyyeh cyyeh force-pushed the chore/ai-service/retrieval-improvement branch from 71e5765 to 1b628ba Compare August 26, 2024 07:19
@cyyeh cyyeh marked this pull request as ready for review August 27, 2024 01:36
Copy link
Member

@paopa paopa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines +41 to +56
should_force_deploy=bool(os.getenv("SHOULD_FORCE_DEPLOY", "")),
column_indexing_batch_size=(
int(os.getenv("COLUMN_INDEXING_BATCH_SIZE"))
if os.getenv("COLUMN_INDEXING_BATCH_SIZE")
else 50
),
table_retrieval_size=(
int(os.getenv("TABLE_RETRIEVAL_SIZE"))
if os.getenv("TABLE_RETRIEVAL_SIZE")
else 10
),
table_column_retrieval_size=(
int(os.getenv("TABLE_COLUMN_RETRIEVAL_SIZE"))
if os.getenv("TABLE_COLUMN_RETRIEVAL_SIZE")
else 1000
),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just mention first, and I don't think it need to be done in this PR. we can consider to use a config class to initial all env and it will make the code more concise.

@paopa paopa merged commit 4e0acc9 into main Aug 30, 2024
8 checks passed
@paopa paopa deleted the chore/ai-service/retrieval-improvement branch August 30, 2024 06:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci/ai-service ai-service related module/ai-service ai-service related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants