-
-
Notifications
You must be signed in to change notification settings - Fork 337
Open
Description
Hello,
I was playing around with the QA dataset generation capabilities and getting an error
You have to sample more than two passages for making two-hop questions.
I am using random_single_hop
, n=50
to sample from the corpus. Am I doing something wrong? I don't see double hop sampling functions; the only sampling functions available seem to be single hop.
Full code (adapted from your Evaluation data creation tutorial in the docs):
import pandas as pd
from llama_index.llms.openai import OpenAI
from openai import AsyncOpenAI
from autorag.data.qa.filter.dontknow import dontknow_filter_rule_based
from autorag.data.qa.filter.passage_dependency import passage_dependency_filter_openai
from autorag.data.qa.generation_gt.llama_index_gen_gt import (
make_basic_gen_gt,
make_concise_gen_gt,
)
from autorag.data.qa.schema import Raw, Corpus
from autorag.data.qa.query.llama_gen_query import factoid_query_gen, concept_completion_query_gen, two_hop_incremental
from autorag.data.qa.sample import random_single_hop
llm = OpenAI()
raw_df = pd.read_parquet("./all_txt_parsing_output/parsed_result.parquet")
raw_instance = Raw(raw_df)
corpus_df = pd.read_parquet("all_txt_chunking_output/0.parquet")
corpus_instance = Corpus(corpus_df, raw_instance)
initial_qa = (
corpus_instance.sample(random_single_hop, n=50)
.map(
lambda df: df.reset_index(drop=True),
)
.make_retrieval_gt_contents()
.batch_apply(
two_hop_incremental,
llm=llm, # query generation
)
.batch_apply(
make_basic_gen_gt, # answer generation (basic)
llm=llm,
)
.batch_apply(
make_concise_gen_gt, # answer generation (concise)
llm=llm,
)
.filter(
dontknow_filter_rule_based, # filter don't know
lang="en",
)
)
initial_qa.to_parquet('./qa_output/factoid_twohop_basic_concise_dontknow/all_docs_qa.parquet', './all_docs_corpus.parquet')
Metadata
Metadata
Assignees
Labels
No labels