This project presents a curated dataset and experimental analysis that evaluates the factual accuracy of various Large Language Models (LLMs) enhanced with search functionality. Despite integration with search tools, our findings show that LLMs still exhibit factual inconsistencies.
To mitigate this, we propose a novel fact-verification pipeline that significantly improves factual accuracy. We offer this method as a baseline and pose it as an open research challenge to the community.
dataset/
β Contains curated queries and LLM responses.evaluation/
β Scripts and metrics used for factuality evaluation.pipeline/
β Our proposed method for improving factual accuracy.results/
β Comparative analysis of different models and methods.
- Dataset Creation: A benchmark dataset of fact-based queries and LLM-generated answers.
- Empirical Evaluation: Demonstrated that LLMs with search still produce hallucinated facts.
- Proposed Pipeline: A verification-based method that improves accuracy.
- Open Challenge: The pipeline is proposed as a base for further research.
We welcome contributions from the community. You can help by:
- Extending the dataset with newer or domain-specific queries.
- Enhancing the verification pipeline.
- Proposing new evaluation metrics or integrating more LLMs.