This repository contains the winning solution for both prize nominations in the RAG Challenge competition. The system achieved state-of-the-art results in answering questions about company annual reports using a combination of:
- Custom PDF parsing with Docling
- Vector search with parent document retrieval
- LLM reranking for improved context relevance
- Structured output prompting with chain-of-thought reasoning
- Query routing for multi-company comparisons
This is competition code - it's scrappy but it works. Some notes before you dive in:
- IBM Watson integration won't work (it was competition-specific)
- The code might have rough edges and weird workarounds
- No tests, minimal error handling - you've been warned
- You'll need your own API keys for OpenAI/Gemini
- GPU helps a lot with PDF parsing (I used 4090)
If you're looking for production-ready code, this isn't it. But if you want to explore different RAG techniques and their implementations - check it out!
Clone and setup:
git clone https://github.com/IlyaRice/RAG-Challenge-2.git
cd RAG-Challenge-2
python -m venv venv
venv\Scripts\Activate.ps1 # Windows (PowerShell)
pip install -e . -r requirements.txt
Rename env
to .env
and add your API keys.
The repository includes two datasets:
- A small test set (in
data/test_set/
) with 5 annual reports and questions - The full ERC2 competition dataset (in
data/erc2_set/
) with all competition questions and reports
Each dataset directory contains its own README with specific setup instructions and available files. You can use either dataset to:
- Study example questions, reports, and system outputs
- Run the pipeline from scratch using provided PDFs
- Use pre-processed data to skip directly to specific pipeline stages
See the respective README files for detailed dataset contents and setup instructions:
data/test_set/README.md
- For the small test datasetdata/erc2_set/README.md
- For the full competition dataset
You can run any part of pipeline by uncommenting the method you want to run in src/pipeline.py
and executing:
python .\src\pipeline.py
You can also run any pipeline stage using main.py
, but you need to run it from the directory containing your data:
cd .\data\test_set\
python ..\..\main.py process-questions --config max_nst_o3m
Get help on available commands:
python main.py --help
Available commands:
download-models
- Download required docling modelsparse-pdfs
- Parse PDF reports with parallel processing optionsserialize-tables
- Process tables in parsed reportsprocess-reports
- Run the full pipeline on parsed reportsprocess-questions
- Process questions using specified config
Each command has its own options. For example:
python main.py parse-pdfs --help
# Shows options like --parallel/--sequential, --chunk-size, --max-workers
python main.py process-reports --config ser_tab
# Process reports with serialized tables config
max_nst_o3m
- Best performing config using OpenAI's o3-mini modelibm_llama70b
- Alternative using IBM's Llama 70B modelgemini_thinking
- Full context answering with using enormous context window of Gemini. It is not RAG, actually
Check pipeline.py
for more configs and detils on them.
MIT