RAG Challenge Winner Solution

This repository contains the winning solution for both prize nominations in the RAG Challenge competition. The system achieved state-of-the-art results in answering questions about company annual reports using a combination of:

Custom PDF parsing with Docling
Vector search with parent document retrieval
LLM reranking for improved context relevance
Structured output prompting with chain-of-thought reasoning
Query routing for multi-company comparisons

Disclaimer

This is competition code - it's scrappy but it works. Some notes before you dive in:

IBM Watson integration won't work (it was competition-specific)
The code might have rough edges and weird workarounds
No tests, minimal error handling - you've been warned
You'll need your own API keys for OpenAI/Gemini
GPU helps a lot with PDF parsing (I used 4090)

If you're looking for production-ready code, this isn't it. But if you want to explore different RAG techniques and their implementations - check it out!

Quick Start

Clone and setup:

git clone https://github.com/IlyaRice/RAG-Challenge-2.git
cd RAG-Challenge-2
python -m venv venv
venv\Scripts\Activate.ps1  # Windows (PowerShell)
pip install -e . -r requirements.txt

Rename env to .env and add your API keys.

Test Dataset

The repository includes two datasets:

A small test set (in data/test_set/) with 5 annual reports and questions
The full ERC2 competition dataset (in data/erc2_set/) with all competition questions and reports

Each dataset directory contains its own README with specific setup instructions and available files. You can use either dataset to:

Study example questions, reports, and system outputs
Run the pipeline from scratch using provided PDFs
Use pre-processed data to skip directly to specific pipeline stages

See the respective README files for detailed dataset contents and setup instructions:

data/test_set/README.md - For the small test dataset
data/erc2_set/README.md - For the full competition dataset

Usage

You can run any part of pipeline by uncommenting the method you want to run in src/pipeline.py and executing:

python .\src\pipeline.py

You can also run any pipeline stage using main.py, but you need to run it from the directory containing your data:

cd .\data\test_set\
python ..\..\main.py process-questions --config max_nst_o3m

CLI Commands

Get help on available commands:

python main.py --help

Available commands:

download-models - Download required docling models
parse-pdfs - Parse PDF reports with parallel processing options
serialize-tables - Process tables in parsed reports
process-reports - Run the full pipeline on parsed reports
process-questions - Process questions using specified config

Each command has its own options. For example:

python main.py parse-pdfs --help
# Shows options like --parallel/--sequential, --chunk-size, --max-workers

python main.py process-reports --config ser_tab
# Process reports with serialized tables config

Some configs

max_nst_o3m - Best performing config using OpenAI's o3-mini model
ibm_llama70b - Alternative using IBM's Llama 70B model
gemini_thinking - Full context answering with using enormous context window of Gemini. It is not RAG, actually

Check pipeline.py for more configs and detils on them.

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAG Challenge Winner Solution

Disclaimer

Quick Start

Test Dataset

Usage

CLI Commands

Some configs

License

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data		data
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
env		env
main.py		main.py
requirements.txt		requirements.txt
setup.py		setup.py

License

ssyroezhkin/RAG-Challenge-2

Folders and files

Latest commit

History

Repository files navigation

RAG Challenge Winner Solution

Disclaimer

Quick Start

Test Dataset

Usage

CLI Commands

Some configs

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages