Haerae-Evaluation-Toolkit is an emerging open-source Python library designed to streamline and standardize the evaluation of Large Language Models (LLMs), with a particular focus on Korean.
HRET: A Self-Evolving LLM Evaluation Toolkit for Korean (Paper Link)
-
Multiple Evaluation Methods
- Logit-Based, String-Match, Partial-Match LLM-as-a-Judge, and more.
-
Reasoning Chain Analysis
- Dedicated to analyzing extended Korean chain-of-thought reasoning.
-
Extensive Korean Datasets
- Includes HAE-RAE Bench, KMMLU, KUDGE, CLiCK, K2-Eval, HRM8K, and more.
-
Scalable Inference-Time Techniques
- Best-of-N, Majority Voting, Beam Search, and other advanced methods.
-
Integration-Ready
- Supports OpenAI-Compatible Endpoints, Huggingface, and LiteLLM.
-
Flexible and Pluggable Architecture
- Easily extend with new datasets, evaluation metrics, and inference backends.
We are actively developing core features and interfaces. Current goals include:
-
Unified API
- Seamless loading and integration of diverse Korean benchmark datasets.
-
Configurable Inference Scaling
- Generate higher-quality outputs through techniques like best-of-N and beam search.
-
Pluggable Evaluation Methods
- Enable chain-of-thought assessments, logit-based scoring, and standard evaluation metrics.
-
Modular Architecture
- Easily extendable for new backends, tasks, or custom evaluation logic.
-
Dataset Abstraction
- Load and preprocess your datasets (or subsets) with minimal configuration.
-
Scalable Methods
- Apply decoding strategies such as sampling, beam search, and best-of-N approaches.
-
Evaluation Library
- Compare predictions to references, use judge models, or create custom scoring methods.
-
Registry System
- Add new components (datasets, models, scaling methods) via simple decorator-based registration.
-
Clone the repository:
git clone https://github.com/HAE-RAE/haerae-evaluation-toolkit.git cd haerae-evaluation-toolkit
-
(Optional) Create and activate a virtual environment:
- Using venv:
python -m venv venv source venv/bin/activate # On Windows use `venv\Scripts\activate`
- Using Conda:
conda create -n hret python=3.11 -y conda activate hret
- Using venv:
-
Install dependencies: Choose one of the following methods:
-
Using pip:
pip install -r requirements.txt
-
Using uv (Recommended for speed):
- First, install uv if you haven't already. See uv installation guide.
- Then, install dependencies using uv:
uv pip install -r requirements.txt
-
Below is a minimal example of how to use the Evaluator
interface to load a dataset, apply a model and (optionally) a scaling method, and then evaluate the outputs.
Below is an example, for more detailed instructions on getting it up and running, see tutorial/kor(eng)/quick_start.md.
from llm_eval.evaluator import Evaluator
# 1) Initialize an Evaluator with default parameters (optional).
evaluator = Evaluator()
# 2) Run the evaluation pipeline
results = evaluator.run(
model="huggingface", # or "litellm", "openai", etc.
judge_model=None, # specify e.g. "huggingface_judge" if needed
reward_model=None, # specify e.g. "huggingface_reward" if needed
dataset="haerae_bench", # or "kmmlu", "qarv", ...
subset=["csat_geo", "csat_law"], # optional subset(s)
split="test", # "train"/"validation"/"test"
dataset_params={"revision":"main"}, # example HF config
model_params={"model_name_or_path":"gpt2"}, # example HF Transformers param
judge_params={}, # params for judge model (if judge_model is not None)
reward_params={}, # params for reward model (if reward_model is not None)
scaling_method=None, # or "beam_search", "best_of_n"
scaling_params={}, # e.g., {"beam_size":3, "num_iterations":5}
evaluator_params={} # e.g., custom evaluation settings
)
- Dataset is loaded from the registry (e.g.,
haerae_bench
is just one of many). - Model is likewise loaded via the registry (
huggingface
,litellm
, etc.). - judge_model and reward_model can be provided if you want LLM-as-a-Judge or reward-model logic. If both are None, the system uses a single model backend.
ScalingMethod
is optional if you want to do specialized decoding.EvaluationMethod
(e.g.,string_match
,log_likelihood
,partial_match
orllm_judge
) measures performance.
We also provide a simple command-line interface (CLI) via evaluator.py
:
python llm_eval/evaluator.py \
--model huggingface \
--judge_model huggingface_judge \
--reward_model huggingface_reward \
--dataset haerae_bench \
--subset csat_geo \
--split test \
--scaling_method beam_search \
--evaluation_method string_match \
--model_params '{"model_name_or_path": "gpt2"}' \
--scaling_params '{"beam_size":3, "num_iterations":5}' \
--output_file results.json
This command will:
- Load the
haerae_bench
(subset=csat_geo
) test split. - Create a MultiModel internally with: Generate model: huggingface → gpt2 Judge model: huggingface_judge (if you pass relevant judge_params) Reward model: huggingface_reward (if you pass relevant reward_params).
- Apply Beam Search (
beam_size=3
). - Evaluate final outputs via
string_match
. - Save the resulting JSON file to
results.json
.
We welcome collaborators, contributors, and testers interested in advancing LLM evaluation methods, especially for Korean language tasks.
- Development Lead: [email protected]
- Research Lead: [email protected]
We look forward to hearing your ideas and contributions!
If you find HRET useful in your research, please consider citing our paper:
@article{lee2025hret,
title={HRET: A Self-Evolving LLM Evaluation Toolkit for Korean},
author={Lee, Hanwool and Kim, Soo Yong and Choi, Dasol and Baek, SangWon and Hong, Seunghyeok and Jeong, Ilgyun and Hwang, Inseon and Lee, Naeun and Son, Guijin},
journal={arXiv preprint arXiv:2503.22968},
year={2025}
}
Licensed under the Apache License 2.0.
© 2025 The HAE-RAE Team. All rights reserved.