This repository includes the code to evaluate various large language models on FLD corpora under few-shot in-context learning (Section 5.2 of the paper).
See the entry-point repository about the whole FLD project.
The code has been tested on Python 3.11.5
$ pip install -r ./requirements/requirements.txt
$ git clone https://github.com/hitachi-nlp/FLD-task.git && pip install -e ./FLD-task
$ export PYTHONPATH=`pwd -P`:$PYTHONPATH
-
Make a dataset that pairs in-context examples and test examples:
$ python ./scripts/make_dataset.py \ --output-dir ./outputs/dataset \ --dataset-name hitachi-nlp/FLD.v2 \ --dataset-config-name default \ --n-shot 10 \ --seed 0
-
Make predictions by LLMs.
For OpanAI model, specify model names as "openai.xx":
$ python ./scripts/predict.py \ ./outputs/dataset/ICL_dataset.jsonl \ ./outputs/predictions/ \ --model-name openai.gpt-3.5-turbo-16k \ --max-examples 5
For other models on the huggingface hub, specify model names as "hf.xx":
$ python ./scripts/predict.py \ ./outputs/dataset/ICL_dataset.jsonl \ ./outputs/predictions/ \ --model-name hf.Yukang/Llama-2-13b-longlora-32k-ft \ --tokenizer-name hf.meta-llama/Llama-2-7b-hf \ --max-examples 5 \ --tensor-parallel-size 1
-
Compute the metircs:
$ python ./scripts/evaluate_proofs.py \ outputs/predictions/predictions.jsonl \ outputs/metrics
-
Analyze predictions:
$ python ./scripts/analyze_results.py \ ./outputs/metrics/metrics.jsonl \ ./outputs/analysis