This repository contains the LLM evaluation code for the npj Digital Medicine paper "An In-Depth Evaluation of Federated Learning on Biomedical Natural Language Processing for Information Extraction". The datasets used in this paper were downloaded from [FedNLP Repo]. In particular, they are NCBI-Disease, 2018 n2c2 datasets for named entity recognition (NER); GAD and 2018 n2c2 for relation extraction (RE).
Table 1: The results of LLMs with the best among 1/5/10/20-shot prompting on NER and RE tasks, compared with Blue BERT and GPT 2 trained with federated learning.
Model | NER | RE | ||||
---|---|---|---|---|---|---|
NCBI | 2018 n2c2 | 2018 n2c2 | GAD | |||
Strict | Lenient | Strict | Lenient | F1 | ||
Mistral 8x7B Instruct | 0.409 | 0.587 | 0.514 | 0.648 | 0.314 | 0.459 |
GPT 3.5 | 0.575 | 0.719 | 0.565 | 0.705 | 0.290 | 0.485 |
GPT 4 | 0.722 | 0.834 | 0.616 | 0.751 | 0.882 | 0.543 |
PaLM 2 Bison | 0.640 | 0.756 | 0.544 | 0.653 | 0.407 | 0.468 |
PaLM 2 Unicorn | 0.726 | 0.848 | 0.621 | 0.749 | 0.888 | 0.549 |
Gemini 1.0 Pro | 0.654 | 0.779 | 0.566 | 0.694 | 0.411 | 0.541 |
Llama 3 70B Instruct | 0.685 | 0.786 | 0.551 | 0.695 | 0.319 | 0.458 |
Claude 3 Opus | 0.788 | 0.879 | 0.680 | 0.787 | 0.832 | 0.569 |
Blue BERT (FL) | 0.824 | 0.899 | 0.954 | 0.986 | 0.950 | 0.714 |
GPT 2 (FL) | 0.784 | 0.840 | 0.830 | 0.868 | 0.946 | 0.721 |
NOTE:
- GPTs' checkpoints are
gpt-4-1106-preview
andgpt-3.5-turbo-1106
. - Mistral 8x7B Instruct was running on half-precision (~85GB), and Llama 3 70B Instruct was running on 4-bit quantization (~45 GB).
More details of the datasets can be found in data
.
Model | RLHF-Tuned | Instruction-Tuned | Max Input Tokens |
---|---|---|---|
Mistral 8x7B | No | Yes | 32K |
GPT 3.5 (Chat) | Yes | No | 16K |
GPT 4 (Chat) | Yes | No | 128K |
PaLM 2 Bison (Chat) | No | No | 8K |
PaLM 2 Unicorn (Text) | No | No | 8K |
Gemini Pro (Chat) | No | No | 32K |
Claude 3 (Chat) | Yes | No | 200K |
Llama 3 70B | Yes | Yes | 8K |
The models used in this paper are mostly chat models and a text-completion model without specifically tuning for NER and RE tasks. We applied in-context learning by providing examples as prompts to the models. Even with 20-shot prompting, the input tokens length is still within 8K, which all models can handle in its context window.
The example notebooks are in the root folder.