If you use this code in your research, please cite our paper:
@inproceedings{zhang2024dossier,
author={Haoran Zhang and Supriya Nagesh and Milind Shyani and Nina Mishra},
title={DOSSIER: Fact checking in electronic health records while preserving patient privacy},
year={2024},
booktitle={Machine Learning for Healthcare Conference},
organization={PMLR}
}
We release a dataset of 4,250 natural language claims regarding the patient-specific EHR records of 100 admissions from the MIMIC-III dataset, which require information from one or more of the admissions
, chartevents
, labevents
, and inputevents
tables, and 35% of which require external medical knowledge to verify.
As this dataset is derived from MIMIC-III, we are unable to release it publicly. We are currently working on releasing this through PhysioNet under the PhysioNet Credentialed Health Data License. In the meantime, if access to the data is required, please email haoranz [at] mit.edu
with a copy of your CITI certificate and we will reply with the .csv file.
Run the following commands to create the Conda environment and install required packages:
conda env create -f environment.yml
conda activate fact_check
pip install -U git+https://github.com/huggingface/transformers@9ed538f2e67ee10323d96c97284cf83d44f0c507
pip install -U medcat==1.9.0 --no-deps
pip install -U huggingface_hub
Obtain access to PhysioNet, and download MIMIC-III (v1.4). If the files you downloaded are in .csv.gz
format, extract them to .csv
.
Obtain access to the UMLS, and download and extract the UMLS Metathesaurus Full Subset (we use v2023AA).
Download the PREDICATION
, PREDICATION_AUX
, and GENERIC_CONCEPTS
csv files from SemMedDB. Place them in a folder named semmeddb
. Note that we use the 2023 version.
Download the UMLS Small
and SNOMED International
MedCat models. You will need an UMLS account.
Place all of the previous folders into a single DATA_DIR
. Update the paths found in fact_check/constants.py
. For example, your DATA_DIR
should contain the folders {DATA_DIR}/2023AA
, {DATA_DIR}/semmeddb
, and {DATA_DIR}/medcat_mimic_snomed
.
Obtain access to Anthropic and UMLS API keys. Add these lines to your ~/.bashrc
, or run them before running any script.
export ANTHROPIC_API_KEY="YOUR KEY"
export UMLS_API_KEY="YOUR KEY"
Run notebooks/process_semmeddb.ipynb
to process SemMedDB (filtering edges by citations, and merging with the SNOMED hierarchy). This will output a semmeddb_processed_10.csv
file in your semmeddb
folder.
To evaluate the DOSSIER pipeline, run eval_pipeline
, e.g.
python -m fact_check.eval_pipeline \
--prompt full \
--llm claude \
--output_dir /output/dir
This outputs res.pkl
in the output_dir
, which contains the predicted stance for each claim in the cohort.
To evaluate the baseline LLMs which directly take in EHR records, run eval_baseline_llm
, e.g.
python -m fact_check.eval_baseline_llm \
--llm asclepius \
--rag_type knn \
--rag_with_kg \
--output_dir /output/dir
To run MedAlpaca, you will need to clone the MedAlpaca repo into the root directory of this repo (i.e. there should be a folder DOSSIER/medAlpaca
).
If using rag_type = knn
, you will have to first run notebooks/cache_sbert_embs.ipynb
.
The first run of the codebase may be unresponsive for the first 10 minutes as the program caches two additional files in the fact_check/cache
folder. Subsequent runs should be much faster.
To analyze the output of eval_pipeline
and eval_baseline_llm
, use notebooks/agg_results.ipynb
.