The "LLMFactCheck" is a powerful tool designed to validate triples in different sources, ensuring the accuracy of references and enhancing the quality of your research.
Before you begin, make sure you have the following installed: - Miniconda
To ensure a seamless experience, it's crucial to have GCC installed on your MacOS. See here.
To run the LLMFactCheck tool, follow these steps to set up the necessary Conda environment. Follow these steps: 🛠️
-
Make sure you have completed the Prerequisites.
-
Clone the repository:
git clone https://github.com/KoslickiLab/LLMFactCheck.git cd LLMFactCheck
-
Create a Conda environment (if you haven't already):
conda env create -f ./env/LLMFactCheck.yml
-
If you already had a Conda environment for this project and want to update it with new dependencies, execute the conda env update command with the updated environment file. Navigate to the root of your LLMFactCheck project and run the following command:
conda env update -f env/LLMFactCheck.yml
-
Activate the Conda environment:
conda activate myLLMFactCheck
-
Create a
config
folder in the root of the project, create the filekey.py
with the following contents: OPENAI_API_KEY = 'your-openai-api-key'
After completing the installation, you can run the LLMFactCheck tool using the following command:
For LLAMA2 model:
python main.py --model llama --triple_file semmed_triple_data.csv --sentence_file semmed_sentence_data.csv
For LLAMA model with ic:
python main.py --model llama --icl --triple_file semmed_triple_data.csv --sentence_file semmed_sentence_data.csv
For GPT-3.5-turbo model:
python main.py --model gpt_3_5_turbo --triple_file semmed_triple_data.csv --sentence_file semmed_sentence_data.csv
For GPT-3.5-turbo model with icl:
python main.py --model gpt_3_5_turbo --icl --triple_file semmed_triple_data.csv --sentence_file semmed_sentence_data.csv
For GPT-4_0 model:
python main.py --model gpt_4_0 --triple_file semmed_triple_data.csv --sentence_file semmed_sentence_data.csv
For GPT-4_0 model with icl:
python main.py --model gpt_4_0 --icl --triple_file semmed_triple_data.csv --sentence_file semmed_sentence_data.csv
This part of the project follows a well-organized structure for easy navigation and management. Here's a quick overview:
- main.py: Main file that invokes the core logic of all models.
- data: Your data for validation.
human_labeled_semmed.csv
- human-marked triples with sentences for correctnesssemmed_triple_data.csv
- all the columns you need for a triple from SemMedDBsemmed_sentence_data.csv
- sentences from SemMedDBfiltered_triple_data.csv
- only those triples from SemMedDB that match the predicates from the yaml filepredicate-remap.yml
- all predicates have been re-mapped (including mapping SemMedDB predicates to Biolink)test_df_3_5_turbo_icl
- test part human_data_semmed to determine the accuracy of the model GPT 3.5-turbo (with in-context learning)test_df_4_0_icl
- test part human_data_semmed to determine the accuracy of the model GPT 4.0 (with in-context learning)test_df_llama_icl
- test part human_data_semmed to determine the accuracy of the LLAMA2 model (with in-context learning)
- result: Results of Semmed predicate validation will be stored here.
- progress: - Progress of a models
llama_semmed_result.csv
: results of the LLAMA2 modelllama_icl_semmed_result.csv
: results of the LLAMA2 model (with in-context learning)gpt_3_5_turbo_semmed_result.csv
: results of the GPT-3.5-turbo modelgpt_3_5_turbo_icl_semmed_result.csv
: results of the GPT-3.5-turbo model (with in-context learning)gpt_4_0_semmed_result.csv
: results of GPT-4.0 modelgpt_4_0_icl_semmed_result.csv
: results of GPT-4.0 model
- src: Contains the main code for working with model and the Semmed database .
data_processing.py
: File for data processing.triple_processing.py
: File for triple processing.load_model.py
: This file contains functions related to loading and initializing the language model you're using in your project. It might include setting up the Llama model or OpenAI models with appropriate configurations and API keys.get_result.py
: In this file, you can find functions related to obtaining results from the language model. This includes sending prompts to the model, receiving responses, and processing the model's output to extract meaningful information or answers.result_writing.py
: File for writing results.progress.py
- File to track the progress of predicate validation.progress_path.py
: This file related to managing file paths for storing progress information.progressing.py
: The purpose of this file is to manage the progression of tasks and processes within the project.
-
util: The util directory contains the following subdirectories and files:
-
model_accuracy directory
This directory is used to calculate the model's accuracy. It uses human-annotated SemMedDB data to verify the model's ptriple accuracy.
To see the model's accuracy, follow these steps: Make sure you are at the root of the project and then:
For LLAMA2 model:
cd util cd model_accuracy python accuracy.py --model llama --test_df_file test_df --result_file semmed_result
For LLAMA2 model with ICL:
cd util cd model_accuracy python accuracy.py --model llama --icl --test_df_file test_df --result_file semmed_result
For GPT-3.5.-turbo model:
cd util cd model_accuracy python accuracy.py --model gpt_3_5_turbo --test_df_file test_df --result_file semmed_result
For GPT-3.5.-turbo model with ICL:
cd util cd model_accuracy python accuracy.py --model gpt_3_5_turbo --icl --test_df_file test_df --result_file semmed_result
For GPT-4.0 model:
cd util cd model_accuracy python accuracy.py --model gpt_4_0 --test_df_file test_df --result_file semmed_result
For GPT-4.0 model with ICL:
cd util cd model_accuracy python accuracy.py --model gpt_4_0 --icl --test_df_file test_df --result_file semmed_result
You will see a visual representation of the model's accuracy in the form of a pie chart
-
chembl directory
This directory contains files for testing the model's performance on the ChEMBL database. It includes the
chembl_triple.py
file, which generates triples. If You want to try it: Make sure you are at the root of the project and then:cd util cd chembl python chembl_triple.py
Then, this directory includes the
semmed_triple_mapping.py
file, which processes the data from the semmed_triple_data.csv file, creates a new 'TRIPLE' column from this data, and then filters this data to select only those triples that match the predicates specified in the predicate-remap.yaml file. As a result, only those triples from SemMedDB that match the predicates from the yaml file are displayed, and are also written to the filtered_triple_data.csv file in the project data folder. If You want to try it: Make sure you are at the root of the project and then:cd util cd chembl python semmed_triple_mapping.py
You don't have to do this (unless you think it's necessary), because we've already done this work and the result is already in the filtered_triple_data.py file
-
-
test: This directory contains tests for the project. Tests help verify if the code functions correctly and identify any errors or issues.
Running Tests: Make sure you are at the root of the project and then:
To run the tests, use the following command in the terminal:
pytest
This command will run all the tests located within the test directory.
These tests utilize Python's built-in unittest.mock library to mock the file operations and csv.writer methods. This helps isolate the functions from the actual file system and ensure that the tests are repeatable and reliable. By patching the built-in open and csv.writer methods with unittest.mock.patch, we are able to simulate different scenarios and test how our functions react to them. Each test utilizes fixtures and mocks to simulate real data and code behavior. This helps ensure that the tests are reliable and repeatable.
-
Run the Tool: Execute the main script, and watch as the tool works its magic.
-
View Results: The results of the triple validation process will be stored in the "result". You can review them to identify any issues with the references.
-
Celebrate: You've successfully checked triples with LLMFactCheck! 🎉
Feel free to explore the source code in the "src" folder for customization or to understand the inner workings of the tool.
-
Support for Multiple Sources: LLMFactCheck now supports validating triples in various sources, not limited to Semmed. You can easily extend its functionality for different datasets.
-
Console Application: We've introduced a new console application that allows you to validate triples in different sources using the command line. This provides more flexibility and ease of use, especially in environments where a database connection may not be available. Enjoy using LLMFactCheck for all your triple validation needs!
LLMFactCheck simplifies the process of checking triples in various sources. Whether you're a researcher or a data enthusiast, our tool will help you ensure the accuracy and quality of your data. Happy validating!
Please do not hesitate to contact us if you have any questions or see that the code needs to be improved.