Auto-mapping ICD-10 using machine learning model

Researchers

Assistant Professor Piyapong Khumrin, MD, Faculty of Medicine, Chiang Mai University, Chiang Mai, Thailand
Assistant Professor Krit Khwanngern, MD, Faculty of Medicine, Chiang Mai University, Chiang Mai, Thailand
Associate Professor Nipon Theera-Umpon, PhD, Biomedical Engineering Institute, Chiang Mai University
Terence Siganakis, CEO, Growing Data Pty Ltd
Alexander Dokumentov, Data Scientist, Growing Data Pty Ltd

Technical support

Atcharaporn Angsuratanawech
Sittipong Moraray
Pimpaka Chuamsakul
Pitupoom Chumpoo
Prawinee Mokmoongmuang

Duration

6 months (March - August 2019)

Introduction

Over one million patient visit Maharaj Nakhon Chiang Mai hospital at the outer patient department (reported in 2018). Every year, the hospital needs to report data to the government for billing.

Problem statement

The amount of budget which can be claimed from the billing depends on the quality and completeness of the document. One major problem is the completeness of diagnosis (representing in ICD-10 code). The current process to complete the diagnosis in the document is a labour intersive work which requires physicians or technical coders to review medical records and manually enter a proper diagnosis code. Therefore, we see a potential benefit of machine learning application to automate this ICD-10 labelling process.

Prior work

ICD-10 is a medical classification list for medical related terms such as diseases, signs and symptoms, abnormal findings, defined by the World Health Organization (WHO). In this case, ICD-10 is used to standardized the diagnosis in the billing report before submitting the report to the government. Prior research showed the success of applying machine learning for auto-mapping ICD-10.

Serguei et al. applied simple filters or criteria to predict ICD-10 code such as assuming that a high frequency code that shares the same keywords is a correct code, or gender specific diagnsis (use gender to separate female and male specific diseases). Other cases which were not solved by those filters, then use a machine learning model (Naive Bayes) or bag of words technique. They used the SNoW implementation of the naïve Bayes classifier to solve the large number of classification, and Phrase Chunker Component for mixed approach to solve a classification. The model evaluation showed that over 80% of entities were correctly classified (with precision, recall, and F-measure above 90%). The limitations of the study were the features were treated independently which might lead to a wrong classification, continuous value such as age interfered how the diseases are classified, and lower number of reference standard (cased manually labelled by human coder).

Koopman et al. developed a machine learning model to automatically classify ICD-10 of cancers from free-text death certificates. Natural language processing and SNOMED-CT were used to extract features to term-based and concept-based features. SVM was trained and deployed into two levels: 1) cancer/nocancer (F-measure 0.94) and 2) if cancer, then classify type of cancer (F-measure 0.7).

Medori and Fairon mapped clinical text with standardized medical terminologies (UMLS) to formulate features to train a Naive Bayes model to predict ICD-6(81% recall).

Boytcheva matched ICD-10 codes to diagnoses extracted from discharge letters using SVM. The precision, recall, F-measure of the model were 97.3% 74.68% 84.5%, respectively.

In summary, prior research shows that machine learning model plays a significant and beneficial role in auto-mapping ICD-10 to clinical data. The common approach of the preprocessing process is using NLP process to digest raw text and map with standardized medical terminologies to build input features. This is the first step challenge of our research to develop a preprocessing protocol. Then, the second step is to design an approach how to deal with a large number of input features and target classes (ICD-10).

Our objectives are to develop machine learning model to mapp missing or verify ICD-10 in order to obtain more complete billing document. We aim to test if we use the model to complete ICD-10 for one year report and evaluate how much more the hospital can claim the billing.

Objectives

Use machine learning models to predict missing ICD-10.
Use machine learning models to verify ICD-10 labelled by human.

Aims

The performance of machine learning model shows precision, recall, and F-measure greater than 80%.
Present one year cost-benefit analysis compared between before and after using machine learning models to fill ICD-10.

Time line

March 2019

Write and submit a research proposal and ethic.
Setup a new server.
Duplicate clinical data to the server.
Map and label column name and description.
Join the table data and create a single dataset.

April 2019

Apply NLP and standard medical terminologies to preprocess input features.
Design and evaluate machine learning model.

May 2019

Close the project either, the model performance is greater than 80% or it is the last week of May.

June - August 2019

Write and submit a paper.

Materials and methods

Target group

Clinical records of outer-patient visits from 2006 - 2017 (2006 - 2016 for a training set, and 2017 for a test set) are retrospectively retrieved from the Maharaj Nakhon Chiang Mai electronic health records. Approximately one million records are expected to retrieve per year. Only encoded data (number, string) are included in the experiment (excluded images and scanned document).

Data preprocessing

All identification data such as name, surname, address, national identification, hospital number will be removed according to patient privacy. Data of interest include:

Demographic data such as date of birth, gender
History taking and physical examination (including discharge summary)
Laboratory and investigation reports
Medical prescription (investigation, drug)
ICD-10 (coded by a technical coder)

Data analysis

Data from 2005 - 2016 are used to train machine learning models and data from 2017 are used to evaluate the models. We use overall accuracy, precision, recall, F-measure, and area under ROC curve to evaluate and compare predictive performance between models.

Dataset

Data recorded between 2006 - 2019 from the electronic health records of Maharaj Nakhon Chiang Mai were deidentified and preprocessed. All data that could be potentially able to track back to an individual patient such as patients' name, surname, address, national identification number, address, phone number, hospital number were removed. We used TXN (a unique number representing a patient visit) to be a joining key. The dataset was divided into five groups.

Registration data
Admission data
Laboratory data
Radiological report data
Drug prescription data

Registration data

The registration data is the demographic information of patients who visited (mostly outer patient department (OPD) cases) at Maharaj Nakhon Chiang Mai hospital. See the full detail of registration metadata here.

Admission data

The admission data is the demographic information of patients who admitted to any internal wards (inner patient departments (IPD) cases) at Maharaj Nakhon Chiang Mai hospital. See the full detail of admission metadata here.

Laboratory data

See the full detail of laboratory metadata here.

Radiological report data

The radiological report data is the reports that radiologists took notes after they reviewed the imaging. The notes were written in plain text describing the finding within the imaging and the impression of suspected abnormalities and/or provisional diagnosis. We do not include any image data in this experiment. The notes are required to preprocessed using natural language process techniques to clean and do feature engineering. This work is contributed in radio branch of this project.

Drug prescription data

The drug prescription data is the information of type of drugs which were prescribed to the patients. See the full detail of laboratory metadata here.

How to use

Clone the project and change to dev branch

git clone https://github.com/u4507075/icd_10.git
cd icd_10
git checkout dev

Check out and update dev branch

git fetch
git checkout dev
git pull

Commit and push

git add .
git commit -m "your message"
git push

#check remote
git remote -v

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
.gitignore		.gitignore
ADMISSION_METADATA.md		ADMISSION_METADATA.md
DRUG_METADATA.md		DRUG_METADATA.md
LAB_METADATA.md		LAB_METADATA.md
LICENSE		LICENSE
README.md		README.md
REGISTRATION_METADATA.md		REGISTRATION_METADATA.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Auto-mapping ICD-10 using machine learning model

Researchers

Technical support

Duration

Introduction

Problem statement

Prior work

Objectives

Aims

Time line

March 2019

April 2019

May 2019

June - August 2019

Materials and methods

Target group

Data preprocessing

Data analysis

Dataset

Registration data

Admission data

Laboratory data

Radiological report data

Drug prescription data

How to use

How it works

Model evaluation

Limitations

About

Releases

Packages

License

billza7/icd_10

Folders and files

Latest commit

History

Repository files navigation

Auto-mapping ICD-10 using machine learning model

Researchers

Technical support

Duration

Introduction

Problem statement

Prior work

Objectives

Aims

Time line

March 2019

April 2019

May 2019

June - August 2019

Materials and methods

Target group

Data preprocessing

Data analysis

Dataset

Registration data

Admission data

Laboratory data

Radiological report data

Drug prescription data

How to use

How it works

Model evaluation

Limitations

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages