- Assistant Professor Piyapong Khumrin, MD, Faculty of Medicine, Chiang Mai University, Chiang Mai, Thailand
- Assistant Professor Krit Khwanngern, MD, Faculty of Medicine, Chiang Mai University, Chiang Mai, Thailand
- Associate Professor Nipon Theera-Umpon, PhD, Biomedical Engineering Institute, Chiang Mai University
- Terence Siganakis, CEO, Growing Data Pty Ltd
- Alexander Dokumentov, Data Scientist, Growing Data Pty Ltd
- Atcharaporn Angsuratanawech
- Sittipong Moraray
- Pimpaka Chuamsakul
- Pitupoom Chumpoo
- Prawinee Mokmoongmuang
6 months (March - August 2019)
Over one million patient visit Maharaj Nakhon Chiang Mai hospital at the outer patient department (reported in 2018). Every year, the hospital needs to report data to the government for billing.
The amount of budget which can be claimed from the billing depends on the quality and completeness of the document. One major problem is the completeness of diagnosis (representing in ICD-10 code). The current process to complete the diagnosis in the document is a labour intersive work which requires physicians or technical coders to review medical records and manually enter a proper diagnosis code. Therefore, we see a potential benefit of machine learning application to automate this ICD-10 labelling process.
ICD-10 is a medical classification list for medical related terms such as diseases, signs and symptoms, abnormal findings, defined by the World Health Organization (WHO). In this case, ICD-10 is used to standardized the diagnosis in the billing report before submitting the report to the government. Prior research showed the success of applying machine learning for auto-mapping ICD-10.
Serguei et al. applied simple filters or criteria to predict ICD-10 code such as assuming that a high frequency code that shares the same keywords is a correct code, or gender specific diagnsis (use gender to separate female and male specific diseases). Other cases which were not solved by those filters, then use a machine learning model (Naive Bayes) or bag of words technique. They used the SNoW implementation of the naïve Bayes classifier to solve the large number of classification, and Phrase Chunker Component for mixed approach to solve a classification. The model evaluation showed that over 80% of entities were correctly classified (with precision, recall, and F-measure above 90%). The limitations of the study were the features were treated independently which might lead to a wrong classification, continuous value such as age interfered how the diseases are classified, and lower number of reference standard (cased manually labelled by human coder).
Koopman et al. developed a machine learning model to automatically classify ICD-10 of cancers from free-text death certificates. Natural language processing and SNOMED-CT were used to extract features to term-based and concept-based features. SVM was trained and deployed into two levels: 1) cancer/nocancer (F-measure 0.94) and 2) if cancer, then classify type of cancer (F-measure 0.7).
Medori and Fairon mapped clinical text with standardized medical terminologies (UMLS) to formulate features to train a Naive Bayes model to predict ICD-6(81% recall).
Boytcheva matched ICD-10 codes to diagnoses extracted from discharge letters using SVM. The precision, recall, F-measure of the model were 97.3% 74.68% 84.5%, respectively.
In summary, prior research shows that machine learning model plays a significant and beneficial role in auto-mapping ICD-10 to clinical data. The common approach of the preprocessing process is using NLP process to digest raw text and map with standardized medical terminologies to build input features. This is the first step challenge of our research to develop a preprocessing protocol. Then, the second step is to design an approach how to deal with a large number of input features and target classes (ICD-10).
Our objectives are to develop machine learning model to mapp missing or verify ICD-10 in order to obtain more complete billing document. We aim to test if we use the model to complete ICD-10 for one year report and evaluate how much more the hospital can claim the billing.
- Use machine learning models to predict missing ICD-10.
- Use machine learning models to verify ICD-10 labelled by human.
- The performance of machine learning model shows precision, recall, and F-measure greater than 80%.
- Present one year cost-benefit analysis compared between before and after using machine learning models to fill ICD-10.
- Write and submit a research proposal and ethic.
- Setup a new server.
- Duplicate clinical data to the server.
- Map and label column name and description.
- Join the table data and create a single dataset.
- Apply NLP and standard medical terminologies to preprocess input features.
- Design and evaluate machine learning model.
- Close the project either, the model performance is greater than 80% or it is the last week of May.
- Write and submit a paper.
Clinical records of outer-patient visits from 2006 - 2017 (2006 - 2016 for a training set, and 2017 for a test set) are retrospectively retrieved from the Maharaj Nakhon Chiang Mai electronic health records. Approximately one million records are expected to retrieve per year. Only encoded data (number, string) are included in the experiment (excluded images and scanned document).
All identification data such as name, surname, address, national identification, hospital number will be removed according to patient privacy. Data of interest include:
- Demographic data such as date of birth, gender
- History taking and physical examination (including discharge summary)
- Laboratory and investigation reports
- Medical prescription (investigation, drug)
- ICD-10 (coded by a technical coder)
Data from 2005 - 2016 are used to train machine learning models and data from 2017 are used to evaluate the models. We use overall accuracy, precision, recall, F-measure, and area under ROC curve to evaluate and compare predictive performance between models.
Data recorded between 2006 - 2019 from the electronic health records of Maharaj Nakhon Chiang Mai were deidentified and preprocessed. All data that could be potentially able to track back to an individual patient such as patients' name, surname, address, national identification number, address, phone number, hospital number were removed. We used TXN (a unique number representing a patient visit) to be a joining key. The dataset was divided into five groups.
- Registration data
- Admission data
- Laboratory data
- Radiological report data
- Drug prescription data
The registration data is the demographic information of patients who visited (mostly outer patient department (OPD) cases) at Maharaj Nakhon Chiang Mai hospital. See the full detail of registration metadata here.
The admission data is the demographic information of patients who admitted to any internal wards (inner patient departments (IPD) cases) at Maharaj Nakhon Chiang Mai hospital. See the full detail of admission metadata here.
See the full detail of laboratory metadata here.
The radiological report data is the reports that radiologists took notes after they reviewed the imaging. The notes were written in plain text describing the finding within the imaging and the impression of suspected abnormalities and/or provisional diagnosis. We do not include any image data in this experiment. The notes are required to preprocessed using natural language process techniques to clean and do feature engineering. This work is contributed in radio branch of this project.
The drug prescription data is the information of type of drugs which were prescribed to the patients. See the full detail of laboratory metadata here.
- Clone the project and change to dev branch
git clone https://github.com/u4507075/icd_10.git
cd icd_10
git checkout dev
- Check out and update dev branch
git fetch
git checkout dev
git pull
- Commit and push
git add .
git commit -m "your message"
git push
#check remote
git remote -v