This is a Java application for loading MEDLINE XML files into a relational database (currently supporting SQL Server and PostgreSQL). The application was designed with two goals in mind:
-
Everything in the XML files needs to go into the database*.
-
Any changes in the XML structure that occur over the years should not require changing the program.
- In 2017 we started breaking this rule by omitting inline tags in text fields. For example, abstracts could contain <I> and <B> tags, but these are ignored when inserting into the database.
The application is run in two phases:
-
During analysis, the structure and contents of a large set of XML files is analysed, and a database structure is build to accommodate the data. This is typically done only once a year.
-
During parse, all XML files in a folder are parsed and their contents are inserted into the database. This is typically done every time new XML files are available from MEDLINE.
Note that the application works directly of the GZipped XML files, so no need to unzip them.
- Supports SQL Server, PostgreSQL and MySQL
- Scans the XML files to determine the structure of the database needed to hold the data
- All data in the XML files is loaded in the database
- Allows incremental loading of data as it is made available by NLM
- Automatically deletes old versions of citations as revisions are made available
- Also includes a parser for the MeSH database
This is a pure Java application that can only be used through the command line.
Requires Java 17 or higher, and write and create access to the database.
-
Download all xml.gz files from MEDLINE (see http://www.nlm.nih.gov/databases/license/license.html for licensing information)
-
Create an ini file according to the example in the iniFileExamples folder, pointing to the folder containing the xml.gz files, and the server and schema where the data should be uploaded
-
Under the Releases tab, download MedlineXmlToDatabase*.zip, and unzip the file. Alternatively, you can download the source code and use the included Ant file to build the Jar file.
-
From the command line, use
java -Xmx10000m -jar MedlineXmlToDatabase.jar -analyse -ini <path to ini file>
to create the database structure. -
From the command line, use
java -Xmx10000m -jar MedlineXmlToDatabase.jar -parse -ini <path to ini file>
to load the data from the xml files into the database.
Optionally, you can also include the MeSH database:
-
Download the XML gz files (descxxxx.gz and suppxxxx.gz) from NLM (see https://www.nlm.nih.gov/mesh/download_mesh.html)
-
Add the path to the gz files to the ini file under
MESH_XML_FOLDER
-
From the command line, use
java -jar MedlineXmlToDatabase.jar -parse_mesh -ini <path to ini file>
to load the data from the xml files into the database.
- Developer questions/comments/feedback: OHDSI Forum
- We use the GitHub issue tracker for all bugs/issues/enhancements
MedlineXmlToDatabase is licensed under Apache License 2.0
MedlineXmlToDatabase was developed in Eclipse. Contributions are welcome.
Beta testing
Martijn Schuemie is the author of this application.