Machine learning in rare disease

Manuscript description

This manuscript is a perspective on the usage of various machine learning methodologies on rare disease data.

Substantial technological advances have dramatically changed biomedicine by making deep characterization of patient samples routine. These technologies provide a rich portrait of genes, cellular pathways, and cell types involved in complex phenotypes. Machine learning is often a perfect fit for the types of data now being generated, and Nature Methods routinely has reports of machine learning methods that extract disease-relevant patterns from these high dimensional datasets. Often, these methods require a large number of samples to identify reproducible and biologically meaningful patterns. With rare diseases, biological specimens and consequently data, are limited due to the rarity of the condition. In this perspective, we outline the challenges and emerging solutions for using machine learning in these settings. We aim to spur the development of powerful machine learning techniques for rare diseases. We also note that precision medicine presents a similar challenge, in which a common disease is partitioned into small subsets of patients with shared etiologies and treatment strategies. Advances from rare disease research are likely to be highly informative for precision medicine applications as well.

Techniques that build on prior knowledge and indirectly related data are necessary for many rare disease applications.

This section will highlight promising approaches for analyzing rare disease data to extract biological insights. We will discuss techniques like transfer learning, representation learning, cascade learning, integrative analysis, and knowledge-graph creation and use that leverage other knowledge and data sources to construct testable hypotheses from rare diseases datasets with limited sample sizes.

Techniques and procedures must be implemented to manage model complexity without sacrificing the value of machine learning.

Inherent challenges posed by low sample numbers in rare diseases are further aggravated by disease heterogeneity, poorly defined disease phenotypes, and often a lack of control (i.e. normal) data. Machine learning approaches must be carefully designed to address these challenges. We discuss how to implement methodological solutions like bootstrapping sample data, regularization methods for deep learning, and hyper-ensemble techniques to minimize misinterpretation of the data.

Techniques to manage disparities in data generation are required to power robust analyses in rare diseases.

Rarity of patients leads to heterogeneity in sample collection, causing disparities in the data. We will discuss how rigorous normalization and methodologies capturing sample-wise gene-set level information can help appropriate integration of disparate data points to power machine learning approaches.

We will conclude by discussing the potential of the above-mentioned approaches in rare diseases, as well as in precision medicine and other biomedical areas where data is scarce.

Manubot

Manubot is a system for writing scholarly manuscripts via GitHub. Manubot automates citations and references, versions manuscripts using git, and enables collaborative writing via GitHub. An overview manuscript presents the benefits of collaborative writing with Manubot and its unique features. The rootstock repository is a general purpose template for creating new Manubot instances, as detailed in SETUP.md. See USAGE.md for documentation how to write a manuscript.

Please open an issue for questions related to Manubot usage, bug reports, or general inquiries.

Repository directories & files

The directories are as follows:

content contains the manuscript source, which includes markdown files as well as inputs for citations and references. See USAGE.md for more information.
output contains the outputs (generated files) from Manubot including the resulting manuscripts. You should not edit these files manually, because they will get overwritten.
webpage is a directory meant to be rendered as a static webpage for viewing the HTML manuscript.
build contains commands and tools for building the manuscript.
ci contains files necessary for deployment via continuous integration.

Local execution

The easiest way to run Manubot is to use continuous integration to rebuild the manuscript when the content changes. If you want to build a Manubot manuscript locally, install the conda environment as described in build. Then, you can build the manuscript on POSIX systems by running the following commands from this root directory.

# Activate the manubot conda environment (assumes conda version >= 4.4)
conda activate manubot

# Build the manuscript, saving outputs to the output directory
bash build/build.sh

# At this point, the HTML & PDF outputs will have been created. The remaining
# commands are for serving the webpage to view the HTML manuscript locally.
# This is required to view local images in the HTML output.

# Configure the webpage directory
manubot webpage

# You can now open the manuscript webpage/index.html in a web browser.
# Alternatively, open a local webserver at http://localhost:8000/ with the
# following commands.
cd webpage
python -m http.server

Sometimes it's helpful to monitor the content directory and automatically rebuild the manuscript when a change is detected. The following command, while running, will trigger both the build.sh script and manubot webpage command upon content changes:

bash build/autobuild.sh

Continuous Integration

Whenever a pull request is opened, CI (continuous integration) will test whether the changes break the build process to generate a formatted manuscript. The build process aims to detect common errors, such as invalid citations. If your pull request build fails, see the CI logs for the cause of failure and revise your pull request accordingly.

When a commit to the master branch occurs (for example, when a pull request is merged), CI builds the manuscript and writes the results to the gh-pages and output branches. The gh-pages branch uses GitHub Pages to host the following URLs:

HTML manuscript at https://jaybee84.github.io/ml-in-rd/
PDF manuscript at https://jaybee84.github.io/ml-in-rd/manuscript.pdf

For continuous integration configuration details, see .github/workflows/manubot.yaml if using GitHub Actions or .travis.yml if using Travis CI.

License

Except when noted otherwise, the entirety of this repository is licensed under a CC BY 4.0 License (LICENSE.md), which allows reuse with attribution. Please attribute by linking to https://github.com/jaybee84/ml-in-rd.

Since CC BY is not ideal for code and data, certain repository components are also released under the CC0 1.0 public domain dedication (LICENSE-CC0.md). All files matched by the following glob patterns are dual licensed under CC BY 4.0 and CC0 1.0:

*.sh
*.py
*.yml / *.yaml
*.json
*.bib
*.tsv
.gitignore

All other files are only available under CC BY 4.0, including:

*.md
*.html
*.pdf
*.docx

Please open an issue for any question related to licensing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Machine learning in rare disease

Manuscript description

Techniques that build on prior knowledge and indirectly related data are necessary for many rare disease applications.

Techniques and procedures must be implemented to manage model complexity without sacrificing the value of machine learning.

Techniques to manage disparities in data generation are required to power robust analyses in rare diseases.

Manubot

Repository directories & files

Local execution

Continuous Integration

License

Files

README.md

Latest commit

History

README.md

File metadata and controls

Machine learning in rare disease

Manuscript description

Techniques that build on prior knowledge and indirectly related data are necessary for many rare disease applications.

Techniques and procedures must be implemented to manage model complexity without sacrificing the value of machine learning.

Techniques to manage disparities in data generation are required to power robust analyses in rare diseases.

Manubot

Repository directories & files

Local execution

Continuous Integration

License