Rare disease researchers increasingly depend on machine learning (ML) to analyze high-dimensional datasets. A systematic review of ML applications in rare diseases (as defined in the European Union, i.e. fewer than five patients per 10,000 people) uncovered 211 human studies that used ML to study 74 rare diseases over the last ten years.[@doi:10.1186/s13023-020-01424-6] Indeed, ML can be a powerful tool in biomedical research but it does not come without pitfalls, some of which are magnified in a rare disease context.[@doi:10.3389/fmed.2021.747612] In this perspective, we discuss considerations for using two types of ML – supervised and unsupervised learning – in the study of rare diseases, with a specific focus on high-dimensional molecular data.
ML algorithms are computational methods that identify patterns in data, often in the form of lower-dimensional representations that can be used to perform useful computational tasks.
Supervised learning algorithms must be trained with data in which samples are “labeled” with a trait of interest, such as a biological or clinical phenotype.
Supervised methods can learn correlations of features (e.g., expression measurements of a large number of genes) that may be associated with these labels to predict or infer these labels in unlabeled data, such as predicting which patients will or will not respond to treatment.
Therefore, if a study aims to classify patients with a rare disease into disease subtypes based on high-throughput molecular profiling, a supervised ML algorithm is appropriate to carry out this task.
Conversely, unsupervised learning algorithms learn patterns or features from unlabeled data.
In the absence of known disease subtypes, unsupervised ML approaches can be applied to gene expression data to identify groups of samples with similar patterns of molecular states or pathway activity. [@doi:10.1158/0008-5472.CAN-08-2100].
Unsupervised approaches can also extract combinations of features (e.g., genes) that may describe a certain cell type or pathway.
See Box 1 for more examples of how ML can be used in rare disease research.
While ML can be a useful tool, there are challenges in applying ML to rare disease datasets. ML methods are generally most effective when using large datasets; analyzing high dimensional biomedical data such as gene expression with many thousands of features from rare disease datasets that typically contain relatively few samples is challenging[@https://www.fda.gov/media/99546/download; @doi:10.1186/s13023-020-01424-6]. Small sample datasets tend to lack statistical power and magnify the susceptibility of ML to misinterpretation and unstable performance. With insufficient data, an unsupervised model will fail to identify patterns that are useful for biological discovery. In the case of supervised models, they can be adversely impact if sample labels are uncertain or contain “label-noise”. [@doi:10.1093/jamia/ocw028] Datasets with high label-noise decrease prediction accuracy and necessitate larger sample sizes during the process by which models learn patterns that distinguish samples in different classes [@doi:10.1109/tnnls.2013.2292894] (model training; Box 2). Rare disease datasets often come with significant label-noise. For example, if classifications of rare disease subtypes evolve over time, researchers constructing datasets for ML research may find that cohorts collected at different time periods do not have comparable labels. Additionally, a supervised ML model is of limited utility if it can only accurately predict sample labels in the data it was trained on, also known as overfitting. Instead, most researchers aspire to develop models that generalize (maintain performance) when applied to new data that has not yet been “seen” by the model.
While we expect ML in rare disease research to continue to increase in popularity, the field requires methods that can learn patterns from small datasets and can generalize to newly acquired data [@doi:10.1016/j.ebiom.2019.08.027]. In this perspective, we highlight approaches that address or better tolerate the limitations of rare disease data and discuss the future of ML applications in rare disease.