High-throughput ‘omic’ assays can generate thousands to billions of measurements from whole transcriptome and whole genome sequencing respectively, resulting in high-dimensional datasets. A typical rare disease dataset consists of a small number of samples[@doi:10.1186/s13023-020-01424-6] leading to the “curse of dimensionality” in which the feature space is much larger than the sample space, increasing the difficulty in building highly generalizable models [@doi:10.1038/nrc2294]. A larger feature space can contribute to higher data missingness (sparsity), more dissimilarity between samples (variance), and increased redundancy among individual features or combinations (multicollinearity) [@doi:10.1038/s41592-018-0019-x], all of which contribute to challenges in ML implementation.
An important factor in ML is model performance: the accuracy of a supervised model in identifying patterns relevant for a biological question, or the reliability of an unsupervised model in identifying hypothetical biological patterns supported by post-hoc validation. When small sample sizes compromise an ML model’s performance, two approaches can be taken to manage sparsity, variance, and multicollinearity: 1) increase the number of samples, 2) improve the quality of samples. In the first approach, appropriate training, evaluation, and held-out validation sets could be constructed by combining multiple rare disease cohorts (Figure [@fig:1]a, Box 2). When combining datasets, special attention should be directed towards data harmonization since data collection methods can differ between cohorts. Without careful selection of aggregation methods, one may introduce technical – in contrast to biological – variability into the combined dataset and negatively impact the ML model’s ability to learn or detect meaningful signals. Steps like reprocessing data using a single pipeline, using batch correction methods [@doi:10.1093/biostatistics/kxj037; @doi:10.1093/nar/gku864], and normalizing raw values appropriately without affecting the underlying variance in the data [@doi:10.1186/gb-2010-11-3-r25, @doi:10.1371/journal.pcbi.1003531, @doi:10.1186/s13059-014-0550-8] may be necessary to mitigate unwanted variability. (Figure [@fig:1]a) Data harmonization may also entail standardization of sample labels using biomedical ontologies to normalize how samples are described across multiple datasets.
How does one know if a composite dataset has undergone proper harmonization and annotation? Ideally, the dominant patterns of the composite dataset reflect variables of interest, such as phenotype labels rather than technical labels. In the latter case, this suggests that the datasets used to generate the composite dataset need to be corrected to overcome differences in how the data were generated or collected. In the next section, we discuss approaches that help identify and visualize structure in datasets to determine whether composite rare disease datasets are appropriate for ML use.