Skip to content

Latest commit

 

History

History
42 lines (33 loc) · 5.31 KB

08.outlook.md

File metadata and controls

42 lines (33 loc) · 5.31 KB

Outlook

This perspective highlights challenges in applying ML to rare disease data and approaches that address these challenges. Small sample size, while significant, is not the only roadblock. The high dimensionality of modern data requires creative approaches, such as learning new representations of the data, to manage the curse of dimensionality. Leveraging prior knowledge and transfer learning methods to appropriately interpret data is also required. Furthermore, we posit that researchers applying ML methods on rare disease data should use techniques that increase confidence (i.e., bootstrapping) and penalize complexity of the resultant models (i.e., regularization) to enhance the generalizability of their work. It should be noted that the line between classical statistical methods and ML is fuzzy. Multiple statistical techniques that were considered to be out of scope of this article (e.g. hierarchical models, Bayesian frameworks, association tests) [@doi:10.1016/j.ajhg.2016.01.008; @doi:10.1016/j.ajhg.2011.05.029; @doi:10.1371/journal.pgen.1004729; @doi:10.1016/j.ajhg.2017.05.015], may have substantial potential to enhance the accuracy and generalizability of models, and should be considered in the rare disease study design process.

The approaches highlighted in this perspective come with challenges that may undermine investigators' confidence in using these techniques for rare disease research. We believe that the challenges in applying ML to rare disease are opportunities to improve data generation and method development going forward. The following two areas are particularly important for the field to explore.

Intentional data generation and sharing mechanisms are key for powering the future of rare disease data analysis

While many techniques exist to collate rare data from different sources, low-quality data may hurt the end goal even if it increases the size of the dataset. In our experience, collaboration with domain experts has proved to be critical in gaining insight into potential sources of variation in the datasets. An anecdotal example: conversations with a clinician revealed that samples in a particular tumor dataset were collected using vastly different surgical techniques (laser ablation and excision vs standard excision). This information, not readily available to non-experts, was obvious to the clinician. Such instances suggest that collaboration with domain experts and sharing of well-annotated data is needed to generate robust datasets.

In addition to sample scarcity, comprehensive phenotypic-genotypic databases are also lacking. While rare disease studies that collect genomic and phenotypic data are becoming more common [@doi:10.1038/nrg3555; @doi:10.1038/nrg.2017.116; @doi:10.1056/NEJMra1711801], developing comprehensive genomics-based genotype-phenotype databases that prioritize clinical and genomics data standards is key to fueling interpretation of ML methods. This method can be bolstered by funding or fostering collaboration between biobanking projects and patient registry initiatives. Mindful sharing of data with proper metadata and attribution enabling prompt data reuse is important in building valuable datasets for rare disease research [@https://www.nature.com/articles/s41576-020-0257-5]. Finally, federated learning methods, such as those used in mobile health [@doi:10.1038/s41746-020-00323-1] and electronic healthcare records studies [@doi:10.1001/jamanetworkopen.2021.24946], may allow researchers to develop ML models on data from larger numbers of people with rare diseases whilst protecting patient privacy.

Methods that reliably support mechanistic interrogation of specific rare diseases are an unmet need

Most ML methods for rare diseases are used for classification tasks. Not many methods investigate biological mechanisms; this is likely due to a lack of methods that can handle the limitations of rare disease data described throughout this perspective. Developing methods to address this will be critical for applying ML to rare disease data.

For example, development of methods with a focus on explainability of the model can identify features that may be related to the underlying disease mechanism.[@doi:10.1038/s41551-018-0304-0] Alternatively, representation learning or regularization methods may help identify multiple correlated features which can be interrogated to identify biologically meaningful pathways. Additionally, robust error analysis for newly developed models to help users understand how a feature influences the performance of a model can provide insight into its potential contribution to the underlying disease mechanism. [@doi:10.1093/bioinformatics/bth060] Interrogating disease mechanisms by adopting modifications of these approaches is necessary as ML applications become mainstream in research and clinical settings.

Finally, methods that can reliably integrate disparate datasets will likely always remain a need in rare disease research. Methods that rely on finding structural correspondence between datasets ("anchors") may be able to transform the status-quo of using ML in rare disease [@https://www.aclweb.org/anthology/W06-1615; @https://dl.acm.org/doi/10.5555/2283516.2283652; @doi:10.1016/j.cell.2019.05.031]. We speculate that this is an important and burgeoning area of research, and we are optimistic about the future of applying ML approaches to rare diseases.