Skip to content

Latest commit

 

History

History
175 lines (106 loc) · 42.5 KB

Genomics.md

File metadata and controls

175 lines (106 loc) · 42.5 KB
layout title permalink
page
Genomics
/Genomics/

Genomics

  1. BIDS - Zhou J, Troyanskaya O. 2015. Predicting effects of noncoding variants with deep learning-based sequence model. Nat Methods. 2015 Oct;12(10):931-4. doi: 10.1038/nmeth.3547. Epub 2015 Aug 24. https://www.ncbi.nlm.nih.gov/pubmed/26301843.

    Identifying functional effects of noncoding variants is a major challenge in human genetics. To predict the noncoding-variant effects de novo from sequence, we developed a deep learning-based algorithmic framework, DeepSEA (http://deepsea.princeton.edu/), that directly learns a regulatory sequence code from large-scale chromatin-profiling data, enabling prediction of chromatin effects of sequence alterations with single-nucleotide sensitivity. We further used this capability to improve prioritization of functional variants including expression quantitative trait loci (eQTLs) and disease-associated variants.

  2. BIDS - Y. Hao et al., Semi-supervised Learning Predicts Approximately One Third of the Alternative Splicing Isoforms as Functional Proteins. Cell Rep. 12, 183–189 (2015). https://www.ncbi.nlm.nih.gov/pubmed/26146086

    Alternative splicing acts on transcripts from almost all human multi-exon genes. Notwithstanding its ubiquity, fundamental ramifications of splicing on protein expression remain unresolved. The number and identity of spliced transcripts that form stably folded proteins remain the sources of considerable debate, due largely to low coverage of experimental methods and the resulting absence of negative data. We circumvent this issue by developing a semi-supervised learning algorithm, positive unlabeled learning for splicing elucidation (PULSE; http://www.kimlab.org/software/pulse), which uses 48 features spanning various categories. We validated its accuracy on sets of bona fide protein isoforms and directly on mass spectrometry (MS) spectra for an overall AU-ROC of 0.85, we predict that around 32% of "exon skipping" alternative splicing events produce stable proteins, suggesting that the process engenders a significant number of previously uncharacterized proteins, and we also provide insights into the distribution of positive isoforms in various functional classes and into the structural effects of alternative splicing.

  3. E. R. Gamazon et al., A gene-based association method for mapping traits using reference transcriptome data. Nat. Genet. 47, 1091–1098 (2015). https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4552594/

    Genome-wide association studies (GWAS) have identified thousands of variants robustly associated with complex traits. However, the biological mechanisms underlying these associations are, in general, not well understood. We propose a gene-based association method called PrediXcan that directly tests the molecular mechanisms through which genetic variation affects phenotype. PrediXcan enjoys the benefits of gene-based approaches such as reduced multiple testing burden and a principled approach to the design of follow-up experiments. Our results demonstrate that PrediXcan can detect known and novel genes associated with disease traits and provide insights into the mechanism of these associations.

  4. C. Dwork et al., STATISTICS. The reusable holdout: Preserving validity in adaptive data analysis. Science. 349, 636–638 (2015). https://www.ncbi.nlm.nih.gov/pubmed/26250683

    Misapplication of statistical data analysis is a common cause of spurious discoveries in scientific research. Existing approaches to ensuring the validity of inferences drawn from data assume a fixed procedure to be performed, selected before the data are examined. In common practice, however, data analysis is an intrinsically adaptive process, with new analyses generated on the basis of data exploration, as well as the results of previous analyses on the same data. We demonstrate a new approach for addressing the challenges of adaptivity based on insights from privacy-preserving data analysis. As an application, we show how to safely reuse a holdout data set many times to validate the results of adaptively chosen analyses.

  5. H. Y. Xiong et al., RNA splicing. The human splicing code reveals new insights into the genetic determinants of disease. Science. 347, 1254806 (2015). https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4362528/

    Advancing whole-genome precision medicine requires understanding how gene expression is altered by genetic variants, especially those that are outside of protein-coding regions. We developed a computational technique that scores how strongly genetic variants alter RNA splicing, a critical step in gene expression whose disruption contributes to many diseases, including cancers and neurological disorders. A genome-wide analysis reveals tens of thousands of variants that alter splicing and are enriched with a wide range of known diseases. Our results provide insight into the genetic basis of spinal muscular atrophy, hereditary nonpolyposis colorectal cancer and autism spectrum disorder.

  6. R. Middleton et al., IRFinder: assessing the impact of intron retention on mammalian gene expression. Genome Biol. 18, 51 (2017). https://www.ncbi.nlm.nih.gov/pubmed/28298237

    Intron retention (IR) occurs when an intron is transcribed into pre-mRNA and remains in the final mRNA. We have developed a program and database called IRFinder to accurately detect IR from mRNA sequencing data. Analysis of 2573 samples showed that IR occurs in all tissues analyzed, affects over 80% of all coding genes and is associated with cell differentiation and the cell cycle. Frequently retained introns are enriched for specific RNA binding protein sites and are often retained in clusters in the same gene. IR is associated with lower protein levels and intron-retaining transcripts that escape nonsense-mediated decay are not actively translated.

  7. T. A. Hopf et al., Mutation effects predicted from sequence co-variation. Nat. Biotechnol. 35, 128–135 (2017). https://www.ncbi.nlm.nih.gov/pubmed/28092658

    Many high-throughput experimental technologies have been developed to assess the effects of large numbers of mutations (variation) on phenotypes. However, designing functional assays for these methods is challenging, and systematic testing of all combinations is impossible, so robust methods to predict the effects of genetic variation are needed. Most prediction methods exploit evolutionary sequence conservation but do not consider the interdependencies of residues or bases. We present EVmutation, an unsupervised statistical method for predicting the effects of mutations that explicitly captures residue dependencies between positions. We validate EVmutation by comparing its predictions with outcomes of high-throughput mutagenesis experiments and measurements of human disease mutations and show that it outperforms methods that do not account for epistasis. EVmutation can be used to assess the quantitative effects of mutations in genes of any organism, and we provide pre-computed predictions for ∼7,000 human proteins at http://evmutation.org/.

  8. BIDS - N. Beerenwinkel, R. F. Schwarz, M. Gerstung, F. Markowetz, Cancer evolution: mathematical models and computational inference. Syst. Biol. 64, e1–25 (2015). https://www.ncbi.nlm.nih.gov/pubmed/25293804

    Cancer is a somatic evolutionary process characterized by the accumulation of mutations, which contribute to tumor growth, clinical progression, immune escape, and drug resistance development. Evolutionary theory can be used to analyze the dynamics of tumor cell populations and to make inference about the evolutionary history of a tumor from molecular data. We review recent approaches to modeling the evolution of cancer, including population dynamics models of tumor initiation and progression, phylogenetic methods to model the evolutionary relationship between tumor subclones, and probabilistic graphical models to describe dependencies among mutations. Evolutionary modeling helps to understand how tumors arise and will also play an increasingly important prognostic role in predicting disease progression and the outcome of medical interventions, such as targeted therapy.

  9. Chong Z, Ruan J, Gao M, Zhou W, Chen T, Fan X, Ding L, Lee AY, Boutros P, Chen J, Chen K. 2017. novoBreak: local assembly for breakpoint detection in cancer genomes. Nature Methods volume14, pages65–67 (2017) doi:10.1038/nmeth.4084. https://www.nature.com/articles/nmeth.4084.

    We present novoBreak, a genome-wide local assembly algorithm that discovers somatic and germline structural variation breakpoints in whole-genome sequencing data. novoBreak consistently outperformed existing algorithms on real cancer genome data and on synthetic tumors in the ICGC-TCGA DREAM 8.5 Somatic Mutation Calling Challenge primarily because it more effectively utilized reads spanning breakpoints. novoBreak also demonstrated great sensitivity in identifying short insertions and deletions.

  10. BIDS - You, Ronghui, et al. "GOLabeler: Improving Sequence-based Large-scale Protein Function Prediction by Learning to Rank." bioRxiv (2017): 145763. https://www.ncbi.nlm.nih.gov/pubmed/29522145.

    The key of this method is to extract not only homology information but also diverse, deep-rooted information/evidence from sequence inputs and integrate them into a predictor in a both effective and efficient manner. We propose GOLabeler, which integrates five component classifiers, trained from different features, including GO term frequency, sequence alignment, amino acid trigram, domains and motifs, and biophysical properties, etc., in the framework of learning to rank (LTR), a paradigm of machine learning, especially powerful for multilabel classification. The empirical results obtained by examining GOLabeler extensively and thoroughly by using large-scale datasets revealed numerous favorable aspects of GOLabeler, including significant performance advantage over state-of-the-art AFP methods.

  11. BIDS - Gong, Qingtian, Wei Ning, and Weidong Tian. "GoFDR: a sequence alignment based method for predicting protein functions." Methods 93 (2016): 3-14. https://www.ncbi.nlm.nih.gov/pubmed/26277418.

    In this study, we developed a method named GoFDR for predicting Gene Ontology (GO)-based protein functions. The input for GoFDR is simply a query sequence-based multiple sequence alignment (MSA) produced by PSI-BLAST. For each GO term annotated to the sequences in the MSA, GoFDR identifies a number of functionally discriminating residues (FDRs) specific to the GO term, and scores the query sequence using a position specific scoring matrix (PSSM) constructed for the FDRs. The raw score is then converted into a probability score according to a score-to-probability table prepared from training sequences. GoFDR outperformed three sequence-based methods for predicting GO functions in a benchmark of 18,520 sequences. In addition, GoFDR was ranked one of the top methods according to the preliminary evaluation report released by the 2nd Critical Assessment of Function Annotation (CAFA2) project. Finally, we applied GoFDR to the complete human proteome sequences, and showed that the predictions made by GoFDR with high confidence significantly expanded current annotations of human proteome. As such, GoFDR is of great value not only for annotating protein functions in newly sequenced genomes, but also for characterizing the function of proteins of interest.

  12. BIDS - Tapio Pahikkala. Antti Airola. RLScore: Regularized Least-Squares Learners. Journal of Machine Learning Research 17 (2016) 1-5. http://www.jmlr.org/papers/volume17/16-470/16-470.pdf.

    RLScore is a Python open source module for kernel based machine learning. The library provides implementations of several regularized least-squares (RLS) type of learners. RLS methods for regression and classification, ranking, greedy feature selection, multi-task and zero-shot learning, and unsupervised classification are included. Matrix algebra based computational short-cuts are used to ensure efficiency of both training and cross-validation. A simple API and extensive tutorials allow for easy use of RLScore.

  13. BIDS - Pertea M, Pertea GM, Antonescu CM, Chang TC, Mendell JT, Salzberg SL. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat Biotechnol. 2015 Mar;33(3):290-5. doi: 10.1038/nbt.3122. Epub 2015 Feb 18. https://www.ncbi.nlm.nih.gov/pubmed/25690850

    Methods used to sequence the transcriptome often produce more than 200 million short sequences. We introduce StringTie, a computational method that applies a network flow algorithm originally developed in optimization theory, together with optional de novo assembly, to assemble these complex data sets into transcripts. When used to analyze both simulated and real data sets, StringTie produces more complete and accurate reconstructions of genes and better estimates of expression levels, compared with other leading transcript assembly programs including Cufflinks, IsoLasso, Scripture and Traph. For example, on 90 million reads from human blood, StringTie correctly assembled 10,990 transcripts, whereas the next best assembly was of 7,187 transcripts by Cufflinks, which is a 53% increase in transcripts assembled. On a simulated data set, StringTie correctly assembled 7,559 transcripts, which is 20% more than the 6,310 assembled by Cufflinks. As well as producing a more complete transcriptome assembly, StringTie runs faster on all data sets tested to date compared with other assembly software, including Cufflinks.

  14. BIDS - Kim D, Paggi JM, Park C, Bennett C, Salzberg SL. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat Biotechnol. 2019 Aug;37(8):907-915. https://www.ncbi.nlm.nih.gov/pubmed/31375807

    The human reference genome represents only a small number of individuals, which limits its usefulness for genotyping. We present a method named HISAT2 (hierarchical indexing for spliced alignment of transcripts 2) that can align both DNA and RNA sequences using a graph Ferragina Manzini index. We use HISAT2 to represent and search an expanded model of the human reference genome in which over 14.5 million genomic variants in combination with haplotypes are incorporated into the data structure used for searching and alignment. We benchmark HISAT2 using simulated and real datasets to demonstrate that our strategy of representing a population of genomes, together with a fast, memory-efficient search algorithm, provides more detailed and accurate variant analyses than other methods. We apply HISAT2 for HLA typing and DNA fingerprinting; both applications form part of the HISAT-genotype software that enables analysis of haplotype-resolved genes or genomic regions. HISAT-genotype outperforms other computational methods and matches or exceeds the performance of laboratory-based assays.

  15. BIDS - Alipanahi B, Delong A, Weirauch MT, Frey BJ. 2015 Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat Biotechnol. 2015 Aug;33(8):831-8. doi: 10.1038/nbt.3300. Epub 2015 Jul 27. https://www.nature.com/articles/nbt.3300.

    Knowing the sequence specificities of DNA- and RNA-binding proteins is essential for developing models of the regulatory processes in biological systems and for identifying causal disease variants. Here we show that sequence specificities can be ascertained from experimental data with 'deep learning' techniques, which offer a scalable, flexible and unified computational approach for pattern discovery. Using a diverse array of experimental data and evaluation metrics, we find that deep learning outperforms other state-of-the-art methods, even when training on in vitro data and testing on in vivo data. We call this approach DeepBind and have built a stand-alone software tool that is fully automatic and handles millions of sequences per experiment. Specificities determined by DeepBind are readily visualized as a weighted ensemble of position weight matrices or as a 'mutation map' that indicates how variations affect binding within a specific sequence.

  16. BIDS - Dobin A, Gingeras TR. Mapping RNA-seq Reads with STAR. Curr Protoc Bioinformatics. 2015 Sep 3;51:11.14.1-11.14.19. https://www.ncbi.nlm.nih.gov/pubmed/26334920

    Mapping of large sets of high-throughput sequencing reads to a reference genome is one of the foundational steps in RNA-seq data analysis. The STAR software package performs this task with high levels of accuracy and speed. In addition to detecting annotated and novel splice junctions, STAR is capable of discovering more complex RNA sequence arrangements, such as chimeric and circular RNA. STAR can align spliced sequences of any length with moderate error rates, providing scalability for emerging sequencing technologies. STAR generates output files that can be used for many downstream analyses such as transcript/gene expression quantification, differential gene expression, novel isoform reconstruction, and signal visualization. In this unit, we describe computational protocols that produce various output files, use different RNA-seq datatypes, and utilize different mapping strategies. STAR is open source software that can be run on Unix, Linux, or Mac OS X systems.

  17. BIDS - Ryan R Wick, Louise M Judd, Claire L Gorrie, Kathryn E Holt. Unicycler: Resolving bacterial genome assemblies from short and long sequencing reads. PLoS Computational Biology. 2017 13(6):e1005595. https://pubmed.ncbi.nlm.nih.gov/28594827/

    The Illumina DNA sequencing platform generates accurate but short reads, which can be used to produce accurate but fragmented genome assemblies. Pacific Biosciences and Oxford Nanopore Technologies DNA sequencing platforms generate long reads that can produce complete genome assemblies, but the sequencing is more expensive and error-prone. There is significant interest in combining data from these complementary sequencing technologies to generate more accurate "hybrid" assemblies. However, few tools exist that truly leverage the benefits of both types of data, namely the accuracy of short reads and the structural resolving power of long reads. Here we present Unicycler, a new tool for assembling bacterial genomes from a combination of short and long reads, which produces assemblies that are accurate, complete and cost-effective. Unicycler builds an initial assembly graph from short reads using the de novo assembler SPAdes and then simplifies the graph using information from short and long reads. Unicycler uses a novel semi-global aligner to align long reads to the assembly graph. Tests on both synthetic and real reads show Unicycler can assemble larger contigs with fewer misassemblies than other hybrid assemblers, even when long-read depth and accuracy are low. Unicycler is open source (GPLv3) and available at github.com/rrwick/Unicycler.

  18. Uri Ben-David, Benjamin Siransosian, Gavin Ha, ..., Todd R Golub. Genetic and transcriptional evolution alters cancer cell line drug response. Nature. 2018 560(7718):325-330. https://pubmed.ncbi.nlm.nih.gov/30089904/

    Human cancer cell lines are the workhorse of cancer research. Although cell lines are known to evolve in culture, the extent of the resultant genetic and transcriptional heterogeneity and its functional consequences remain understudied. Here we use genomic analyses of 106 human cell lines grown in two laboratories to show extensive clonal diversity. Further comprehensive genomic characterization of 27 strains of the common breast cancer cell line MCF7 uncovered rapid genetic diversification. Similar results were obtained with multiple strains of 13 additional cell lines. Notably, genetic changes were associated with differential activation of gene expression programs and marked differences in cell morphology and proliferation. Barcoding experiments showed that cell line evolution occurs as a result of positive clonal selection that is highly sensitive to culture conditions. Analyses of single-cell-derived clones demonstrated that continuous instability quickly translates into heterogeneity of the cell line. When the 27 MCF7 strains were tested against 321 anti-cancer compounds, we uncovered considerably different drug responses: at least 75% of compounds that strongly inhibited some strains were completely inactive in others. This study documents the extent, origins and consequences of genetic variation within cell lines, and provides a framework for researchers to measure such variation in efforts to support maximally reproducible cancer research.

  19. BIDS - Rubinacci, S., Ribeiro, D.M., Hofmeister, R.J. et al. Efficient phasing and imputation of low-coverage sequencing data using large reference panels. Nat Genet 53, 120–126 (2021). https://doi.org/10.1038/s41588-020-00756-0

    Low-coverage whole-genome sequencing followed by imputation has been proposed as a cost-effective genotyping approach for disease and population genetics studies. However, its competitiveness against SNP arrays is undermined because current imputation methods are computationally expensive and unable to leverage large reference panels. Here, we describe a method, GLIMPSE, for phasing and imputation of low-coverage sequencing datasets from modern reference panels. We demonstrate its remarkable performance across different coverages and human populations. GLIMPSE achieves imputation of a genome for less than US$1 in computational cost, considerably outperforming other methods and improving imputation accuracy over the full allele frequency range. As a proof of concept, we show that 1× coverage enables effective gene expression association studies and outperforms dense SNP arrays in rare variant burden tests. Overall, this study illustrates the promising potential of low-coverage imputation and suggests a paradigm shift in the design of future genomic studies.

  20. BIDS - Davies, R.W., Kucka, M., Su, D. et al. Rapid genotype imputation from sequence with reference panels. Nat Genet 53, 1104–1111 (2021). https://doi.org/10.1038/s41588-021-00877-0

    Inexpensive genotyping methods are essential to modern genomics. Here we present QUILT, which performs diploid genotype imputation using low-coverage whole-genome sequence data. QUILT employs Gibbs sampling to partition reads into maternal and paternal sets, facilitating rapid haploid imputation using large reference panels. We show this partitioning to be accurate over many megabases, enabling highly accurate imputation close to theoretical limits and outperforming existing methods. Moreover, QUILT can impute accurately using diverse technologies, including long reads from Oxford Nanopore Technologies, and a new form of low-cost barcoded Illumina sequencing called haplotagging, with the latter showing improved accuracy at low coverages. Relative to DNA genotyping microarrays, QUILT offers improved accuracy at reduced cost, particularly for diverse populations that are traditionally underserved in modern genomic analyses, with accuracy nearly doubling at rare SNPs. Finally, QUILT can accurately impute (four-digit) human leukocyte antigen types, the first such method from low-coverage sequence data.

  21. de Goede OM, Nachun DC, Ferraro NM, Gloudemans MJ, Rao AS, Smail C, Eulalio TY, Aguet F, Ng B, Xu J, Barbeira AN. Population-scale tissue transcriptomics maps long non-coding RNAs to complex disease. Cell. 2021 May 13;184(10):2633-48. https://pubmed.ncbi.nlm.nih.gov/33864768/

    Long non-coding RNA (lncRNA) genes have well-established and important impacts on molecular and cellular functions. However, among the thousands of lncRNA genes, it is still a major challenge to identify the subset with disease or trait relevance. To systematically characterize these lncRNA genes, we used Genotype Tissue Expression (GTEx) project v8 genetic and multi-tissue transcriptomic data to profile the expression, genetic regulation, cellular contexts, and trait associations of 14,100 lncRNA genes across 49 tissues for 101 distinct complex genetic traits. Using these approaches, we identified 1,432 lncRNA gene-trait associations, 800 of which were not explained by stronger effects of neighboring protein-coding genes. This included associations between lncRNA quantitative trait loci and inflammatory bowel disease, type 1 and type 2 diabetes, and coronary artery disease, as well as rare variant associations to body mass index.

  22. Castel SE, Aguet F, Mohammadi P, Ardlie KG, Lappalainen T. A vast resource of allelic expression data spanning human tissues. Genome biology. 2020 Dec;21(1):1-2. https://pubmed.ncbi.nlm.nih.gov/32912332/

    Allele expression (AE) analysis robustly measures cis-regulatory effects. Here, we present and demonstrate the utility of a vast AE resource generated from the GTEx v8 release, containing 15,253 samples spanning 54 human tissues for a total of 431 million measurements of AE at the SNP level and 153 million measurements at the haplotype level. In addition, we develop an extension of our tool phASER that allows effect sizes of cis-regulatory variants to be estimated using haplotype-level AE data. This AE resource is the largest to date, and we are able to make haplotype-level data publicly available. We anticipate that the availability of this resource will enable future studies of regulatory variation across human tissues.

  23. BIDS - Müller NF, Wagner C, Frazar CD, Roychoudhury P, Lee J, Moncla LH, Pelle B, Richardson M, Ryke E, Xie H, Shrestha L. Viral genomes reveal patterns of the SARS-CoV-2 outbreak in Washington State. Science Translational Medicine. 2021 May 26;13(595). https://pubmed.ncbi.nlm.nih.gov/33941621/

    The rapid spread of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has gravely affected societies around the world. Outbreaks in different parts of the globe have been shaped by repeated introductions of new viral lineages and subsequent local transmission of those lineages. Here, we sequenced 3940 SARS-CoV-2 viral genomes from Washington State (USA) to characterize how the spread of SARS-CoV-2 in Washington State in early 2020 was shaped by differences in timing of mitigation strategies across counties and by repeated introductions of viral lineages into the state. In addition, we show that the increase in frequency of a potentially more transmissible viral variant (614G) over time can potentially be explained by regional mobility differences and multiple introductions of 614G but not the other variant (614D) into the state. At an individual level, we observed evidence of higher viral loads in patients infected with the 614G variant. However, using clinical records data, we did not find any evidence that the 614G variant affects clinical severity or patient outcomes. Overall, this suggests that with regard to D614G, the behavior of individuals has been more important in shaping the course of the pandemic in Washington State than this variant of the virus.

  24. BIDS - Rentzsch P, Schubach M, Shendure J, Kircher M. CADD-Splice—improving genome-wide variant effect prediction using deep learning-derived splice scores. Genome medicine. 2021 Dec;13(1):1-2. https://pubmed.ncbi.nlm.nih.gov/33618777/

    Background: Splicing of genomic exons into mRNAs is a critical prerequisite for the accurate synthesis of human proteins. Genetic variants impacting splicing underlie a substantial proportion of genetic disease, but are challenging to identify beyond those occurring at donor and acceptor dinucleotides. To address this, various methods aim to predict variant effects on splicing. Recently, deep neural networks (DNNs) have been shown to achieve better results in predicting splice variants than other strategies. Methods: It has been unclear how best to integrate such process-specific scores into genome-wide variant effect predictors. Here, we use a recently published experimental data set to compare several machine learning methods that score variant effects on splicing. We integrate the best of those approaches into general variant effect prediction models and observe the effect on classification of known pathogenic variants. Results: We integrate two specialized splicing scores into CADD (Combined Annotation Dependent Depletion; cadd.gs.washington.edu ), a widely used tool for genome-wide variant effect prediction that we previously developed to weight and integrate diverse collections of genomic annotations. With this new model, CADD-Splice, we show that inclusion of splicing DNN effect scores substantially improves predictions across multiple variant categories, without compromising overall performance. Conclusions: While splice effect scores show superior performance on splice variants, specialized predictors cannot compete with other variant scores in general variant interpretation, as the latter account for nonsense and missense effects that do not alter splicing. Although only shown here for splice scores, we believe that the applied approach will generalize to other specific molecular processes, providing a path for the further improvement of genome-wide variant effect prediction.

  25. BIDS - Avsec, Ž., Agarwal, V., Visentin, D., Ledsam, J.R., Grabska-Barwinska, A., Taylor, K.R., Assael, Y., Jumper, J., Kohli, P. and Kelley, D.R., 2021. Effective gene expression prediction from sequence by integrating long-range interactions. Nature methods, 18(10), pp.1196-1203. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8490152/

    How noncoding DNA determines gene expression in different cell types is a major unsolved problem, and critical downstream applications in human genetics depend on improved solutions. Here, we report substantially improved gene expression prediction accuracy from DNA sequences through the use of a deep learning architecture, called Enformer, that is able to integrate information from long-range interactions (up to 100 kb away) in the genome. This improvement yielded more accurate variant effect predictions on gene expression for both natural genetic variants and saturation mutagenesis measured by massively parallel reporter assays. Furthermore, Enformer learned to predict enhancer-promoter interactions directly from the DNA sequence competitively with methods that take direct experimental data as input. We expect that these advances will enable more effective fine-mapping of human disease associations and provide a framework to interpret cis-regulatory evolution.

  26. BIDS - Zargari, A., Lodewijk, G.A., Mashhadi, N., Cook, N., Neudorf, C.W., Araghbidikashani, K., Hays, R., Kozuki, S., Rubio, S., Hrabeta-Robinson, E. and Brooks, A., 2023. DeepSea is an efficient deep-learning model for single-cell segmentation and tracking in time-lapse microscopy. Cell Reports Methods. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10326378/

    Time-lapse microscopy is the only method that can directly capture the dynamics and heterogeneity of fundamental cellular processes at the single-cell level with high temporal resolution. Successful application of single-cell time-lapse microscopy requires automated segmentation and tracking of hundreds of individual cells over several time points. However, segmentation and tracking of single cells remain challenging for the analysis of time-lapse microscopy images, in particular for widely available and non-toxic imaging modalities such as phase-contrast imaging. This work presents a versatile and trainable deep-learning model, termed DeepSea, that allows for both segmentation and tracking of single cells in sequences of phase-contrast live microscopy images with higher precision than existing models. We showcase the application of DeepSea by analyzing cell size regulation in embryonic stem cells.

  27. Orenbuch, R., Filip, I., Comito, D., Shaman, J., Pe’er, I. and Rabadan, R., 2020. arcasHLA: high-resolution HLA typing from RNAseq. Bioinformatics, 36(1), pp.33-40. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6956775/

    Motivation: The human leukocyte antigen (HLA) locus plays a critical role in tissue compatibility and regulates the host response to many diseases, including cancers and autoimmune di3orders. Recent improvements in the quality and accessibility of next-generation sequencing have made HLA typing from standard short-read data practical. However, this task remains challenging given the high level of polymorphism and homology between HLA genes. HLA typing from RNA sequencing is further complicated by post-transcriptional modifications and bias due to amplification. Results: Here, we present arcasHLA: a fast and accurate in silico tool that infers HLA genotypes from RNA-sequencing data. Our tool outperforms established tools on the gold-standard benchmark dataset for HLA typing in terms of both accuracy and speed, with an accuracy rate of 100% at two-field resolution for Class I genes, and over 99.7% for Class II. Furthermore, we evaluate the performance of our tool on a new biological dataset of 447 single-end total RNA samples from nasopharyngeal swabs, and establish the applicability of arcasHLA in metatranscriptome studies. Availability and implementation: arcasHLA is available at https://github.com/RabadanLab/arcasHLA.

  28. Beagrie, R.A., Thieme, C.J., Annunziatella, C., Baugher, C., Zhang, Y., Schueler, M., Kukalev, A., Kempfer, R., Chiariello, A.M., Bianco, S. and Li, Y., 2023. Multiplex-GAM: genome-wide identification of chromatin contacts yields insights overlooked by Hi-C. Nature Methods, pp.1-11. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10333126/

    Technology for measuring 3D genome topology is increasingly important for studying gene regulation, for genome assembly and for mapping of genome rearrangements. Hi-C and other ligation-based methods have become routine but have specific biases. Here, we develop multiplex-GAM, a faster and more affordable version of genome architecture mapping (GAM), a ligation-free technique that maps chromatin contacts genome-wide. We perform a detailed comparison of multiplex-GAM and Hi-C using mouse embryonic stem cells. When examining the strongest contacts detected by either method, we find that only one-third of these are shared. The strongest contacts specifically found in GAM often involve 'active' regions, including many transcribed genes and super-enhancers, whereas in Hi-C they more often contain 'inactive' regions. Our work shows that active genomic regions are involved in extensive complex contacts that are currently underestimated in ligation-based approaches, and highlights the need for orthogonal advances in genome-wide contact mapping technologies.

  29. Zhong, Y., Perera, M.A. and Gamazon, E.R., 2019. On using local ancestry to characterize the genetic architecture of human traits: genetic regulation of gene expression in multiethnic or admixed populations. The American Journal of Human Genetics, 104(6), pp.1097-1115. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6562007/

    Understanding the nature of the genetic regulation of gene expression promises to advance our understanding of the genetic basis of disease. However, the methodological impact of the use of local ancestry on high-dimensional omics analyses, including, most prominently, expression quantitative trait loci (eQTL) mapping and trait heritability estimation, in admixed populations remains critically underexplored. Here, we develop a statistical framework that characterizes the relationships among the determinants of the genetic architecture of an important class of molecular traits. We provide a computationally efficient approach to local ancestry analysis in eQTL mapping while increasing control of type I and type II error over traditional approaches. Applying our method to National Institute of General Medical Sciences (NIGMS) and Genotype-Tissue Expression (GTEx) datasets, we show that the use of local ancestry can improve eQTL mapping in admixed and multiethnic populations, respectively. We estimate the trait variance explained by ancestry by using local admixture relatedness between individuals. By using simulations of diverse genetic architectures and degrees of confounding, we show improved accuracy in estimating heritability when accounting for local ancestry similarity. Furthermore, we characterize the sparse versus polygenic components of gene expression in admixed individuals. Our study has important methodological implications for genetic analysis of omics traits across a range of genomic contexts, from a single variant to a prioritized region to the entire genome. Our findings highlight the importance of using local ancestry to better characterize the heritability of complex traits and to more accurately map genetic associations.

  30. Chan, M.M., Sadeghi-Alavijeh, O., Lopes, F.M., Hilger, A.C., Stanescu, H.C., Voinescu, C.D., Beaman, G.M., Newman, W.G., Zaniew, M., Weber, S. and Ho, Y.M., 2022. Diverse ancestry whole-genome sequencing association study identifies TBX5 and PTK7 as susceptibility genes for posterior urethral valves. Elife, 11, p.e74777. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9512401/

    Posterior urethral valves (PUV) are the commonest cause of end-stage renal disease in children, but the genetic architecture of this rare disorder remains unknown. We performed a sequencing-based genome-wide association study (seqGWAS) in 132 unrelated male PUV cases and 23,727 controls of diverse ancestry, identifying statistically significant associations with common variants at 12q24.21 (p=7.8 × 10-12; OR 0.4) and rare variants at 6p21.1 (p=2.0 × 10-8; OR 7.2), that were replicated in an independent European cohort of 395 cases and 4151 controls. Fine mapping and functional genomic data mapped these loci to the transcription factor TBX5 and planar cell polarity gene PTK7, respectively, the encoded proteins of which were detected in the developing urinary tract of human embryos. We also observed enrichment of rare structural variation intersecting with candidate cis-regulatory elements, particularly inversions predicted to affect chromatin looping (p=3.1 × 10-5). These findings represent the first robust genetic associations of PUV, providing novel insights into the underlying biology of this poorly understood disorder and demonstrate how a diverse ancestry seqGWAS can be used for disease locus discovery in a rare disease.

  31. Fu, J.M., Satterstrom, F.K., Peng, M., Brand, H., Collins, R.L., Dong, S., Wamsley, B., Klei, L., Wang, L., Hao, S.P. and Stevens, C.R., 2022. Rare coding variation provides insight into the genetic architecture and phenotypic context of autism. Nature genetics, 54(9), pp.1320-1331. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9653013/

    Some individuals with autism spectrum disorder (ASD) carry functional mutations rarely observed in the general population. We explored the genes disrupted by these variants from joint analysis of protein-truncating variants (PTVs), missense variants and copy number variants (CNVs) in a cohort of 63,237 individuals. We discovered 72 genes associated with ASD at false discovery rate (FDR) ≤ 0.001 (185 at FDR ≤ 0.05). De novo PTVs, damaging missense variants and CNVs represented 57.5%, 21.1% and 8.44% of association evidence, while CNVs conferred greatest relative risk. Meta-analysis with cohorts ascertained for developmental delay (DD) (n = 91,605) yielded 373 genes associated with ASD/DD at FDR ≤ 0.001 (664 at FDR ≤ 0.05), some of which differed in relative frequency of mutation between ASD and DD cohorts. The DD-associated genes were enriched in transcriptomes of progenitor and immature neuronal cells, whereas genes showing stronger evidence in ASD were more enriched in maturing neurons and overlapped with schizophrenia-associated genes, emphasizing that these neuropsychiatric disorders may share common pathways to risk.

  32. BIDS - Li, J., Bzdok, D., Chen, J., Tam, A., Ooi, L.Q.R., Holmes, A.J., Ge, T., Patil, K.R., Jabbi, M., Eickhoff, S.B. and Yeo, B.T., 2022. Cross-ethnicity/race generalization failure of behavioral prediction from resting-state functional connectivity. Science advances, 8(11), p.eabj1812. https://pubmed.ncbi.nlm.nih.gov/35294251/

    Algorithmic biases that favor majority populations pose a key challenge to the application of machine learning for precision medicine. Here, we assessed such bias in prediction models of behavioral phenotypes from brain functional magnetic resonance imaging. We examined the prediction bias using two independent datasets (preadolescent versus adult) of mixed ethnic/racial composition. When predictive models were trained on data dominated by white Americans (WA), out-of-sample prediction errors were generally higher for African Americans (AA) than for WA. This bias toward WA corresponds to more WA-like brain-behavior association patterns learned by the models. When models were trained on AA only, compared to training only on WA or an equal number of AA and WA participants, AA prediction accuracy improved but stayed below that for WA. Overall, the results point to the need for caution and further research regarding the application of current brain-behavior prediction models in minority populations.

  33. Avsec, Ž., Weilert, M., Shrikumar, A., Krueger, S., Alexandari, A., Dalal, K., Fropf, R., McAnany, C., Gagneur, J., Kundaje, A., et al. 2021. Base-resolution models of transcription-factor binding reveal soft motif syntax. Nat. Genet. 53, 354–366. https://www.nature.com/articles/s41588-021-00782-6

    The arrangement (syntax) of transcription factor (TF) binding motifs is an important part of the cis-regulatory code, yet remains elusive. We introduce a deep learning model, BPNet, that uses DNA sequence to predict base-resolution chromatin immunoprecipitation (ChIP)–nexus binding profiles of pluripotency TFs. We develop interpretation tools to learn predictive motif representations and identify soft syntax rules for cooperative TF binding interactions. Strikingly, Nanog preferentially binds with helical periodicity, and TFs often cooperate in a directional manner, which we validate using clustered regularly interspaced short palindromic repeat (CRISPR)-induced point mutations. Our model represents a powerful general approach to uncover the motifs and syntax of cis-regulatory sequences in genomics data.