As of today, there are more than 140,000 E. coli genomes available on public databases. While data is widely available, collating the data and extracting meaningful information from it often requires multiple steps, computational resources and expert knowledge. Here, we collate a high quality and comprehensive set of over 10,000 E. coli genomes, isolated from human hosts, into a set of manageable files that offer an accessible and usable snapshot of the currently available genome data, linked to a minimal data quality standard. The data provided includes a detailed synopsis of the main lineages present, including their antimicrobial and virulence profiles, their complete gene content, and all the associated metadata for each genome. This includes a database which enables the user to compare newly sequenced isolates against the assembled genomes. Additionally, we provide a searchable index which allows the user to query any DNA sequence against the assemblies of the collection. This collection paves the path for many future studies, including those investigating the differences between E. coli lineages, following the evolution of different genes in the E. coli pan-genome and exploring the dynamics of horizontal gene transfer in this important organism.
Data Summary
-
The complete aggregated metadata of 10,146 high quality genomes isolated from human hosts (https://figshare.com/s/f1c581d39b3d1dbd0091, File F1).
-
A PopPUNK database which can be used to query any genome and examine its context relative to this collection (Deposited to doi.org/10.6084/m9.figshare.12650834).
-
A BIGSI index of all the genomes which can be used to easily and quickly query the genomes for any DNA sequence of 61 bp or longer (Deposited to doi.org/10.6084/m9.figshare.12666497).
-
Description and complete profiling the 50 largest lineages which represent the majority of publicly available human-isolated E. coli genomes (https://figshare.com/s/f1c581d39b3d1dbd0091, , File F2). Phylogenetic trees of representative genomes of these lineages, presented in this manuscript, are also provided (https://figshare.com/s/f1c581d39b3d1dbd0091,, Files tree_500.nwk and tree_50.nwk).
-
The complete pan-genome of the 50 largest lineages which includes:
a. A FASTA file containing a single representative sequence of each gene of the gene pool (https://figshare.com/s/f1c581d39b3d1dbd0091, File F3).
b. Complete gene presence-absence across all isolates (https://figshare.com/s/f1c581d39b3d1dbd0091, File F4).
c. The frequency of each gene within each of the lineages (https://figshare.com/s/f1c581d39b3d1dbd0091, File F5).
d. The representative sequences from each lineage for all the genes (https://figshare.com/s/f1c581d39b3d1dbd0091, File F6).