E. coli Genome Collection

As of today, there are more than 140,000 E. coli genomes available on public databases. While data is widely available, collating the data and extracting meaningful information from it often requires multiple steps, computational resources and expert knowledge. Here, we collate a high quality and comprehensive set of over 10,000 E. coli genomes, isolated from human hosts, into a set of manageable files that offer an accessible and usable snapshot of the currently available genome data, linked to a minimal data quality standard. The data provided includes a detailed synopsis of the main lineages present, including their antimicrobial and virulence profiles, their complete gene content, and all the associated metadata for each genome. This includes a database which enables the user to compare newly sequenced isolates against the assembled genomes. Additionally, we provide a searchable index which allows the user to query any DNA sequence against the assemblies of the collection. This collection paves the path for many future studies, including those investigating the differences between E. coli lineages, following the evolution of different genes in the E. coli pan-genome and exploring the dynamics of horizontal gene transfer in this important organism.

Data Summary

The complete aggregated metadata of 10,146 high quality genomes isolated from human hosts (https://figshare.com/s/f1c581d39b3d1dbd0091, File F1).
A PopPUNK database which can be used to query any genome and examine its context relative to this collection (Deposited to doi.org/10.6084/m9.figshare.12650834).
A BIGSI index of all the genomes which can be used to easily and quickly query the genomes for any DNA sequence of 61 bp or longer (Deposited to doi.org/10.6084/m9.figshare.12666497).
Description and complete profiling the 50 largest lineages which represent the majority of publicly available human-isolated E. coli genomes (https://figshare.com/s/f1c581d39b3d1dbd0091, , File F2). Phylogenetic trees of representative genomes of these lineages, presented in this manuscript, are also provided (https://figshare.com/s/f1c581d39b3d1dbd0091,, Files tree_500.nwk and tree_50.nwk).
The complete pan-genome of the 50 largest lineages which includes:

a. A FASTA file containing a single representative sequence of each gene of the gene pool (https://figshare.com/s/f1c581d39b3d1dbd0091, File F3).

b. Complete gene presence-absence across all isolates (https://figshare.com/s/f1c581d39b3d1dbd0091, File F4).

c. The frequency of each gene within each of the lineages (https://figshare.com/s/f1c581d39b3d1dbd0091, File F5).

d. The representative sequences from each lineage for all the genes (https://figshare.com/s/f1c581d39b3d1dbd0091, File F6).

bioRxiv Publication

https://www.biorxiv.org/content/10.1101/2020.09.21.293175v1

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
11_pairwise_roary		11_pairwise_roary
12_correct_pan_genome		12_correct_pan_genome
13_AMR_vir		13_AMR_vir
4_file_paths		4_file_paths
5_QC		5_QC
6_poppunk_running		6_poppunk_running
7_treemer_reps		7_treemer_reps
8_phylo_tree		8_phylo_tree
9_pan_genome_per_lineage		9_pan_genome_per_lineage
READS		READS
.gitignore		.gitignore
10_choose_longest_rep.py		10_choose_longest_rep.py
1_aggregate_data.R		1_aggregate_data.R
2_fix_metadata.py		2_fix_metadata.py
3_add_conversion.py		3_add_conversion.py
README.md		README.md
change_filenames.py		change_filenames.py
get_sanger_sequences.sh		get_sanger_sequences.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

E. coli Genome Collection

bioRxiv Publication

About

Releases

Packages

Languages

ghoresh11/ecoli_genome_collection

Folders and files

Latest commit

History

Repository files navigation

E. coli Genome Collection

bioRxiv Publication

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages