pn2.0-mlst-databases

These are databases for use with the PN2.0 caller.

Databases

scheme	target	directory
Campylobacter	C. jejuni, C. coli, C. fetus, C. upsaliensis, and C. lari	db/CAMPY
C. botulinum	C. botulinum	db/CBOT
Cronobacter	Cronobacter spp.	db/CRONO
Listeria	Listeria monocytogenes	db/LISTERIA
Salmonella	Salmonella spp.	db/SALM
STEC	Escherichia, especially Shiga toxin producing E. coli or Shigella	db/STEC
Vibrio	V. cholerae, V. vulnificus, and V. parahaemolyticus	db/VIBR

Database structure

An MLST database has several standard files in a directory. More details for the files are given in subsections below the table.

filename	description
alleles.fasta.gz	(optional) A compressed fasta file of all entries in the blast database
alleles_0.*	The blast database
aleleleinfo.txt_0	A four-column file describing each allele
loci.tsv	A two column file describing each locus
loci/	(optional) A directory of alternative locus labels. In the case of Vibrio, it shows different labels per organism.
OrganismSettings.json	Description of custom settings per schema

alleleinfo.txt_0

This file has four columns:

allele, e.g., SALM_1_1
locus, e.g., SALM_1
length of allele in nucleotides, e.g., 714
Is the start and stop required (1) or optional (0)? This is a boolean 1 or 0.

Example:

SALM_1_1        SALM_1  714     1
SALM_1_2        SALM_1  714     1
SALM_2_1        SALM_2  228     1
SALM_25365_823  SALM_25365      501     0
SALM_25365_824  SALM_25365      501     0
SALM_25365_825  SALM_25365      501     0

loci.tsv

This is a tab delimited file containing the locus ID and its respective core/accessory label.

Example:

ID      allele_type
SALM_12272      core
SALM_13534      core
SALM_13975      core
SALM_9997       accessory
SALM_9998       accessory
SALM_9999       accessory

OrganismSettings.json

Example:

{
  "AllowInternalStopForAccept": "0",
  "AvgQuality": "30.0",
  "DcMegaBlastWordSize": "11, 12",
  "DcMegaBlastWordSizeDefault": "11",
  "ExpectedPresentLoci": "1235",
  "kmerLen": "35",
  "Length": "1600000",
  "MaxNrGapsForAccept": "100",
  "MinHomolForAccept": "70",
  "MinHomolForDetect": "70",
  "MinOccurrenceForAccept": "1",
  "NrAFPresent": "1235",
  "NrBAFPresent": "1235",
  "NrConsensus": "1235",
  "RequireStartStopCodonForAccept": "1",
  "SubmitNewAlleleComment": "[LabID] / [EntryID]"
}

Hashing function

When the PN2.0 caller is run, this is the function for the hashing algorithm. Is is based on the MD5 algorithm, but reduces the value to 56 bits.

def hash_sequence(sequence: str) -> str:

    md5 = hashlib.md5(sequence.encode("utf-8"))
    max_bits_in_result = 56
    p = (1 << max_bits_in_result) - 1
    rest = int(md5.hexdigest(), 16)
    result = 0
    while rest != 0:
        result = result ^ (rest & p)
        rest = rest >> max_bits_in_result
    return str(result)

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github/workflows		.github/workflows
db		db
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pn2.0-mlst-databases

Databases

Database structure

alleleinfo.txt_0

loci.tsv

OrganismSettings.json

Hashing function

About

Releases 1

Packages

ncezid-biome/pn2.0-mlst-databases

Folders and files

Latest commit

History

Repository files navigation

pn2.0-mlst-databases

Databases

Database structure

alleleinfo.txt_0

loci.tsv

OrganismSettings.json

Hashing function

About

Resources

Stars

Watchers

Forks

Releases 1

Packages 0

Packages