These are databases for use with the PN2.0 caller.
scheme | target | directory |
---|---|---|
Campylobacter | C. jejuni, C. coli, C. fetus, C. upsaliensis, and C. lari | db/CAMPY |
C. botulinum | C. botulinum | db/CBOT |
Cronobacter | Cronobacter spp. | db/CRONO |
Listeria | Listeria monocytogenes | db/LISTERIA |
Salmonella | Salmonella spp. | db/SALM |
STEC | Escherichia, especially Shiga toxin producing E. coli or Shigella | db/STEC |
Vibrio | V. cholerae, V. vulnificus, and V. parahaemolyticus | db/VIBR |
An MLST database has several standard files in a directory. More details for the files are given in subsections below the table.
filename | description |
---|---|
alleles.fasta.gz | (optional) A compressed fasta file of all entries in the blast database |
alleles_0.* | The blast database |
aleleleinfo.txt_0 | A four-column file describing each allele |
loci.tsv | A two column file describing each locus |
loci/ | (optional) A directory of alternative locus labels. In the case of Vibrio, it shows different labels per organism. |
OrganismSettings.json | Description of custom settings per schema |
This file has four columns:
- allele, e.g., SALM_1_1
- locus, e.g., SALM_1
- length of allele in nucleotides, e.g., 714
- Is the start and stop required (1) or optional (0)? This is a boolean 1 or 0.
Example:
SALM_1_1 SALM_1 714 1
SALM_1_2 SALM_1 714 1
SALM_2_1 SALM_2 228 1
SALM_25365_823 SALM_25365 501 0
SALM_25365_824 SALM_25365 501 0
SALM_25365_825 SALM_25365 501 0
This is a tab delimited file containing the locus ID and its respective core/accessory label.
Example:
ID allele_type
SALM_12272 core
SALM_13534 core
SALM_13975 core
SALM_9997 accessory
SALM_9998 accessory
SALM_9999 accessory
Example:
{
"AllowInternalStopForAccept": "0",
"AvgQuality": "30.0",
"DcMegaBlastWordSize": "11, 12",
"DcMegaBlastWordSizeDefault": "11",
"ExpectedPresentLoci": "1235",
"kmerLen": "35",
"Length": "1600000",
"MaxNrGapsForAccept": "100",
"MinHomolForAccept": "70",
"MinHomolForDetect": "70",
"MinOccurrenceForAccept": "1",
"NrAFPresent": "1235",
"NrBAFPresent": "1235",
"NrConsensus": "1235",
"RequireStartStopCodonForAccept": "1",
"SubmitNewAlleleComment": "[LabID] / [EntryID]"
}
When the PN2.0 caller is run, this is the function for the hashing algorithm. Is is based on the MD5 algorithm, but reduces the value to 56 bits.
def hash_sequence(sequence: str) -> str:
md5 = hashlib.md5(sequence.encode("utf-8"))
max_bits_in_result = 56
p = (1 << max_bits_in_result) - 1
rest = int(md5.hexdigest(), 16)
result = 0
while rest != 0:
result = result ^ (rest & p)
rest = rest >> max_bits_in_result
return str(result)