cgMLSTsearch is a proof of concept implementation for quickly searching a cgMLST profile against a database. So far there is no handling of missing data.
The implementation drew inspiration from this paper: https://arxiv.org/ftp/arxiv/papers/1505/1505.03090.pdf
--create-sequences Creates a new set of sequences
--create-index Recreates the index
--distance [int] the distance from the query within which matches are returned.
--naive Naive search that compares the query to each database sequence, and drops off when the search distance is exceeded.
--trie Search that uses a forest of random order tries to limit the search space.
--heuristic This option changes the comparison to one that uses a binomial confidence interval to end the comparison early, for very similar or disimilar profiles. Mainly useful if comparisons are slow due to high distance (=slow dropoff and more comparisons)
On a laptop cpu, actually running the program takes less than 2 seconds to search 200.000 sequences. The actual search (with indexing) takes 89 milliseconds according to cProfile.
The search is parallelizable but this is hardly necessary with current database sizes.
In the current implementation of indexed search, hits will be missed when using a large search radius.
Store index and allele profiles in mongodb.
/Outline/
- Collection for allele profiles
- Collection of indices
- Indices store ObjectID for profiles
- Explore if search can happen directly in DB or will have to be processed client side