Skip to content

anjshenoy/levenshtein

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

####################################################################################################
#  This application was run using:
####################################################################################################
1. ruby 1.9.2 version
2. Gems used: rspec2. Installing the rspec gem (version 2.6.0) is required if you want to run tests. 
If you have RVM (see https://rvm.beginrescueend.com) and the bundler gem installed, run "bundle install" 
from the command line. Otherwise just run "gem install rspec" from the command line.
3. To run tests simply type "rake" at the command line.
4. To run the app type "ruby world.rb" at the command line.
Note: I added a .rvmrc file to create a levenshtein gemset against ruby 1.9.2 version - so it 
doesn't intefere with any existing gemsets. This will also install bundler and use bundler to install
rspec2. If you don't have rvm, it should not intefere with running the app or tests.

####################################################################################################
#  Execution time
####################################################################################################
This application took ~19seconds (loading the population + building connections + counting nodes for 
causes). The code was run on OSX with a 2.4GHz Dual Core processor. Execution time depends on other 
cpu/memory hogging applications running at the same time.

####################################################################################################
#  Approach used
####################################################################################################
1. Load the population: Load the entire population into a hash of hashes with each sub-hash holding 
words of the same length.
2. Build connections: iterate through the words in each sub-hash (or level) and add friends.
3. Walk and count: Once connections are built, use "causes" as an input, walk the network and count 
the number of nodes.

Keeping the base problem in mind i.e. to solve for a Levenshtein distance of 1 meant finding each word's 
children (a distance of 1 char less), siblings ( substitution of 1 char) and parents (1 char more).

* Finding children is straightforward as a word of length S can have at most S children. For example 
children of the word causes would be: auses, cuses, cases, caues, causs, cause, which could be simply 
looked up in the hash for words of length 5. Thus finding children can be reduced to creating S possible 
words and looking them up among the words of length S-1.

* Finding siblings is a little less straightforward, but is similar to finding children. Two siblings will 
always have a child in common by eliminating the character that is different. However, the index position 
at which the character is eliminated is important, as "causes" and "abuses" have a child "auses", but 
are not siblings. On the other hand, "causes" and "pauses" have "auses" as a child and are siblings. 
This means, siblings can be found using a children list, memorizing the position of the character eliminated, 
and connecting words with the same child. For example: one possible substitution of causes is *auses 
where * = any letter from a-z. Therefore, the hash for level 6 would contain {"auses1" => "causes"} - 
where 1 is the index position of the letter substituted and the value is the word. (Technically it 
holds the Node object for that word, but here I just use the word for simplicity). So the hash for causes 
would look like this:
{"auses1" => "causes",
"cuses2" => "causes",
"cases3" => "causes",
"caues4" => "causes",
"causs5" => "causes",
"cause6" => "causes"}

Pretend the word causes has already been processed and the hash is loaded as above. Now when the iteration 
comes up against the word "pauses", it creates the word substitution combination below. The hash for pauses 
would look like this:
{"auses1" => "pauses",
"puses2" =>  "pauses",
"pases3" =>  "pauses",
"paues4" =>  "pauses",
"pauss5" =>  "pauses",
"pause6" =>  "pauses"}

Except instead of adding the key "auses1" that already exists, we found a sibling of "pauses" with a 
Levenshtein distance of 1 and connect it with "causes".

One the other hand, "abuses" will create "auses2", which will not match "auses1", and therefore not be 
linked to causes -  which is correct, as the Levenshtein distance of "causes" and "abuses" is greater 
than one. It is worth noting that the above approach is more performant than iterating from a-z, 
substituting the respective index position character for each letter in the alphabet and then checking 
if the word exists in the population. The latter would take anywhere from S to S*26 iterations, where 
S is the length of the string; whereas the above approach would take at most S iterations.

** What about parent nodes?  Finding parent word combinations is the hardest puzzle to solve since it is 
expensive to guess what the parent of a word should be. Given there are 26 letters in the alphabet,
 if a word is S letters long, that would make 26*S parent word combinations (times 260,000 words).  
We can get around this by building connections starting with the queue of longest words, finding siblings 
and children and then moving one level down and repeating. Thus we get the parents for free through 
finding the children.

* Networks and Nodes: Each word is saved as a node object. When the population hash is loaded for the 
first time each word is converted into a node representing the word. Each node has an attribute called 
first_degree_friends to which it adds any children/siblings it runs across once connection building starts. 
Each time a node is added as a friend, it is doubly linked (see the add_friend method in the Node class). 
The reason for a two way link is to be able to walk the network from any node and not to end up at dead ends.

* Walking the network: This is basically walking a double linked list. To make sure each node is 
counted once and to avoid circular traversals, an attribute on each node called "touched" is set 
to 1 if the node is counted.

Computational time: The complexity of the above algorithm is O(P+N) = O(P) (N <= P) where P = size of 
the population and N = size of network versus O(P*N) = O(P^2) (N <= P) when iterating through the network. 
Processing the population as above costs less than the computational sum of loading the population and 
then iterating through each word in the network to find friends of a specific word. 

####################################################################################################
# Little optimizations applied along the way:
####################################################################################################
1. Network scans only need to be applied if there is the possibility of finding siblings/children/parents.
That means, if a word is the only one in its level and there are no words in the level below it or in 
the level above it, then there's no possibility of finding friends. If so, constructing all possible 
child word combinations is useless. 
2. Only add sibling word matches to the hash. Adding child word combinations is a waste, as children 
can be directly looked up in the base population (also in a hash).
3. Clearing the hash after processing all words at each level shaved off another second. The assumption 
is that it reduces the amount of memory operations (clearing keys vs. deleting the old hash and 
allocating a new hash object).

####################################################################################################
# Initial approaches used:
####################################################################################################
1. Finding friends recursively - The problem statement itself suggested this. However, due to the 
size of the network the stack went belly up at 2187 levels deep. That meant the next approach would 
be iteration.
2. Finding friends iteratively - The first iterative approach checked each word against the base 
population to see if a parent/child/sibling existed that had not been already added to the network of the 
word "causes". This was first tested against a smaller dataset of 65000 and it ran in 1 hour, 20 minutes 
with a network size of about 16,000. However, when running against the full population, it ran for 8+ hrs 
suggesting a plain iterative solution was inadequate. A closer look at the algorithm revealed that it would 
take N*P or ~20 Billion operations (260K * 78K).  Clearly I needed to reduce the size of the datasets.
3. Splitting the dataset - The most obvious way was to organize the data by length. Using the same algorithm 
as above (with a few optimizations), I was able to get a run time of ~67 minutes. However, I noticed that 
some of the datasets were quite large e.g. the population of words of length 8 was ~39K. I further split this
into a hash of starts_with/ends_with parts so that it contained every possilbe string split for each word. The 
popluation hash start_with/ends_with for causes would look like this:
{:starts_with => {"c" => ["causes"], "ca" => ["causes"], "cau" => ["causes"], "caus" => ["causes"], "cause" => ["causes"]},
:ends_with => {"auses" => ["causes"], "uses" => ["causes"], "ses" => ["causes"], "es" => ["causes"], "s" => ["causes"]}}

Then I could split my input string into half like so: "cau"/"ses" for even strings or "ca"/"use" for odd strings. 
Then all that was required was looking up strings that matched the first/last half of the input string and iterating
through those. The idea was to do as few iterations as possible since hash lookups are much cheaper than iterating through 
arrays. This approach got the run time down 480 seconds or 8 minutes.
4. The problem still was that this type of algorithm had a complexity of P*N where P = size of population 
and N = size of the network (or P’ * N where P’ = f(P) = reduced size of population (after organizing 
by length/word splits etc.)). That means as the population gets bigger, the network is likely to get bigger, 
which means the computational time will grow quadratically rather than linearly. This led to the final 
solution, which does not depend on the network size. It is also worth noting that the final solution 
yields every possible network in the population, and can be easily scaled to processing word-length-queues 
in parallel since finding friends at each level is independent of others.

####################################################################################################
# Language of choice: ruby. Why?
####################################################################################################
1. Concise : ruby lets me use as little code as possible. As a case in point consider: 
self.citizens.keys.sort.reverse (in Population.build_connnections).
2. IRB - since it's interpreter-based, I can test as much or as little of my code right away.
3. Dynamic class loading - if I want to add a method to a class, I can simply extend it and add the method I want.
4. Easy to use testing framework.

About

Find words based on levenshtein distance

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages