-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to view logging #35
Comments
I'm not quite sure I understand this question. The logger feeds into standard Python logging, so if you have this set up (e.g. via logging.getLogger("countrynames").setLevel(logging.DEBUG) |
@pudo thanks for the quick response. I think I realized what is going on. I was incorrectly expecting the logger to record data on any use of the best_distance > (len(name) * 0.15) My initial test was going to be "Untied States" instead of "United States." However, because the length of both strings is 13 and 13 * 0.15 = 1.95 the strings must have a Levenshtein distance of 1 (0 would be exact match and it can only generate positive integers). In fact any word with 6 or fewer letters would be impossible to be matched using I wonder if it would make sense to either/or of the following options:
best_distance > ((len(name) + len(cand)) * 0.15) Examples: Untied States and United States are both 13 characters and have a Levenshtein distance of 2. Evaluating the distance of 2 against 26 (13 + 13) may allow this to be a bit more robust. Another example where this approach might prove beneficial is comparing 'German' to 'Germany'. These two spellings have a Levenshtein distance of 1 but because 'German' is only 6 characters the best distance represents 0.16667 which fails to meet the 0.15 threshold. |
@pudo just wanted to let you know that the info on logging you gave was able to help me find the logging information. Honestly, I haven't really used the While I was exploring this though I did find I think a bug with this line of code log.debug("Guessing country: %s -> %s (distance %d)", name, code, best_distance) When I run the following two searches >>> print(countrynames._fuzzy_search(countrynames.normalize_name('Unided States of America')))
US
>>> print(countrynames._fuzzy_search(countrynames.normalize_name('Bundesrepublik Deutschlan')))
DE I get the following results in my log file
I believe that the line should instead be log.debug("Guessing country: %s -> %s (distance %d)", name, best_code, best_distance) Otherwise it always returns 'ZW' as the code in the debug file since that was the last code in the |
@pudo thanks for the update of the log call. What are your thoughts regarding the measure of similarity between strings? I've been doing some research on this topic, and ultimately you as an end user need to pick a "goodness" of match. Is this something you'd consider updating, and if not allowing end user to pass this as a parameter or set is as a variable to override the default? One interesting thing I did find doing the research.... There is a function in if transpositions and last_left > 0 and last_right > 0:
d = lev[last_left - 1][last_right - 1] + i - last_left + j - last_right - 1 |
Hey @nschimmoller, thanks for filing that detailed report. I'm a little slow in responding at the moment since I'm on vacation and the internet is less good than the wine. Obviously, the behaviour you've identified is a bug and needs to be addressed. First off: I'm opposed to establish a dependency on I'm pretty sure we could achieve the same outcome of treating transpositions as a single edit using the functions contained in The best thing to do would be to build out our test cases a bit to cover not just the scenarios we want to work ( |
Thanks @pudo enjoy your holiday and wine. I'm working on a project that I'm sure has a handful of spelling errors, user inputted shipping addresses over the world. I might look into creating a fork, making a modification to the code with the lower threshold, and the transposition implemented and sharing back what the results look like. Honestly, might be 2-3 weeks until I get around to this but should be a decently large dataset to test this hypothesis on. |
That would be fantastic, and very much appreciated! |
Thanks so much for this great package. However, I'm having some issues trying to view the log.DEBUG info for _fuzzy_search. Any guidance on where to find this logging info would be greatly appreciated.
The text was updated successfully, but these errors were encountered: