Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to view logging #35

Open
nschimmoller opened this issue Aug 13, 2022 · 7 comments
Open

How to view logging #35

nschimmoller opened this issue Aug 13, 2022 · 7 comments

Comments

@nschimmoller
Copy link

Thanks so much for this great package. However, I'm having some issues trying to view the log.DEBUG info for _fuzzy_search. Any guidance on where to find this logging info would be greatly appreciated.

@pudo
Copy link
Member

pudo commented Aug 13, 2022

I'm not quite sure I understand this question. The logger feeds into standard Python logging, so if you have this set up (e.g. via logging.basicConfig) you should be able to raise the log level for the package like this:

logging.getLogger("countrynames").setLevel(logging.DEBUG)

@nschimmoller
Copy link
Author

nschimmoller commented Aug 13, 2022

@pudo thanks for the quick response. I think I realized what is going on. I was incorrectly expecting the logger to record data on any use of the _fuzzy_search function. However, I realized that the threshold of similarity to strings must have is very high and thus most uses of _fuzzy_search end because they do not meet the threshold of

best_distance > (len(name) * 0.15)

My initial test was going to be "Untied States" instead of "United States." However, because the length of both strings is 13 and 13 * 0.15 = 1.95 the strings must have a Levenshtein distance of 1 (0 would be exact match and it can only generate positive integers).

In fact any word with 6 or fewer letters would be impossible to be matched using _fuzzy_search as 6 * 0.15 = 0.9 < 1.

I wonder if it would make sense to either/or of the following options:

  1. revisit this threshold
  2. evaluate the ratio of the matching characters not just upon the length of the inputted value, name, to the function but the ratio of matching characters to both the inputted value, name, and evaluated value, cand.
best_distance > ((len(name) + len(cand)) * 0.15)

Examples:

Untied States and United States are both 13 characters and have a Levenshtein distance of 2. Evaluating the distance of 2 against 26 (13 + 13) may allow this to be a bit more robust.

Another example where this approach might prove beneficial is comparing 'German' to 'Germany'. These two spellings have a Levenshtein distance of 1 but because 'German' is only 6 characters the best distance represents 0.16667 which fails to meet the 0.15 threshold.

@nschimmoller
Copy link
Author

nschimmoller commented Aug 13, 2022

@pudo just wanted to let you know that the info on logging you gave was able to help me find the logging information. Honestly, I haven't really used the logging package that much and a lot of the online documentation seems to be around implementation of it rather than end user use of it.

While I was exploring this though I did find I think a bug with this line of code

log.debug("Guessing country: %s -> %s (distance %d)", name, code, best_distance)

When I run the following two searches

>>> print(countrynames._fuzzy_search(countrynames.normalize_name('Unided States of America')))
US
>>> print(countrynames._fuzzy_search(countrynames.normalize_name('Bundesrepublik Deutschlan')))
DE

I get the following results in my log file

DEBUG:countrynames:Guessing country: unided states of america -> ZW (distance 1)
DEBUG:countrynames:Guessing country: bundesrepublik deutschlan -> ZW (distance 1)

I believe that the line should instead be

log.debug("Guessing country: %s -> %s (distance %d)", name, best_code, best_distance)

Otherwise it always returns 'ZW' as the code in the debug file since that was the last code in the COUNTRY_NAMES dictionary that was checked.

pudo added a commit that referenced this issue Aug 15, 2022
@nschimmoller
Copy link
Author

@pudo thanks for the update of the log call. What are your thoughts regarding the measure of similarity between strings?

I've been doing some research on this topic, and ultimately you as an end user need to pick a "goodness" of match. Is this something you'd consider updating, and if not allowing end user to pass this as a parameter or set is as a variable to override the default?

One interesting thing I did find doing the research.... There is a function in nltk.metrics.distance called edit_distance which allows you to compare a string Untied States to a corpus of allowable words stored as a list. It is using Levenshtein distance as well (minimum edits needed), however it also takes a parameter called transpositions which when set to True measures the edit distance of Untied States to United States as 1 as opposed to 2, because the adjacent ...ti.. and ...it... can be swapped as opposed to both having to be edited using the following snippet

if transpositions and last_left > 0 and last_right > 0:
    d = lev[last_left - 1][last_right - 1] + i - last_left + j - last_right - 1

@pudo
Copy link
Member

pudo commented Aug 15, 2022

Hey @nschimmoller, thanks for filing that detailed report. I'm a little slow in responding at the moment since I'm on vacation and the internet is less good than the wine. Obviously, the behaviour you've identified is a bug and needs to be addressed.

First off: I'm opposed to establish a dependency on nltk: countrynames is meant to be a small utility library, whereas nltk is a near-infinite labyrinth of so-so maintained code that to my mind lives north of countrynames in the stack.

I'm pretty sure we could achieve the same outcome of treating transpositions as a single edit using the functions contained in python-Levenshtein, e.g. editops: https://rawgit.com/ztane/python-Levenshtein/master/docs/Levenshtein.html - however, I sort of like your first solution - maybe upping the threshold value and applying it to the combined length of both strings - for its simplicity.

The best thing to do would be to build out our test cases a bit to cover not just the scenarios we want to work (Untied States), but also some of the ones we want to avoid (Gambia and Guinea being messed up). I might start there.

@nschimmoller
Copy link
Author

Thanks @pudo enjoy your holiday and wine.

I'm working on a project that I'm sure has a handful of spelling errors, user inputted shipping addresses over the world. I might look into creating a fork, making a modification to the code with the lower threshold, and the transposition implemented and sharing back what the results look like.

Honestly, might be 2-3 weeks until I get around to this but should be a decently large dataset to test this hypothesis on.

@pudo
Copy link
Member

pudo commented Aug 15, 2022

That would be fantastic, and very much appreciated!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants