-
Notifications
You must be signed in to change notification settings - Fork 462
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Check top-level domain suggestion against domains #89
base: master
Are you sure you want to change the base?
Conversation
After correcting the top-level domain, our suggestion is sometimes close enough to a full domain that accepting the suggestion results in another bad email address that prompts a suggestion. A previous commit tried to solve this problem by changing the thresholds at which we suggest top-level domains and full domains, but some edge cases remained. This commit cuts out the middleman by directly checking for a full domain suggestion before returning a top-level domain suggestion.
|
@derrickko currently we have:
I think it's confusing and not scalable. I would like to merge this solution and use a single threshold instead. Thoughts? cc @kevinhughes27. Potentially related: #96, #97 |
I like the idea proposed in this PR - I don't think removing the individual thresholds is a good idea. Users of this lib don't have to change them if they don't need to but they are quite useful for tweaking. Btw I didn't submit this in my PR from the Shopify changes but we use much tighter thresholds (2, 1.5, 1,5) IIRC |
What's the basis for tweaking? For example, how did you decide 2, 1.5, 1.5 are the best thresholds? Trail and error? I have trouble even trying to come up with useful default values for these thresholds to avoid future issues like #96 and #97. Without digging into the code, what do the threshold mean? Do they mean anything by themselves or are they only meaningful when compared among themselves? |
the thresholds are how close a string needs to be to be considered a match and thus be corrected to the matching string (in this case domains or parts of domains). They are pretty important and being able to tweak them is also important let me give an example: I ran a pretty detailed experiment with mailcheck on our checkout page and collected well over a million data points. I measured 3 things - checkout successes, mailcheck suggestions shown and mailcheck suggestions clicked. Right away we noticed that unclicked mailcheck suggestions were related to decreased checkout successes - so in our environment its important to be conservative with suggestions. After noticing this I played with the thresholds and the default domain lists to bring down the number of unclicked suggestions (I collected the original email address if it was unclicked so I could iterate). That is how I arrived at 2, 1.5, 1.5 for thresholds. Our users going through checkout are fairly international so our domain list is large but if for example you are adding mailcheck to a signup page and most of your users will be other tech enthusiasts then your setup will be very different. In order to avoid the issues you mentioned we should reduce the default thresholds - but you should still be experimenting with the setup in your environment. Maybe @derrickko can elaborate but my take away from looking through previous PRs (most PRs that alter the default config are closed unless there is a really good reason) is that Mailcheck is intended as more of a framework and less of a plug and play solution. That being said lets tighten the thresholds a bit so the default config is more useful, I should have included this in my original PR here. |
Hey @ kevinhughes27 your explanation was amazing. are u still using this library nowadays? It's no longer maintained, so I rewrote it entirely and update it for 2022 and onward: https://github.com/ZooTools/email-spell-checker. Would love to get your thoughts and see if there is something we can improve.
Link: https://github.com/ZooTools/email-spell-checker Come check it out and give it a ⭐️ for the effort. |
After correcting the top-level domain, the suggestion is sometimes
close enough to a full domain that accepting the suggestion results in
another bad email address. A previous commit tried to solve this
problem by changing the thresholds at which we suggest top-level
domains and full domains, but some edge cases remained. This
commit cuts out the middleman by directly checking for a full domain
suggestion before returning a top-level domain suggestion.