-
-
Notifications
You must be signed in to change notification settings - Fork 212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Domain parsing fails with trailing spaces #75
Comments
I think this is the calling code's responsibility, not tldextract. 2 cents. |
Perhaps. Lines are a bit blurry here for me. On the one hand sure it's the calling code's responsibility. On the other I doubt if my team is the first one that's stumbled on this, or that we'll be the last. In any event, that's certainly how we will respond to it. But either way I think it deserved to be mentioned here as an issue (perhaps as you say one that shouldn't be fixed) |
I get you, I'm glad you opened the ticket. So. Assessing whether it should be changed. I don't think other tools would do this cleaning for you either. TLDExtract expects a valid domain name to be given. A domain name with a space isn't valid. Note, for example, that you couldn't make a request to this URL.
or aside from Python, let's try
I think users would generally assume it's their responsibility to clean their input (domain name, URL, what have you). If for no other reason: because other libraries also assume they have cleaned their input. |
But maybe tldextract should yell. Maybe if there's whitespace it's an obviously invalid thing, so an error should be thrown, instead of returning a result that might be confusing. I must defer to @john-kurkowski on this question |
I would certainly understand if tldextract threw an exception and said "hey, that isn't a url." I would also understand if it trimmed its input. But what it's doing now is a little counter intuitive imo. Actually we have noticed it handle non urls like
Which turned out to be useful once when we had to deal with some bad ones that had :android tacked at the end of them. But at the same time, now that you mention it, I think I'd actually be more inclined to think it should scream when you throw it something that isn't a url. I don't know... I also defer to @john-kurkowski here |
I understand that you wouldn't want tldextract to make assumptions for the user, i.e. cleaning their input for them. I definitely like the idea of raising an exception when tldextract is able to detect an invalid domain being passed as input, however. At the very least, checking for "known" illegal characters could be useful even if the check isn't fully comprehensive. |
|
As for the bigger issue of raising errors on obviously bad input, like However in the interest of users being able to get up and running with
(A validation example would be nice.) |
In keeping with the library's attitude towards url validation, as documented in the FAQ, it seems to me like striping the whitespace off of urls would make more sense that leaving trimming whitespace up to pre-validation of urls by the user. As it stands right now, it's pretty easy to get an unexpected result without warning or exception. Stripping the white space out does make an assumption on behalf of the user... but in the absence of url validation, I suspect it's better to assume that whites spaces should be trimmed |
I am not sure if we should trim whitespace.
Both cases can give wrong results if we trim whitespaces. Users should debug what they sent to the function and understand what we are doing. |
@mauricioabreu I think that @EvanV is right, about keeping with the existing attitude in the library. A caller should clean their input before calling, but the library will tolerate a bit of fudgery. tldextract should never be used to validate URLs in such a broad way, however. Use other tools for that, such as tools from stdlib 😉 In any case even if you wanted to validate user-inputted URLs, you should probably do some basic cleaning before validating or using ... what user wants a brittle input field? This line of thinking starts to get off topic from tldextract ... |
@hangtwenty ok, I agree. And it is exactly my point. |
Maybe I've lost track of what you're getting at @mauricioabreu . The consensus in this thread so far was that whitespace should be trimmed because there is no case where valid input would have whitespace. i.e. While the library shouldn't do a bunch of validation that's out of scope (and shouldn't be relied upon for validation), it's really not a problem to call |
I'm not sure I follow you either @mauricioabreu... could you try explaining your line of thinking again? My apologies... just not sure I follow you On my end, I still think that as it stands right now, it's easy to get an unexpected result without warning or exception. If TLDExtract intends to validate urls (which it doesn't) it's a different matter.... and in the absence of validation, it's very easy to silently get an unexpected result, which I would argue is really the worst thing that can happen. I think trimming whitespace is a better move than what it does right now, albeit certainly The Right Thing To Do is to sanitize your input. |
I think we are having some problems to communicate. I am sorry for this. |
The text values passed to extract should strip the text for trailing spaces, eg,
The text was updated successfully, but these errors were encountered: