Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default wildcard rule missing from the PSL algorithm implementation #338

Open
braedon opened this issue Aug 30, 2024 · 2 comments
Open

Default wildcard rule missing from the PSL algorithm implementation #338

braedon opened this issue Aug 30, 2024 · 2 comments

Comments

@braedon
Copy link

braedon commented Aug 30, 2024

The PSL formal algorithm includes the following step:

  • If no rules match, the prevailing rule is "*".

This rule means that if an explicit public suffix can't be found for a domain in the PSL, the (actual, not effective) TLD is treated as a public suffix. This allows new TLDs (or custom internal TLDs) to be handled without requiring an explicit update to the PSL.

(It also means that the only time the PSL algorithm doesn't return a registrable domain is when the domain is itself a public suffix, which helps disambiguate that case.)

The PSL project has tests for this in the standard test suite:

// Unlisted TLD.
checkPublicSuffix('example', null);
checkPublicSuffix('example.example', 'example.example');
checkPublicSuffix('b.example.example', 'example.example');
checkPublicSuffix('a.b.example.example', 'example.example');

Here's the output for the test domains in tldextract 5.1.2:

>>> tldextract.extract('example').registered_domain                                                                                                                                                                                                                          
''
>>> tldextract.extract('example.example').registered_domain
''
>>> tldextract.extract('b.example.example').registered_domain
''
>>> tldextract.extract('a.b.example.example').registered_domain
''

The last three results are turning '', when they should return example.example.

@elliotwutingfeng
Copy link
Contributor

Good catch. There are bunch of issues related to this at the public suffix list repository, such as publicsuffix/list#694

Also relevant is https://wiki.mozilla.org/Public_Suffix_List/platform.sh_Problem#Further_Information (under the Further Information heading) which comments on the implications of enforcing/not enforcing the "*" rule.

@john-kurkowski

@john-kurkowski
Copy link
Owner

The last three results are turning '', when they should return example.example.

And the ExtractResult objects for all four results should have suffix='example', instead of domain='example'?

I'm worried about adopting this change. It would be breaking for anybody using this library to distinguish recognized suffixes from bogus ones or from internal hostnames like localhost.

Maybe a new property that honors the PSL formal algorithm? And/or a warning in the documentation?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants