-
Notifications
You must be signed in to change notification settings - Fork 355
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wrong URL encoding #274
Comments
## Remove `sanitize-html` The dependency is introducing a bug related to malformed URLs: apostrophecms/sanitize-html#274 In fact, I detected it's no longer necessary since `htmlparser2` is present as part of `cheerio` load method. **Result**: Smaller bundler, less parsing time. ## Setup CSS Insensitive Rules One of the things related to `sanitize-html` was normalized some common things around the HTML markup. Because this dependency is no more dependency and after discovering that [CSS rules can be insensitive](https://twitter.com/Kikobeats/status/1083091303930494976), I enabled it properly in where is possible. **Result**: Better data detection, less initial parsing time. ## Improve Date Rules Based on the insensitive CSS rules improvement, I was re-checking the bundle set related to `metascraper-date`. I detected some interesting improvement opportunities: some rules can be merged into the same, also being possible to convert some rules into more generic, improving the data accurately. Also, I tried to prioritize *update* over *create*, so the output is more associated with the last modification date over the creation date. **Result**: Better date accurate, more value detected. ## Improve URL detection The URL detection has been improved for being possible detected more kind of URLs. An URL is a subtype of [URI](https://kikobeats.com/what-is-uri). The thing that I want to be sure is detecting as much data as possible. Now the `metascraper-helpers` related with `urls` being possible detected URIs, such data image URI encoded on base64 or magnet URIs. The challenge here is doing that while we still support original functionality. I added a lot of tests to ensure about that. **Result**: Better URLs detection, supporting URIs.
When they appear as part of an HTML attribute, |
Although I agree with you, that's an assumption difficult to accomplish when you are handling an arbitrary HTML markup and I expected that as part of the library responsibility |
And it does what you're asking, except when the markup is actually ambiguous and it really can't tell what you want, as in this case (: |
Hello,
I noted the library is wrongly encoding the following URL:
Specifically, the library is doing unreacheable the URL converting
ℑ
intoℑ
.Sample code for reproducing that:
I'm not suring why this happening 🤔
The text was updated successfully, but these errors were encountered: