Wrong URL encoding #274

Kikobeats · 2019-01-09T21:28:06Z

Hello,

I noted the library is wrongly encoding the following URL:

https://o.aolcdn.com/images/dims?thumbnail=1200%2C630&quality=80&image_uri=https%3A%2F%2Fo.aolcdn.com%2Fimages%2Fdims%3Fcrop%3D4834%252C3230%252C0%252C0%26quality%3D85%26format%3Djpg%26resize%3D1600%252C1069%26image_uri%3Dhttps%253A%252F%252Fs.yimg.com%252Fos%252Fcreatr-images%252F2019-01%252F2d1e22f0-12ae-11e9-bae7-60d640081814%26client%3Da1acac3e1b3290917d92%26signature%3D269ae0af3a1a1772ffea6759d126791b9ad25184&client=amp-blogside-v2&signature=ea8c3bcfcab1cb450830a18333e5a76988365326

Specifically, the library is doing unreacheable the URL converting &image; into ℑ.

Sample code for reproducing that:

'use strict'

const sanitizeHtml = require('sanitize-html')

const sanitize = html =>
  sanitizeHtml(html, {
    allowedTags: false, // allow all tags
    allowedAttributes: false, // allow all attributes
    parser: {
      lowerCaseTags: true,
      decodeEntities: true,
      lowerCaseAttributeNames: true
    }
  })

const html = sanitize(`
<html lang="en">
<meta property="og:image" content="https://o.aolcdn.com/images/dims?thumbnail=1200%2C630&quality=80&image_uri=https%3A%2F%2Fo.aolcdn.com%2Fimages%2Fdims%3Fcrop%3D4834%252C3230%252C0%252C0%26quality%3D85%26format%3Djpg%26resize%3D1600%252C1069%26image_uri%3Dhttps%253A%252F%252Fs.yimg.com%252Fos%252Fcreatr-images%252F2019-01%252F2d1e22f0-12ae-11e9-bae7-60d640081814%26client%3Da1acac3e1b3290917d92%26signature%3D269ae0af3a1a1772ffea6759d126791b9ad25184&client=amp-blogside-v2&signature=ea8c3bcfcab1cb450830a18333e5a76988365326">
</html>
`)

console.log(html)

I'm not suring why this happening 🤔

The text was updated successfully, but these errors were encountered:

## Remove `sanitize-html` The dependency is introducing a bug related to malformed URLs: apostrophecms/sanitize-html#274 In fact, I detected it's no longer necessary since `htmlparser2` is present as part of `cheerio` load method. **Result**: Smaller bundler, less parsing time. ## Setup CSS Insensitive Rules One of the things related to `sanitize-html` was normalized some common things around the HTML markup. Because this dependency is no more dependency and after discovering that [CSS rules can be insensitive](https://twitter.com/Kikobeats/status/1083091303930494976), I enabled it properly in where is possible. **Result**: Better data detection, less initial parsing time. ## Improve Date Rules Based on the insensitive CSS rules improvement, I was re-checking the bundle set related to `metascraper-date`. I detected some interesting improvement opportunities: some rules can be merged into the same, also being possible to convert some rules into more generic, improving the data accurately. Also, I tried to prioritize *update* over *create*, so the output is more associated with the last modification date over the creation date. **Result**: Better date accurate, more value detected. ## Improve URL detection The URL detection has been improved for being possible detected more kind of URLs. An URL is a subtype of [URI](https://kikobeats.com/what-is-uri). The thing that I want to be sure is detecting as much data as possible. Now the `metascraper-helpers` related with `urls` being possible detected URIs, such data image URI encoded on base64 or magnet URIs. The challenge here is doing that while we still support original functionality. I added a lot of tests to ensure about that. **Result**: Better URLs detection, supporting URIs.

boutell · 2019-01-10T20:07:40Z

When they appear as part of an HTML attribute, & symbols must be correctly escaped as &. Your input does not do this, so while the parser follows various "try and make sense of it" rules in the absence of the escaping, it is vulnerable to misinterpretation if the characters following the & resemble a known HTML entity escape code. This is not a bug in sanitize-html or htmlparser2, your input must be well-formed.

Kikobeats · 2019-01-10T21:52:36Z

Although I agree with you, that's an assumption difficult to accomplish when you are handling an arbitrary HTML markup and I expected that as part of the library responsibility

boutell · 2019-01-10T23:18:04Z

And it does what you're asking, except when the markup is actually ambiguous and it really can't tell what you want, as in this case (:

Kikobeats changed the title ~~Wrong encoding~~ Wrong URL encoding Jan 9, 2019

Kikobeats mentioned this issue Jan 10, 2019

Improve data accurate microlinkhq/metascraper#148

Merged

boutell closed this as completed Jan 10, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong URL encoding #274

Wrong URL encoding #274

Kikobeats commented Jan 9, 2019

boutell commented Jan 10, 2019

Kikobeats commented Jan 10, 2019

boutell commented Jan 10, 2019

Wrong URL encoding #274

Wrong URL encoding #274

Comments

Kikobeats commented Jan 9, 2019

boutell commented Jan 10, 2019

Kikobeats commented Jan 10, 2019

boutell commented Jan 10, 2019