Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong URL encoding #274

Closed
Kikobeats opened this issue Jan 9, 2019 · 3 comments
Closed

Wrong URL encoding #274

Kikobeats opened this issue Jan 9, 2019 · 3 comments

Comments

@Kikobeats
Copy link

Hello,

I noted the library is wrongly encoding the following URL:

https://o.aolcdn.com/images/dims?thumbnail=1200%2C630&quality=80&image_uri=https%3A%2F%2Fo.aolcdn.com%2Fimages%2Fdims%3Fcrop%3D4834%252C3230%252C0%252C0%26quality%3D85%26format%3Djpg%26resize%3D1600%252C1069%26image_uri%3Dhttps%253A%252F%252Fs.yimg.com%252Fos%252Fcreatr-images%252F2019-01%252F2d1e22f0-12ae-11e9-bae7-60d640081814%26client%3Da1acac3e1b3290917d92%26signature%3D269ae0af3a1a1772ffea6759d126791b9ad25184&client=amp-blogside-v2&signature=ea8c3bcfcab1cb450830a18333e5a76988365326

Specifically, the library is doing unreacheable the URL converting ℑ into .

Sample code for reproducing that:

'use strict'

const sanitizeHtml = require('sanitize-html')

const sanitize = html =>
  sanitizeHtml(html, {
    allowedTags: false, // allow all tags
    allowedAttributes: false, // allow all attributes
    parser: {
      lowerCaseTags: true,
      decodeEntities: true,
      lowerCaseAttributeNames: true
    }
  })

const html = sanitize(`
<html lang="en">
<meta property="og:image" content="https://o.aolcdn.com/images/dims?thumbnail=1200%2C630&quality=80&image_uri=https%3A%2F%2Fo.aolcdn.com%2Fimages%2Fdims%3Fcrop%3D4834%252C3230%252C0%252C0%26quality%3D85%26format%3Djpg%26resize%3D1600%252C1069%26image_uri%3Dhttps%253A%252F%252Fs.yimg.com%252Fos%252Fcreatr-images%252F2019-01%252F2d1e22f0-12ae-11e9-bae7-60d640081814%26client%3Da1acac3e1b3290917d92%26signature%3D269ae0af3a1a1772ffea6759d126791b9ad25184&client=amp-blogside-v2&signature=ea8c3bcfcab1cb450830a18333e5a76988365326">
</html>
`)

console.log(html)

I'm not suring why this happening 🤔

@Kikobeats Kikobeats changed the title Wrong encoding Wrong URL encoding Jan 9, 2019
Kikobeats added a commit to microlinkhq/metascraper that referenced this issue Jan 10, 2019
## Remove `sanitize-html`

The dependency is introducing a bug related to malformed URLs: apostrophecms/sanitize-html#274

In fact, I detected it's no longer necessary since `htmlparser2` is present as part of `cheerio` load method.

**Result**: Smaller bundler, less parsing time.

## Setup CSS Insensitive Rules

One of the things related to `sanitize-html` was normalized some common things around the HTML markup.

Because this dependency is no more dependency and after discovering that [CSS rules can be insensitive](https://twitter.com/Kikobeats/status/1083091303930494976), I enabled it properly in where is possible.

**Result**: Better data detection, less initial parsing time.

## Improve Date Rules

Based on the insensitive CSS rules improvement, I was re-checking the bundle set related to `metascraper-date`.

I detected some interesting improvement opportunities: some rules can be merged into the same, also being possible to convert some rules into more generic, improving the data accurately.

Also, I tried to prioritize *update* over *create*, so the output is more associated with the last modification date over the creation date.

**Result**: Better date accurate, more value detected.

## Improve URL detection

The URL detection has been improved for being possible detected more kind of URLs.

An URL is a subtype of [URI](https://kikobeats.com/what-is-uri). The thing that I want to be sure is detecting as much data as possible.

Now the `metascraper-helpers` related with `urls` being possible detected URIs, such data image URI encoded on base64 or magnet URIs.

The challenge here is doing that while we still support original functionality. I added a lot of tests to ensure about that.

**Result**: Better URLs detection, supporting URIs.
@boutell
Copy link
Member

boutell commented Jan 10, 2019

When they appear as part of an HTML attribute, & symbols must be correctly escaped as &amp;. Your input does not do this, so while the parser follows various "try and make sense of it" rules in the absence of the escaping, it is vulnerable to misinterpretation if the characters following the & resemble a known HTML entity escape code. This is not a bug in sanitize-html or htmlparser2, your input must be well-formed.

@boutell boutell closed this as completed Jan 10, 2019
@Kikobeats
Copy link
Author

Although I agree with you, that's an assumption difficult to accomplish when you are handling an arbitrary HTML markup and I expected that as part of the library responsibility

@boutell
Copy link
Member

boutell commented Jan 10, 2019

And it does what you're asking, except when the markup is actually ambiguous and it really can't tell what you want, as in this case (:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants