Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix relative URL's for pages with a <base/> element. #422

Closed
wants to merge 18 commits into from

Conversation

bradphilips
Copy link
Contributor

@bradphilips bradphilips commented Feb 25, 2018

This fixes relative URL's for pages with a element. This base element is intended to override the base path for root-less relative url's. For example:

<base href="/"/> 
...
<img src="images/image.png"/>

The image in the source, if on http://www.somedomain.com/somepath/article, should resolve to http://www.somedomain.com/images/image.png given the base element relative to the path root "/".

Readability is currently not taking this into account and simply treating it as a path relative to the page, and resolving it to http://www.somedomain.com/somepath/images/image.png which is incorrect.

Let me know if you have questions. Thanks!

@gijsk
Copy link
Contributor

gijsk commented Feb 26, 2018

I'm pretty nervous about this, given the history in https://bugzilla.mozilla.org/show_bug.cgi?id=1173823 and especially https://bugzilla.mozilla.org/show_bug.cgi?id=1358248 .

Consumers can clearly already get the correct behaviour by passing the correct baseURI in from the document that they give readability to parse (per the patches on those bugs).

From this I'm assuming you're running into this from a different consumer application (ie not Firefox for desktop/android). Can you elaborate on your setup?

I'm somewhat sympathetic towards fixing this inside readability, but I think the correct fix would be to use (and maybe cache) this._doc.baseURI (which I assume jsdom implements correctly), and implement that in JSDOMParser if it doesn't already, instead of manually looking for the <base> element every time. I don't know off-hand how the spec says multiple <base> elements ought to be handled, but it may be worth looking at that.

Copy link
Contributor

@gijsk gijsk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR! I think it's a good idea, but I would prefer we lean more heavily on the DOM baseURI stuff if we can.

Readability.js Outdated
@@ -295,6 +295,9 @@ Readability.prototype = {

// Standard relative URI; add entire path. pathBase already includes a
// trailing "/".
if (baseElementPath)
return toAbsoluteURI(baseElementPath) + uri;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is recursive if the baseElementPath is something that doesn't hit the earlier return statements, which seems like it would lead to infinite recursion if passed the 'wrong' string.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be worth having a test case for a relative plain <base> element as well (e.g. <base href="foopath/"/>).

Readability.js Outdated
// Determine the path base if there is a "base" element
var baseElement = this._doc.getElementsByTagName('base');
if (baseElement.length > 0) {
baseElementPath = baseElement[0].getAttribute("href");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted in my comment on the PR, I think we should use this._doc.baseURI - perhaps closer to the point where we need it (rather than passing it around).

Also, ideally we would use a URL constructor in the code that gets a baseURI rather than the manual URL munging. So something like:

var baseElement = this._doc.getElementsByTagName('base');
if (baseElement.length) {
  baseElementPath = new URL(baseElement[0].getAttribute("href"), this._uri.pathBase).href;
}

For this we'd have to up the node requirement to at least 7.0, which I think should be fine. I'll also have some issues integrating it into Firefox, but I'll deal with that when we get there. :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gijsk, this definitely makes sense and I'll try to get to this tonight. To the bugs you reference, do you think there's a better way of doing this that wouldn't require a change to this library? Thanks for your feedback!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think fixing the library is the 'right' fix, that we probably should have pursued in the bugzilla entries if I had considered it at the time.

But yeah, if you need a workaround, the Readability constructor takes a URI as argument. Instead of passing URI information out of document.location as noted in the readme, you could pass it out of a URL constructed from document.baseURI.

In fact, perhaps we could remove the URI argument for the Readability constructor entirely with some work. But don't worry about that as part of this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(And thank you for the PR! It's great to see a steady stream of people finding this library useful and contributing back fixes to make it work better.)

…the method, added a unit test for recursion of plain relative paths and fixed recursion.
@bradphilips
Copy link
Contributor Author

@gijsk , you were absolutely right about the problem with recursion on plain relative URL's in the base element. I couldn't get it exactly how you requested it but I think I hit your highlights of not passing it through and fixing the recursion problem. I also added a unit test for the plain relative base URL as you mentioned. I didn't implement URL because that would have to be required via node and was hesitant to do this in the Readability module itself -- looks like we want this to be a pure implementation.

Let me know what you think! Thanks for your help on this 👍 , this is helping a project I'm working on for RSS feeds. Love this lib.

@gijsk
Copy link
Contributor

gijsk commented Feb 27, 2018

OK. We can use URL by putting the require for it in index.js for use from nodejs, and it'll just use the 'regular' DOM one if you use the Readability and/or JSDOMParser scripts from a regular DOM window. Though in any case only JSDOMParser needs it, and so you wouldn't need it for use from a regular webpage. Which is good because apparently IE doesn't support new URL(). Using this, I have an implementation + tests for baseURI finished. I'll put up a PR in about an hour (need to travel right now), and then you can base a change to Readability on that, expecting the document's DOM to have a working baseURI property. This URL should always be absolute, which simplifies things a bit. :-)

@bradphilips
Copy link
Contributor Author

@gijsk this worked beautifully and actually handled even more than (at least) I thought it would for cleaning up URI's. This is now providing absolute paths for relative URI's that had ../ so I've updated those unit tests to provide the correct absolute URI. The only scenario it didn't take care of effectively is the './' because its relative to the pathBase instead of the base element path. Any ideas here would be appreciated. This is the resulting code:

function toAbsoluteURI(uri) {
  // If this is already an absolute URI, return it.
  if (/^[a-zA-Z][a-zA-Z0-9\+\-\.]*:/.test(uri))
    return uri;

  // Dotslash relative URI.
  if (uri.indexOf("./") === 0)
    return pathBase + uri.slice(2);

  // Ignore hash URIs:
  if (uri[0] == "#")
    return uri;

  // Standard relative URI; use the baseURI and construct the URL
  return new URL(uri, baseURI).href;
}

Let me know if this works. Thanks for all your help. :)

@gijsk
Copy link
Contributor

gijsk commented Mar 1, 2018

Sorry for not having gotten to this yet, btw, I've been a bit snowed under (both literally and figuratively - weather in the UK right now is interesting...).

The only scenario it didn't take care of effectively is the './' because its relative to the pathBase instead of the base element path. Any ideas here would be appreciated.

I'm afraid I don't quite understand from this comment what the issue is. Can you elaborate? It's possible we just need to update testcase expectations...

In particular, I see that one of your tests has e.g.:

 +    <p><a href="./foo/bar/baz.html">link</a></p> 

in the source, with a base of "base/", and the expected link right now is:

 +        <p><a href="http://fakehost/test/foo/bar/baz.html">link</a></p> 

whereas I would expect the expected link to still include base/.

I was hoping we could remove the toAbsoluteURI() function entirely, and just use new URL() on its own, but it's possible I missed edgecases which mean we need to keep it for now...

Entirely separately, github claims there are conflicts merging this... you may need to rebase? Sorry!

@bradphilips
Copy link
Contributor Author

@gijsk, you were right -- there were some URL's in the unit tests that were still using the relative path functionality and ultimately incorrect per the specification of the <base /> element.

I've changed this to use a URL constructed from the baseURI as suggested. That's odd about the conflicts -- i don't see any on my side. The only conflict i had was the package.json when i merged from the upstream. I'll try to get this rebased. Thanks!

@bradphilips
Copy link
Contributor Author

@gijsk I've rebased upstream on my changes, let me know if this looks okay now. Thanks!

gijsk pushed a commit that referenced this pull request Mar 2, 2018
@gijsk
Copy link
Contributor

gijsk commented Mar 2, 2018

I dunno what github's deal is... but I realized two things:

  1. we should keep in-document links (#foo) the same if there's no base URI.
  2. I forgot to update generate-testcase.js

So I fixed those (and some of the test fallout from that) and landed all of it as part of the commit 8525c6a . I think I've done it wrong somehow because github hasn't closed this PR... sorry about that. :-\

In any case, thanks very much for this! Hopefully these changes help you in what you're doing with readability - I know that they'll fix some issues in Firefox as well, once I merge the latest set of code in. 👍

@gijsk gijsk closed this Mar 2, 2018
@bradphilips
Copy link
Contributor Author

Awesome thank you so much. I was questioning that too actually. Thanks for all your help on this, it will help immensely with my project (maybe I can share to see if there's anything you'd like to reuse). I hope it also helps Firefox as well. Hope your weather clears up to, we been having some weird stuff come our way to. Thanks again! 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants