-
Notifications
You must be signed in to change notification settings - Fork 619
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix relative URL's for pages with a <base/> element. #422
Conversation
…> exists, this will override the relative path for all non-scoped baseless urls.
I'm pretty nervous about this, given the history in https://bugzilla.mozilla.org/show_bug.cgi?id=1173823 and especially https://bugzilla.mozilla.org/show_bug.cgi?id=1358248 . Consumers can clearly already get the correct behaviour by passing the correct From this I'm assuming you're running into this from a different consumer application (ie not Firefox for desktop/android). Can you elaborate on your setup? I'm somewhat sympathetic towards fixing this inside readability, but I think the correct fix would be to use (and maybe cache) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this PR! I think it's a good idea, but I would prefer we lean more heavily on the DOM baseURI
stuff if we can.
Readability.js
Outdated
@@ -295,6 +295,9 @@ Readability.prototype = { | |||
|
|||
// Standard relative URI; add entire path. pathBase already includes a | |||
// trailing "/". | |||
if (baseElementPath) | |||
return toAbsoluteURI(baseElementPath) + uri; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is recursive if the baseElementPath
is something that doesn't hit the earlier return statements, which seems like it would lead to infinite recursion if passed the 'wrong' string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be worth having a test case for a relative plain <base>
element as well (e.g. <base href="foopath/"/>
).
Readability.js
Outdated
// Determine the path base if there is a "base" element | ||
var baseElement = this._doc.getElementsByTagName('base'); | ||
if (baseElement.length > 0) { | ||
baseElementPath = baseElement[0].getAttribute("href"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As noted in my comment on the PR, I think we should use this._doc.baseURI
- perhaps closer to the point where we need it (rather than passing it around).
Also, ideally we would use a URL
constructor in the code that gets a baseURI
rather than the manual URL munging. So something like:
var baseElement = this._doc.getElementsByTagName('base');
if (baseElement.length) {
baseElementPath = new URL(baseElement[0].getAttribute("href"), this._uri.pathBase).href;
}
For this we'd have to up the node requirement to at least 7.0, which I think should be fine. I'll also have some issues integrating it into Firefox, but I'll deal with that when we get there. :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gijsk, this definitely makes sense and I'll try to get to this tonight. To the bugs you reference, do you think there's a better way of doing this that wouldn't require a change to this library? Thanks for your feedback!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think fixing the library is the 'right' fix, that we probably should have pursued in the bugzilla entries if I had considered it at the time.
But yeah, if you need a workaround, the Readability constructor takes a URI as argument. Instead of passing URI information out of document.location
as noted in the readme, you could pass it out of a URL constructed from document.baseURI
.
In fact, perhaps we could remove the URI argument for the Readability constructor entirely with some work. But don't worry about that as part of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(And thank you for the PR! It's great to see a steady stream of people finding this library useful and contributing back fixes to make it work better.)
…the method, added a unit test for recursion of plain relative paths and fixed recursion.
@gijsk , you were absolutely right about the problem with recursion on plain relative URL's in the base element. I couldn't get it exactly how you requested it but I think I hit your highlights of not passing it through and fixing the recursion problem. I also added a unit test for the plain relative base URL as you mentioned. I didn't implement URL because that would have to be required via node and was hesitant to do this in the Readability module itself -- looks like we want this to be a pure implementation. Let me know what you think! Thanks for your help on this 👍 , this is helping a project I'm working on for RSS feeds. Love this lib. |
OK. We can use |
# Conflicts: # package.json
@gijsk this worked beautifully and actually handled even more than (at least) I thought it would for cleaning up URI's. This is now providing absolute paths for relative URI's that had ../ so I've updated those unit tests to provide the correct absolute URI. The only scenario it didn't take care of effectively is the './' because its relative to the function toAbsoluteURI(uri) {
// If this is already an absolute URI, return it.
if (/^[a-zA-Z][a-zA-Z0-9\+\-\.]*:/.test(uri))
return uri;
// Dotslash relative URI.
if (uri.indexOf("./") === 0)
return pathBase + uri.slice(2);
// Ignore hash URIs:
if (uri[0] == "#")
return uri;
// Standard relative URI; use the baseURI and construct the URL
return new URL(uri, baseURI).href;
} Let me know if this works. Thanks for all your help. :) |
Sorry for not having gotten to this yet, btw, I've been a bit snowed under (both literally and figuratively - weather in the UK right now is interesting...).
I'm afraid I don't quite understand from this comment what the issue is. Can you elaborate? It's possible we just need to update testcase expectations... In particular, I see that one of your tests has e.g.:
in the source, with a base of
whereas I would expect the expected link to still include I was hoping we could remove the Entirely separately, github claims there are conflicts merging this... you may need to rebase? Sorry! |
@gijsk, you were right -- there were some URL's in the unit tests that were still using the relative path functionality and ultimately incorrect per the specification of the I've changed this to use a URL constructed from the baseURI as suggested. That's odd about the conflicts -- i don't see any on my side. The only conflict i had was the package.json when i merged from the upstream. I'll try to get this rebased. Thanks! |
…> exists, this will override the relative path for all non-scoped baseless urls.
…the method, added a unit test for recursion of plain relative paths and fixed recursion.
@gijsk I've rebased upstream on my changes, let me know if this looks okay now. Thanks! |
I dunno what github's deal is... but I realized two things:
So I fixed those (and some of the test fallout from that) and landed all of it as part of the commit 8525c6a . I think I've done it wrong somehow because github hasn't closed this PR... sorry about that. :-\ In any case, thanks very much for this! Hopefully these changes help you in what you're doing with |
Awesome thank you so much. I was questioning that too actually. Thanks for all your help on this, it will help immensely with my project (maybe I can share to see if there's anything you'd like to reuse). I hope it also helps Firefox as well. Hope your weather clears up to, we been having some weird stuff come our way to. Thanks again! 👍 |
This fixes relative URL's for pages with a element. This base element is intended to override the base path for root-less relative url's. For example:
The image in the source, if on http://www.somedomain.com/somepath/article, should resolve to http://www.somedomain.com/images/image.png given the base element relative to the path root "/".
Readability is currently not taking this into account and simply treating it as a path relative to the page, and resolving it to http://www.somedomain.com/somepath/images/image.png which is incorrect.
Let me know if you have questions. Thanks!