-
Notifications
You must be signed in to change notification settings - Fork 687
Extract image URL #975
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Extract image URL #975
Conversation
Forgot to mention: This Pull Request addresses #834. At least in parts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the patch!
I left some minor comments inline.
More importantly, I think some bits are missing here, unfortunately. I think we should have at least a little bit of automated test coverage. I don't have enough visibility into whether this should include JSON-LD; if there's a good reason to push that into a follow-up ticket/PR then that's fine but perhaps it could be included here if it's straightforward to do / if there are already examples of that in the existing test pages.
On the whole this part of the API is unfortunately not documented nor included in the type annotations; fixing that would also be useful but not required for this PR.
Readability.js
Outdated
// property is a space-separated list of values | ||
var propertyPattern = | ||
/\s*(article|dc|dcterm|og|twitter)\s*:\s*(author|creator|description|published_time|title|site_name)\s*/gi; | ||
/\s*(article|dc|dcterm|og|twitter)\s*:\s*(author|creator|description|image:alt|image|published_time|title|site_name)\s*/gi; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the PR comment you wrote:
Please note that based on the nature of how the regex regarding metadata extraction works, the order of the new og:image and og:image:alt search in the regex is important.
However, I'm a bit confused why og:image:alt
is in here at all - we don't seem to use it later on? Is this just to avoid treating og:image:alt
as equivalent/overriding og:image
if both are present, because the og:image
regex would still match the og:image:alt
instance, only excluding the "alt" part?
A code comment about why this is here may help future readers. :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs a bit of explanation.
Looking into some BBC page, you find these lines (without context):
<meta data-rh="true" property="og:image" content="https://ichef.bbci.co.uk/ace/branded_news/1200/cpsprodpb/b98b/live/2554ced0-678d-11f0-8dbd-f3d32ebd3327.jpg"/>
<meta data-rh="true" property="og:image:alt" content="Bandmates Tony Iommi, Ozzy Osbourne and Geezer Butler pose in front of yellow and white Grammy logos while holding a grammy award for Best Metal Performance at the 2014 Grammy awards"/>
If you look at the regex on line 1776 in Readability.js (before my change), you notice that it does not contain a check for neither the start nor the end of the property name.
var propertyPattern =
/\s*(article|dc|dcterm|og|twitter)\s*:\s*(author|creator|description|image|published_time|title|site_name)\s*/gi;
If you check the lines from above for og:image, you will get two hits. one for every line. Because both lines contain og:image. While the first one would be a "direct" / complete hit, the second hit is not, it only matches a substring. Since there is no check for beginning/end of the property name.
In lines 1784 and following there is a loop over all meta elements and ultimately over all properties that return a hit for the regex. If there is a hit, the content is then placed in the field named after the hit.
The first line works, we get a hit for og:image, place the uri in the content as value for the property og:image.
The second line hits as well. But since we use the part of the regex that hits as a name for the property to place the content of the property into, we replace the value from the step above with (in this case) a note about Ozzy. Correct were to place that note into the property it belongs into: og:image:alt.
Placing the image:alt part right before the image part in the regex, we make sure that the first one hits first and the value is placed in the correct property (og:image:alt) and not into og:image.
True, this is some kind of hack. But if you want to fix this, you will have to completely refactor the whole _getArticleMetadata function. I tried to avoid that since this would be more intrusive and a lot more work to do. For me as a newcomer regarding Readability.js sources, probably not the best place to start with.
About your note re comment. The explanation needs some space. What about just linking this PR instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Linking to the PR would be fine, thanks!
A few notes about the latest changes.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the iteration here, and especially catching that dcterms
typo, that's very unfortunate.
I noticed there's still an issue with the favicon code (see inline comments). In a roundabout way, eslint
also noticed this and complained.
Unfortunately our infra is set up such that it then (after the eslint failure) didn't run formatting (which would also have picked up a few things) or tests... it would be interesting to see if any test output/metadata changes as a result of this.
It would also be good to have at least some test coverage for the new properties you're adding here - I would imagine some existing testcases will already have favicons that you can use to add this test coverage (probably by adding checks in test/test-readability.js
that already runs for all the pages in test/test-pages
, we wouldn't necessarily need more testcases.
// svg wins as best quality format | ||
return this._toAbsoluteURI(favicon); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This returns from the inner function but not the outer one, so I don't think this works, unfortunately?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is still broken, unfortunately? The code wants to use the return this._toAbsoluteURII(favicon)
as the return value for _getArticleFavicon
. But this is inside the (function) callback provided to this._forEachNode
, and the return value is not used by _forEachNode
, see
Line 349 in 1f0ec42
Array.prototype.forEach.call(nodeList, fn, this); |
To make this work, switch from using _forEachNode
to using a manual loop, for (var i = 0; i < metaElements.length; i++)
etc. - this is fine in this loop as we're not modifying the DOM while we're looping over these elements.
…k for begin and end of string to not extract the wrong properties. Therefore, removed the "image:alt" hack from before (the hack failed for the regular image:size properties and similar anyway). - Moved the toAbsoluteUri function in function to the "root", changed all calls to the new location of the function. Using an arrow function in one location because of scope issues with "this".
I moved the _toAbsoluteURI function to the root, replaced all calls for it. And fixed that regex so that it does not need a hack anymore (which only worked in a specific case anyway). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for the slow response to the update here. Unfortunately I think this is still not quite ready - see my inline comment. Please also make sure you run eslint/prettier before submitting - I'd offer to fix it myself (I understanding linting/formatting is annoying, especially if different from what you're used to) but it seems that edits by maintainers are not turned on for this pull request, so I can't.
// svg wins as best quality format | ||
return this._toAbsoluteURI(favicon); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is still broken, unfortunately? The code wants to use the return this._toAbsoluteURII(favicon)
as the return value for _getArticleFavicon
. But this is inside the (function) callback provided to this._forEachNode
, and the return value is not used by _forEachNode
, see
Line 349 in 1f0ec42
Array.prototype.forEach.call(nodeList, fn, this); |
To make this work, switch from using _forEachNode
to using a manual loop, for (var i = 0; i < metaElements.length; i++)
etc. - this is fine in this loop as we're not modifying the DOM while we're looping over these elements.
I noticed that although the meta data of websites is used for title and other info, there is no extraction for the image URL (nor favicon). This pull request tries to get the image URL from a given website.
Please note that based on the nature of how the regex regarding metadata extraction works, the order of the new og:image and og:image:alt search in the regex is important.