Extract image URL #975

LordEidi · 2025-07-21T07:54:38Z

I noticed that although the meta data of websites is used for title and other info, there is no extraction for the image URL (nor favicon). This pull request tries to get the image URL from a given website.

Please note that based on the nature of how the regex regarding metadata extraction works, the order of the new og:image and og:image:alt search in the regex is important.

LordEidi · 2025-07-21T10:18:24Z

Forgot to mention: This Pull Request addresses #834. At least in parts.

gijsk

Thanks for the patch!

I left some minor comments inline.

More importantly, I think some bits are missing here, unfortunately. I think we should have at least a little bit of automated test coverage. I don't have enough visibility into whether this should include JSON-LD; if there's a good reason to push that into a follow-up ticket/PR then that's fine but perhaps it could be included here if it's straightforward to do / if there are already examples of that in the existing test pages.

On the whole this part of the API is unfortunately not documented nor included in the type annotations; fixing that would also be useful but not required for this PR.

Readability.js

gijsk · 2025-07-22T14:45:27Z

Readability.js

    // property is a space-separated list of values
    var propertyPattern =
-      /\s*(article|dc|dcterm|og|twitter)\s*:\s*(author|creator|description|published_time|title|site_name)\s*/gi;
+      /\s*(article|dc|dcterm|og|twitter)\s*:\s*(author|creator|description|image:alt|image|published_time|title|site_name)\s*/gi;


In the PR comment you wrote:

Please note that based on the nature of how the regex regarding metadata extraction works, the order of the new og:image and og:image:alt search in the regex is important.

However, I'm a bit confused why og:image:alt is in here at all - we don't seem to use it later on? Is this just to avoid treating og:image:alt as equivalent/overriding og:image if both are present, because the og:image regex would still match the og:image:alt instance, only excluding the "alt" part?

A code comment about why this is here may help future readers. :-)

This needs a bit of explanation.

Looking into some BBC page, you find these lines (without context):

<meta data-rh="true" property="og:image" content="https://ichef.bbci.co.uk/ace/branded_news/1200/cpsprodpb/b98b/live/2554ced0-678d-11f0-8dbd-f3d32ebd3327.jpg"/> <meta data-rh="true" property="og:image:alt" content="Bandmates Tony Iommi, Ozzy Osbourne and Geezer Butler pose in front of yellow and white Grammy logos while holding a grammy award for Best Metal Performance at the 2014 Grammy awards"/>

If you look at the regex on line 1776 in Readability.js (before my change), you notice that it does not contain a check for neither the start nor the end of the property name.

var propertyPattern = /\s*(article|dc|dcterm|og|twitter)\s*:\s*(author|creator|description|image|published_time|title|site_name)\s*/gi;

If you check the lines from above for og:image, you will get two hits. one for every line. Because both lines contain og:image. While the first one would be a "direct" / complete hit, the second hit is not, it only matches a substring. Since there is no check for beginning/end of the property name.

In lines 1784 and following there is a loop over all meta elements and ultimately over all properties that return a hit for the regex. If there is a hit, the content is then placed in the field named after the hit.

The first line works, we get a hit for og:image, place the uri in the content as value for the property og:image.

The second line hits as well. But since we use the part of the regex that hits as a name for the property to place the content of the property into, we replace the value from the step above with (in this case) a note about Ozzy. Correct were to place that note into the property it belongs into: og:image:alt.

Placing the image:alt part right before the image part in the regex, we make sure that the first one hits first and the value is placed in the correct property (og:image:alt) and not into og:image.

True, this is some kind of hack. But if you want to fix this, you will have to completely refactor the whole _getArticleMetadata function. I tried to avoid that since this would be more intrusive and a lot more work to do. For me as a newcomer regarding Readability.js sources, probably not the best place to start with.

About your note re comment. The explanation needs some space. What about just linking this PR instead?

Linking to the PR would be fine, thanks!

…ms".

LordEidi · 2025-07-24T15:02:07Z

A few notes about the latest changes.

I added the extraction for the favicon. Which needed two new helper functions.
I also added the extraction for the modifiedTime.
And then I fixed a typo. In several places there was a lookup for properties within the dcterms namespace. Unfortunately, the s was missing in every occurrence. But the standard is written with an s, as you can see here: https://www.dublincore.org/resources/userguide/publishing_metadata/. This probably never really worked because of the missing s.

gijsk

Thanks for the iteration here, and especially catching that dcterms typo, that's very unfortunate.

I noticed there's still an issue with the favicon code (see inline comments). In a roundabout way, eslint also noticed this and complained.

Unfortunately our infra is set up such that it then (after the eslint failure) didn't run formatting (which would also have picked up a few things) or tests... it would be interesting to see if any test output/metadata changes as a result of this.

It would also be good to have at least some test coverage for the new properties you're adding here - I would imagine some existing testcases will already have favicons that you can use to add this test coverage (probably by adding checks in test/test-readability.js that already runs for all the pages in test/test-pages, we wouldn't necessarily need more testcases.

gijsk · 2025-07-25T11:55:27Z

Readability.js

+          // svg wins as best quality format
+          return this._toAbsoluteURI(favicon);


This returns from the inner function but not the outer one, so I don't think this works, unfortunately?

I think this is still broken, unfortunately? The code wants to use the return this._toAbsoluteURII(favicon) as the return value for _getArticleFavicon. But this is inside the (function) callback provided to this._forEachNode, and the return value is not used by _forEachNode, see

readability/Readability.js

Line 349 in 1f0ec42

Array.prototype.forEach.call(nodeList, fn, this);

. So it's just going to be dropped on the floor.

To make this work, switch from using _forEachNode to using a manual loop, for (var i = 0; i < metaElements.length; i++) etc. - this is fine in this loop as we're not modifying the DOM while we're looping over these elements.

Readability.js

…k for begin and end of string to not extract the wrong properties. Therefore, removed the "image:alt" hack from before (the hack failed for the regular image:size properties and similar anyway). - Moved the toAbsoluteUri function in function to the "root", changed all calls to the new location of the function. Using an arrow function in one location because of scope issues with "this".

LordEidi · 2025-08-04T07:50:33Z

I moved the _toAbsoluteURI function to the root, replaced all calls for it. And fixed that regex so that it does not need a hack anymore (which only worked in a specific case anyway).

gijsk

Apologies for the slow response to the update here. Unfortunately I think this is still not quite ready - see my inline comment. Please also make sure you run eslint/prettier before submitting - I'd offer to fix it myself (I understanding linting/formatting is annoying, especially if different from what you're used to) but it seems that edits by maintainers are not turned on for this pull request, so I can't.

gijsk · 2025-08-15T11:16:53Z

Readability.js

+          // svg wins as best quality format
+          return this._toAbsoluteURI(favicon);


I think this is still broken, unfortunately? The code wants to use the return this._toAbsoluteURII(favicon) as the return value for _getArticleFavicon. But this is inside the (function) callback provided to this._forEachNode, and the return value is not used by _forEachNode, see

readability/Readability.js

Line 349 in 1f0ec42

Array.prototype.forEach.call(nodeList, fn, this);

. So it's just going to be dropped on the floor.

To make this work, switch from using _forEachNode to using a manual loop, for (var i = 0; i < metaElements.length; i++) etc. - this is fine in this loop as we're not modifying the DOM while we're looping over these elements.

LordEidi and others added 2 commits July 21, 2025 09:45

Extract image URL

47f4b3a

Fixed image reference to dot notation according Mozilla QS

65ad660

gijsk requested changes Jul 22, 2025

View reviewed changes

Adding extraction for favicon and modifiedTime. Fixing typo in "dcter…

5936d88

…ms".

Updating description of returned object

27a90dc

gijsk requested changes Jul 25, 2025

View reviewed changes

gijsk requested changes Aug 15, 2025

View reviewed changes

		// svg wins as best quality format
		return this._toAbsoluteURI(favicon);

Extract image URL #975

Are you sure you want to change the base?

Extract image URL #975

Conversation

LordEidi commented Jul 21, 2025

Uh oh!

LordEidi commented Jul 21, 2025

Uh oh!

gijsk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

gijsk Jul 22, 2025

Choose a reason for hiding this comment

Uh oh!

LordEidi Jul 24, 2025

Choose a reason for hiding this comment

Uh oh!

gijsk Jul 25, 2025

Choose a reason for hiding this comment

Uh oh!

LordEidi commented Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gijsk left a comment

Choose a reason for hiding this comment

Uh oh!

gijsk Jul 25, 2025

Choose a reason for hiding this comment

Uh oh!

gijsk Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

LordEidi commented Aug 4, 2025

Uh oh!

gijsk left a comment

Choose a reason for hiding this comment

Uh oh!

gijsk Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

LordEidi commented Jul 24, 2025 •

edited

Loading