Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Readability mode drops hyperlinked paragraph #878

Open
swethapillai opened this issue Jun 11, 2024 · 8 comments
Open

Readability mode drops hyperlinked paragraph #878

swethapillai opened this issue Jun 11, 2024 · 8 comments

Comments

@swethapillai
Copy link

Hi all,
I am experiencing an issue with applying readability to the following webpage:
https://www.hartlepoolmail.co.uk/news/crime/seven-hartlepool-men-appear-in-court-charged-with-the-murder-of-michael-phillips-637809

This webpage has two paragraphs that are hyperlinked:

<div class="Markup__ParagraphWrapper-sc-13q6ywe-0 iWyzoK markup ">
<p><a href="https://www.hartlepoolmail.co.uk/news/crime/live-updates-seven-hartlepool-men-appear-court-charged-murder-michael-phillips-637633" rel="nofollow" target="_blank" data-vars-event="gaEvent" data-vars-ec="navigation" data-vars-ea="in article" data-vars-el="plain links" data-vars-aidclick="637633" data-vars-titleclick="On Friday September 27, John Musgrave, 54, Sean Musgrave, 30, both Wordsworth Avenue, Hartlepool, and Anthony Small, 39, of Rydal Street, Hartlepool all denied the charge of murder. " data-vars-urlclick="https://www.hartlepoolmail.co.uk/news/crime/live-updates-seven-hartlepool-men-appear-court-charged-murder-michael-phillips-637633"><strong>On Friday September 27, John Musgrave, 54, Sean Musgrave, 30, both Wordsworth Avenue, Hartlepool, and Anthony Small, 39, of Rydal Street, Hartlepool all denied the charge of murder. </strong></a></p>
</div>
<div class="Markup__ParagraphWrapper-sc-13q6ywe-0 iWyzoK markup ">
<p><strong><a href="https://www.hartlepoolmail.co.uk/news/crime/niramax-boss-neil-elliott-and-co-accused-face-trial-next-year-after-denying-murder-of-michael-phillips-472180" rel="nofollow" target="_blank" data-vars-event="gaEvent" data-vars-ec="navigation" data-vars-ea="in article" data-vars-el="plain links" data-vars-aidclick="472180" data-vars-titleclick="Previously, a director of Niramax waste management firm, Neil Elliott, 44, of Briarfield Close, Hartlepool, and co-accused Lee Darby, 31, of Ridley Court, Hartlepool denied killing Mr Phillips. " data-vars-urlclick="https://www.hartlepoolmail.co.uk/news/crime/niramax-boss-neil-elliott-and-co-accused-face-trial-next-year-after-denying-murder-of-michael-phillips-472180">Previously, a director of Niramax waste management firm, Neil Elliott, 44, of Briarfield Close, Hartlepool, and co-accused Lee Darby, 31, of Ridley Court, Hartlepool denied killing Mr Phillips. </a></strong></p>
</div>

However, readability drops the second hyperlinked paragraph, and only keeps the first hyperlinked paragraph amongst the other extracted text.

image image

I believe it could be an issue with readability and the density of commas in these wrapped sections of text. I found that when I remove a single comma from the first hyperlinked paragraph and then apply readability mode - both hyperlinked paragraphs show up.

The way in which this was found was:

  1. Open webpage in firefox
  2. Click on readability icon

Is this an expected limitation of the package?

@gijsk
Copy link
Contributor

gijsk commented Jun 14, 2024

Is this an expected limitation of the package?

Yes and no. :-)

No, as in, we should capture all the article text.

Yes in the sense that, the code tries to determine whether paragraphs and links are relevant based on the number of commas / text / links in them. Though the outcome is surprising to me here - running the processing with debug turned on might reveal why this is happening.

@swethapillai
Copy link
Author

Ran processing with debug turned on and it looks like we drop the missing text string in the Article content post-prep stage of readability.
I.e. we successfully grab the missing string: 'Previously, a director of...' for the first stage: Pre-GrabArticle and the second: Article content pre-prep but drop this text in Article content post-prep.

Debug output: HartlepoolNiramaxReadabilityDebugLogOutput.txt

@swethapillai
Copy link
Author

swethapillai commented Jun 17, 2024

It appears that the text is being dropped in the following location:

PrepArticle() > this.CleanConditionally(articleContent, "div"); (line 582) where flag1 is false and flag2, which is the output, is true.
The value of flag2 is determined by the following condition:

bool flag2 = 
	(double) num3 > 1.0 && (double) num2 / (double) num3 < 0.5 && !Reader.HasAncestorTag(node, "figure") ||
	!flag1 && (double) num4 > (double) num2 ||
	(double) num5 > Math.Floor((double) num2 / 3.0) ||
	!flag1 && (double) textDensity < 0.8999999761581421 && ((length1 < 25 ? 1 : 0) & (num3.CompareTo(0.0f) == 0 ? 1 : ((double) num3 > 2.0 ? 1 : 0))) != 0 && !Reader.HasAncestorTag(node, "figure") ||
	!flag1 && classWeight < 25 && linkDensity > 0.20000000298023224 ||
	(double) classWeight >= 25.0 && linkDensity > 0.5 ||
	num6 == 1 && length1 < 75 ||
	num6 > 1;

where, in this case, the following condition is true:

!flag1 &&
(double) textDensity < 0.8999999761581421 &&
((length1 < 25 ? 1 : 0) & (num3.CompareTo(0.0f) == 0 ? 1 : ((double) num3 > 2.0 ? 1 : 0))) != 0 &&
!Reader.HasAncestorTag(node, "figure")

@gijsk
Copy link
Contributor

gijsk commented Jun 18, 2024

Can you try with current main HEAD? That would provide better logging output.

What you've shared so far is confusing, as the variable names and method calls all appear to have been changed; are you using some other version of the package (not the plain JS in this github repo) ? Anyway - length1 here should be the innerText.length of the node in question, which is way longer than 25, so I don't understand how that could be true here. On main the logging will indicate exactly which condition is failing and on what grounds.

@swethapillai
Copy link
Author

Ahh sorry! We are using a package that uses Readability within it *SmartReader.
Hence why the output is not as you expected.
However the issue does still show up when viewing the webpage using readability via the Firefox browser.
The debug logs, using readability from this repo, also show that we drop the text string in: Article content post-prep
Debug logs: ReadabilityDebugOutput.txt
Here's the repo where we tested this out - if you want to have a deeper look.

@george-kirkman
Copy link

@gijsk Did you get a change to look at this since the correct output logs?

@gijsk
Copy link
Contributor

gijsk commented Jun 28, 2024

I'm sorry, I haven't - I will put it on my list for next week. The debug output unfortunately still didn't have what I was hoping for - the logging here I think should be hit for the code described in this comment.

@gijsk
Copy link
Contributor

gijsk commented Jul 1, 2024

So a log from the commandline when generating a testcase and then running the debug-testcase.js script against the same source produces:

Reader: (Readability) Cleaning Conditionally <div class="Markup__ParagraphWrapper-sc-13q6ywe-0 bhZZYl markup ">
Reader: (Readability) Checks failed [ 'Low weight and a little linky. (linkDensity=1)' ]

This makes sense - all the text in that paragraph is contained in a link (so the link density is 1).

I found that when I remove a single comma from the first hyperlinked paragraph and then apply readability mode - both hyperlinked paragraphs show up.

This would make less sense to me - but also I cannot reproduce it and you didn't specify which comma you removed... instead, I see removing a comma from that first para causing that para to also be removed...

I think the correct fix is probably to specialcase situations where a single <a> tag is the entire content of a <p> or even <h1>/<h2>/... tag and strip out the link but keep its text, or something - assuming the text is long enough (for some meaning of long enough).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants