-
Notifications
You must be signed in to change notification settings - Fork 615
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Readability mode drops hyperlinked paragraph #878
Comments
Yes and no. :-) No, as in, we should capture all the article text. Yes in the sense that, the code tries to determine whether paragraphs and links are relevant based on the number of commas / text / links in them. Though the outcome is surprising to me here - running the processing with |
Ran processing with debug turned on and it looks like we drop the missing text string in the Debug output: HartlepoolNiramaxReadabilityDebugLogOutput.txt |
It appears that the text is being dropped in the following location: PrepArticle() >
where, in this case, the following condition is true:
|
Can you try with current What you've shared so far is confusing, as the variable names and method calls all appear to have been changed; are you using some other version of the package (not the plain JS in this github repo) ? Anyway - |
Ahh sorry! We are using a package that uses Readability within it *SmartReader. |
@gijsk Did you get a change to look at this since the correct output logs? |
I'm sorry, I haven't - I will put it on my list for next week. The debug output unfortunately still didn't have what I was hoping for - the logging here I think should be hit for the code described in this comment. |
So a log from the commandline when generating a testcase and then running the
This makes sense - all the text in that paragraph is contained in a link (so the link density is 1).
This would make less sense to me - but also I cannot reproduce it and you didn't specify which comma you removed... instead, I see removing a comma from that first para causing that para to also be removed... I think the correct fix is probably to specialcase situations where a single |
Hi all,
I am experiencing an issue with applying readability to the following webpage:
https://www.hartlepoolmail.co.uk/news/crime/seven-hartlepool-men-appear-in-court-charged-with-the-murder-of-michael-phillips-637809
This webpage has two paragraphs that are hyperlinked:
However, readability drops the second hyperlinked paragraph, and only keeps the first hyperlinked paragraph amongst the other extracted text.
I believe it could be an issue with readability and the density of commas in these wrapped sections of text. I found that when I remove a single comma from the first hyperlinked paragraph and then apply readability mode - both hyperlinked paragraphs show up.
The way in which this was found was:
Is this an expected limitation of the package?
The text was updated successfully, but these errors were encountered: