-
Notifications
You must be signed in to change notification settings - Fork 619
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fails to detect article content #381
Comments
Resulting text is 406 words long, and by default Readability discards texts shorter than 500 words. Try using something like this:
|
Thanks! I've forked readability for now to iteratively reduce the threshold if no body text is found, rather than bail completely. |
Is the word count accurate here? I'm afraid I don't speak Japanese, but judging from the translated result, I expect our naive regex-based word parsing is also massively undercounting the number of words. From a quick look at Google, without a fully-fledged natural language parser, it's not all that feasible to "count words" in Hiragana, Katakana or Kanji (Chinese or Japanese). Perhaps we should ignore the word count if we encounter a significant proportion of non-Latin text? That would be a bit error-prone, but it feels somewhat better than just giving up for the wrong reasons... @aehlke does that sound sane, and/or is this something you would be interested in fixing? :-) |
Maybe we can count characters, if that is possible in other languages? Something like "if the majority of the characters are non-western, count X amount of characters to decide if the text was parsed correctly" |
Yes, I suppose we could fall back to a character count. That's probably better than ignoring the word count completely - good idea! |
Would a customized "length" function for strings and a regexp that detects when a piece of text belongs to the CKJ codepoints be a good start? Note that in general asian languages can express the same "content" with less chars, so the length function should account for that. |
Performance might be an issue. I think counting characters for non-western languages is better. |
Counting characters should be ok for the average Japanese or Chinese texts (though may fall apart in edge cases like Japanese texts written for children, which will use less kanji and more phonetic alphabet). I can't speak on other Asian languages though - there are ones that use pure phonetic alphabets. |
I must be missing something, because I'm fairly new to javascript, but I've been looking into Readability and it seems that it already counts characters, not words:
What am I missing? |
Huh, you're right. Guess this is fixed then... Go team! (?) |
Thanks @andreskrey for looking into that. Then, the only thing left is to rename the parameter to something like charThreshold? On a quick scan of the source code, the only thing language-dependant that I've seen is the usage of the comma (',') here. My intuition is telling me that asian languages do not use the comma much. Maybe @aehlke can shred some light onto this? |
@JoanEspasa I only know Japanese. Comma is used there (as "、"), much less than English but still common. |
We should stop not returning any content when we can't find content longer than the minimum limit. However, if I remember correctly we can't remove the limit entirely without regressing cases where the initial (strict-most) pass ends up with little or no text, and then a pass that leaves more content intact finds more text. I think the easiest way to fix this is to save intermediate results, and at the end of all passes, select the one with the most text instead of returning an error. This should be relatively straightforward to implement. Would anyone be interested in contributing a patch for this? I'm sorry to say I have basically no time to dedicate to making these types of improvements myself at the moment ( :-( ) but I'm always happy to respond (hopefully quickly!) to pull requests. |
Oh, and yes, we should rename that limit thing given it's not really a word limit... @JoanEspasa For the comma stuff, I'm afraid I don't know any Asian languages, but it seems like filing a separate issue for that problem may be appropriate? If you have one, with a specific testcase where the current implementation produces problems. Perhaps we can at least include the alternative unicode commas in the splitting code, as a stop-gap, and/or think about a more reliable method of doing whatever that code is trying to do. I don't think we should be penalizing authors who simply don't use a lot of commas, either! |
Just my two cents, but do we really want to start returning whatever text we find? I think that the 500 words/characters limit is there to prevent selecting wrong paragraphs and return meaningful content. Most people won't click the readability button just to parse a >500 character text. If I'm using this function, I'm probably looking at a long article with distracting elements. Maybe we could lower the threshold but removing it, in my opinion, it's not a good approach to solve this problem. Just a minor rant :) @gijsk if you don't agree, I volunteer anyway to code the fix you proposed :D |
We already have In addition to the issue raised here with non-latin languages and how 'character count' is a bit of an imperfect metric for "is this a long-ish bit of text", it also turns out people with various forms of disability use Reader Mode to normalize pages and then make them readable on other devices (sometimes as HTML, or PDF, or through processing it and/or ePub). They don't care how long the texts are, they just want to read things on other devices, or without being uncomfortable / having a seizure as a result of crappy web layouts. So while I agree that having some kind of limit serves a purpose, I don't think the current situation is optimal. I'm open to other suggestions on how to fix that, though... :-) (Note that the reason we have |
Ok, I agree. I'll try to code a solution this weekend :) |
Sorry for not being able to respond earlier. I can provide examples of some Japanese websites where Readability does not work as expected if you want.
We could say that the situation where we show the UI to users and then they end up with an error for a lack of text is false positive (FP). i.e., we said it should work but in reality it did not. In turn, sometimes there are false negatives (FN). That is, Now, to the point: @gijsk , if you want to try and decrease the number of FP, I would suggest the following:
With that, and a given threshold you can measure the % of FP and FN. If you change the threshold you can determine which percentage of FP you are willing to accept. If you want to take that further we may have to open another issue 😸 |
@JoanEspasa I'd be interested in having a separate issue for that, yes please. I think this particular issue is fixed, thanks to @andreskrey and others for their help with this, so I will close it. You can verify in current Firefox Nightly ( https://nightly.mozilla.org/ ). 👍 |
This page contains an article title, image, and body which Readability fails to detect (
_grabArticle
returnsnull
): http://hukumusume.com/douwa/pc/aesop/01/01.htmHowever, this similar (though longer) page is detected properly: http://hukumusume.com/douwa/pc/aesop/01/02.htm
The text was updated successfully, but these errors were encountered: