Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fails to detect article content #381

Closed
aehlke opened this issue Jun 20, 2017 · 19 comments
Closed

Fails to detect article content #381

aehlke opened this issue Jun 20, 2017 · 19 comments

Comments

@aehlke
Copy link

aehlke commented Jun 20, 2017

This page contains an article title, image, and body which Readability fails to detect (_grabArticle returns null): http://hukumusume.com/douwa/pc/aesop/01/01.htm

However, this similar (though longer) page is detected properly: http://hukumusume.com/douwa/pc/aesop/01/02.htm

@andreskrey
Copy link
Contributor

Resulting text is 406 words long, and by default Readability discards texts shorter than 500 words.
If you set the wordThreshold option to 300, the article gets parsed correctly.

Try using something like this:

var uri = {
    spec: loc.href,
    host: loc.host,
    prePath: loc.protocol + "//" + loc.host,
    scheme: loc.protocol.substr(0, loc.protocol.indexOf(":")),
    pathBase: loc.protocol + "//" + loc.host + loc.pathname.substr(0, loc.pathname.lastIndexOf("/") + 1)
};
var article = new Readability(uri, document, {'wordThreshold': 303}).parse();

@aehlke
Copy link
Author

aehlke commented Nov 26, 2017

Thanks! I've forked readability for now to iteratively reduce the threshold if no body text is found, rather than bail completely.

@gijsk
Copy link
Contributor

gijsk commented Nov 27, 2017

Is the word count accurate here? I'm afraid I don't speak Japanese, but judging from the translated result, I expect our naive regex-based word parsing is also massively undercounting the number of words. From a quick look at Google, without a fully-fledged natural language parser, it's not all that feasible to "count words" in Hiragana, Katakana or Kanji (Chinese or Japanese).

Perhaps we should ignore the word count if we encounter a significant proportion of non-Latin text? That would be a bit error-prone, but it feels somewhat better than just giving up for the wrong reasons...

@aehlke does that sound sane, and/or is this something you would be interested in fixing? :-)

@andreskrey
Copy link
Contributor

Maybe we can count characters, if that is possible in other languages? Something like "if the majority of the characters are non-western, count X amount of characters to decide if the text was parsed correctly"

@gijsk
Copy link
Contributor

gijsk commented Nov 27, 2017

Yes, I suppose we could fall back to a character count. That's probably better than ignoring the word count completely - good idea!

@JoanEspasa
Copy link

JoanEspasa commented Jan 10, 2018

Would a customized "length" function for strings and a regexp that detects when a piece of text belongs to the CKJ codepoints be a good start?

Note that in general asian languages can express the same "content" with less chars, so the length function should account for that.

@andreskrey
Copy link
Contributor

Performance might be an issue. I think counting characters for non-western languages is better.

@aehlke
Copy link
Author

aehlke commented Jan 10, 2018

Counting characters should be ok for the average Japanese or Chinese texts (though may fall apart in edge cases like Japanese texts written for children, which will use less kanji and more phonetic alphabet). I can't speak on other Asian languages though - there are ones that use pure phonetic alphabets.

@JoanEspasa
Copy link

I must be missing something, because I'm fairly new to javascript, but I've been looking into Readability and it seems that it already counts characters, not words:

  • By its name, wordThreshold should be the parameter governing the minimum number of words needed for a content to be considered valid.

  • The only place this option seems to be used is here, where it is compared with the result of invoking this._getInnerText on the extracted content. _getInnerText in turn returns the .length (number of characters) of the textContent of the given node and all its descendants.

What am I missing?

@andreskrey
Copy link
Contributor

Huh, you're right.

Guess this is fixed then... Go team! (?)

@JoanEspasa
Copy link

Thanks @andreskrey for looking into that. Then, the only thing left is to rename the parameter to something like charThreshold?

On a quick scan of the source code, the only thing language-dependant that I've seen is the usage of the comma (',') here. My intuition is telling me that asian languages do not use the comma much. Maybe @aehlke can shred some light onto this?

@aehlke
Copy link
Author

aehlke commented Jan 12, 2018

@JoanEspasa I only know Japanese. Comma is used there (as "、"), much less than English but still common.

@gijsk
Copy link
Contributor

gijsk commented Jan 29, 2018

We should stop not returning any content when we can't find content longer than the minimum limit.

However, if I remember correctly we can't remove the limit entirely without regressing cases where the initial (strict-most) pass ends up with little or no text, and then a pass that leaves more content intact finds more text.

I think the easiest way to fix this is to save intermediate results, and at the end of all passes, select the one with the most text instead of returning an error. This should be relatively straightforward to implement. Would anyone be interested in contributing a patch for this? I'm sorry to say I have basically no time to dedicate to making these types of improvements myself at the moment ( :-( ) but I'm always happy to respond (hopefully quickly!) to pull requests.

@gijsk
Copy link
Contributor

gijsk commented Jan 29, 2018

Oh, and yes, we should rename that limit thing given it's not really a word limit...

@JoanEspasa For the comma stuff, I'm afraid I don't know any Asian languages, but it seems like filing a separate issue for that problem may be appropriate? If you have one, with a specific testcase where the current implementation produces problems. Perhaps we can at least include the alternative unicode commas in the splitting code, as a stop-gap, and/or think about a more reliable method of doing whatever that code is trying to do. I don't think we should be penalizing authors who simply don't use a lot of commas, either!

@andreskrey
Copy link
Contributor

Just my two cents, but do we really want to start returning whatever text we find? I think that the 500 words/characters limit is there to prevent selecting wrong paragraphs and return meaningful content. Most people won't click the readability button just to parse a >500 character text.

If I'm using this function, I'm probably looking at a long article with distracting elements. Maybe we could lower the threshold but removing it, in my opinion, it's not a good approach to solve this problem.

Just a minor rant :) @gijsk if you don't agree, I volunteer anyway to code the fix you proposed :D

@gijsk
Copy link
Contributor

gijsk commented Jan 30, 2018

We already have isProbablyReaderable which controls whether we show the reader mode icon. What I want to get rid of is this situation where we show the UI to users so they think it's usable, and then they end up with the error message instead of reader mode.

In addition to the issue raised here with non-latin languages and how 'character count' is a bit of an imperfect metric for "is this a long-ish bit of text", it also turns out people with various forms of disability use Reader Mode to normalize pages and then make them readable on other devices (sometimes as HTML, or PDF, or through processing it and/or ePub). They don't care how long the texts are, they just want to read things on other devices, or without being uncomfortable / having a seizure as a result of crappy web layouts.

So while I agree that having some kind of limit serves a purpose, I don't think the current situation is optimal. I'm open to other suggestions on how to fix that, though... :-)

(Note that the reason we have isProbablyReaderable is that running the entirety of readability on every page load adds a bunch of slowdown that we need to avoid - can't spend 60ms to decide not to show that button every time you load a page...)

@andreskrey
Copy link
Contributor

Ok, I agree. I'll try to code a solution this weekend :)

@JoanEspasa
Copy link

Sorry for not being able to respond earlier. I can provide examples of some Japanese websites where Readability does not work as expected if you want.

isProbablyReaderable is a function that can be considered a heuristic. It tries to predict if the Readability process will work or not. It returns a Boolean, as it is directly used to determine if the button in the UI should be shown or not.

We could say that the situation where we show the UI to users and then they end up with an error for a lack of text is false positive (FP). i.e., we said it should work but in reality it did not. In turn, sometimes there are false negatives (FN). That is, isProbablyReaderable returns False but if we would execute Readability it would extract the content successfully. Normally, FP and FN are considered complementary: If you want to decrease the number of FP, you will increase the number of FN, and vice-versa.

Now, to the point: @gijsk , if you want to try and decrease the number of FP, I would suggest the following:

  • transform isProbablyReaderable to return a number (easily done, as it already uses a score value)
  • Stablish an initial threshold, which will determine if the UI is shown or not.
  • Gather a set of URLs that you know that Readability works, and a set of URLs that you know that it does not work

With that, and a given threshold you can measure the % of FP and FN. If you change the threshold you can determine which percentage of FP you are willing to accept. If you want to take that further we may have to open another issue 😸

@gijsk
Copy link
Contributor

gijsk commented Mar 13, 2018

@JoanEspasa I'd be interested in having a separate issue for that, yes please. I think this particular issue is fixed, thanks to @andreskrey and others for their help with this, so I will close it. You can verify in current Firefox Nightly ( https://nightly.mozilla.org/ ). 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants