-
Notifications
You must be signed in to change notification settings - Fork 620
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Return longest text after failing to detect text longer than the configured value #421
Return longest text after failing to detect text longer than the configured value #421
Conversation
… no relevant text was found)
… no relevant text was found)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this! This is definitely the right direction. Some comments follow that will help with the warning your IDE is showing and also improve performance a little bit. 👍
Readability.js
Outdated
// Now that we've gone through the full algorithm, check to see if | ||
// we got any meaningful content. If we didn't, we may need to re-run | ||
// grabArticle with different flags set. This gives us a higher likelihood of | ||
// finding the content, and the sieve approach gives us a higher likelihood of | ||
// finding the -right- content. | ||
if (this._getInnerText(articleContent, true).length < this._wordThreshold) { | ||
parseSuccessful = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So what I would do is move the result of the _getInnerText().length
text call to a variable before the if
statement:
var textLength = this._getInnerText(articleContent, true).length;
if (textLength < this._wordThreshold) {
Readability.js
Outdated
page.innerHTML = pageCacheHtml; | ||
|
||
if (this._flagIsActive(this.FLAG_STRIP_UNLIKELYS)) { | ||
this._removeFlag(this.FLAG_STRIP_UNLIKELYS); | ||
this._attempts.push(articleContent); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then in all of these, I'd push an object:
this._attempts.push({articleContent: articleContent, textLength: textLength});
Readability.js
Outdated
} else { | ||
return null; | ||
// No luck after removing flags, just return the longest text we found during the different loops |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that you're missing the last result here - we need to push articleContent
in the array here, too!
Readability.js
Outdated
return null; | ||
// No luck after removing flags, just return the longest text we found during the different loops | ||
var that = this; | ||
var results = this._attempts.sort(function (a, b) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then you don't need to rebind this
here and can just sort like so:
this._attempts.sort(function(a, b) {
return a.textLength < b.textLength;
});
This has the advantage that we don't recompute the text content a lot of times, and we also avoid the this
reference.
Note that in JavaScript, Array.sort
modifies the array on which it's called, so you don't need to assign to var results
here.
Readability.js
Outdated
}); | ||
|
||
// But first check if we actually have something | ||
if (this._getInnerText(results[0], true).length === 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here you could then just check if (!this._attempts[0].textLength)
Should we add a test for this case, by the way? :-) |
Yeah, of course, I'll add the html of the referenced issue. I'll try to solve your reviews this week. Thanks! |
Hey @andreskrey , thanks for this. There seems to be some issue with this sequence of changes because github is telling me "This branch cannot be rebased due to conflicts". Looking at the diff view, it looks OK to me. Do you know what's up, or shall I just try to do a separate merge commit myself to land this? |
@gijsk Yep yep yep, I'm aware of that, I'm not done yet tough! :) I'm working on the test case and then I'll rewrite my commits, somewhere down the line I screwed the git log. I'll let you know once it's done |
Also, thank you a lot for the review. Helped me a lot to understand the issues I had with my code :) |
Hi,
This solves issue #381.
I altered the loop that disables flags and now the algorithm will always return some text, which will be the longest it can find across the different attempts.
There's only one concern I have with this PR and it's on line 1108. I need to access the parent
this
scope to use the_getInnerText
function and the only way I found to do so was to clone thethis
variable in thethat
variable. I'm sure there's something wrong because my IDE complains saying that the "Mutable variable is accessible from closure".@gijsk can you apply some of your JS-fu and maybe suggest a better way to do this?
Thanks