Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return longest text after failing to detect text longer than the configured value #421

Closed
wants to merge 9 commits into from
Closed

Return longest text after failing to detect text longer than the configured value #421

wants to merge 9 commits into from

Conversation

andreskrey
Copy link
Contributor

Hi,

This solves issue #381.

I altered the loop that disables flags and now the algorithm will always return some text, which will be the longest it can find across the different attempts.

There's only one concern I have with this PR and it's on line 1108. I need to access the parent this scope to use the _getInnerText function and the only way I found to do so was to clone the this variable in the that variable. I'm sure there's something wrong because my IDE complains saying that the "Mutable variable is accessible from closure".

@gijsk can you apply some of your JS-fu and maybe suggest a better way to do this?

Thanks

Copy link
Contributor

@gijsk gijsk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this! This is definitely the right direction. Some comments follow that will help with the warning your IDE is showing and also improve performance a little bit. 👍

Readability.js Outdated
// Now that we've gone through the full algorithm, check to see if
// we got any meaningful content. If we didn't, we may need to re-run
// grabArticle with different flags set. This gives us a higher likelihood of
// finding the content, and the sieve approach gives us a higher likelihood of
// finding the -right- content.
if (this._getInnerText(articleContent, true).length < this._wordThreshold) {
parseSuccessful = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what I would do is move the result of the _getInnerText().length text call to a variable before the if statement:

var textLength = this._getInnerText(articleContent, true).length;
if (textLength < this._wordThreshold) {

Readability.js Outdated
page.innerHTML = pageCacheHtml;

if (this._flagIsActive(this.FLAG_STRIP_UNLIKELYS)) {
this._removeFlag(this.FLAG_STRIP_UNLIKELYS);
this._attempts.push(articleContent);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then in all of these, I'd push an object:

this._attempts.push({articleContent: articleContent, textLength: textLength});

Readability.js Outdated
} else {
return null;
// No luck after removing flags, just return the longest text we found during the different loops
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that you're missing the last result here - we need to push articleContent in the array here, too!

Readability.js Outdated
return null;
// No luck after removing flags, just return the longest text we found during the different loops
var that = this;
var results = this._attempts.sort(function (a, b) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then you don't need to rebind this here and can just sort like so:

this._attempts.sort(function(a, b) {
  return a.textLength < b.textLength;
});

This has the advantage that we don't recompute the text content a lot of times, and we also avoid the this reference.

Note that in JavaScript, Array.sort modifies the array on which it's called, so you don't need to assign to var results here.

Readability.js Outdated
});

// But first check if we actually have something
if (this._getInnerText(results[0], true).length === 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you could then just check if (!this._attempts[0].textLength)

@gijsk
Copy link
Contributor

gijsk commented Feb 19, 2018

Should we add a test for this case, by the way? :-)

@andreskrey
Copy link
Contributor Author

Yeah, of course, I'll add the html of the referenced issue. I'll try to solve your reviews this week. Thanks!

@gijsk
Copy link
Contributor

gijsk commented Feb 26, 2018

Hey @andreskrey , thanks for this. There seems to be some issue with this sequence of changes because github is telling me "This branch cannot be rebased due to conflicts". Looking at the diff view, it looks OK to me. Do you know what's up, or shall I just try to do a separate merge commit myself to land this?

@andreskrey
Copy link
Contributor Author

@gijsk Yep yep yep, I'm aware of that, I'm not done yet tough! :)

I'm working on the test case and then I'll rewrite my commits, somewhere down the line I screwed the git log.

I'll let you know once it's done

@andreskrey
Copy link
Contributor Author

Also, thank you a lot for the review. Helped me a lot to understand the issues I had with my code :)

@andreskrey
Copy link
Contributor Author

Somehow I fucked the git log beyond repair. I'm going to create a new PR and close this one.

XKCD

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants