Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wait for lazy loaded elements when using the fullPage option #40

Merged
merged 15 commits into from
Aug 10, 2020

Conversation

sgtrusty
Copy link
Contributor

@sgtrusty sgtrusty commented Nov 13, 2019

Fixes #28

Based on the following walkthrough initially documented by a now deprecated ("Note: Service has been discontinued, this website remains for archival purposes") platform. Thanks to its wiki's author for documenting.

i failed to archive.org the site ... wiki zegnat net_cache__md5=f7ce4fd73de0ac41f15ea708b4c8f20f


IssueHunt Summary

Referenced issues

This pull request has been submitted to:


IssueHunt has been backed by the following sponsors. Become a sponsor

@sgtrusty
Copy link
Contributor Author

Just saw the other issue ... #1 ... Pending some fixes then (for endless scrollers and also expanding horizontally)

@Vasile-Peste
Copy link

This

await wait(100);

is not good.

You should wait actually for images and other snippets to load, not giving them 100ms...
The waitForNavigation method waiting for a network idle should be used instead.

@sgtrusty
Copy link
Contributor Author

sgtrusty commented Nov 20, 2019

@Vasile-Peste
I see your message. I will do some more investigation on the topic and add the due fixes. Thanks.

@sindresorhus
Copy link
Owner

Bump :)

@sgtrusty
Copy link
Contributor Author

Thanks for the followup, I haven't had time in order to investigate further and commit the necessary changes. If it may be of interest, I'm also working on a little app of my own of similar nature. click me

@sgtrusty
Copy link
Contributor Author

sgtrusty commented Feb 24, 2020

@sindresorhus
Check it and let me know if anything is pending. I will try to find the time to tackle the other issues in the post.

@sgtrusty
Copy link
Contributor Author

sgtrusty commented Feb 25, 2020

f) For tackling FIXED elements, after taking the very first screenshot, remove them from page otherwise they will repeat in every screenshot and will cause noise.

Do you suppose this would be a cheap or actual solution? I don't think this has been implemented yet.

As well as this:

For INFINITE scroll
a) Scroll the page to the bottom
b) Check if height is increased because of lazy-included elements at the bottom
c) Repeat (a) and (b) until height is not increasing or the MAX_SCREENSHOT_HEIGHT is reached.

From: #1 (comment)

@sindresorhus
Copy link
Owner

Yes, those situations needs to be handled too.

@sindresorhus
Copy link
Owner

Do you suppose this would be a cheap or actual solution? I don't think this has been implemented yet.

That sounds like an ok solution.

@sgtrusty
Copy link
Contributor Author

sgtrusty commented Mar 7, 2020

Ok, if it's possible, i'd like to solve those as part of the other PR, in order to get the bounty for this solution first. Let me know if that is possible, so I may work both solutions out in order to get the bounties. Thanks.

(note this solves #28 and those issues are described per the PR #1)

@sindresorhus
Copy link
Owner

Ok, if it's possible, i'd like to solve those as part of the other PR, in order to get the bounty for this solution first.

Sure

index.js Outdated Show resolved Hide resolved
index.js Outdated
@@ -7,6 +7,7 @@ const puppeteer = require('puppeteer');
const devices = require('puppeteer/DeviceDescriptors');
const toughCookie = require('tough-cookie');

const sleep = promisify(setTimeout);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const sleep = promisify(setTimeout);
const delay = promisify(setTimeout);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing delay name change in practice (use) as well. Going to commit standalone.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: i decided to remove this in order to add waitForNavigation instead. Let me know when we can run some tests to see if it is working adequately.

index.js Outdated Show resolved Hide resolved
index.js Outdated Show resolved Hide resolved
});

// Some extra delay to let images load
await page.waitForFunction(imagesHaveLoaded, {timeout: 60});
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the user would expect it to take up to one minute. If we keep this, it needs to be documented.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might also be added as an argument with default value set to timeout 60. Might want to comment on this or implement some other better functionality? Maybe it's also worth looking into alternate functions, such as the waitForNavigation Vasile mentioned earlier.

This

await wait(100);

is not good.

You should wait actually for images and other snippets to load, not giving them 100ms...
The waitForNavigation method waiting for a network idle should be used instead.

In case it might be useful for documentation purposes; I retrieved part of the solution from https://stackoverflow.com/questions/51651830/how-to-wait-for-all-images-to-load-from-page-evaluate-function-in-puppeteer-when

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would look into something better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using page.waitForNavigation as per pointed by #40 (review)

@sindresorhus sindresorhus changed the title Wait for lazy loaded elements when using the ullPage option Wait for lazy loaded elements when using the fullPage option Mar 9, 2020
@sindresorhus sindresorhus changed the title Wait for lazy loaded elements when using the fullPage option Wait for lazy loaded elements when using the fullPage option Mar 9, 2020
@sindresorhus
Copy link
Owner

This will need some automated tests.

sgtrusty and others added 2 commits March 13, 2020 00:07
Co-Authored-By: Sindre Sorhus <[email protected]>
viewportIncr to elongated viewportIncrement

Co-Authored-By: Sindre Sorhus <[email protected]>
@sgtrusty
Copy link
Contributor Author

sgtrusty commented Mar 13, 2020

This will need some automated tests.

I found two websites with appropriate content which could be used as guidelines for the test cases:

Perhaps it would be wise to contact the webmasters of said pages and see if they can be used for the test scenarios, or if it would be wise to use a static html inside the capture-website repo distribution, or some other sort of scenario. For the sake of me I couldn't find a legally scrapable website like http://testing-ground.scraping.pro/ for this purpose.

P.S.: Should there be some brainstorming spare time, should probably also reconsider the whole eslint-disable/eslint-enable transaction for no-await-in-loop and such. Hoping to hear back from you. Thanks.

Edit: also please let me know if you have any clue why the last build failed to run under Node v8.

@sindresorhus
Copy link
Owner

or if it would be wise to use a static html inside the capture-website repo distribution, or some other sort of scenario.

It would be better to have something locally that is not affected by network conditions or whether the website in question is up. I think you can just put together a minimal HTML page that lazy loads some images from Unsplash. Lots of examples out there for that.

P.S.: Should there be some brainstorming spare time, should probably also reconsider the whole eslint-disable/eslint-enable transaction for no-await-in-loop and such.

Not sure I understand?

Edit: also please let me know if you have any clue why the last build failed to run under Node v8.

Just ignore Node.js 8. I plan to drop support for it.

@sgtrusty
Copy link
Contributor Author

sgtrusty commented Mar 14, 2020

The eslint part:
https://github.com/netrules/capture-website/blob/master/index.js#L337-L344

/* eslint-disable no-await-in-loop */
await page.evaluate(_viewportHeight => {
/* eslint-disable no-undef */
window.scrollBy(0, _viewportHeight);
/* eslint-enable no-undef */
}, viewportHeight);
await navigationPromise;
/* eslint-enable no-await-in-loop */

I will look into the HTML making and upload that briefly. Hope everything works out, thanks.

index.js Outdated
const viewportHeight = viewportOptions.height;
let viewportIncrement = 0;
while (viewportIncrement + viewportHeight < bodyBoundingHeight) {
const navigationPromise = page.waitForNavigation();
Copy link

@ghost ghost Apr 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this line wait for networkidle instead of the default loaded? I assume this promise will be awaited to let the images load, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I probably agree with you, but care to clarify?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I probably agree with you, but care to clarify?

Waiting for network idle allows lazy elements to load (elements loaded after the "load" event of the document). I suggest doing as follows:

  1. Wait "load" event of the document.
  2. Scroll the page to its maximum Y (this will load lazy elements like lazy loaded images).
  3. Wait for network idle.

Copy link

@ghost ghost Apr 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also worth to keep in mind that some (old) websites can actually never fulfill the requirements for the network idle navigation event - for example frequent chat requests.

In the end, the propsed flow above should work just fine (most cases).

sgtrusty and others added 2 commits April 11, 2020 20:13
* Fix setting an expiring cookie

Fixes sindresorhus/capture-website-cli#21

* 0.8.1

Co-authored-by: Sindre Sorhus <[email protected]>
test.js Outdated
for(let i=0;i<numItemsToGenerate;i++){
let randomImageIndex = Math.floor(Math.random() * numImagesAvailable);
renderGalleryItem(randomImageIndex);
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clean up the code style here? It's a bit messy. Use async/await, template literals, etc.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also move the string to the top-level.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. Will learn a few things here and there and proceed as such. Thanks for the feedback

package.json Outdated
@@ -1,6 +1,6 @@
{
"name": "capture-website",
"version": "0.8.0",
"version": "0.8.1",
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't bump this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dunno why that'd get bumped. Gotcha.

@sindresorhus
Copy link
Owner

Bump

@sgtrusty
Copy link
Contributor Author

Bump

Seems like pretty much already complete. All that is pending for me to check is #40 (comment) ... I will look into that and deliver a final result.

@sgtrusty
Copy link
Contributor Author

sgtrusty commented Jul 23, 2020

Bump

I looked into what you say, and since the necessary solution for testing isn't even that complex, I came up with two approaches:

  • Making it completely template literal, i.e.:
server.get('/', async (request, response) => {
		response.end(`
			<body>
				<div style="display: grid; grid-template-columns: repeat( auto-fill, minmax(150px, 1fr) );" id="grid">
					<img src="https://picsum.photos/150/150?random=1" loading="lazy" />
					<img src="https://picsum.photos/150/150?random=2" loading="lazy" />
					<img src="https://picsum.photos/150/150?random=3" loading="lazy" />
					<img src="https://picsum.photos/150/150?random=4" loading="lazy" />
					<img src="https://picsum.photos/150/150?random=5" loading="lazy" />
					<img src="https://picsum.photos/150/150?random=6" loading="lazy" />
					<img src="https://picsum.photos/150/150?random=7" loading="lazy" />
					<img src="https://picsum.photos/150/150?random=8" loading="lazy" />
					<img src="https://picsum.photos/150/150?random=9" loading="lazy" />
					<img src="https://picsum.photos/150/150?random=10" loading="lazy" />
					<img src="https://picsum.photos/150/150?random=11" loading="lazy" />
					<img src="https://picsum.photos/150/150?random=12" loading="lazy" />
					<img src="https://picsum.photos/150/150?random=13" loading="lazy" />
					<img src="https://picsum.photos/150/150?random=14" loading="lazy" />
					<img src="https://picsum.photos/150/150?random=15" loading="lazy" />
				</div>
			</body>
		`);
  • Or using the final ES6 code I came up with:
const imageUrl = 'https://picsum.photos/150/150';
const imageHolder = document.getElementById('grid');

const renderGallery = async _ => {
  let galleryItem;
  
  for(let i=0;i<250;i++){
    galleryItem = document.createElement('div');
    galleryItem.innerHTML = '<img src="'+imageUrl+'?random='+i+'" loading="lazy" />';
    imageHolder.appendChild(galleryItem);
  }
}

renderGallery();

Oh yeah, I switched the API to picsum which is essentially still Unsplash, but it's less chunky and the cache doesn't overcomplicate things.

Also pending: #40 (comment)

Just saw the other issue ... #1 ... Pending some fixes then (for endless scrollers and also expanding horizontally)

Missing these two things. Waiting for your input, then I'll proceed with those two. I have no idea how to deal with endless scrollers though, perhaps another waitForNavigation and catch a timeOut specified by default in your docs, could be up to 5-10 minutes, I guess, if the page keeps on scrolling at scrolling. This would be simpler than getting into the whole quantical formulation of connection speed * page content, etc.

@sindresorhus
Copy link
Owner

Making it completely template literal, i.e.:

Seems like the simplest solution. You can just do a for loop to generate the <img elements strings.

@sindresorhus
Copy link
Owner

Missing these two things. Waiting for your input, then I'll proceed with those two. I have no idea how to deal with endless scrollers though, perhaps another waitForNavigation and catch a timeOut specified by default in your docs, could be up to 5-10 minutes, I guess, if the page keeps on scrolling at scrolling. This would be simpler than getting into the whole quantical formulation of connection speed * page content, etc.

I don't have any good suggestions on how to resolve those issues.

@sgtrusty
Copy link
Contributor Author

sgtrusty commented Aug 3, 2020

Cool. Merge pulls, please. Let's see how I can go about claiming this bounty 👍
Maybe I'll also fix #1 later for next bounty (two issues I described earlier)

@sindresorhus sindresorhus merged commit 9401749 into sindresorhus:master Aug 10, 2020
@sindresorhus
Copy link
Owner

@netrules This seems to fail randomly on Travis: https://travis-ci.org/github/sindresorhus/capture-website/jobs/716595910

@sgtrusty
Copy link
Contributor Author

sgtrusty commented Aug 11, 2020

@sindresorhus Perhaps because we are running outside sources which are network expensive. Would you like to blob them inside a local storage cache, separated from the code, or maybe as part of the library, as template/skeleton files?

Edit: maybe it's even possible to find a solution which generates random image blobs on the run, while inserting lazy elements into the fullPage. It's a bit outside of the scope of the current bug fix, but I can find a work around since it is the test I designed.

Edit 2: https://stackoverflow.com/questions/53187941/why-is-puppeteer-failing-simple-tests-with-waiting-for-function-failed-timeou also found this. How do you recommend I go about testing? My own repo and a travis pipeline?

Edit 3: this might be useful? master...netrules:patch-1 ... Thanks a lot for the bounty btw.

sgtrusty added a commit to sgtrusty/capture-website that referenced this pull request Aug 11, 2020
@sindresorhus
Copy link
Owner

Edit 3: this might be useful? master...netrules:patch-1 ... Thanks a lot for the bounty btw.

Can you do a PR? That will run travis and check if it passes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Wait for lazy loaded elements when using the fullPage option
3 participants