Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Request] Real-world Gatsby sites (50k+ pages) #19512

Closed
pvdz opened this issue Nov 14, 2019 · 62 comments
Closed

[Request] Real-world Gatsby sites (50k+ pages) #19512

pvdz opened this issue Nov 14, 2019 · 62 comments
Labels
help wanted Issue with a clear description that the community can help with. topic: performance Related to runtime & build performance

Comments

@pvdz
Copy link
Contributor

pvdz commented Nov 14, 2019

Hello kind Gatsby user 👋

My name is Peter, and I’m a Gatsby employee focused on performance and scalability.

We at Gatsby are always looking for ways to improve the performance of building out your Gatsby applications and making Gatsby not only scalable to building out hundreds of thousands of pages (or more!) but making this process as lightning quick as the resulting Gatsby application.

To best support this endeavor, we need your help! We have benchmarks in the repo, but they tend to be quite contrived and not necessarily indicative of real-world usage of Gatsby.

Specifically, we're looking for sites that:

  • Have 50k pages (or more!)
  • Are using Gatsby v2 (ideally the latest version of Gatsby)
  • Can be relatively easily set up (e.g. no or minimally complicated build process, just gatsby build would be ideal!)
  • Secrets can be shared with us privately (see my e-mail below)

For this first batch, we'll be using these real-world applications to identify low-hanging fruit as it relates to performance so we can make the Gatsby build process ever faster and ever more scalable.

Does this sound like you? Please share a link to your application's source code below or e-mail any necessary details to peter at gatsbyjs dot com. We appreciate you 💜

Thanks! Onwards and upwards 📈🚀

@pvdz pvdz added help wanted Issue with a clear description that the community can help with. no triage labels Nov 14, 2019
@DSchau DSchau pinned this issue Nov 14, 2019
@eads
Copy link

eads commented Nov 15, 2019

Hi there! Yes I'd very much like to help. My biggest Gatsby site has ~24k pages but I figure it's still a pretty decent one to take a look at. You can see it live at https://govbook.chicagoreporter.com/en/ and the code is open source at https://github.com/thechicagoreporter/govbook

It runs on Gatsby v2. No secrets required, and we track the source data in the repo for now, so you shouldn't even need to pull down fresh data. It does depend on having SQLite installed; I don't think anything else is required other than the standard Gatsby dependencies.

Also tagging in @artfulaction who is contracted to work on the project through at least the end of the year.

I've had a helluva time getting it to build on Netlify and AWS Amplify, so that's been a persistent issue. Thus far to locally develop the site I just limit the size of the query manually which isn't ideal either.

The database that drives it is about 8k rows, and there's a row per page. How is it over 24k pages then? Because there's a spanish version, an english version, and the redirect page. Any language we add will add another 8k pages to the build, and we're very much hoping to get a few more languages (especially Mandarin and Polish) into the site in the next year or so.

Longer term, we hope to bake out VCF files for each of the contacts in the database, so that will add tens of thousands of additional files as well and should represent an interesting use case for the Gatsby toolchain.

Thanks for doing this 👏

@pvdz
Copy link
Contributor Author

pvdz commented Nov 18, 2019

@eads this is a great example of what I'm looking for. Thank you :)

@Swizec
Copy link
Contributor

Swizec commented Nov 18, 2019

Hi,

I don't quite fit your requirements, but I do have a Gatsby build that takes upwards of 40 minutes on my local machine, crashes Zeit, and makes Netlify choke. Both also choke when I try to upload the resulting static package of 18,000+ files.

It's been fun :D

Right now Gatsby's cloud service comes the closest to working.

Here's the repo and specific branch: https://github.com/Swizec/swizec-blog/tree/many-posts

It's about 1400 pages in all, but the image and embed processing kills everything. Even have to increase Node's heap size to make the build survive.

You can't see it live anywhere because I'm trying to avoid having to setup my own VPS+CDN and such. That was one of my original motivations for moving to Gatsby in the first place – hoping for an easy way to host+setup with modern tools.

@pvdz
Copy link
Contributor Author

pvdz commented Nov 19, 2019

@Swizec thanks! Your site may not have as many pages but those images sure keep the cores spinning. Not sure how much we can improve on that since image processing is simply expensive. However, we do see some problems with the social plugin, and some room for improvement on fetching external resoures. Thank you :)

@pvdz
Copy link
Contributor Author

pvdz commented Nov 20, 2019

cc @brod-ie @ashtonsix You've mentioned in other issues that you have mega big sites. Any chance we could get a slice of that for benchmarking? It could ultimately help your site as well.

pvdz added a commit that referenced this issue Nov 21, 2019
The particular search here started with `pages.length == 25599` and `matchPathPages.length == 2` but because `matchPathPages` was being pushed onto inside the loop, the array grew and grew. Ultimately the inner `find` was calling its callback 72 million (!) times, causing massive build delays (this was taking 2/3rds of the total build time for this site).

In contrast, this `forEach` took 390s (seconds!) before and now takes 0.47s on my machine.

This problem is caused by inefficient routing that is sometimes necessary for proper 404 handling. This becomes a big problem as the site grows and the match-path.json grows. There are some user-land improvements we can suggest to work around other problems, but at least this one we can resolve.

Thanks to @eads for offering a real-world site in #19512 so we could profile this, and @pieh for helping with the debugging.
wardpeet pushed a commit that referenced this issue Nov 22, 2019
The particular search here started with `pages.length == 25599` and `matchPathPages.length == 2` but because `matchPathPages` was being pushed onto inside the loop, the array grew and grew. Ultimately the inner `find` was calling its callback 72 million (!) times, causing massive build delays (this was taking 2/3rds of the total build time for this site).

In contrast, this `forEach` took 390s (seconds!) before and now takes 0.47s on my machine.

This problem is caused by inefficient routing that is sometimes necessary for proper 404 handling. This becomes a big problem as the site grows and the match-path.json grows. There are some user-land improvements we can suggest to work around other problems, but at least this one we can resolve.

Thanks to @eads for offering a real-world site in #19512 so we could profile this, and @pieh for helping with the debugging.
@rjyo
Copy link

rjyo commented Nov 22, 2019

Hi @pvdz, Gatsby is really nice and I would like to provide an example here.

It has around 48K pages with 200K+ records and growing.

With the queries optimized, it builds under 5 minutes with just gatsby build on a 9700K with 32GB RAM.

success open and validate gatsby-configs - 0.161s
success load plugins - 2.368s
success onPreInit - 0.008s
success delete html and css files from previous builds - 0.007s
success initialize cache - 0.006s
success copy gatsby files - 0.038s
success onPreBootstrap - 0.824s
success loading DatoCMS content - 4.214s
success source and transform nodes - 4.593s
warn On types with the `@dontInfer` directive, or with the `infer` extension set to `false`, automatically adding fields for children types
is deprecated.
In Gatsby v3, only children fields explicitly set with the `childOf` extension will be added.
success building schema - 0.486s
success createPages - 39.880s
success createPagesStatefully - 0.052s
success onPreExtractQueries - 0.005s
success update schema - 22.360s
success extract queries from components - 0.395s
success write out requires - 0.075s
success write out redirect data - 0.001s
success Build manifest and related icons - 0.072s
success onPostBootstrap - 0.105s
⠀
info bootstrap finished - 75.080 s
⠀
success Building production JavaScript and CSS bundles - 11.084s
success Rewriting compilation hashes - 0.002s
success run queries - 72.590s - 46529/46529 640.98/s
success Building static HTML for pages - 87.670s - 46526/46526 530.70/s
info Done building in 246.738346453 sec

Everything goes on smoothly until this morning when we upgraded to 2.18+, the build speed dropped dramatically when running batched graphql in gatsby-node.js. Still investigating the reason. (And that's why I saw this request.)

Thank you all for the great Gatsby!

@asilgag
Copy link

asilgag commented Nov 22, 2019

Hi @pvdz and @sidharthachatterjee,

Coming here from #19718. Glad to help on improving Gatsby performance!

Our case scenario:

  • 10,000 JSON source files which turn into 10,000 HTML pages and 10,000 AMP pages. Thus, builds generate 20,000 pages
  • Build time: ~2 minutes on a m5.2xlarge EC2 instance. We are quite happy with these times.
  • We don't use GraphQL nor gatsby-plugin-sharp. We have a really complex set of JSON source files without a standardized schema, so, on a loop, we read all JSON files on gatsby-node.js, and call createPage for each one, passing the parsed JSON data on page context. All data needed to create the page is in those JSON files.
  • On the same loop, we call createPage for AMP pages, passing a different component as a template.
  • Gatsby version 2.17.10

This is the build log we get:

success open and validate gatsby-configs - 0.786s
success load plugins - 0.798s
success onPreInit - 0.059s
success delete html and css files from previous builds - 0.010s
success initialize cache - 0.008s
success copy gatsby files - 0.019s
success onPreBootstrap - 0.003s
success source and transform nodes - 2.530s
success building schema - 1.080s
success createPages - 20.736s
success createPagesStatefully - 0.053s
success onPreExtractQueries - 0.001s
success update schema - 0.025s
success extract queries from components - 0.316s
success write out requires - 0.072s
success write out redirect data - 0.002s
success Build manifest and related icons - 0.027s
success onPostBootstrap - 0.059s
info bootstrap finished - 28.844 s

success Building production JavaScript and CSS bundles - 7.128s
success Rewriting compilation hashes - 0.002s
success run queries - 34.002s - 19563/19563 575.36/s
success Building static HTML for pages - 37.959s - 19555/19555 515.15/s

info Done building in 105.479798822 sec

We would like to improve the following:

  • createPages step: ~20 seconds
  • "Building production JavaScript and CSS bundles" step: we would like a build flag to disable generating bundles for specific paths, like AMP.
  • "run queries" step: we don't completely understand this step. We think Gatsby it's saving "page-data.json" files to disk on this step, is that right? We would like to disable creating "page-data.json" files for specific paths, like AMP.

All in all, we would like some kind of flag to make Gatsby work as a "fully static site generator". I mean:

  • Without generating page-data.json
  • Without building production JavaScript and CSS bundles
  • Without adding JS assets to page bottom (and their corresponding preload links on head)

I know this could sound stupid: "why turn Gatsby into a traditional SSR like Hugo or Jekyll?". Well, apart from solving our scaling issues with AMP, I can't imagine working without React components, even if they are only used to generate static HTML without any further JS interactivity. Hugo and Jekyll are fine, but React's simplicity and working with components are key for us (and for lot of people, I think).

I can't publicly share any further detail here, but I'll reach you by email with more details.

Thanks!

@ganapativs
Copy link
Contributor

I had a huge problem with scalability with Gatsby earlier.

Issue: #17233

I had to switch to Next.js because of this. Happy to see that Gatsby team is prioritizing scalability 👍

@pvdz
Copy link
Contributor Author

pvdz commented Nov 26, 2019

@rjyo the regression came with the shadowing feature that landed a few days ago. We're looking into the regression and how to best mitigate it. I don't suppose I could build your site myself for benchmarking purposes? :) Thanks for the feedback!

@asilgag we kind of need the page-data.json per page, if nothing else, for later parallalization. Each page becomes an individual job and that way we would be able to spread the load on multiple cores, something we can't do just yet right now. We should be able to improve the situation though. And if you don't save page-data.json to disk you'd have to retain it in memory, which certainly does not scale for most people (although some can certainly just throw money at it). I will take your suggestions into consideration when contemplating next steps into scaling perf and get back to you on them. Thank you!

@rjyo
Copy link

rjyo commented Nov 26, 2019

@pvdz I just upgrade to 2.18.4 and the performance regression has gone! The createPage took 20s more than 2.17.x builds and updateSchema's time went down from 20s+ to less than 1s. i.e. The sum is quite steady.

Thanks for your information!

@pvdz
Copy link
Contributor Author

pvdz commented Nov 27, 2019

@ganapativs Sorry to hear that! I am definitely interested in your case and will be looking into it, regardless. Thanks for the test case :)

@rjyo
Copy link

rjyo commented Nov 28, 2019

@pvdz After running on 2.18.4 with dozens of hourly builds on CI, around 50% of the builds failed on createPages

...
error "gatsby-node.js" threw an error while running the createPages lifecycle:
Cannot read property 'rocket' of null
  TypeError: Cannot read property 'rocket' of null
...

where rocket should be returned from the GraphQL request. note: There are a bunch of queries running in createPages, and most had already finished without any problem.

Redo the job will, again, have a 50% about success rate.

Hope you guys can find the problem. Please contact me directly were there any debug info I can provide.

Thanks!

@pvdz
Copy link
Contributor Author

pvdz commented Nov 28, 2019

@rjyo that doesn't sound good. Can you open a new issue (if not already done so) for this? And try it on 2.18.5 ? This may contain a fix that could have already fixed your problem.

@rjyo
Copy link

rjyo commented Nov 28, 2019

@pvdz Thanks! I just tried 2.18.5 and the first attempt just went on well. The build time is quite similar to those of 2.17.x. Less time on createPage and what it takes on updateSchema just comes back now.

I'll let it run for some more and let your know the results.

Thanks again!

@pvdz
Copy link
Contributor Author

pvdz commented Nov 28, 2019

Glad to hear that :) I'm working on keeping better tabs on scaling performance regressions. Please do feel free to ping when you see something regress unexpectedly. That goes for anyone.

@pvdz
Copy link
Contributor Author

pvdz commented Nov 28, 2019

@eads good news! If you weren't using the CI=true flag yet, you're going to get an even better build time :D If you are using it already, well, good :) I'm changing the logger which drops the govbook build time from 210s to 140s for me locally :D ( #19866 )

For anyone else; This PR affects the progress bar so if you were testing large sites with default settings, you should get a perf win as well.

Note that if you're building in a ci then setting CI=true is a good idea. It'll reduce log spam. After the aforementioned PR gets merged it won't matter much anymore in terms of Gatsby perf.

@prashant1k99
Copy link

Hi @pvdz, Gatsby is really nice and I would like to provide an example here.

It has around 48K pages with 200K+ records and growing.

With the queries optimized, it builds under 5 minutes with just gatsby build on a 9700K with 32GB RAM.

success open and validate gatsby-configs - 0.161s
success load plugins - 2.368s
success onPreInit - 0.008s
success delete html and css files from previous builds - 0.007s
success initialize cache - 0.006s
success copy gatsby files - 0.038s
success onPreBootstrap - 0.824s
success loading DatoCMS content - 4.214s
success source and transform nodes - 4.593s
warn On types with the `@dontInfer` directive, or with the `infer` extension set to `false`, automatically adding fields for children types
is deprecated.
In Gatsby v3, only children fields explicitly set with the `childOf` extension will be added.
success building schema - 0.486s
success createPages - 39.880s
success createPagesStatefully - 0.052s
success onPreExtractQueries - 0.005s
success update schema - 22.360s
success extract queries from components - 0.395s
success write out requires - 0.075s
success write out redirect data - 0.001s
success Build manifest and related icons - 0.072s
success onPostBootstrap - 0.105s
⠀
info bootstrap finished - 75.080 s
⠀
success Building production JavaScript and CSS bundles - 11.084s
success Rewriting compilation hashes - 0.002s
success run queries - 72.590s - 46529/46529 640.98/s
success Building static HTML for pages - 87.670s - 46526/46526 530.70/s
info Done building in 246.738346453 sec

Everything goes on smoothly until this morning when we upgraded to 2.18+, the build speed dropped dramatically when running batched graphql in gatsby-node.js. Still investigating the reason. (And that's why I saw this request.)

Thank you all for the great Gatsby!

@rjyo Can you share the running site url, I am really curious about the website...

@pvdz
Copy link
Contributor Author

pvdz commented Dec 2, 2019

@prashant1k99 that sounds like #5002 :)

@muescha
Copy link
Contributor

muescha commented Dec 6, 2019

@pvdz Look at #9083 there are also 2 users with large pages:

@crock
Copy link
Contributor

crock commented Dec 11, 2019

I have a Gatsby site that's currently not live as I'm still trying to work out if Gatsby is gonna work out because I have 200k+ rows in a MySQL database and each row would be a single page.

Is this a site you would want to use? It's relatively simple. It is a Twitch.tv clip aggregator that just embeds an iframe on each page along with a comment system.

@pvdz
Copy link
Contributor Author

pvdz commented Mar 26, 2020

@giupas @mikaelmoller Hey, thanks for your messages. Sorry for taking so long to respond, it's been a little weird the past two weeks and some github notifications slipped through.

@giupas this is more a question for Cloud or Builds. Somebody will reach out in private about this, I think we can make this work! :)

@mikaelmoller I can triage it. First what I need is a build output, so I can see which parts require the most time. Then an example of the gatsby-node and a template, to see what kind of queries you're running and how you're passing on data. What kind of site is it? Markdown, mdx, something else? Have you tried the usual suspects? Things like adding a graphql scheme to prevent type inference, putting as little data in the context as possible, precomputing images, etc? Best would be if I can just look, or even locally build, the site.

@pvdz
Copy link
Contributor Author

pvdz commented Mar 30, 2020

[email protected] contains #22574 which should improve performance for sites with many nodes that use queries containing multiple eq filters.

Before this optimization was only applied to queries with single eq filters. I'm in the process of also adding support for other operators.

@gerardoboss
Copy link

I have a 57K database in Mysql, but I only manage to create 22k pages try to get more memory for gatsby, but is always the same, do you think is a limit with mysql for returning rows?

@xmflsct
Copy link

xmflsct commented Apr 24, 2020

@pvdz I have a site with much less pages using gatsby-source-contentful but already crashes Zeit/Vercel. Maybe you want to take a look? #23463

@crock
Copy link
Contributor

crock commented Apr 24, 2020

I have a 57K database in Mysql, but I only manage to create 22k pages try to get more memory for gatsby, but is always the same, do you think is a limit with mysql for returning rows?

@gerardoboss I've had issues with gatsby-source-mysql in the past when dealing with very large datasets. It timeouts after a while. The best option is to write a custom source plugin and break up the sql queries into smaller ones.

@gerardoboss
Copy link

@crock thank you so much, Im thinking maybe go with a CSV I try it to brake it in queries of 10K records, but the result is exactly the same, dont know where to look for the problem, there is no error o log that tells me what is wrong, if it was an error or time out.

So I will try to a CSV, probably need to convert to json or something need to check.

Thank you so much. I'll update if I am able to do it.

@pvdz
Copy link
Contributor Author

pvdz commented Apr 28, 2020

@gerardoboss have you tried to give the nodejs process more memory? You can do something like node --max_old_space_size=4000 node_modules/.bin/gatsby build to bump the memory available to nodejs which you'll need to do for larger sites. How much you need really depends on your setup and is different for every site. Generally for 50k sites I'd expect 2gb to 4gb to be enough. If you have a public repo I can checkout I can take a look.

@xmflsct I see you were able to resolve it, great! :) Fwiw, the contentful plugin adds a lot of internal nodes (the core unit of information inside Gatsby) which is resulting in scaling problems. I've seen sites with 15k pages rack up over a million internal nodes because it was creating a node for each piece of text in Contentful. I have no concrete way forward here, but that's been my observation.

@gerardojaras
Copy link

gerardojaras commented Apr 28, 2020

@pvdz it worked flawlessly! Thanks a lot!

Screen Shot 2020-04-28 at 12 36 16 PM

@muescha
Copy link
Contributor

muescha commented Apr 30, 2020

should be added note at the troubleshooting page https://www.gatsbyjs.org/docs/troubleshooting-common-errors/ about max_old_space_size?

@pvdz
Copy link
Contributor Author

pvdz commented Jun 1, 2020

Going to close this issue. Thanks everyone who participated. Your contributions have made a great impact to the perf of Gatsby :d

Feel free to keep posting large sites (public repo, something I can build locally). The ones so far serve as excellent benchmarks.

At this point my definition of large sites are 100k to 1m page sites. Although it's more accurate to speak in terms of internal node count, which is around 1 million. You can see the node counts by running gatsby build --verbose. The node counts will be printed during bootstrap. (Page nodes separately shortly after). A page with 1 million nodes builds in roughly 20 to 60 minutes, depending on sourcing, plugins, and type of website.

So a large site will have a million+ nodes internally and I'm still working on raising that ceiling :)

Be well. Reach out if you need help.

@pvdz pvdz closed this as completed Jun 1, 2020
@LekoArts LekoArts removed the not stale label Jul 3, 2020
@daiky00
Copy link

daiky00 commented Jul 26, 2020

@pvdz I need help with my build times my site is slow and it just has like 1200 pages but it does contain around 12k images. Can you please help me

@KyleAMathews
Copy link
Contributor

@daiky00 have you tried Gatsby Cloud btw? It speeds up processing large numbers of images a lot by parallelization across cloud functions and better caching between builds

@daiky00
Copy link

daiky00 commented Jul 26, 2020

@KyleAMathews I am using Netlify with their parallel image processing plugin with google cloud. But the build is slow can you help me improve or take a look at it the repo is private though so I will need to give you access. I am also using their incremental build solution which is great but my problem is with the first build

@KyleAMathews
Copy link
Contributor

@daiky00 you'd need to ask Netlify for help then as the constraint on build speed would be their plugin. You should also try Gatsby Cloud to compare the experience.

@daiky00
Copy link

daiky00 commented Jul 26, 2020

@KyleAMathews I would have try it if it wasn't as expensive $99 a month is too much and the build is not slow because of netlify when I do it locally is the same. I want to speed up the building time can you help me?

@KyleAMathews
Copy link
Contributor

Yeah happy to take a look

@daiky00
Copy link

daiky00 commented Jul 26, 2020

@KyleAMathews I will give you access to repo and you tell me what you see wrong

@daiky00
Copy link

daiky00 commented Jul 26, 2020

ok @KyleAMathews I invited you

@daiky00
Copy link

daiky00 commented Jul 26, 2020

@KyleAMathews also use the search-page branch for latest code

@KyleAMathews
Copy link
Contributor

@daiky00 got it running locally and built it twice — the first run took 8:12 & the second run took 1:26 as the image generation was cached the second run. Most of the 1:26 is now spent in refetching the data & creating pages. What kind of build speeds are you seeing?

I totally hear you that $99/month is too much — we're actually launching a much cheaper price plan soon that gives you inc builds (which is even faster than the 1:26) — email me @ [email protected] if you'd like early access.

@pvdz
Copy link
Contributor Author

pvdz commented Jul 29, 2020

Cool :)

@daiky00
Copy link

daiky00 commented Jul 29, 2020

@KyleAMathews not right now I need to deploy this site ASAP to production. But I will contact you when I do 😃👍 and thanks

@ramkumarvs
Copy link

@pvdz Let me know if you are still looking for sites with larger pages, I am trying to build one which might easily hit 100k pages if not more. https://bettercapital.us. The repo is not public, because I have a lot of IP in it, but happy to share the site

@pvdz
Copy link
Contributor Author

pvdz commented Oct 4, 2021

Nope, cheers

@gatsbyjs gatsbyjs locked as resolved and limited conversation to collaborators Oct 8, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
help wanted Issue with a clear description that the community can help with. topic: performance Related to runtime & build performance
Projects
None yet
Development

No branches or pull requests