-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Request] Real-world Gatsby sites (50k+ pages) #19512
Comments
Hi there! Yes I'd very much like to help. My biggest Gatsby site has ~24k pages but I figure it's still a pretty decent one to take a look at. You can see it live at https://govbook.chicagoreporter.com/en/ and the code is open source at https://github.com/thechicagoreporter/govbook It runs on Gatsby v2. No secrets required, and we track the source data in the repo for now, so you shouldn't even need to pull down fresh data. It does depend on having SQLite installed; I don't think anything else is required other than the standard Gatsby dependencies. Also tagging in @artfulaction who is contracted to work on the project through at least the end of the year. I've had a helluva time getting it to build on Netlify and AWS Amplify, so that's been a persistent issue. Thus far to locally develop the site I just limit the size of the query manually which isn't ideal either. The database that drives it is about 8k rows, and there's a row per page. How is it over 24k pages then? Because there's a spanish version, an english version, and the redirect page. Any language we add will add another 8k pages to the build, and we're very much hoping to get a few more languages (especially Mandarin and Polish) into the site in the next year or so. Longer term, we hope to bake out VCF files for each of the contacts in the database, so that will add tens of thousands of additional files as well and should represent an interesting use case for the Gatsby toolchain. Thanks for doing this 👏 |
@eads this is a great example of what I'm looking for. Thank you :) |
Hi, I don't quite fit your requirements, but I do have a Gatsby build that takes upwards of 40 minutes on my local machine, crashes Zeit, and makes Netlify choke. Both also choke when I try to upload the resulting static package of 18,000+ files. It's been fun :D Right now Gatsby's cloud service comes the closest to working. Here's the repo and specific branch: https://github.com/Swizec/swizec-blog/tree/many-posts It's about 1400 pages in all, but the image and embed processing kills everything. Even have to increase Node's heap size to make the build survive. You can't see it live anywhere because I'm trying to avoid having to setup my own VPS+CDN and such. That was one of my original motivations for moving to Gatsby in the first place – hoping for an easy way to host+setup with modern tools. |
@Swizec thanks! Your site may not have as many pages but those images sure keep the cores spinning. Not sure how much we can improve on that since image processing is simply expensive. However, we do see some problems with the social plugin, and some room for improvement on fetching external resoures. Thank you :) |
cc @brod-ie @ashtonsix You've mentioned in other issues that you have mega big sites. Any chance we could get a slice of that for benchmarking? It could ultimately help your site as well. |
The particular search here started with `pages.length == 25599` and `matchPathPages.length == 2` but because `matchPathPages` was being pushed onto inside the loop, the array grew and grew. Ultimately the inner `find` was calling its callback 72 million (!) times, causing massive build delays (this was taking 2/3rds of the total build time for this site). In contrast, this `forEach` took 390s (seconds!) before and now takes 0.47s on my machine. This problem is caused by inefficient routing that is sometimes necessary for proper 404 handling. This becomes a big problem as the site grows and the match-path.json grows. There are some user-land improvements we can suggest to work around other problems, but at least this one we can resolve. Thanks to @eads for offering a real-world site in #19512 so we could profile this, and @pieh for helping with the debugging.
The particular search here started with `pages.length == 25599` and `matchPathPages.length == 2` but because `matchPathPages` was being pushed onto inside the loop, the array grew and grew. Ultimately the inner `find` was calling its callback 72 million (!) times, causing massive build delays (this was taking 2/3rds of the total build time for this site). In contrast, this `forEach` took 390s (seconds!) before and now takes 0.47s on my machine. This problem is caused by inefficient routing that is sometimes necessary for proper 404 handling. This becomes a big problem as the site grows and the match-path.json grows. There are some user-land improvements we can suggest to work around other problems, but at least this one we can resolve. Thanks to @eads for offering a real-world site in #19512 so we could profile this, and @pieh for helping with the debugging.
Hi @pvdz, Gatsby is really nice and I would like to provide an example here. It has around 48K pages with 200K+ records and growing. With the queries optimized, it builds under 5 minutes with just
Everything goes on smoothly until this morning when we upgraded to 2.18+, the build speed dropped dramatically when running batched graphql in Thank you all for the great Gatsby! |
Hi @pvdz and @sidharthachatterjee, Coming here from #19718. Glad to help on improving Gatsby performance! Our case scenario:
This is the build log we get:
We would like to improve the following:
All in all, we would like some kind of flag to make Gatsby work as a "fully static site generator". I mean:
I know this could sound stupid: "why turn Gatsby into a traditional SSR like Hugo or Jekyll?". Well, apart from solving our scaling issues with AMP, I can't imagine working without React components, even if they are only used to generate static HTML without any further JS interactivity. Hugo and Jekyll are fine, but React's simplicity and working with components are key for us (and for lot of people, I think). I can't publicly share any further detail here, but I'll reach you by email with more details. Thanks! |
I had a huge problem with scalability with Gatsby earlier. Issue: #17233 I had to switch to Next.js because of this. Happy to see that Gatsby team is prioritizing scalability 👍 |
@rjyo the regression came with the shadowing feature that landed a few days ago. We're looking into the regression and how to best mitigate it. I don't suppose I could build your site myself for benchmarking purposes? :) Thanks for the feedback! @asilgag we kind of need the page-data.json per page, if nothing else, for later parallalization. Each page becomes an individual job and that way we would be able to spread the load on multiple cores, something we can't do just yet right now. We should be able to improve the situation though. And if you don't save page-data.json to disk you'd have to retain it in memory, which certainly does not scale for most people (although some can certainly just throw money at it). I will take your suggestions into consideration when contemplating next steps into scaling perf and get back to you on them. Thank you! |
@pvdz I just upgrade to 2.18.4 and the performance regression has gone! The Thanks for your information! |
@ganapativs Sorry to hear that! I am definitely interested in your case and will be looking into it, regardless. Thanks for the test case :) |
@pvdz After running on 2.18.4 with dozens of hourly builds on CI, around 50% of the builds failed on
where Redo the job will, again, have a 50% about success rate. Hope you guys can find the problem. Please contact me directly were there any debug info I can provide. Thanks! |
@rjyo that doesn't sound good. Can you open a new issue (if not already done so) for this? And try it on 2.18.5 ? This may contain a fix that could have already fixed your problem. |
@pvdz Thanks! I just tried 2.18.5 and the first attempt just went on well. The build time is quite similar to those of 2.17.x. Less time on I'll let it run for some more and let your know the results. Thanks again! |
Glad to hear that :) I'm working on keeping better tabs on scaling performance regressions. Please do feel free to ping when you see something regress unexpectedly. That goes for anyone. |
@eads good news! If you weren't using the For anyone else; This PR affects the progress bar so if you were testing large sites with default settings, you should get a perf win as well. Note that if you're building in a ci then setting |
@rjyo Can you share the running site url, I am really curious about the website... |
@prashant1k99 that sounds like #5002 :) |
@pvdz Look at #9083 there are also 2 users with large pages:
|
I have a Gatsby site that's currently not live as I'm still trying to work out if Gatsby is gonna work out because I have 200k+ rows in a MySQL database and each row would be a single page. Is this a site you would want to use? It's relatively simple. It is a Twitch.tv clip aggregator that just embeds an iframe on each page along with a comment system. |
@giupas @mikaelmoller Hey, thanks for your messages. Sorry for taking so long to respond, it's been a little weird the past two weeks and some github notifications slipped through. @giupas this is more a question for Cloud or Builds. Somebody will reach out in private about this, I think we can make this work! :) @mikaelmoller I can triage it. First what I need is a build output, so I can see which parts require the most time. Then an example of the gatsby-node and a template, to see what kind of queries you're running and how you're passing on data. What kind of site is it? Markdown, mdx, something else? Have you tried the usual suspects? Things like adding a graphql scheme to prevent type inference, putting as little data in the |
[email protected] contains #22574 which should improve performance for sites with many nodes that use queries containing multiple Before this optimization was only applied to queries with single |
I have a 57K database in Mysql, but I only manage to create 22k pages try to get more memory for gatsby, but is always the same, do you think is a limit with mysql for returning rows? |
@gerardoboss I've had issues with gatsby-source-mysql in the past when dealing with very large datasets. It timeouts after a while. The best option is to write a custom source plugin and break up the sql queries into smaller ones. |
@crock thank you so much, Im thinking maybe go with a CSV I try it to brake it in queries of 10K records, but the result is exactly the same, dont know where to look for the problem, there is no error o log that tells me what is wrong, if it was an error or time out. So I will try to a CSV, probably need to convert to json or something need to check. Thank you so much. I'll update if I am able to do it. |
@gerardoboss have you tried to give the nodejs process more memory? You can do something like @xmflsct I see you were able to resolve it, great! :) Fwiw, the contentful plugin adds a lot of internal nodes (the core unit of information inside Gatsby) which is resulting in scaling problems. I've seen sites with 15k pages rack up over a million internal nodes because it was creating a node for each piece of text in Contentful. I have no concrete way forward here, but that's been my observation. |
@pvdz it worked flawlessly! Thanks a lot! |
should be added note at the troubleshooting page https://www.gatsbyjs.org/docs/troubleshooting-common-errors/ about |
Going to close this issue. Thanks everyone who participated. Your contributions have made a great impact to the perf of Gatsby :d Feel free to keep posting large sites (public repo, something I can build locally). The ones so far serve as excellent benchmarks. At this point my definition of large sites are 100k to 1m page sites. Although it's more accurate to speak in terms of internal node count, which is around 1 million. You can see the node counts by running So a large site will have a million+ nodes internally and I'm still working on raising that ceiling :) Be well. Reach out if you need help. |
@pvdz I need help with my build times my site is slow and it just has like 1200 pages but it does contain around 12k images. Can you please help me |
@daiky00 have you tried Gatsby Cloud btw? It speeds up processing large numbers of images a lot by parallelization across cloud functions and better caching between builds |
@KyleAMathews I am using Netlify with their parallel image processing plugin with google cloud. But the build is slow can you help me improve or take a look at it the repo is private though so I will need to give you access. I am also using their incremental build solution which is great but my problem is with the first build |
@daiky00 you'd need to ask Netlify for help then as the constraint on build speed would be their plugin. You should also try Gatsby Cloud to compare the experience. |
@KyleAMathews I would have try it if it wasn't as expensive $99 a month is too much and the build is not slow because of netlify when I do it locally is the same. I want to speed up the building time can you help me? |
Yeah happy to take a look |
@KyleAMathews I will give you access to repo and you tell me what you see wrong |
ok @KyleAMathews I invited you |
@KyleAMathews also use the |
@daiky00 got it running locally and built it twice — the first run took 8:12 & the second run took 1:26 as the image generation was cached the second run. Most of the 1:26 is now spent in refetching the data & creating pages. What kind of build speeds are you seeing? I totally hear you that $99/month is too much — we're actually launching a much cheaper price plan soon that gives you inc builds (which is even faster than the 1:26) — email me @ [email protected] if you'd like early access. |
Cool :) |
@KyleAMathews not right now I need to deploy this site ASAP to production. But I will contact you when I do 😃👍 and thanks |
@pvdz Let me know if you are still looking for sites with larger pages, I am trying to build one which might easily hit 100k pages if not more. https://bettercapital.us. The repo is not public, because I have a lot of IP in it, but happy to share the site |
Nope, cheers |
Hello kind Gatsby user 👋
My name is Peter, and I’m a Gatsby employee focused on performance and scalability.
We at Gatsby are always looking for ways to improve the performance of building out your Gatsby applications and making Gatsby not only scalable to building out hundreds of thousands of pages (or more!) but making this process as lightning quick as the resulting Gatsby application.
To best support this endeavor, we need your help! We have benchmarks in the repo, but they tend to be quite contrived and not necessarily indicative of real-world usage of Gatsby.
Specifically, we're looking for sites that:
gatsby build
would be ideal!)For this first batch, we'll be using these real-world applications to identify low-hanging fruit as it relates to performance so we can make the Gatsby build process ever faster and ever more scalable.
Does this sound like you? Please share a link to your application's source code below or e-mail any necessary details to peter at gatsbyjs dot com. We appreciate you 💜
Thanks! Onwards and upwards 📈🚀
The text was updated successfully, but these errors were encountered: