Skip to content

bundler: optimize gemfile parsing#4059

Merged
Nishnha merged 1 commit intodependabot:mainfrom
skipkayhil:gemfile-parse-optimization
Sep 22, 2022
Merged

bundler: optimize gemfile parsing#4059
Nishnha merged 1 commit intodependabot:mainfrom
skipkayhil:gemfile-parse-optimization

Conversation

@skipkayhil
Copy link
Copy Markdown
Contributor

Summary

After profiling FileParser#parse for bundler, I found that the ruby
Parser was extremely hot. The same Gemfile was being parsed twice for
each dependency, when it only needed to be parsed a single time per
file.

The performance improvement depends on the size of the Gemfile being
parsed, but the following example shows the speedup for the
discourse/discourse repo.

Before:

Warming up --------------------------------------
               parse     1.000  i/100ms
Calculating -------------------------------------
               parse      0.121  (± 0.0%) i/s -   1.000  in   8.262804s

After:

Warming up --------------------------------------
               parse     1.000  i/100ms
Calculating -------------------------------------
               parse      6.319  (±15.8%) i/s -  32.000  in   5.099269s

Other Information

I wasn't sure what the policy around breaking API changes was, especially for a relatively internal class like GemfileDeclarationFinder. Any feedback would be appreciated, I believe I can accomplish something similar without breaking the method signatures if necessary.

Benchmark Script

# frozen_string_literal: true

require "bundler/inline"

gemfile(true) do
  source "https://rubygems.org"

  gem "dependabot-common", path: "./common"
  gem "dependabot-bundler", path: "./bundler"
  gem "benchmark-ips"
  # This was helping some DNS issues I was having but is optional
  gem "resolv-replace"
end

package_manager = "bundler"
repo_name = "discourse/discourse"

credentials = [
  {
    "type" => "git_source",
    "host" => "github.com",
    "username" => "x-access-token",
    "password" => ENV["GITHUB_ACCESS_TOKEN"] # A GitHub access token with read access to public repos
  }
]

source = Dependabot::Source.new(
  provider: "github",
  repo: repo_name,
  directory: "/",
)

puts "Fetching #{package_manager} dependency files for #{repo_name}"
fetcher = Dependabot::FileFetchers.for_package_manager(package_manager).new(
  source: source,
  credentials: credentials,
)

puts "Parsing dependencies information"
parser = Dependabot::FileParsers.for_package_manager(package_manager).new(
  dependency_files: fetcher.files,
  source: source,
  credentials: credentials,
)

Benchmark.ips do |x|
  x.report("parse") { parser.parse }
  x.compare!
end

@skipkayhil skipkayhil requested a review from a team as a code owner July 18, 2021 21:21
@Nishnha
Copy link
Copy Markdown
Member

Nishnha commented Jul 20, 2021

Hello @skipkayhil,

Thank you for this contribution!
Indeed, many of Dependabot Core's FileParsers read the same dependencies multiple times.

I wasn't sure what the policy around breaking API changes was, especially for a relatively internal class like GemfileDeclarationFinder

Generally we try to avoid breaking the existing API. I've made a couple of comments in the code pointing out breaking changes (e.g. removing the @dependency variable).

Have you been able to run the test suite to make sure these changes pass? There are instruction in the README on how to run the test suite.
Once you have passing tests we can run your PR against the CI 😄

These speedups are very welcome since they have an even bigger effect at scale! Let's try to find a solution that doesn't break the existing API together.

P.S., the Dependabot team is currently short-handed so we may take a while to respond.

@feelepxyz
Copy link
Copy Markdown
Contributor

Thanks! Following on from @Nishnha comment. I don't this change would break any APIs as we don't expose this module externally and don't think anyone is relying on it directly.

I also think the interface change makes sense given to goal of only parsing a gemfile once, as there are usually multiple dependencies per file 👍

I've kicked off CI, new contributions need the actions run approved.

@jurre
Copy link
Copy Markdown
Member

jurre commented Jul 21, 2021

Thanks for looking into this @skipkayhil, I think it'd be good to instrument this in a full dry-run rather than in isolation, as the parsing step only happens once during an entire update, so even though it's not very fast in isolation it might not make a huge difference overall. That's not to say this isn't worth doing, would just good to get a sense of the overall impact

@skipkayhil
Copy link
Copy Markdown
Contributor Author

Thanks for looking into this @skipkayhil, I think it'd be good to instrument this in a full dry-run rather than in isolation, as the parsing step only happens once during an entire update, so even though it's not very fast in isolation it might not make a huge difference overall. That's not to say this isn't worth doing, would just good to get a sense of the overall impact

That's a pretty good point, I hadn't thought about the other steps. So far I've been using dependabot as a library to gather dependency information but haven't yet used it to update things. In that scenario this change makes a pretty big impact, but after running the dry run script it looks like it doesn't end up being as much of an improvement in the whole process. I'll probably be looking into the UpdateChecker process in the future because that seems to be where the bulk of time is spent.

On my (not so powerful) laptop:

# main
$ time bin/dry-run.rb bundler discourse/discourse
...
real    43m9.203s
user    41m10.464s
sys     1m47.502s

And swapping out Benchmark.ips for Benchmark.measure in my original script:

# main
$ ruby benchmark_script.rb
14.032599   0.035577  14.623058 ( 14.636967)

# skipkayhil:gemfile-parse-optimization
$ ruby benchmark_script.rb
0.395320   0.000374   0.901548 (  0.899722)

@jurre
Copy link
Copy Markdown
Member

jurre commented Jul 22, 2021

I'll probably be looking into the UpdateChecker process in the future because that seems to be where the bulk of time is spent.

Nice 👍 I think most of the time is spent doing networking, but if we can find any optimizations there, that'd be fantastic!

@jeffwidman jeffwidman added the L: ruby:bundler RubyGems via bundler label Aug 4, 2022
@jeffwidman
Copy link
Copy Markdown
Member

@skipkayhil thanks again for putting up this PR... are you still interested in driving it to completion? Given the millions of jobs that we run, we're always interested in perf improvements. 👍

@skipkayhil
Copy link
Copy Markdown
Contributor Author

are you still interested in driving it to completion?

Would definitely still love to see this merged! I just gave it a rebase and will update if anything fails CI

After profiling FileParser#parse for bundler, I found that the ruby
Parser was extremely hot. The same Gemfile was being parsed twice for
each dependency, when it only needed to be parsed a single time per
file.

The performance improvement depends on the size of the Gemfile being
parsed, but the following example shows the speedup for the
discourse/discourse repo.

Before:
```
Parsing dependencies information
Warming up --------------------------------------
               parse     1.000  i/100ms
Calculating -------------------------------------
               parse      0.121  (± 0.0%) i/s -   1.000  in   8.262804s
```

After:
```
Warming up --------------------------------------
               parse     1.000  i/100ms
Calculating -------------------------------------
               parse      6.319  (±15.8%) i/s -  32.000  in   5.099269s
```
@Nishnha
Copy link
Copy Markdown
Member

Nishnha commented Sep 22, 2022

I had some initial concerns with merging this PR incase it changed some underlying behavior in how we parse dependencies.

Seeing this PR is passing our e2e smoke tests, which tests for consistent output, is reassuring!

Thanks for the massive speed up and benchmark script @skipkayhil - definitely going to try this out on our other ecosystems!

@Nishnha Nishnha merged commit 62c9265 into dependabot:main Sep 22, 2022
@skipkayhil skipkayhil deleted the gemfile-parse-optimization branch September 26, 2022 22:16
@pavera pavera mentioned this pull request Oct 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

L: ruby:bundler RubyGems via bundler

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants