Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regexp compilation taking a long time #333

Closed
another-rex opened this issue Apr 5, 2023 · 4 comments · Fixed by #346
Closed

Regexp compilation taking a long time #333

another-rex opened this issue Apr 5, 2023 · 4 comments · Fixed by #346
Assignees

Comments

@another-rex
Copy link
Collaborator

It was noted that in some workloads regexp compliation is taking up to 60% of osv-scanner's runtime.

Especially this line:

re := regexp.MustCompile(matcher)

A solution would be to store the result of the compilation, rather than compiling every time.

FYI @spencerschrock

@another-rex another-rex self-assigned this Apr 5, 2023
@G-Rath
Copy link
Collaborator

G-Rath commented Apr 5, 2023

Interesting - do you have any more details you can share? I went back and forth on this for a while, including doing benchmarking especially because of this this huge regexp because that is large enough that it adds a few seconds to the test suite.

However at runtime I found there wasn't a significant difference which I assumed was due to Go internally caching some things that it didn't do when the tests were being run in parallel, whereas moving all those regexp compiles out into variables would mean their compilation cost has to be paid regardless of if the related function is actually used (so be clear: moving that big regexp out meant osv-detector was slower to start consistently, vs when it was inline and you weren't parsing requirements.txt files)

I had also trailed a JIT-based approach but again found no significant difference in my benchmarks.

@spencerschrock
Copy link
Contributor

spencerschrock commented Apr 5, 2023

So this came from profiling Scorecard's cron, which repeatedly calls osvscanner.DoScan. Selfishly, a one-time cost is better for my use, which is not a typical use case of osv-scanner.

osvscanner.scanDir 7.82 ->
lockfile.Parse 6.54s ->
(lockfile.parseYarnPackageGroup 1.64s + lockfile.ParseNpmLock 4.75s) ->
lockfile.tryExtractCommit 4.86s ->
regexp.MustCompile 4.39s

whereas moving all those regexp compiles out into variables would mean their compilation cost has to be paid regardless of if the related function is actually used

Which is a valid concern for the common case.
Would a package variable but lazy init be better? Perhaps behind a sync.Once.Do ?

@G-Rath
Copy link
Collaborator

G-Rath commented Apr 5, 2023

I'm assuming that is the total time across all calls rather than the time of a single call.

Would a package variable but lazy init be better? Perhaps behind a sync.Once.Do ?

That would be the preferable option - as I said, when I tried this I didn't see any significant difference in my benchmarking to consider it worth the readability cost, but I didn't know sync.Once.Do was a thing and that might make it a bit nicer.

(it also is entirely possible I mucked up my benchmarking 🤷)

@spencerschrock
Copy link
Contributor

spencerschrock commented Apr 5, 2023

I'm assuming that is the total time across all calls rather than the time of a single call.

Correct, the timings were aggregates over several 10 - 20 min samples ofpprof and the default (~4-5%) sample rate.
The results are dependent on the repos Scorecard analyzes; I'm guessing we tend to see more npm/yarn projects, which leads to this code being a hotspot for us.

I didn't see any significant difference in my benchmarking to consider it worth the readability cost, but I didn't know sync.Once.Do was a thing and that might make it a bit nicer.

I do think sync.Once helps with readability, but it is still an extra burden over the current code (didn't bother writing the array logic, but could be done here).

var whitelistR struct{
  rex *regexp.Regexp
  once sync.Once
}

func (f *foo) bar(ctx context.Context) error {
  whitelistR.once.Do(func() {
    whitelistR.rex = regexp.MustCompile(somePattern)
  })


  if !whitelistR.rex.Match(...) {...}
}

Adapted from https://stackoverflow.com/a/72185182

another-rex added a commit that referenced this issue Apr 17, 2023
Fixes #333 

I measured the time taken to compile the regexp at startup and it's
unnoticeable for these regexp, so I don't think we need the added
complexity with `sync.Once`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants