-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using binary logs to significantly increase CodeQL analysis performance for C# #16346
Comments
The MSBuild team is also happy to support here. |
Thank you for putting together the sample. I think your suggestion makes sense, I'm exploring it a bit further to understand what else would need to be changed. One point that is not immediately clear to me is how to get hold of the syntax trees produced by source generators. Is there a Roslyn API that would return these trees? |
The default experience is that generated files are embedded into the PDB produced by the build. There is no official API for this but it's in a well known part of the PDB. The code for reading that is here. If the goal is to just read these files for a given compilation on demand then it's a fairly simple change to expose an API directly that lets you access them. It would be something along the lines of List<SyntaxTree> ReadAllGeneratedFiles(CompilerCall compilerCall); There is one caveat for that approach though: it only works if the build is using portable PDBs. Native PDBs don't have a facility for that. Using native PDbs with source generators though is decently rare. Seems reasonable that for a CodeQL run you could ask that portable PDBs be generated.
Curious if you're asking about generated files in this context: essentially how can If so the underlying library takes a couple of different approaches:
|
Thanks for the explanation.
Yes, this is the goal. BTW, what is the reason that Roslyn doesn't expose an API on
We're injecting the following properties into the build command:
We're adding |
Okay will add an API for that
The underlying APIs for running the generators are available. Can see how compiler logs approaches that problem here. That provides the Compilation compilation = ...;
var driver = CreateGeneratorDriver();
driver.RunGeneratorsAndUpdateCompilation(compilation, out var compilation2, out var diagnostics, cancellationToken);
var generatedSyntaxTrees = compilation2.SyntaxTrees.Skip(compilation.SyntaxTrees.Count()); That does require the host to write a bit of code though to load the analyzers, run the driver, etc ... The reason there isn't a higher level API that just takes care of all of this for you is that it's hard to find a single solution that fits all hosts. In particular analyzer loading is very complex and as a result different hosts tend to make different decisions on how to approach it. Consider that even within the Roslyn compiler analyzers are loaded with three different strategies depending on the scenario. The reason the NuPkg I used in the sample has easy APIs for creating
I'm not as familiar with this scenario. I attempted to recreate it with a .NET Framework ASPNet application. Setting that property I do see it go into a Are you all intercepting that call? Or is there a subtly I missed where this also generates a separate csc call. There is a csc invocation but I see that with our without the
Yep those are just a normal source generator at the end of the day and should behave like other generators in terms of finding the files. |
I'm not that familiar with
|
@jaredpar I started experimenting with binlog support. Here's a variant that supports source generators and another variant that shamelessly takes your sample. The latter requires fewer changes in the extractor and it's more straightforward to understand. Also, I think for your usecase with This is how a codeql database can be created from the binlog:
I'm testing this variant on some larger repos and comparing the alerts reported by the tracing extractor to the binlog based ones. The binlog file needs to be manually passed to codeql, which makes large scale testing difficult. What is the long term plan for https://github.com/jaredpar/complog/? Is it going to be moved to the |
Walked through both of those changes and they look good to me 😄
Very interested to hear how this works out.
At the moment it's a tool that the C# compiler team maintains. It started as a personal project which is why it exists in my personal GitHub repo. The dotnet org already consumes the tool in a few places. But we'd hadn't moved it fully into the dotnet org because it wasn't on any production path yet. If you all ended up taking a dependency on this in codeql then we'd very likely move this into the dotnet org. |
My assumption was wrong. Roslyn uses source generators in |
@tamasvajk your injecting
|
You guys should really be injecting |
Is there a reason why you have to run generators again vs. just looking at the generated files from the build that created the binary log? The compiler log code makes those easy to extract, assuming portable PDB. |
Curious: why do you all need these files? If you're re-running generators then the generated files will be in the |
Yes, in certain cases, injecting
I continued with the variant in #16747 that also processes the output of source generators (in contrast to #16672, which doesn't). We get the generated ASTs from
I think there's a mix of contexts here. @KirillOsenkov is suggesting that we should be injecting the In the context of binary log extraction, we don't need any of the injected properties. Actually, with the binary log based extractor, we're not even starting our tracer as there's no build command that we'd want to trace. |
Is this a teams chat that I can join? Always interested in how adding properties to the build impact the results unexpectedly.
More recent versions of It only works when portable PDBs are used as native PDBs don't support this function. That is how the vast majority of compilations work though. Could possibly save a bit of time by having that as the fast path and falling back to the |
@jaredpar I briefly looked into using |
@jaredpar The binary log based extractor in #16747 has been merged and it's part of the |
Awesome! Thanks for doing this! Going to start trying this out as soon as that release is available to us. |
Hi @jaredpar, it has been a while since the last correspondence. |
@tamasvajk Would it be possible to support creating the database from multiple binary log files, in cases where each project in a C# build is built and produces a binary log file individually? |
Sorry, missed this question. Yes the binary log mode helps significantly here. Still some work to get it turned on at scale in the builds
The underlying compiler log stack can support this: just create a reader per log and grab the compilation. I would assume that should be able to plug into CodeQL just fine. If not could add APIs that basically merge them together so they all come from a single reader. That is fairly straight forward work. |
@DmitriyShepelev I've started working on this in #17955. |
The nature of the C# driver means that it needs to see all csc invocations for a build. Today that is achieved by disabling shared compilation during build. That unfortunately creates a significant performance penalty when building C# code. The larger the repo the more significant the penalty. The rule of thumb is that for solutions of at least medium size this will cause somewhere from a 3-4X build slow down. For larger repositories this can be even bigger.
Consider as a concrete example the dotnet/runtime repository. When their CodeQL pipeline runs and shared compilation is disabled it increases their build time by 1 hour and 38 minutes. Or alternatively building the dotnet/roslyn repository. Building that locally results in a ~4X perf penalty when shared compilation is disabled.
I understand from previous conversations that this is done because the extractor needs to see every C# compiler call in the build and potentially process it. An alternative approach to doing this is to have the build produce a binary log. That is a non-invasive change to builds and there is ample tooling for reading C# compiler calls from those once the build completes. The code for reading compiler calls is at the core very straight forward:
I have a sample here of integrating this approach into the existing extractor tool. On my machine I was able to use this sample to run over a full build of roslyn without any obvious issues. And most importantly, no changes to how I built roslyn 😄
Using libraries like this have other advantages because you could also offload the work of creating
CSharpCompilation
objects. There is code in that library that will very faithfully recreate aCSharpCompilation
for everyCompilerCall
instance. Furthermore, think there are other optimization opportunities for the extractor once you are running the analysis in bulk: cachingMetadataReferences
, cachingSourceText
, etc ...More than happy to chat about this, the different ways binary logs can be used, moving analysis off machine, etc ... My big end goal is to get back to us using shared compilation so we can keep our build times down.
The text was updated successfully, but these errors were encountered: