Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Oheger bosch/analyzer/dependency representation #3502

Conversation

oheger-bosch
Copy link
Member

@oheger-bosch oheger-bosch commented Jan 13, 2021

This PR addresses #3321 by proposing a more efficient storage format for the dependency representation in Gradle analyzer results. The basic idea is to not repeat the dependency trees for all scopes, but construct an optimized dependency graph, in which dependencies are shared. Ideally, in this graph each dependency occurs once. However, as sometimes packages are referenced from scopes with different sets of dependencies, multiple occurrences are possible.

The most important class of this PR is the new GradleDependencyGraphBuilder class that integrates the project's dependencies into a dependency graph (represented by the DependencyGraph class). The class is currently specific to Gradle, but it should be possible to make it more generic, so that other analyzer implementations can make use of this optimized format as well.

Here are some numbers to give an impression of the size reduction of the dependency representation. The numbers are based on the Mozilla Focus Android project, which was used for testing. This project uses a large number of scopes causing the dependency graph in the old representation to become really huge:

Original representation:

Analyzer execution time: 55:55
Result file size: 1,2 GB
Time to load the result:  111539ms
In-memory representation of analyzer result 1)
byte[]:     45.883.121
String:     45.880.168
Identifier: 11.462.288
PackageRef: 11.461.897

New representation:

Analyzer execution time: 44:55 2)
Result file size: 46 MB
Time to load the result:  5509ms
In-memory representation of analyzer result 1)
byte[]:         36.154
String:         33.200
DependencyRef: 376.067 3)
DependencyIdx:  10.708
Identifier:        892
PackageRef:        501  4)

Remarks:

  1. Number of instances of the listed classes, taken from the Memory view of the IntelliJ Debugger.
  2. Not sure about the execution time, I actually expected the new approach to take slightly longer because of the additional effort to de-duplicate the dependency graph; however, as this is all done in-memory, it probably does not have a major impact. So the differences in the times could be caused by varying load of our Azure setup.
  3. DependencyRef is the counter-part of PackageRef in the new dependency graph representation. It references dependencies by a numeric index rather than an identifier.
  4. From the new dependency graph representation the set of Scopes is constructed, but also in an optimized form (shared references to packages). Therefore, the number of PackageRef instances is low.

@oheger-bosch oheger-bosch force-pushed the oheger-bosch/analyzer/dependency-representation branch 3 times, most recently from d6d38b2 to 6a1e2f0 Compare January 14, 2021 06:42
@oheger-bosch oheger-bosch marked this pull request as ready for review January 14, 2021 09:33
model/src/main/kotlin/DependencyGraph.kt Outdated Show resolved Hide resolved
model/src/main/kotlin/DependencyGraph.kt Outdated Show resolved Hide resolved
model/src/main/kotlin/DependencyGraph.kt Outdated Show resolved Hide resolved
model/src/main/kotlin/DependencyGraph.kt Outdated Show resolved Hide resolved
model/src/main/kotlin/DependencyGraph.kt Outdated Show resolved Hide resolved
model/src/test/kotlin/DependencyGraphTest.kt Outdated Show resolved Hide resolved
@oheger-bosch oheger-bosch force-pushed the oheger-bosch/analyzer/dependency-representation branch 3 times, most recently from e6bd23e to 8e6b318 Compare January 18, 2021 14:49
@oheger-bosch
Copy link
Member Author

By tweaking the representation of dependency references a bit, the size of the result file could be reduced even further to 25.2 MB.

As this approach uses numeric indices to reference dependencies, the result - at least the part with the dependency tree - is now not really readable for human beings. I am not sure whether this is a criterion or not. It would be possible, based on this strongly optimized format, to come to a representation that is still free of most redundancies, but would be more readable (something like a tree of packages that replaces duplicate sub trees by references). But this format would have to be implemented, and it would increase the result size again of course.

@oheger-bosch
Copy link
Member Author

@sschuberth, @mnonnenmacher, @fviernau: It would be great to have a discussion regarding the format of the dependency tree in the Analyzer result. Should it be as efficient as possible (using indices as dependency references) or do you prefer a more human-readable format?

What has been implemented here for Gradle can be applied to other Analyzer implementations as well. I would like to look at Maven next. I have some hope that the de-duplication of dependencies will also improve the performance of this package manager.

@sschuberth
Copy link
Member

Hi @oheger-bosch, can you please rebase this PR onto latest master so the Docker image can be built for it out of the box?

@oheger-bosch oheger-bosch force-pushed the oheger-bosch/analyzer/dependency-representation branch from 8e6b318 to 5c5aa45 Compare February 5, 2021 10:29
@oheger-bosch
Copy link
Member Author

@sschuberth, the PR has been rebased.

model/src/main/kotlin/DependencyGraph.kt Show resolved Hide resolved
model/src/main/kotlin/DependencyGraph.kt Outdated Show resolved Hide resolved
model/src/main/kotlin/DependencyGraph.kt Outdated Show resolved Hide resolved
model/src/test/kotlin/DependencyGraphTest.kt Outdated Show resolved Hide resolved
model/src/test/kotlin/DependencyGraphTest.kt Outdated Show resolved Hide resolved
model/src/main/kotlin/Project.kt Outdated Show resolved Hide resolved
model/src/main/kotlin/Project.kt Outdated Show resolved Hide resolved
model/src/test/kotlin/ProjectTest.kt Outdated Show resolved Hide resolved
model/src/test/kotlin/ProjectTest.kt Outdated Show resolved Hide resolved
model/src/test/kotlin/ProjectTest.kt Outdated Show resolved Hide resolved
@oheger-bosch oheger-bosch force-pushed the oheger-bosch/analyzer/dependency-representation branch from 5c5aa45 to a96a410 Compare February 9, 2021 13:31
This class implements an alternative layout to store information about
the scopes of a project and the dependencies they contain. This layout
avoids the duplication of dependency trees over different scopes by
introducing a minimized graph of shared dependency nodes. Especially
in large projects with many scopes, this storage format reduces the
size of analyzer results significantly.

Signed-off-by: Oliver Heger <[email protected]>
This class is going to replace parts of the dependency analysis
implementation of the Gradle analyzer. It constructs a representation
of dependencies, which is more efficient than the original
representation and can hence deal with projects with many scopes and
large dependency sets.

Signed-off-by: Oliver Heger <[email protected]>
The project's dependencies can now be represented either by the
classic set of scopes or a DependencyGraph.

If a DependencyGraph is used, generate the scopes representation on
demand. Use the original 'scopes' property for this purpose, so that
read access to this information does not need to be changed; however,
the construction to Project instances now has to be adapted, which
affects also the analyzer implementations.

Signed-off-by: Oliver Heger <[email protected]>
@oheger-bosch oheger-bosch force-pushed the oheger-bosch/analyzer/dependency-representation branch from a96a410 to b4d74d7 Compare February 9, 2021 14:50
Use the GradleDependencyGraphBuilder class to create a significantly
smaller result that avoids redundancies in the storage format of
dependencies. Adapt functional tests to expect the new output format.

Signed-off-by: Oliver Heger <[email protected]>
@oheger-bosch oheger-bosch force-pushed the oheger-bosch/analyzer/dependency-representation branch from b4d74d7 to 4ac7626 Compare February 9, 2021 14:55
@sschuberth
Copy link
Member

@mnonnenmacher @fviernau this is looking good to me now, and I'd like to get this merged very soon. While this is not a breaking change in the strict sense, it changes the default output format for new Gradle project analyses. Any objections that I merge this?

@mnonnenmacher
Copy link
Member

@mnonnenmacher @fviernau this is looking good to me now, and I'd like to get this merged very soon. While this is not a breaking change in the strict sense, it changes the default output format for new Gradle project analyses. Any objections that I merge this?

Please let me have another look before merging it, I'm currently busy with testing #3564. In the meantime I have a question, do you plan to implement this for the other package managers anytime soon? Archiving two different formats in parallel makes any analysis of the analyzer results very difficult.

@sschuberth
Copy link
Member

sschuberth commented Feb 9, 2021

do you plan to implement this for the other package managers anytime soon?

No, probably not soon. We're planning to do this on-demand, if and and when we run into memory issues with other package managers. But that's purely out of a lack of time and other priorities; otherwise I'd of course like to migrate consistently to the new format as soon as possible.

Archiving two different formats in parallel makes any analysis of the analyzer results very difficult.

Are you referring to using UploadResultToPostgresCommand to put analyzer results into Postgres for gathering statistics? If so, I believe we could rather easily add an optional on-the-fly conversion from new to old format to UploadResultToPostgresCommand.

@mnonnenmacher
Copy link
Member

do you plan to implement this for the other package managers anytime soon?

No, probably not soon. We're planning to do this on-demand, if and and when we run into memory issues with other package managers. But that's purely out of a lack of time and other priorities; otherwise I'd of course like to migrate consistently to the new format as soon as possible.

This would also be a chance to align the implementation of the different package managers a bit if the GradleDependencyGraphBuilder is made generic.

Archiving two different formats in parallel makes any analysis of the analyzer results very difficult.

Are you referring to using UploadResultToPostgresCommand to put analyzer results into Postgres for gathering statistics? If so, I believe we could rather easily add an optional on-the-fly conversion from new to old format to UploadResultToPostgresCommand.

Yes, good idea, I have already prepared that change and will make a PR once this is merged.

@sschuberth
Copy link
Member

This would also be a chance to align the implementation of the different package managers a bit if the GradleDependencyGraphBuilder is made generic.

Indeed, we should take that opportunity to work a bit towards a more DRY implementation of the package managers which @haikoschol was mentioning a long time ago (I thought we had an issue for that, but I cannot find it anymore).

@sschuberth sschuberth merged commit c8d69ca into oss-review-toolkit:master Feb 10, 2021
@sschuberth sschuberth deleted the oheger-bosch/analyzer/dependency-representation branch February 10, 2021 06:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants