Skip to content

Conversation

@JiaLiangC
Copy link
Contributor

Description of PR

The reason for the slow compilation: The Hadoop project has many modules, and the inability to compile them in parallel results in a slow process. For instance, the first compilation of Hadoop might take several hours, and even with local Maven dependencies, a subsequent compilation can still take close to 40 minutes, which is very slow.

How to solve it: Use mvn dependency:tree and maven-to-plantuml to investigate the dependency issues that prevent parallel compilation.

Investigate the dependencies between project modules.
Analyze the dependencies in multi-module Maven projects.
Download maven-to-plantuml:

wget https://github.com/phxql/maven-to-plantuml/releases/download/v1.0/maven-to-plantuml-1.0.jar

Generate a dependency tree:

mvn dependency:tree > dep.txt

Generate a UML diagram from the dependency tree:

java -jar maven-to-plantuml.jar --input dep.txt --output dep.puml

For more information, visit: maven-to-plantuml GitHub repository.

Hadoop Parallel Compilation Submission Logic

Reasons for Parallel Compilation Failure
In sequential compilation, as modules are compiled one by one in order, there are no errors because the compilation follows the module sequence.
However, in parallel compilation, all modules are compiled simultaneously. The compilation order during multi-module concurrent compilation depends on the inter-module dependencies. If Module A depends on Module B, then Module B will be compiled before Module A. This ensures that the compilation order follows the dependencies between modules.
But when Hadoop compiles in parallel, for example, compiling hadoop-yarn-project, the dependencies between modules are correct. The issue arises during the dist package stage. dist packages all other compiled modules.
Behavior of hadoop-yarn-project in Serial Compilation:

In serial compilation, it compiles modules in the pom one by one in sequence. After all modules are compiled, it compiles hadoop-yarn-project. During the prepare-package stage, the maven-assembly-plugin plugin is executed for packaging. All packages are repackaged according to the description in hadoop-assemblies/src/main/resources/assemblies/hadoop-yarn-dist.xml.
Behavior of hadoop-yarn-project in Parallel Compilation:
Parallel compilation compiles modules according to the dependency order among them. If modules do not declare dependencies on each other through dependency, they are compiled in parallel. According to the dependency definition in the pom of hadoop-yarn-project, the dependencies are compiled first, followed by hadoop-yarn-project, executing its maven-assembly-plugin.
However, the files needed for packaging in hadoop-assemblies/src/main/resources/assemblies/hadoop-yarn-dist.xml are not all included in the dependency of hadoop-yarn-project. Therefore, when compiling hadoop-yarn-project and executing maven-assembly-plugin, not all required modules are built yet, leading to errors in parallel compilation.
Solution:
The solution is relatively straightforward: organize all modules from hadoop-assemblies/src/main/resources/assemblies/hadoop-yarn-dist.xml, and then declare them as dependencies in the pom of hadoop-yarn-project.

How was this patch tested?

manual test
image

For code changes:

  • Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'BIGTOP-3638. Your PR title ...')?
  • Make sure that newly added files do not have any licensing issues. When in doubt refer to https://www.apache.org/licenses/

@JiaLiangC
Copy link
Contributor Author

@sekikn @iwasakims
Could you help review this PR?

@JiaLiangC
Copy link
Contributor Author

@sekikn @iwasakims @guyuqi
In this pull request, the patch for concurrent compilation of Hadoop has been merged into the Hadoop master branch. Could you please help review this pull request?
apache/hadoop#6373

tested on rocky8
image

./docker-hadoop.sh --create 3 --image bigtop/puppet:trunk-rockylinux-8 --docker-compose-plugin --memory 8g --repo file:///bigtop-home/output --disable-gpg-check --stack hdfs,yarn,mapreduce --smoke-tests hdfs,yarn,mapreduce

image

@guyuqi
Copy link
Member

guyuqi commented Jan 29, 2024

+1.

@guyuqi guyuqi merged commit d481826 into apache:master Jan 29, 2024
@guyuqi
Copy link
Member

guyuqi commented Jan 29, 2024

Merged.
Thanks @JiaLiangC

@JiaLiangC JiaLiangC deleted the BIGTOP-4054a branch February 2, 2024 01:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants