Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean build: SourceFile(s) read twice #2691

Open
jukzi opened this issue Jul 8, 2024 · 20 comments
Open

Clean build: SourceFile(s) read twice #2691

jukzi opened this issue Jul 8, 2024 · 20 comments

Comments

@jukzi
Copy link
Contributor

jukzi commented Jul 8, 2024

During Project/"Clean all projects" SourceFile(s) are read multiple times.
Once from ReadManager - exepected.
And later another time from Parser.getMethodBodies() - without further digging i guess this is wrong. It is prepared to use a ReadManager there, but readManager=null.

Thread [Compiler Processing Task] (Suspended (breakpoint at line 11694 in org.eclipse.jdt.internal.compiler.parser.Parser))	
	org.eclipse.jdt.internal.compiler.parser.Parser.getMethodBodies(org.eclipse.jdt.internal.compiler.ast.CompilationUnitDeclaration) line: 11694	
	org.eclipse.jdt.internal.compiler.Compiler.process(org.eclipse.jdt.internal.compiler.ast.CompilationUnitDeclaration, int) line: 890	
	org.eclipse.jdt.internal.compiler.ProcessTaskManager.compile() line: 163	

image

@jukzi
Copy link
Contributor Author

jukzi commented Jul 9, 2024

Both reads happen during org.eclipse.jdt.internal.compiler.Compiler.compile(ICompilationUnit[], boolean)
The first read happens during org.eclipse.jdt.internal.compiler.Compiler.beginToCompile(org.eclipse.jdt.internal.compiler.env.ICompilationUnit[]) line: 393
and the second read is triggered from
org.eclipse.jdt.internal.compiler.Compiler.processCompiledUnits(int, boolean) line: 606

reproducible with org.eclipse.jdt.core.tests.builder.AnnotationDependencyTests.testTypeAnnotationDependency()

@jukzi
Copy link
Contributor Author

jukzi commented Jul 9, 2024

In Compiler.internalBeginToCompile() a decision is made if a full parse is done or only a dietParse. By default it is only a dietParse because parseThreshold<=0. That diet parse skips parsing the method bodies. To get them later a second parse has to occur. That 2nd parse is done in another Thread ("Compiler Processing Task"). Nevertheless the main compilation has to wait on that result anyway.
I measured an experiment with diet parse vs full parse.
image
image
i.e. directly using a full parse uses more cpu time but is overall 10% faster because it removes the extra read and a second parse.
So i do not see a good reason for a first diet parse.
@iloveeclipse WDYT, any idea why there is first a diet parse only?
also code like
this.parseThreshold = 0; // will request a full parse
seems to be wrong. any parseThreshold <= 0 may do the opposite: it does a dietParse if totalUnits>0

@iloveeclipse
Copy link
Member

WDYT, any idea why there is first a diet parse only?

See https://github.com/eclipse-jdt/eclipse.jdt.core/wiki/ECJ-Parse

I'm not a compiler expert (but @srikanth-sankaran is), but I assume diet parse looks for obvious errors first (as it skips method bodies), e.g. if class definition is already wrong, no reason to parse methods and do more work if basics are broken anyway.

@srikanth-sankaran
Copy link
Contributor

For a number of IDE operations it may suffice to work with empty method bodies altogether or drill down into just one method alone (where the cursor is).

Think of providing an outline view, generating a Java model, code completion, jump to declaration F3, search for overrides, hierarchy computations etc.

So a dietParse need not always be followed by parse of all method bodies.

In the batch compiler, it is certainly the case that we will populate all method bodies. But even here, the design breakdown into various phases may aesthetically fit better with the model of dietParse followed by parse of method bodies.

This is just one design. Javac doesn't have an equivalent mode.

Do you have numbers for a large project's full build ?

Any change here would also have to take into account memory utilization.

We have the capability and deploy that to null out method bodies once we don't require them too.

@jukzi
Copy link
Contributor Author

jukzi commented Jul 9, 2024

For a number of IDE operations it may suffice to work with empty method bodies altogether or drill down into just one method alone (where the cursor is).

All such operations where only a single file is parsed complete in a blink of an eye anyway. A performance difference can only perceived during parsing many files. Like build, search or type hierarchy.

Do you have numbers for a large project's full build ?

see screenshots in #2691 (comment)
Building platform workspace takes 65sec vs 73sec in Compiler.compile

@srikanth-sankaran
Copy link
Contributor

Well by all means propose a patch and let us see all it takes to make it green with the full test suite and take a call.

Short answer is there is nothing sacrosanct about the split into diet parse plus populating that diet AST with parsed method bodies. As I said javac doesn't follow such an approach.

It may be significant work though given this is a long standing design choice that is pervasive in implementation.

But before you invest much effort let us ask @stephan-herrmann for his opinion on this too to make sure I haven't overlooked some key observation here.

jukzi pushed a commit to jukzi/eclipse.jdt.core that referenced this issue Jul 9, 2024
Use parse() instead of dietParse() to avoid parse during
Parser.getMethodBodies()

eclipse-jdt#2691
jukzi pushed a commit to jukzi/eclipse.jdt.core that referenced this issue Jul 9, 2024
Use parse() instead of dietParse() to avoid parse during
Parser.getMethodBodies()

eclipse-jdt#2691
@stephan-herrmann
Copy link
Contributor

As this challenges one of the foundations of JDT's design, any change must be done with utmost caution.

Some comments:

All such operations where only a single file is parsed complete in a blink of an eye anyway.

Consider the example of code completion: indeed parsing may be super fast, but when later the AST is resolved not having to resolve uninteresting methods may very well make a significant difference. Type inference is a candidate that can take several seconds per file.

MatchLocator takes the opposite approach, its method purgeMethodStatements() implements what @srikanth-sankaran mentioned above: remove statements after a full parse. But this method cannot directly be re-used for completion, as it uses a search-specific approach.

I think it would be very hard to demonstrate that a change of strategy would be beneficial for all uses of the compiler.

I could imagine, though, an approach that specifically tells the compiler when it is called by a builder or as batch compiler that it should do a full parse right away (and make sure it doesn't still try a second parse for method bodies :) ).

@stephan-herrmann
Copy link
Contributor

[...] removes the extra read [...]

Do we know how much of the overhead relates to parsing vs. re-reading? Some implementations of ICompilationUnit cache the file content, so that's another way to avoid the extra read.

@jukzi
Copy link
Contributor Author

jukzi commented Jul 9, 2024

Do we know how much of the overhead relates to parsing vs. re-reading? Some implementations of ICompilationUnit cache the file content, so that's another way to avoid the extra read.

SourceFile does intentional not cache the content to save memory and the second read does not use the ReadManager which results in ~ 6 seconds read overhead in my example:
image

@stephan-herrmann
Copy link
Contributor

SourceFile does intentional not cache the content

@srikanth-sankaran if the CU doesn't cache, do you see value in caching the source in CompilationUnitDeclaration instead? At least this could mean we no longer need Parser.stashTextualRepresentation(FunctionalExpression) :)

Of course such cached source can be freed during compilation (if no functional expressions: after parseMethodBodies(), otherwise after resolve).

@srikanth-sankaran
Copy link
Contributor

All such operations where only a single file is parsed complete in a blink of an eye anyway. A performance difference can only perceived during parsing many files. Like build, search or type hierarchy.

I am all for experimentation - but I echo @stephan-herrmann's concerns about As this challenges one of the foundations of JDT's design, any change must be done with utmost caution. and I think it would be very hard to demonstrate that a change of strategy would be beneficial for all uses of the compiler.

In fact I would rephrase his last sentence and say I think it would be very hard to demonstrate that a change of strategy would NOT BE HARMFUL for ANY uses of the compiler.

So let us by all means explore but Festina lente! Let us make haste slowly!

I could imagine, though, an approach that specifically tells the compiler when it is called by a builder or as batch compiler that it should do a full parse right away (and make sure it doesn't still try a second parse for method bodies :) ).

As per my earlier comment, let us not focus exclusively on time to the exclusion of space. Case in point, while we are in this call stack:

LookupEnvironment.completeTypeBindings() line: 558	
Compiler.internalBeginToCompile(ICompilationUnit[], int) line: 880	
Compiler.beginToCompile(ICompilationUnit[]) line: 393	
Compiler.compile(ICompilationUnit[], boolean) line: 443	
Compiler.compile(ICompilationUnit[]) line: 425	
BatchImageBuilder(AbstractImageBuilder).compile(SourceFile[], SourceFile[], boolean) line: 414	
BatchImageBuilder.compile(SourceFile[], SourceFile[], boolean) line: 211	
BatchImageBuilder(AbstractImageBuilder).compile(SourceFile[]) line: 345	
BatchImageBuilder.build() line: 79	
JavaBuilder.buildAll() line: 286	
JavaBuilder.build(int, Map, IProgressMonitor) line: 192	

we have ALL the source files parsed in diet mode and that is perfectly good enough to carry out the operation of "completing type bindings" which has these stages:

		CHECK_AND_SET_IMPORTS,
		CONNECT_TYPE_HIERARCHY,
		SEAL_TYPE_HIERARCHY,
		BUILD_FIELDS_AND_METHODS,
		INTEGRATE_ANNOTATIONS_IN_HIERARCHY,
		CHECK_PARAMETERIZED_TYPES

Now I worry what is the implication of a full parse at this point for memory utilization - we will be holding complete parse trees of all compilation units which is not necessary for the operation at hand.

While we may observe a 6 seconds savings for platforms workspace, are there projects out there where the memory pressure could zoom out of control resulting in thrashing ??

@srikanth-sankaran
Copy link
Contributor

As this challenges one of the foundations of JDT's design, any change must be done with utmost caution.

Some comments:

All such operations where only a single file is parsed complete in a blink of an eye anyway.

Consider the example of code completion: indeed parsing may be super fast, but when later the AST is resolved not having to resolve uninteresting methods may very well make a significant difference. Type inference is a candidate that can take several seconds per file.

Correct, in a large file if code assist starts resolving fuller parse trees, we will start seeing sluggish interactive performance.
Also second the point about inference in some corner and not so corner cases taking significant times - this can cause slowdown in interactive performance at a distant point in source

MatchLocator takes the opposite approach, its method purgeMethodStatements() implements what @srikanth-sankaran mentioned above: remove statements after a full parse. But this method cannot directly be re-used for completion, as it uses a search-specific approach.

I think it would be very hard to demonstrate that a change of strategy would be beneficial for all uses of the compiler.

I could imagine, though, an approach that specifically tells the compiler when it is called by a builder or as batch compiler that it should do a full parse right away (and make sure it doesn't still try a second parse for method bodies :) ).

jukzi pushed a commit to jukzi/eclipse.jdt.core that referenced this issue Jul 10, 2024
Use parse() instead of dietParse() to avoid parse during
Parser.getMethodBodies()

eclipse-jdt#2691
@srikanth-sankaran
Copy link
Contributor

SourceFile does intentional not cache the content

@srikanth-sankaran if the CU doesn't cache, do you see value in caching the source in CompilationUnitDeclaration instead? At least this could mean we no longer need Parser.stashTextualRepresentation(FunctionalExpression) :)

I can connect the dots from across different tickets and make out you don't like this design 😆

To answer, may be.

@jukzi
Copy link
Contributor Author

jukzi commented Jul 10, 2024

Now I worry what is the implication of a full parse at this point for memory utilization - we will be holding complete parse trees of all compilation units which is not necessary for the operation at hand.

Thats an argument. So maybe the initial idea to have 2 phases of parse is to have less memory. I took a measurent: Building platform workspace i put a breakpoint while compiling the bigest project. That is jd.ui with 1759 sourceUnits. The breakpoint was at the completeTypeBindings() call. I forced a GC and took heapdump. i did it once with a dietparse and another time with full parse.
With Full parse the heapdump was ~481 MB
With diet parse the heapdump was ~415 MB

The difference is a total of ~ 66 MB (+15%) distributed like this:
image
An example GC root of the most added memory is:
image

At this point i don't make any conclusion yet but just want to share the result.

@stephan-herrmann
Copy link
Contributor

@srikanth-sankaran if the CU doesn't cache, do you see value in caching the source in CompilationUnitDeclaration instead? At least this could mean we no longer need Parser.stashTextualRepresentation(FunctionalExpression) :)

I can connect the dots from across different tickets and make out you don't like this design 😆

There's nothing "wrong" with your design, only a funny connection between a trick in the parser and a rabbit-out-of-the-hat in LE.copy() :) - my comment was only meant to suggest that I prefer one design that cleanly covers several issues, rather than several strategies. OTOH, memory considerations already speak against my suggestion.

@stephan-herrmann
Copy link
Contributor

stephan-herrmann commented Jul 10, 2024

The difference is a total of ~ 66 MB (+15%) distributed like this:

Compare that to a 11% speedup, this sound like a consideration we would perhaps want to pass on to users to decide? OTOH, build time has no hard limit, memory may have ...

@srikanth-sankaran
Copy link
Contributor

The difference is a total of ~ 66 MB (+15%) distributed like this:

Compare that to a 11% speedup, this sound like a consideration we would perhaps want to pass on to users to decide? OTOH, build time has no hard limit, memory may have ...

I am not aware of serious complaints about build times. It is great to proactively track and improve, of course. That said, if our performance improvement resource budget is limited - which I suspect it is, I would imagine there ought to be other things we can pursue that don't have equivalent downside risk. Not to be seen as discouraging further exploration here.

@jukzi
Copy link
Contributor Author

jukzi commented Jul 11, 2024

I am not aware of serious complaints about build times

Nothing to complain about, but if we could make it faster my colleagues who are waiting 15min each day for their builds to complete could use their time better then for waiting. So if you have any idea how to get it faster please tell. To my measurements there is no hotspot except the file accesses on windows.

I think we could also avoid the second reading of the files with caching a softreference of the content, as i have plenty of unused memory.... but for now its vacation time.

@iloveeclipse
Copy link
Member

I think we could also avoid the second reading of the files with caching a softreference of the content

That would be less invasive change as all what I saw proposed before.

@stephan-herrmann
Copy link
Contributor

my colleagues who are waiting 15min each day for their builds

in my experience most significant waste of time typically happens when builders are triggered unnecessarily, in particular when builders are triggering each other, or full builds happening after restart although previous exit should have saved a clean state, etc. Perhaps this was already improved recently, and anyway those issues are hard to reproduce & fix, but to me that sounds more like dealing with minutes rather then the few seconds reported above.

ymmv

jukzi pushed a commit to jukzi/eclipse.jdt.core that referenced this issue Sep 4, 2024
During compile parsing happens in two stages:
1. diet parse (any blocks like method bodies are skipped)
2. parse bodies
Both phases did read the source .java file from file system. With this
change the file contents is kept until no longer needed. It is cached in
a SoftReference to avoid OutOfMemoryError.

eclipse-jdt#2691
@jukzi jukzi linked a pull request Sep 4, 2024 that will close this issue
jukzi pushed a commit to jukzi/eclipse.jdt.core that referenced this issue Sep 5, 2024
During compile parsing happens in two stages:
1. diet parse (any blocks like method bodies are skipped)
2. parse bodies
Both phases did read the source .java file from file system. With this
change the file contents is kept until no longer needed. It is cached in
a SoftReference to avoid OutOfMemoryError.

eclipse-jdt#2691
jukzi pushed a commit to jukzi/eclipse.jdt.core that referenced this issue Sep 5, 2024
During compile parsing happens in two stages:
1. diet parse (any blocks like method bodies are skipped)
2. parse bodies
Both phases did read the source .java file from file system. With this
change the file contents is kept until no longer needed. It is cached in
a SoftReference to avoid OutOfMemoryError.

eclipse-jdt#2691
jukzi pushed a commit to jukzi/eclipse.jdt.core that referenced this issue Sep 9, 2024
During compile parsing happens in two stages:
1. diet parse (any blocks like method bodies are skipped)
2. parse bodies
Both phases did read the source .java file from file system. With this
change the file contents is kept until no longer needed. It is cached in
a SoftReference to avoid OutOfMemoryError.

eclipse-jdt#2691
jukzi pushed a commit to jukzi/eclipse.jdt.core that referenced this issue Sep 30, 2024
During compile parsing happens in two stages:
1. diet parse (any blocks like method bodies are skipped)
2. parse bodies
Both phases did read the source .java file from file system. With this
change the file contents is kept in CompilationResult.contentRef until
no longer needed. It is cached in
a SoftReference to avoid OutOfMemoryError.

eclipse-jdt#2691
jukzi pushed a commit to jukzi/eclipse.jdt.core that referenced this issue Sep 30, 2024
During compile parsing happens in two stages:
1. diet parse (any blocks like method bodies are skipped)
2. parse bodies
Both phases did read the source .java file from file system. With this
change the file contents is kept in CompilationResult.contentRef until
no longer needed. It is cached in
a SoftReference to avoid OutOfMemoryError.

eclipse-jdt#2691
jukzi pushed a commit to jukzi/eclipse.jdt.core that referenced this issue Oct 1, 2024
During compile parsing happens in two stages:
1. diet parse (any blocks like method bodies are skipped)
2. parse bodies
Both phases did read the source .java file from file system. With this
change the file contents is kept in CompilationResult.contentRef until
no longer needed. It is cached in
a SoftReference to avoid OutOfMemoryError.

eclipse-jdt#2691
jukzi pushed a commit that referenced this issue Oct 7, 2024
During compile parsing happens in two stages:
1. diet parse (any blocks like method bodies are skipped)
2. parse bodies
Both phases did read the source .java file from file system. With this
change the file contents is kept in CompilationResult.contentRef until
no longer needed. It is cached in
a SoftReference to avoid OutOfMemoryError.

#2691
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants