Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concurrency issue in org.eclipse.equinox.weaving.caching might cause "ClassFormatError: truncated class file" #233

Closed
xpomul opened this issue Mar 16, 2023 · 2 comments

Comments

@xpomul
Copy link
Contributor

xpomul commented Mar 16, 2023

Summary

If Equinox Weaving with AspectJ in conjunction with caching is used in a multithreaded setting, it can happen that a class file that is currently written to disk is read with 0 byte in a concurrent thread. This can lead to a ClassFormatError: truncated class file non-deterministically.

Setting / Environment

We are building an Eclipse RCP / IDE product that is tested on a virtualized CI environment on Windows VMs with RCPTT.
The IDE product itself is composed of multiple custom and 3rd party bundles that also perform a lot of background initialization; also the IDE product includes a small amount of Aspects and thus, includes the Equinox AspectJ load-time weaving and caching bundles.

The RCPTT test runtime uses Equinox AspectJ load-time weaving as well to hook into all kinds of SWT and JFace classes.
Our RCPTT test cases include a lot of restarts of the IDE product.

So basically now the product under test includes concurrent class loads, and Equinox AspectJ load-time weaving and caching with a larger number of Aspects and is restarted frequently.

The Symptom

Almost on every build a different test case fails with the root cause being a failed class load with the error java.lang.ClassFormatError: truncated class file.

Analysis

Since the bug is only reproducible on our CI, not on a local machine (most likely the timing on our CI VMs is more on the exotic side), live debugging is difficult to impossible. So I added logging statements here and there to nail down the root cause.
It turns our that BundleCachingService.storeClass() is called with a valid byte[] array. A few lines of log output later, BundleCachingService.findStoredClass() for the same class is called and results in a CacheEntry containing a 0-byte array.
That 0-byte array is propagated into the ClassLoader and results in the aforementioned ClassFormatError.

What happens internally:
Whenever a candidate class for Aspect Weaving is encountered, the cache checks whether a cache file for the woven class exists in the file system. If so, it is loaded.

If not, the class is woven and then given to the cache for storing. The cache does not store it directly (synchronously), but puts it in a writer queue from which a writer thread takes classes one by one to write to disk.

It can now happen that a class is already woven and exists in the cache writer queue, while another thread wants to load the same class and checks the file system, resulting in a cache miss. In this case, the class is even woven twice (still no problem, only a small performance penalty).

But in a rarer case, the writer thread could currently be in the process of writing the file to disk (so, e.g., an empty file could already exist, just the content is not yet flushed), while another thread checks the cache and file system. This can lead to a 0-byte read, and thus, a 0-byte array.

Solution

I had a short discussion with @martinlippert and from that, I have tried to first implement a lookup map for classes that are currently in the writer queue. This solves the issue of weaving the same class multiple times, because even if the class is not yet written to disk, we have a cache hit and can already reuse that class in the writer queue.

This implementation made things a bit better, but still resulted in the same symptom; it only became rarer.

The problem is most likely that there is no synchronization between cache reads and writes, and it can be the case that the reader has a cache miss, and immediately at that point (before the reader checks the file system), the writer is given the same class to store and it directly starts to store it, which still can result in an incomplete file being present when the reader now checks the file system.

So, I implemented a synchronization mechanism in addition to the queue lookup. This synchronization uses one RWLock per class name, so that we still have optimal throughput as long as no conflicting access occurs.

With this implementation, the RCPTT tests are now rock-solid.

@martinlippert
Copy link
Contributor

I am wondering why the lookup map idea isn't enough in case we remove the item from that lookup map AFTER writing the cache item to disk, but let's continue that discussion probably on the PR: #234

tjwatson pushed a commit that referenced this issue Mar 24, 2023
#233

Fix concurrency issue in Equinox Caching leading to 0-length byte array
to be read

Signed-off-by: Stefan Winkler <[email protected]>
tjwatson pushed a commit that referenced this issue Mar 24, 2023
#233

Bumped version of org.eclipse.equinox.weaving.caching to 1.2.300

Signed-off-by: Stefan Winkler <[email protected]>
@xpomul
Copy link
Contributor Author

xpomul commented Mar 25, 2023

fixed by #234

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants