Concurrency issue in org.eclipse.equinox.weaving.caching might cause "ClassFormatError: truncated class file" #233

xpomul · 2023-03-16T14:52:55Z

Summary

If Equinox Weaving with AspectJ in conjunction with caching is used in a multithreaded setting, it can happen that a class file that is currently written to disk is read with 0 byte in a concurrent thread. This can lead to a ClassFormatError: truncated class file non-deterministically.

Setting / Environment

We are building an Eclipse RCP / IDE product that is tested on a virtualized CI environment on Windows VMs with RCPTT.
The IDE product itself is composed of multiple custom and 3rd party bundles that also perform a lot of background initialization; also the IDE product includes a small amount of Aspects and thus, includes the Equinox AspectJ load-time weaving and caching bundles.

The RCPTT test runtime uses Equinox AspectJ load-time weaving as well to hook into all kinds of SWT and JFace classes.
Our RCPTT test cases include a lot of restarts of the IDE product.

So basically now the product under test includes concurrent class loads, and Equinox AspectJ load-time weaving and caching with a larger number of Aspects and is restarted frequently.

The Symptom

Almost on every build a different test case fails with the root cause being a failed class load with the error java.lang.ClassFormatError: truncated class file.

Analysis

Since the bug is only reproducible on our CI, not on a local machine (most likely the timing on our CI VMs is more on the exotic side), live debugging is difficult to impossible. So I added logging statements here and there to nail down the root cause.
It turns our that BundleCachingService.storeClass() is called with a valid byte[] array. A few lines of log output later, BundleCachingService.findStoredClass() for the same class is called and results in a CacheEntry containing a 0-byte array.
That 0-byte array is propagated into the ClassLoader and results in the aforementioned ClassFormatError.

What happens internally:
Whenever a candidate class for Aspect Weaving is encountered, the cache checks whether a cache file for the woven class exists in the file system. If so, it is loaded.

If not, the class is woven and then given to the cache for storing. The cache does not store it directly (synchronously), but puts it in a writer queue from which a writer thread takes classes one by one to write to disk.

It can now happen that a class is already woven and exists in the cache writer queue, while another thread wants to load the same class and checks the file system, resulting in a cache miss. In this case, the class is even woven twice (still no problem, only a small performance penalty).

But in a rarer case, the writer thread could currently be in the process of writing the file to disk (so, e.g., an empty file could already exist, just the content is not yet flushed), while another thread checks the cache and file system. This can lead to a 0-byte read, and thus, a 0-byte array.

Solution

I had a short discussion with @martinlippert and from that, I have tried to first implement a lookup map for classes that are currently in the writer queue. This solves the issue of weaving the same class multiple times, because even if the class is not yet written to disk, we have a cache hit and can already reuse that class in the writer queue.

This implementation made things a bit better, but still resulted in the same symptom; it only became rarer.

The problem is most likely that there is no synchronization between cache reads and writes, and it can be the case that the reader has a cache miss, and immediately at that point (before the reader checks the file system), the writer is given the same class to store and it directly starts to store it, which still can result in an incomplete file being present when the reader now checks the file system.

So, I implemented a synchronization mechanism in addition to the queue lookup. This synchronization uses one RWLock per class name, so that we still have optimal throughput as long as no conflicting access occurs.

With this implementation, the RCPTT tests are now rock-solid.

martinlippert · 2023-03-20T14:42:57Z

I am wondering why the lookup map idea isn't enough in case we remove the item from that lookup map AFTER writing the cache item to disk, but let's continue that discussion probably on the PR: #234

#233 Fix concurrency issue in Equinox Caching leading to 0-length byte array to be read Signed-off-by: Stefan Winkler <[email protected]>

#233 Bumped version of org.eclipse.equinox.weaving.caching to 1.2.300 Signed-off-by: Stefan Winkler <[email protected]>

xpomul · 2023-03-25T15:02:23Z

fixed by #234

#233 Fix bug in original PR Signed-off-by: Stefan Winkler <[email protected]>

xpomul mentioned this issue Mar 16, 2023

Fix concurrency issue in equinox caching #234

Merged

tjwatson pushed a commit that referenced this issue Mar 24, 2023

Fix concurrency issue in Equinox Caching

e827957

#233 Fix concurrency issue in Equinox Caching leading to 0-length byte array to be read Signed-off-by: Stefan Winkler <[email protected]>

tjwatson pushed a commit that referenced this issue Mar 24, 2023

Fix concurrency issue in Equinox Caching

f60ed03

#233 Bumped version of org.eclipse.equinox.weaving.caching to 1.2.300 Signed-off-by: Stefan Winkler <[email protected]>

xpomul closed this as completed Mar 25, 2023

xpomul mentioned this issue Apr 21, 2023

Fix concurrency issue in Equinox Caching (errata) #252

Merged

tjwatson pushed a commit that referenced this issue Apr 25, 2023

Fix concurrency issue in Equinox Caching

d1fbae8

#233 Fix bug in original PR Signed-off-by: Stefan Winkler <[email protected]>

xpomul mentioned this issue Aug 7, 2023

Equinox Caching: Improved Exception and Shutdown Handling #300

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Concurrency issue in org.eclipse.equinox.weaving.caching might cause "ClassFormatError: truncated class file" #233

Concurrency issue in org.eclipse.equinox.weaving.caching might cause "ClassFormatError: truncated class file" #233

xpomul commented Mar 16, 2023

martinlippert commented Mar 20, 2023

xpomul commented Mar 25, 2023

Concurrency issue in org.eclipse.equinox.weaving.caching might cause "ClassFormatError: truncated class file" #233

Concurrency issue in org.eclipse.equinox.weaving.caching might cause "ClassFormatError: truncated class file" #233

Comments

xpomul commented Mar 16, 2023

Summary

Setting / Environment

The Symptom

Analysis

Solution

martinlippert commented Mar 20, 2023

xpomul commented Mar 25, 2023