Make transaction log publish atomic in local file system by chenjian2664 · Pull Request #27904 · trinodb/trino

chenjian2664 · 2026-01-13T03:57:40Z

Description

Fixes Flaky testConcurrentInsertsSelectingFromTheSameVersionedTable and testConcurrentInsertsSelectingFromTheSameTemporalVersionedTable in TestDeltaLakeLocalConcurrentWritesTest #21725

The Delta Lake connector does not guarantee that queries are executed in a strict order. Queries may start concurrently, which makes count-based assertions in tests non-deterministic.

Additional context and related issues

Implementation refer https://github.com/trinodb/trino/blob/master/plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/transactionlog/writer/AzureTransactionLogSynchronizer.java

The current Delta transaction commit flow for INSERT operations works as follows:

Read the current table version.
If the table version is greater than the current log version, this indicates a concurrent transaction is being committed. Before writing a new log entry, we perform conflict detection:

a) - Read all actions newly added to the table since the current transaction started.
b) - If any of those actions conflict (overlap) with the current transaction, the commit is considered unsafe and fails.
If no conflicts are detected, write the transaction log entry with currentVersion + 1.

There is a subtle race condition in the retry path, since we have retry when failing the commit log.
On a retry, it is possible (with very low probability) that when the commit logic reaches the conflict detection step (2.a), another concurrent writer has created the log file but has not finished writing its contents. In this case:
* The retry observes a partially written transaction log.
* The newly added actions appear empty or incomplete when current commit reading from the log file.
As a result, the conflict detection passes trivially (no overlapping actions are observed). The retry then proceeds to successfully write its own log entry, step 3.

This leads to an incorrect outcome: the commit succeeds even though it should have detected a conflict with the concurrent transaction.

Update

the pr using the workaround by implementing the createExclusive for the LocalOutputFile

Release notes

(x) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text:

## Section
* Fix some things. ({issue}`issuenumber`)

findinpath · 2026-01-13T10:54:50Z

@chenjian2664 note that this PR relates to #27484

Queries may start concurrently, which makes count-based assertions in tests non-deterministic.

This is not true.
There seems to be indeed an issue when testing against the local filesystem which has not been reproduced against object storage providers.

findinpath · 2026-01-13T10:57:27Z

@chenjian2664 could you please add reference in the description to recorded CI failures against object storage providers?

findinpath

The flakiness is generated by testing against the local file system.

See #27484 (comment)

If the same issue would occur against an object storage, we would have a bug in the implementation.

chenjian2664 · 2026-01-13T12:22:26Z

@findinpath just considering the reading phrase, when the queries arrive at the connector, it is possible it reads the same version of the table right?

findinpath · 2026-01-13T13:49:19Z

@findinpath just considering the reading phrase, when the queries arrive at the connector, it is possible it reads the same version of the table right?

yes, sure.

INSERT INTO " + tableName + " SELECT COUNT(*), 10 AS part FROM " + tableName

This is an operation which reads from the same table it inserts into.
This operation is therefore not a "blind append".

In case of dealing with a non-blind append, we will be checking whether other concurrent operations that are also non-blind inserts have inserted data and if so, fail the operation:

trino/plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java

Lines 2452 to 2455 in 7269eba

    
           boolean readWholeTable = enforcedSourcePartitionConstraints.isAll(); 
        
           if (readWholeTable) { 
        
               throw new TransactionFailedException("Conflicting concurrent writes found. Data files added in the modified table by concurrent write operation."); 
        
           }

So, even if two operations are reading the same version of the table, only one of these operations will actually succeed in performing the commit successfully.

chenjian2664 · 2026-01-13T22:53:30Z

@findinpath Thanks for guiding me to the place that committing the insert.
I think the problem is here

trino/plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java

Line 2354 in b3c5bab

    
           long currentVersion = getMandatoryCurrentVersion(fileSystem, handle.location(), readVersion.get());

, concurrent reads are possibly here and still reads the same readVersion - if it is, we won't step into checkForConcurrentTransactionConflicts at all.

findinpath

Well done @chenjian2664 ❤️

findinpath · 2026-01-20T08:25:42Z

Please add Fixes https://github.com/trinodb/trino/issues/27484 to the description of the PR

ebyhr · 2026-01-20T08:33:37Z

@findinpath Can we re-enable disabled tests in TestDeltaLakeLocalConcurrentWritesTest now?

findinpath · 2026-01-20T09:57:16Z

@ebyhr yes, the tests disabled with the issue #21725 should be re-enabled

ebyhr · 2026-01-21T02:06:54Z

the tests disabled with the issue #21725 should be re-enabled

@findinpath Nice!

@chenjian2664 Could you re-enable those tests?

chenjian2664 · 2026-01-21T04:38:13Z

ci failures https://github.com/trinodb/trino/actions/runs/21194534248/job/60967948329?pr=27904

The rename is atomic but not exclusive :(

chenjian2664 · 2026-01-21T08:08:54Z

@findinpath @ebyhr I am afraid we can't enable the tests of #21725 , the updateExtendedStatistics

trino/plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/statistics/MetaDirStatisticsAccess.java

Line 95 in 9c1abc4

public void updateExtendedStatistics(

does not currently provide atomic guarantees. As a result, two concurrent commits may both succeed and independently update the statistics

findinpath · 2026-01-21T08:30:45Z

I am afraid we can't enable the tests of #21725 , the updateExtendedStatistics does not currently provide atomic guarantees.

@chenjian2664 yes, i remember now - let's handle this in a separate contribution.

Previously, transaction log publishing relied on file creation with `CREATE_NEW` to ensure that only one concurrent writer succeeds. While this prevents multiple writers from winning, it does not guarantee atomic visibility of the log contents. Readers may observe an empty or partially written file while the writer is still writing the transaction log. This change writes the transaction log to a temporary file and publishes it using an atomic rename, ensuring readers only observe a fully written log or no file at all.

findepi · 2026-01-23T16:51:34Z

+        }
+
+        boolean lockCreated = false;
+        Path lockPath = path.resolveSibling(path.getFileName() + ".lock");


It looks like we have two mechanisms (file locks and atomic moves) to achieve same goal (exclusion).
So, what is lock needed for?

the file locks is for the exclusive - only one thread should able to create target file at a time.
The move is for the visibility(atomic), readers should either not see the target file or see the file exists and with the full content.

pls document this line of thinking with a comment

@chenjian2664 lock file is "cooperative locking" - the parties involved need to agree on the "protocol" they are following (the lock file existence, its name, locked region).
rename seems to be sufficient for "cooperative exclusion"

i.e. if there is no lock file at all, what could possibly go wrong?

@findepi The reason is Files.move is not guarantee to be exclusive it is only guarantee to be atomic.
If no lock there then probably multi-threads will think it success moved, but actually being overwrite by others-who commit last win.

I believe that Files.move with ATOMIC_MOVE and without REPLACE_EXISTING is exclusive.
At least it's IMO documented as such.

Am i wrong?

I tested with below code :

static void main() throws Exception { Path tempDirectory = Files.createTempDirectory("trino-test-"); ExecutorService executorService = Executors.newFixedThreadPool(10); CyclicBarrier barrier = new CyclicBarrier(10); Path targetFilePath = tempDirectory.resolve("target.test"); for (int i = 0; i < 10; i++) { Path tmpFilePath = tempDirectory.resolve("file-" + i + ".test"); executorService.execute(() -> { try { writeFile(tmpFilePath); barrier.await(); Files.move(tmpFilePath, targetFilePath, ATOMIC_MOVE); System.out.println("Moved " + tmpFilePath.getFileName() + " to target.test"); } catch (Exception e) { throw new RuntimeException(e); } }); } Thread.sleep(5000); executorService.shutdown(); } private static void writeFile(Path tmpFilePath) throws IOException { try (OutputStream out = Files.newOutputStream(tmpFilePath, CREATE_NEW, WRITE)) { out.write(("this is " + tmpFilePath.getFileName()).getBytes()); } }

multiple tmp files think there success moved. could you please help to confirm, or I am using wrongly with the move?

You're right. I have not read Files.move lengthy javadoc carefully enough. Thanks for standing me corrected.

We may get away from this by

call some native rename / renameat / renameat2 function. Linux kernel awesomely supports this basic operation of renaming a file without overriding a target, should it exist. It's Java's fault it doesn't work

not great as it's going to be some unpretty JNA / FFM call and may or may not work on MacOS & Linux the same way

workaround the limitation with Files.createLink. It seems to deliver exclusive target creation (on MacOS at least)

not great as this function seems quite platform-dependent. it doesn't seem to guarantee exclusiveness

remind ourselves we're doing this for our tests only. While LocalFileSystem is a production facility which needs to be implemented well, the only call site we're really concerned about is LocalTransactionLogSynchronizer (which was supposed to be moved to test dir). For internal test purposes we can simply synchronize in memory on a lock. BTW this is how FileHiveMetastore performs synchronization around filesystem operations.

Let's do this last option.

Sorry for the late reply.
I did consider an in-memory lock at first. My concern is about how we validate the log synchronizer design. We're currently relying entirely on the semantics of createExclusive and assuming the synchronizer logic is correct.

If we can't prove the local concurrency behavior is correct, how can we trust it in the cloud, even if the filesystem provides the same createExclusive guarantees, that’s why I wanted this to work across multiple clusters. It's still not a strong guarantee, but at least it would give us some confidence that matching createExclusive semantics won't lead to corrupted Delta logs or wrongly thinking it is success committed.

But I looked the FileHiveMetastore you've mentioned seems I am overthinking ?

And I don't found tests about the concurrent writing of cloud log synchronizer, is it because of the flaky, or I am missing the tests at somewhere, what's the reason behind?

findepi · 2026-01-27T08:41:26Z

+            throw new FileAlreadyExistsException(location.toString());
+        }
+
+        Closer closer = Closer.create();


try with resources?

findepi · 2026-01-27T08:45:30Z

+        Path lockPath = path.resolveSibling(path.getFileName() + ".lock");
+        FileChannel channel = FileChannel.open(lockPath, CREATE_NEW, WRITE);
+        closer.register(channel);
+        closer.register(() -> Files.deleteIfExists(lockPath));


The lock handling is quite problematic

what if lock file was left behind by previous attempt (process killed and didn't remove)? the CREATE_NEW will trigger lock file creation failure, application won't recover

assuming this is fixed, imagine 3 threads contending for the write. A locks, B fails and removes lock file during its cleanup, C creates new lock file and also locks. Now A and C both think they have the lock.

if locks are gone (https://github.com/trinodb/trino/pull/27904/changes#r2730875323), great
if not, these problems need to be addressed somehow

For the first question, I think it is the problem, but note there will be always problematic if we use the CREATE_NEW strategy to create the file, i,e in previous implementation we use it to create the file then writing into it, and when process killed during writing we are facing the same issue.

Second question, the FileChannel channel = FileChannel.open(lockPath, CREATE_NEW, WRITE); will guarantee that only one thread can register the line 83, then will cleanup the lock. there shouldn't be a thread fail without creating the lock but registered the cleanup logic

For the first question, I think it is the problem, but note there will be always problematic if we use the CREATE_NEW strategy to create the file, i,e in previous implementation we use it to create the file then writing into it, and when process killed during writing we are facing the same issue.

Except that we're writing to a temp file with randomized name, so it's less likely to matter that we left a file behind.

Second question, the FileChannel channel = FileChannel.open(lockPath, CREATE_NEW, WRITE); will guarantee that only one thread can register the line 83, then will cleanup the lock

If line numbers match what GH is showing me

then, the deletion in 83 is registered before lock is taken.

I am sure we can make file locks work reasonable well, but let's first consider this one:

if locks are gone (https://github.com/trinodb/trino/pull/27904/changes#r2730875323), great

i.e. maybe we don't need any file locks at all

I am not sure understand you point correctly - if you are saying we could remove the FileLock lock = channel.lock() line, then I think it is make sense, I will do it, since there should only one thread can pass the line 81 at a time.

I would like to discuss about the point if you think we should remove FileChannel at all :)
refer #27904 (comment) - the move itself doesn't guarantee the exclusive

then, the deletion in 83 is registered before lock is taken.

Oh, I think here is a gap we understanding the "lock", actually the line 81 -> "FileChannel _ = FileChannel.open(lockPath, CREATE_NEW, WRITE)", it has the "lock"(exclusive) semantic, there should only one thread can pass here, the taken lock from file channel I put there is misleading so I removed now.

findepi · 2026-02-03T11:02:34Z

+        Files.createDirectories(path.getParent());
+
+        // see if we can stop early without acquire locking
+        if (Files.exists(path)) {
+            throw new FileAlreadyExistsException(location.toString());
+        }
+
+        Path lockPath = path.resolveSibling(path.getFileName() + ".lock");
+        Closer closer = Closer.create();
+        try {
+            try (FileChannel _ = FileChannel.open(lockPath, CREATE_NEW, WRITE)) {
+                closer.register(() -> Files.deleteIfExists(lockPath));
+                if (Files.exists(path)) {
+                    throw new FileAlreadyExistsException(location.toString());
+                }
+
+                Path tmpFilePath = path.resolveSibling(path.getFileName() + "." + randomUUID() + ".tmp");
+                try (OutputStream out = Files.newOutputStream(tmpFilePath, CREATE_NEW, WRITE)) {
+                    closer.register(() -> Files.deleteIfExists(tmpFilePath));
+                    out.write(data);
+                }
+
+                // Ensure that the file is only visible when fully written
+                Files.move(tmpFilePath, path, ATOMIC_MOVE);
+            }
+        }
+        catch (IOException e) {
+            throw closer.rethrow(handleException(location, e));
+        }
+        finally {
+            closer.close();
+        }


Suggested change

Files.createDirectories(path.getParent());

// see if we can stop early without acquire locking

if (Files.exists(path)) {

throw new FileAlreadyExistsException(location.toString());

}

Path lockPath = path.resolveSibling(path.getFileName() + ".lock");

Closer closer = Closer.create();

try {

try (FileChannel _ = FileChannel.open(lockPath, CREATE_NEW, WRITE)) {

closer.register(() -> Files.deleteIfExists(lockPath));

if (Files.exists(path)) {

throw new FileAlreadyExistsException(location.toString());

}

Path tmpFilePath = path.resolveSibling(path.getFileName() + "." + randomUUID() + ".tmp");

try (OutputStream out = Files.newOutputStream(tmpFilePath, CREATE_NEW, WRITE)) {

closer.register(() -> Files.deleteIfExists(tmpFilePath));

out.write(data);

}

// Ensure that the file is only visible when fully written

Files.move(tmpFilePath, path, ATOMIC_MOVE);

}

}

catch (IOException e) {

throw closer.rethrow(handleException(location, e));

}

finally {

closer.close();

}

Files.createDirectories(path.getParent());

// Check target path before creating lock file

if (Files.exists(path)) {

throw new FileAlreadyExistsException(location.toString());

}

Path lockPath = path.resolveSibling(path.getFileName() + ".lock");

Path tmpFilePath = path.resolveSibling(path.getFileName() + "." + randomUUID() + ".tmp");

try (Closer closer = Closer.create()) {

Files.write(lockPath, new byte[0], CREATE_NEW);

closer.register(() -> Files.delete(lockPath));

Files.write(tmpFilePath, data, CREATE_NEW);

closer.register(() -> Files.deleteIfExists(tmpFilePath));

// Ensure that the file is only visible when fully written

// Files.move with ATOMIC_MOVE is not guaranteed to be exclusive, hence the lock file

Files.move(tmpFilePath, path, ATOMIC_MOVE);

}

chenjian2664 · 2026-02-10T13:32:25Z

superseded by #28092

cla-bot Bot added the cla-signed label Jan 13, 2026

github-actions Bot added the delta-lake Delta Lake connector label Jan 13, 2026

chenjian2664 requested review from ebyhr and findinpath January 13, 2026 03:57

vlad-lyutenko approved these changes Jan 13, 2026

View reviewed changes

findinpath suggested changes Jan 13, 2026

View reviewed changes

chenjian2664 force-pushed the jack/fix-flaky-con-writes-delta branch from 61bb542 to b5cf225 Compare January 20, 2026 02:30

chenjian2664 requested review from findepi, findinpath, vlad-lyutenko and wendigo January 20, 2026 02:34

chenjian2664 force-pushed the jack/fix-flaky-con-writes-delta branch from b5cf225 to 029ce74 Compare January 20, 2026 02:46

findinpath reviewed Jan 20, 2026

View reviewed changes

Comment thread ...in/java/io/trino/plugin/deltalake/transactionlog/writer/LocalTransactionLogSynchronizer.java

findinpath approved these changes Jan 20, 2026

View reviewed changes

findepi reviewed Jan 20, 2026

View reviewed changes

chenjian2664 force-pushed the jack/fix-flaky-con-writes-delta branch from 0a58e78 to 2e2833e Compare January 21, 2026 02:04

chenjian2664 force-pushed the jack/fix-flaky-con-writes-delta branch 3 times, most recently from 4242836 to a1d51c2 Compare January 21, 2026 08:03

chenjian2664 force-pushed the jack/fix-flaky-con-writes-delta branch from a1d51c2 to 2c980f6 Compare January 21, 2026 09:42

chenjian2664 changed the title ~~Fix flakiness of testConcurrentInsertsSelectingFromTheSameTable~~ Make transaction log publish atomic in local file system Jan 21, 2026

chenjian2664 requested review from electrum and findepi January 23, 2026 04:29

findepi reviewed Jan 23, 2026

View reviewed changes

fixup! Make transaction log publish atomic in local file system

0118107

chenjian2664 force-pushed the jack/fix-flaky-con-writes-delta branch from 92ca936 to 0118107 Compare January 27, 2026 04:41

chenjian2664 requested review from findepi and findinpath January 27, 2026 07:56

findepi requested changes Jan 27, 2026

View reviewed changes

fixup! Make transaction log publish atomic in local file system

6d02260

findepi mentioned this pull request Feb 3, 2026

Synchronize Delta log writing in tests too #28092

Merged

findepi reviewed Feb 3, 2026

View reviewed changes

findepi mentioned this pull request Feb 6, 2026

TestDeltaLakeLocalConcurrentWritesTest.testConcurrentInsertsReconciliationForMixedInserts flaky: Failed to decode JSON from statistics access #22455

Open

chenjian2664 closed this Feb 10, 2026

chenjian2664 mentioned this pull request Mar 5, 2026

Flaky testConcurrentInsertsSelectingFromTheSameVersionedTable and testConcurrentInsertsSelectingFromTheSameTemporalVersionedTable in TestDeltaLakeLocalConcurrentWritesTest #21725

Open

This was referenced Apr 7, 2026

Pass allowFailure flag to statistics access #29006

Closed

Store extended statistics filename in Delta Lake CommitInfoEntry #29099

Open

Conversation

chenjian2664 commented Jan 13, 2026 • edited by findepi Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Additional context and related issues

Release notes

Uh oh!

findinpath commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

findinpath commented Jan 13, 2026

Uh oh!

findinpath left a comment

Choose a reason for hiding this comment

Uh oh!

chenjian2664 commented Jan 13, 2026

Uh oh!

findinpath commented Jan 13, 2026

Uh oh!

chenjian2664 commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

findinpath left a comment

Choose a reason for hiding this comment

Uh oh!

findinpath commented Jan 20, 2026

Uh oh!

ebyhr commented Jan 20, 2026

Uh oh!

findinpath commented Jan 20, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ebyhr commented Jan 21, 2026

Uh oh!

chenjian2664 commented Jan 21, 2026

Uh oh!

chenjian2664 commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

findinpath commented Jan 21, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chenjian2664 Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chenjian2664 Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chenjian2664 commented Jan 13, 2026 •

edited by findepi

Loading

findinpath commented Jan 13, 2026 •

edited

Loading

chenjian2664 commented Jan 13, 2026 •

edited

Loading

chenjian2664 commented Jan 21, 2026 •

edited

Loading

chenjian2664 Jan 27, 2026 •

edited

Loading

chenjian2664 Jan 29, 2026 •

edited

Loading

chenjian2664 Jan 28, 2026 •

edited

Loading