Ordering between zil_sync() and uberblock_update() in a txg synchronization context. Do we trim the persistent ZIL before we update the new uberblock? #17057

dgiantsidi · 2025-02-14T17:33:32Z

dgiantsidi
Feb 14, 2025

Hello,

I have the following question: at the transaction group synchronization, does ZFS guarantee that first the uberblock will be updated with the latest state and then the persistent ZIL will be trimmed (sync-ed) to only include the uncommitted blocks that have not yet been stored in the persistent data tree. I am only talking about the persistent data structures.

From the code, I am confused regarding zil_sync. It seems to me that it frees all committed ZIL blocks (lwbs that have been allocated in a previous txg and have now been completed/persisted). However, an lwb that is completed means (in my understanding) that it is persisted as a ZIL block. That does not guarantee that its data has also been persisted to the actual persistent data tree. Now, if we delete this blkptr pointed by a completed lwb without ensuring that the ZFS persisted data tree has this update (basically ensure the uberblock has been updated), upon a crash, we might lose this piece of data.

Importantly I would appreciate some navigation on the code as well. My understanding is the following: during spa_sync, we first call into spa_sync_iterate_to_convergence which calls into dsl_pool_sync, dsl_dataset_sync and then dmu_objset_sync which finally invokes zil_sync. zil_sync waits for all the lwbs to be flushed properly (zil_lwb_flush_wait_all) and then deletes them and updates the zil_header (the in-memory one). Afterwards, the thread that drives the txg synchronization, will call into vdev_config_sync which persists the new uberblock. Doesn't this mean that their is a time window after zil_sync and before uberblock update where a crash might lead to data loss? Why don't we first persist the current uberblock and then trim the ZIL?

Another confusing part is that an lwb with lwb_alloc_txg=8, might have lwb_issued_txg=9? Is this relevant to when it is completed and deleted by the zil_sync?

Thanks much (again),
Dimitra

amotin · 2025-02-14T19:16:18Z

amotin
Feb 14, 2025
Collaborator

Same as for any other block free on ZFS, freed ZIL blocks are not allowed for reuse in less then 3 following TXGs. So if your system crash while committing TXG 9, then following pool import at TXG 8 or 7 (don't remember about 6) should be able to claim and replay all the ZIL blocks from that transaction up to the latest open one (likely 11). So there should be no data loss. Theoretically even longer replay might be possible, but not guaranteed due to possible block reuse or TRIM.

lwb_alloc_txg means when the ZIL block space allocation was accounted. lwb_max_txg means maximum TXG number for records stored in that block. lwb_issued_txg means when the block was actually written. All are important in certain cases.

4 replies

dgiantsidi Feb 14, 2025
Author

wow! thanks that was really helpful!

So basically you say that the lwb's block might be freed but won't be reused until after at least +3 txg synchronization groups have been executed right? So eventually even in the case of a crash right after their deletion they will be replayed on an import.

Could you please give me some pointers in the code on how to do that?

Let's say I recover, how do I find out what is the starting point of the ZIL? importantly, when do we update the ZIL header persistently during txg sync?

amotin Feb 14, 2025
Collaborator

Each dataset in each transaction group has its own ZIL starting block pointer. Once TXG 9 is fully committed, import from it will replay only ZIL blocks that include some records from TXGs 10 and up. But if TXG 9 appear corrupted (could be just not fully committed, but not necessary), import will fall back to TXG 8 and will replay ZIL blocks in the chain from earlier point, including also records belonging to TXG 9 and then up, since from perspective of TXG 8 those ZIL blocks are not yet freed.

For the mechanism of not reusing freed blocks search for TXG_DEFER_SIZE.

dgiantsidi Feb 17, 2025
Author

Okay, I see. However, I struggle to see that in the code ..
Can you elaborate more on the zil_sync? It seems it destroys lwbs (and their associated bps) on the current tx synchronization phase and does not defer deleting ..
Basically, I have the following quick question; when zil_sync frees an lwb (and its associated bp), do we have the guarantee the content of this block has been persisted on the dataset and the uberblock that reflects the change has been persisted too? Is there any way this lwb and bp to be replayed on recovery?

amotin Feb 18, 2025
Collaborator

zil_sync() moves along the LWB list, updating the ZIL head block pointer zh->zh_log and freeing old block as long as those blocks consist no data that are not the the part of this TXG (lwb->lwb_max_txg > txg is false). It does call zio_free(), but as I have told, space allocation code will defer frees (and so reuse) of any blocks on a pool for at least TXG_DEFER_SIZE TXGs. That guarantees that we can safely import pool starting from any of the last 2 or 3 TXGs and replay whats needed up to the latest state of acknowledged fsync(). Number of uberblocks stored depends on vdev's ashift size, but it is definitely bigger than the 2/3. If the uberblock or any other critical structures for the latest TXG is corrupted, ZFS will try to import from previous TXG, which uberblock obviously knows nothing that happened after, but since ZIL chains are always contiguous and just move the head pointer, the replay process will happily go into the "future", as long as there are no corruptions in the chain by external factors.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ordering between zil_sync() and uberblock_update() in a txg synchronization context. Do we trim the persistent ZIL before we update the new uberblock? #17057

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Ordering between zil_sync() and uberblock_update() in a txg synchronization context. Do we trim the persistent ZIL before we update the new uberblock? #17057

dgiantsidi Feb 14, 2025

Replies: 1 comment · 4 replies

amotin Feb 14, 2025 Collaborator

dgiantsidi Feb 14, 2025 Author

amotin Feb 14, 2025 Collaborator

dgiantsidi Feb 17, 2025 Author

amotin Feb 18, 2025 Collaborator

dgiantsidi
Feb 14, 2025

Replies: 1 comment 4 replies

amotin
Feb 14, 2025
Collaborator

dgiantsidi Feb 14, 2025
Author

amotin Feb 14, 2025
Collaborator

dgiantsidi Feb 17, 2025
Author

amotin Feb 18, 2025
Collaborator