Ordering between zil_sync() and uberblock_update() in a txg synchronization context. Do we trim the persistent ZIL before we update the new uberblock? #17057
Replies: 1 comment 4 replies
-
Same as for any other block free on ZFS, freed ZIL blocks are not allowed for reuse in less then 3 following TXGs. So if your system crash while committing TXG 9, then following pool import at TXG 8 or 7 (don't remember about 6) should be able to claim and replay all the ZIL blocks from that transaction up to the latest open one (likely 11). So there should be no data loss. Theoretically even longer replay might be possible, but not guaranteed due to possible block reuse or TRIM. lwb_alloc_txg means when the ZIL block space allocation was accounted. lwb_max_txg means maximum TXG number for records stored in that block. lwb_issued_txg means when the block was actually written. All are important in certain cases. |
Beta Was this translation helpful? Give feedback.
-
Hello,
I have the following question: at the transaction group synchronization, does ZFS guarantee that first the uberblock will be updated with the latest state and then the persistent ZIL will be trimmed (sync-ed) to only include the uncommitted blocks that have not yet been stored in the persistent data tree. I am only talking about the persistent data structures.
From the code, I am confused regarding zil_sync. It seems to me that it frees all committed ZIL blocks (lwbs that have been allocated in a previous txg and have now been completed/persisted). However, an lwb that is completed means (in my understanding) that it is persisted as a ZIL block. That does not guarantee that its data has also been persisted to the actual persistent data tree. Now, if we delete this blkptr pointed by a completed lwb without ensuring that the ZFS persisted data tree has this update (basically ensure the uberblock has been updated), upon a crash, we might lose this piece of data.
Importantly I would appreciate some navigation on the code as well. My understanding is the following: during spa_sync, we first call into spa_sync_iterate_to_convergence which calls into dsl_pool_sync, dsl_dataset_sync and then dmu_objset_sync which finally invokes zil_sync. zil_sync waits for all the lwbs to be flushed properly (zil_lwb_flush_wait_all) and then deletes them and updates the zil_header (the in-memory one). Afterwards, the thread that drives the txg synchronization, will call into vdev_config_sync which persists the new uberblock. Doesn't this mean that their is a time window after zil_sync and before uberblock update where a crash might lead to data loss? Why don't we first persist the current uberblock and then trim the ZIL?
Another confusing part is that an lwb with lwb_alloc_txg=8, might have lwb_issued_txg=9? Is this relevant to when it is completed and deleted by the zil_sync?
Thanks much (again),
Dimitra
Beta Was this translation helpful? Give feedback.
All reactions