Skip to content

Commit da2013b

Browse files
committed
drbd: allow application IO concurrently with resync requests sent to same peer
Since the previous commit ("drbd: respect peer writes held due to conflicts when barrier received"), the receiver also waits for requests that are held due to conflicts when a barrier is received. If these conflicts are due to resync requests towards the same peer, then they will only be resolved once the resync reply has been received. This causes a deadlock. Avoid this by allowing application IO to be submitted while there are conflicting resync requests pending, but only when they relate to the same peer. Implement this by adding "sent" and "received" flags for the intervals. Resync requests that we initiate pass through two conflict resolution phases. The first one is for waiting until the request can be sent. This involves waiting until there are no conflicting application writes from other peers. The second one occurs when we receive the reply and has the purpose of ensuring we do not submit conflicting writes. This change has similarities to 4dc38cd drbd: Break resync deadlock which was reverted because a different solution was required to avoid a distributed deadlock with 3 nodes: c0cd45a drbd: Break distributed deadlock in request processing ddc742b drbd: drop special-casing peer_device in al_begin_io_for_peer However, this is distinctly different and this time it is valid. The key difference is that resync requests are no longer blocked by an active activity log extent. This means that they only wait for local disk writes to complete, and not for a peer ack, which is a distributed event. It is no longer necessary to include a special case for compatibility in receive_Data(). This approach removes the exclusivity of peer writes with resync requests that we have sent and are waiting for the reply for. In fact, the previous solution was incomplete, because it still held peer writes back when they conflicted with resync requests that had been sent and pending a reply. This approach handles that scenario as well.
1 parent 15f2580 commit da2013b

7 files changed

+167
-61
lines changed

Documentation/application-resync-synchronization.rst

+18-8
Original file line numberDiff line numberDiff line change
@@ -56,8 +56,12 @@ scheme:
5656
* Secondary: The application IO interval is locked while the data is being
5757
written to the backing disk.
5858
* Sync target: The ``P_RS_DATA_REQUEST`` interval is locked. The lock is taken
59-
then the dagtag is recorded and the request is sent. The lock is released
60-
when the reply has been received and the data written to the backing disk.
59+
in two phases. Before sending the request, the interval is locked for
60+
conflicts with other peers. Then the dagtag is recorded and the request is
61+
sent. When the reply is received, the interval is additionally locked for
62+
conflicts with ongoing IO, in particular writes from the same peer. The lock
63+
is released when the reply has been received and the data written to the
64+
backing disk.
6165
* Sync source: The ``P_RS_DATA_REQUEST`` interval is locked. The lock is taken
6266
when the dagtag for the request is reached. It is released when the
6367
``P_RS_WRITE_ACK`` is received. This lock is a read lock; it is exclusive
@@ -94,12 +98,12 @@ Correctness of data
9498

9599
We only consider the synchronization between application and resync IO here.
96100

97-
The locking scheme prevents any writes to the resync request interval from when
98-
the request is initiated until the received data is written. After the lock is
99-
taken on the target, the dagtag is recorded and the request is sent to the
100-
source. The source then waits until it has reached this dagtag before reading.
101-
This ensures that the resync data is at least as new as the data on the target
102-
when the request was made.
101+
The locking scheme prevents any writes from other peers to the resync request
102+
interval from when the request is initiated until the received data is written.
103+
After the lock is taken on the target, the dagtag is recorded and the request
104+
is sent to the source. The source then waits until it has reached this dagtag
105+
before reading. This ensures that the resync data is at least as new as the
106+
data on the target when the request was made.
103107

104108
Conflicting application writes that reach the target while the resync request
105109
is in progress are held until the resync data has been written. Hence they
@@ -108,6 +112,12 @@ this application write when it performed the resync read, the application write
108112
will overwrite the resync write on the target with identical data. This is
109113
harmless.
110114

115+
Resync requests sent from the target are not exclusive with application writes
116+
from the same peer. However, since the resync and application data originate
117+
from the same node, they are transmitted in the correct order in the data
118+
stream. Application IO defers to received resync IO, ensuring that a resync
119+
write received before an application write is also submitted first.
120+
111121
Correctness of bitmap bits
112122
--------------------------
113123

drbd/drbd_debugfs.c

+2
Original file line numberDiff line numberDiff line change
@@ -1111,6 +1111,8 @@ static void seq_printf_interval_tree(struct seq_file *m, struct rb_root *root)
11111111
char sep = ' ';
11121112

11131113
seq_printf(m, "%llus+%u %s", (unsigned long long) i->sector, i->size, drbd_interval_type_str(i));
1114+
seq_print_rq_state_bit(m, test_bit(INTERVAL_SENT, &i->flags), &sep, "sent");
1115+
seq_print_rq_state_bit(m, test_bit(INTERVAL_RECEIVED, &i->flags), &sep, "received");
11141116
seq_print_rq_state_bit(m, test_bit(INTERVAL_SUBMIT_CONFLICT_QUEUED, &i->flags), &sep, "submit-conflict-queued");
11151117
seq_print_rq_state_bit(m, test_bit(INTERVAL_SUBMITTED, &i->flags), &sep, "submitted");
11161118
seq_print_rq_state_bit(m, test_bit(INTERVAL_BACKING_COMPLETED, &i->flags), &sep, "backing-completed");

drbd/drbd_int.h

+61-4
Original file line numberDiff line numberDiff line change
@@ -1985,7 +1985,7 @@ extern void drbd_rs_controller_reset(struct drbd_peer_device *);
19851985
extern void drbd_rs_all_in_flight_came_back(struct drbd_peer_device *, int);
19861986
extern void drbd_check_peers(struct drbd_resource *resource);
19871987
extern void drbd_check_peers_new_current_uuid(struct drbd_device *);
1988-
extern void drbd_conflict_submit_resync_request(struct drbd_peer_request *peer_req);
1988+
extern void drbd_conflict_send_resync_request(struct drbd_peer_request *peer_req);
19891989
extern void drbd_ping_peer(struct drbd_connection *connection);
19901990
extern struct drbd_peer_device *peer_device_by_node_id(struct drbd_device *, int);
19911991
extern void repost_up_to_date_fn(struct timer_list *t);
@@ -2094,6 +2094,7 @@ extern void drbd_send_peer_ack_wf(struct work_struct *ws);
20942094
extern bool drbd_rs_c_min_rate_throttle(struct drbd_peer_device *);
20952095
extern void drbd_verify_skipped_block(struct drbd_peer_device *peer_device,
20962096
const sector_t sector, const unsigned int size);
2097+
extern void drbd_conflict_submit_resync_request(struct drbd_peer_request *peer_req);
20972098
extern void drbd_conflict_submit_peer_read(struct drbd_peer_request *peer_req);
20982099
extern void drbd_conflict_submit_peer_write(struct drbd_peer_request *peer_req);
20992100
extern int drbd_submit_peer_request(struct drbd_peer_request *);
@@ -2654,10 +2655,57 @@ static inline struct drbd_connection *first_connection(struct drbd_resource *res
26542655

26552656
#define NODE_MASK(id) ((u64)1 << (id))
26562657

2658+
/*
2659+
* drbd_interval_same_peer - determine whether "interval" is for the same peer as "i"
2660+
*
2661+
* "i" must be an interval corresponding to a drbd_peer_request.
2662+
*/
2663+
static inline bool drbd_interval_same_peer(struct drbd_interval *interval, struct drbd_interval *i)
2664+
{
2665+
struct drbd_peer_request *interval_peer_req, *i_peer_req;
2666+
2667+
/* Ensure we only call "container_of" if it is actually a peer request. */
2668+
if (interval->type == INTERVAL_LOCAL_WRITE ||
2669+
interval->type == INTERVAL_LOCAL_READ ||
2670+
interval->type == INTERVAL_PEERS_IN_SYNC_LOCK)
2671+
return false;
2672+
2673+
interval_peer_req = container_of(interval, struct drbd_peer_request, i);
2674+
i_peer_req = container_of(i, struct drbd_peer_request, i);
2675+
return interval_peer_req->peer_device == i_peer_req->peer_device;
2676+
}
2677+
2678+
/*
2679+
* drbd_resync_should_defer - determine whether "interval" should defer to "i"
2680+
*/
2681+
static inline bool drbd_should_defer_to_resync(struct drbd_interval *interval, struct drbd_interval *i)
2682+
{
2683+
if (!drbd_interval_is_resync(i))
2684+
return false;
2685+
2686+
/* Always defer to resync requests once the reply has been received.
2687+
* These just need to wait for conflicting local I/O to complete. This
2688+
* is necessary to ensure that resync replies received before
2689+
* application writes are submitted first, so that the resync writes do
2690+
* not overwrite newer data. */
2691+
if (test_bit(INTERVAL_RECEIVED, &i->flags))
2692+
return true;
2693+
2694+
/* If we are still waiting for a reply from the peer, only defer to the
2695+
* request if it is towards a different peer. The exclusivity between
2696+
* resync requests and application writes from another peer is
2697+
* necessary to avoid overwriting newer data with older in the resync.
2698+
* When the data in both cases is coming from the same peer, this is
2699+
* not necessary. The peer ensures that the data stream is correctly
2700+
* ordered. */
2701+
return !drbd_interval_same_peer(interval, i);
2702+
}
2703+
26572704
#define CONFLICT_FLAG_WRITE (1 << 0)
26582705
#define CONFLICT_FLAG_DEFER_TO_RESYNC (1 << 1)
2659-
#define CONFLICT_FLAG_EXCLUSIVE_UNTIL_COMPLETED (1 << 2)
2660-
#define CONFLICT_FLAG_APPLICATION_ONLY (1 << 3)
2706+
#define CONFLICT_FLAG_IGNORE_SAME_PEER (1 << 2)
2707+
#define CONFLICT_FLAG_EXCLUSIVE_UNTIL_COMPLETED (1 << 3)
2708+
#define CONFLICT_FLAG_APPLICATION_ONLY (1 << 4)
26612709

26622710
/*
26632711
* drbd_find_conflict - find conflicting interval, if any
@@ -2670,6 +2718,7 @@ static inline struct drbd_interval *drbd_find_conflict(struct drbd_device *devic
26702718
int size = interval->size;
26712719
bool write = flags & CONFLICT_FLAG_WRITE;
26722720
bool defer_to_resync = flags & CONFLICT_FLAG_DEFER_TO_RESYNC;
2721+
bool ignore_same_peer = flags & CONFLICT_FLAG_IGNORE_SAME_PEER;
26732722
bool exclusive_until_completed = flags & CONFLICT_FLAG_EXCLUSIVE_UNTIL_COMPLETED;
26742723
bool application_only = flags & CONFLICT_FLAG_APPLICATION_ONLY;
26752724

@@ -2693,7 +2742,15 @@ static inline struct drbd_interval *drbd_find_conflict(struct drbd_device *devic
26932742
/* Ignore, if not yet submitted, unless we should defer to a
26942743
* resync request. */
26952744
if (!test_bit(INTERVAL_SUBMITTED, &i->flags) &&
2696-
!(defer_to_resync && drbd_interval_is_resync(i)))
2745+
!(defer_to_resync && drbd_should_defer_to_resync(interval, i)))
2746+
continue;
2747+
2748+
/* Ignore peer writes from the peer that this request relates
2749+
* to, if requested. This is only used for determining whether
2750+
* to send a request. It must not be used for determining
2751+
* whether to submit a request, because that would allow
2752+
* concurrent writes to the backing disk. */
2753+
if (ignore_same_peer && i->type == INTERVAL_PEER_WRITE && drbd_interval_same_peer(interval, i))
26972754
continue;
26982755

26992756
if (unlikely(application_only)) {

drbd/drbd_interval.h

+6
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,12 @@ enum drbd_interval_type {
2020
};
2121

2222
enum drbd_interval_flags {
23+
/* Whether this resync write has been sent yet. */
24+
INTERVAL_SENT,
25+
26+
/* Whether this resync write has been received yet. */
27+
INTERVAL_RECEIVED,
28+
2329
/* Whether this has been queued after conflict. */
2430
INTERVAL_SUBMIT_CONFLICT_QUEUED,
2531

drbd/drbd_receiver.c

+49-39
Original file line numberDiff line numberDiff line change
@@ -2550,7 +2550,7 @@ static struct drbd_peer_request *find_resync_request(struct drbd_peer_device *pe
25502550
* the requests. Hence looking the request up with a list traversal
25512551
* should not be too slow. */
25522552
list_for_each_entry(p, list, w.list) {
2553-
if (p->i.sector == sector && test_bit(INTERVAL_SUBMITTED, &p->i.flags)) {
2553+
if (p->i.sector == sector && test_bit(INTERVAL_SENT, &p->i.flags)) {
25542554
peer_req = p;
25552555
break;
25562556
}
@@ -2566,6 +2566,46 @@ static struct drbd_peer_request *find_resync_request(struct drbd_peer_device *pe
25662566
return peer_req;
25672567
}
25682568

2569+
static void drbd_cleanup_after_failed_submit_resync_request(struct drbd_peer_request *peer_req)
2570+
{
2571+
struct drbd_peer_device *peer_device = peer_req->peer_device;
2572+
struct drbd_connection* connection = peer_device->connection;
2573+
struct drbd_device *device = peer_device->device;
2574+
2575+
drbd_err(device, "submit failed, triggering re-connect\n");
2576+
spin_lock_irq(&connection->peer_reqs_lock);
2577+
list_del(&peer_req->w.list);
2578+
list_del(&peer_req->recv_order);
2579+
spin_unlock_irq(&connection->peer_reqs_lock);
2580+
2581+
drbd_remove_peer_req_interval(peer_req);
2582+
drbd_free_peer_req(peer_req);
2583+
put_ldev(device);
2584+
2585+
change_cstate(connection, C_PROTOCOL_ERROR, CS_HARD);
2586+
}
2587+
2588+
void drbd_conflict_submit_resync_request(struct drbd_peer_request *peer_req)
2589+
{
2590+
struct drbd_peer_device *peer_device = peer_req->peer_device;
2591+
struct drbd_device *device = peer_device->device;
2592+
bool conflict;
2593+
2594+
spin_lock_irq(&device->interval_lock);
2595+
clear_bit(INTERVAL_SUBMIT_CONFLICT_QUEUED, &peer_req->i.flags);
2596+
set_bit(INTERVAL_RECEIVED, &peer_req->i.flags);
2597+
conflict = drbd_find_conflict(device, &peer_req->i, CONFLICT_FLAG_WRITE);
2598+
if (!conflict)
2599+
set_bit(INTERVAL_SUBMITTED, &peer_req->i.flags);
2600+
spin_unlock_irq(&device->interval_lock);
2601+
2602+
if (!conflict) {
2603+
int err = drbd_submit_peer_request(peer_req);
2604+
if (err)
2605+
drbd_cleanup_after_failed_submit_resync_request(peer_req);
2606+
}
2607+
}
2608+
25692609
static int recv_resync_read(struct drbd_peer_device *peer_device,
25702610
struct drbd_peer_request_details *d) __releases(local)
25712611
{
@@ -2611,9 +2651,7 @@ static int recv_resync_read(struct drbd_peer_device *peer_device,
26112651
size = peer_req->i.size;
26122652
drbd_set_all_out_of_sync(device, sector, size);
26132653

2614-
err = drbd_submit_peer_request(peer_req);
2615-
if (err)
2616-
goto out;
2654+
drbd_conflict_submit_resync_request(peer_req);
26172655
peer_req = NULL; /* since submitted, might be destroyed already */
26182656

26192657
for_each_peer_device_ref(peer_device, im, device) {
@@ -2622,16 +2660,6 @@ static int recv_resync_read(struct drbd_peer_device *peer_device,
26222660
drbd_send_out_of_sync(peer_device, sector, size);
26232661
}
26242662
return 0;
2625-
out:
2626-
/* don't care for the reason here */
2627-
drbd_err(device, "submit failed, triggering re-connect\n");
2628-
spin_lock_irq(&connection->peer_reqs_lock);
2629-
list_del(&peer_req->w.list);
2630-
list_del(&peer_req->recv_order);
2631-
spin_unlock_irq(&connection->peer_reqs_lock);
2632-
2633-
drbd_free_peer_req(peer_req);
2634-
return err;
26352663
}
26362664

26372665
/* caller must hold interval_lock */
@@ -3187,7 +3215,6 @@ static int receive_Data(struct drbd_connection *connection, struct packet_info *
31873215
struct drbd_peer_request *peer_req;
31883216
struct drbd_peer_request_details d;
31893217
int err, tp;
3190-
unsigned long conflict_flags = CONFLICT_FLAG_WRITE;
31913218
bool conflict = false;
31923219

31933220
peer_device = conn_peer_device(connection, pi->vnr);
@@ -3385,16 +3412,8 @@ static int receive_Data(struct drbd_connection *connection, struct packet_info *
33853412
goto out_del_list;
33863413
}
33873414
}
3388-
/* In protocol < 110 (compatibility with DRBD 8.4), we must not defer
3389-
* to resync. The peer may have received a resync request from us in
3390-
* the same "resync extent" as this write. In this case, it will block
3391-
* until it has received the corresponding barrier ack, so we must
3392-
* submit the write.
3393-
*/
3394-
if (connection->agreed_pro_version >= 110)
3395-
conflict_flags |= CONFLICT_FLAG_DEFER_TO_RESYNC;
3396-
conflict = connection->agreed_pro_version >= 110 &&
3397-
drbd_find_conflict(device, &peer_req->i, conflict_flags);
3415+
conflict = drbd_find_conflict(device, &peer_req->i,
3416+
CONFLICT_FLAG_WRITE | CONFLICT_FLAG_DEFER_TO_RESYNC);
33983417
drbd_insert_interval(&device->requests, &peer_req->i);
33993418
if (!conflict)
34003419
set_bit(INTERVAL_SUBMITTED, &peer_req->i.flags);
@@ -4066,6 +4085,8 @@ static int receive_common_ov_reply(struct drbd_connection *connection, struct pa
40664085
if (err)
40674086
goto fail;
40684087

4088+
set_bit(INTERVAL_RECEIVED, &peer_req->i.flags);
4089+
40694090
drbd_alloc_page_chain(&peer_device->connection->transport,
40704091
&peer_req->page_chain, DIV_ROUND_UP(size, PAGE_SIZE), GFP_TRY);
40714092
if (!peer_req->page_chain.head) {
@@ -8587,20 +8608,7 @@ static int receive_rs_deallocated(struct drbd_connection *connection, struct pac
85878608
spin_unlock_irq(&connection->peer_reqs_lock);
85888609

85898610
atomic_add(pi->size >> 9, &device->rs_sect_ev);
8590-
err = drbd_submit_peer_request(peer_req);
8591-
8592-
if (err) {
8593-
drbd_err(device, "discard submit failed, triggering re-connect\n");
8594-
spin_lock_irq(&connection->peer_reqs_lock);
8595-
list_del(&peer_req->w.list);
8596-
list_del(&peer_req->recv_order);
8597-
spin_unlock_irq(&connection->peer_reqs_lock);
8598-
8599-
drbd_remove_peer_req_interval(peer_req);
8600-
drbd_free_peer_req(peer_req);
8601-
peer_req = NULL;
8602-
put_ldev(device);
8603-
}
8611+
drbd_conflict_submit_resync_request(peer_req);
86048612

86058613
/* No put_ldev() here. Gets called in drbd_endio_write_sec_final(). */
86068614
} else {
@@ -9701,6 +9709,8 @@ static int got_IsInSync(struct drbd_connection *connection, struct packet_info *
97019709

97029710
dec_rs_pending(peer_device);
97039711

9712+
set_bit(INTERVAL_RECEIVED, &peer_req->i.flags);
9713+
97049714
spin_lock_irq(&connection->peer_reqs_lock);
97059715
list_del(&peer_req->w.list);
97069716
spin_unlock_irq(&connection->peer_reqs_lock);

drbd/drbd_req.c

+9-1
Original file line numberDiff line numberDiff line change
@@ -462,6 +462,11 @@ void drbd_release_conflicts(struct drbd_device *device, struct drbd_interval *re
462462
if (test_bit(INTERVAL_SUBMITTED, &i->flags))
463463
continue;
464464

465+
/* If we are waiting for a reply from the peer, then there is
466+
* no need to process the conflict. */
467+
if (test_bit(INTERVAL_SENT, &i->flags) && !test_bit(INTERVAL_RECEIVED, &i->flags))
468+
continue;
469+
465470
dynamic_drbd_dbg(device,
466471
"%s %s request at %llus+%u after conflict with %llus+%u\n",
467472
test_bit(INTERVAL_SUBMIT_CONFLICT_QUEUED, &i->flags) ? "Already queued" : "Queue",
@@ -2112,7 +2117,10 @@ void drbd_do_submit_conflict(struct work_struct *ws)
21122117

21132118
list_for_each_entry_safe(peer_req, peer_req_tmp, &resync_writes, submit_list) {
21142119
list_del_init(&peer_req->submit_list);
2115-
drbd_conflict_submit_resync_request(peer_req);
2120+
if (!test_bit(INTERVAL_SENT, &peer_req->i.flags))
2121+
drbd_conflict_send_resync_request(peer_req);
2122+
else
2123+
drbd_conflict_submit_resync_request(peer_req);
21162124
}
21172125

21182126
list_for_each_entry_safe(peer_req, peer_req_tmp, &resync_reads, submit_list) {

0 commit comments

Comments
 (0)