Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prov/verbs: account for off-by-one credit initialization #6212

Merged
merged 1 commit into from
Aug 26, 2020

Conversation

ooststep
Copy link
Contributor

When we are enabling flow control, we artificially inject
a credit so that the credit messaging itself is not blocked
by a lack of credits. To counter this, we will adjust the number
of credits we send the first time by initializing to -1.

When we are enabling flow control, we artificially inject
a credit so that the credit messaging itself is not blocked
by a lack of credits.  To counter this, we will adjust the number
of credits we send the first time by initializing to -1.

Signed-off-by: Stephen Oost <[email protected]>
@shefty
Copy link
Member

shefty commented Aug 26, 2020

Intel CI is reporting a failure, but looking at the details, everything shows passing. Maybe it was restarted? Anyway, changes look good. Thanks

@shefty shefty merged commit 220af33 into ofiwg:master Aug 26, 2020
@swelch
Copy link
Contributor

swelch commented Aug 28, 2020

@shefty, @ooststep - Looks like the commit associated with this PR breaks ofi_rxm;verbs with FI_OFI_RXM_USE_SRX=1 when running osu_bw benchmark.

`Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00007fa6e23ff7da in vrb_add_credits (ep_fid=0x18aa5e0, credits=18446744073709551615)
at prov/verbs/src/verbs_ep.c:49
49 cq->cq_fastlock_acquire(&cq->cq_lock);
[Current thread is 1 (Thread 0x7fa6e5062c40 (LWP 13355))]
(gdb) bt
#0 0x00007fa6e23ff7da in vrb_add_credits (ep_fid=0x18aa5e0, credits=18446744073709551615)
at prov/verbs/src/verbs_ep.c:49
#1 0x00007fa6e2423f97 in rxm_handle_credit (rxm_ep=0x18ab580, rx_buf=0x7fa6dcb1f510) at prov/rxm/src/rxm_cq.c:1334
#2 0x00007fa6e24242fd in rxm_handle_comp (rxm_ep=0x18ab580, comp=0x7ffe6fc05b30) at prov/rxm/src/rxm_cq.c:1414
#3 0x00007fa6e24256e8 in rxm_ep_do_progress (util_ep=0x18ab580) at prov/rxm/src/rxm_cq.c:1723
#4 0x00007fa6e24258a3 in rxm_ep_progress (util_ep=0x18ab580) at prov/rxm/src/rxm_cq.c:1761
#5 0x00007fa6e23a60e2 in ofi_cq_progress (cq=0x18a3820) at prov/util/src/util_cq.c:599
#6 0x00007fa6e23a5311 in ofi_cq_readfrom (cq_fid=0x18a3820, buf=0x7ffe6fc05e90, count=8, src_addr=0x0)
at prov/util/src/util_cq.c:247
#7 0x00007fa6e23a5601 in ofi_cq_read (cq_fid=0x18a3820, buf=0x7ffe6fc05e90, count=8) at prov/util/src/util_cq.c:314
#8 0x00007fa6e3c3478a in MPIR_Waitall_impl () from /scratch/jenkins/builds/mpich/stable/lib/libmpi.so.12
#9 0x00007fa6e3c5cb5a in MPIR_Waitall () from /scratch/jenkins/builds/mpich/stable/lib/libmpi.so.12
#10 0x00007fa6e3c5da8e in PMPI_Waitall () from /scratch/jenkins/builds/mpich/stable/lib/libmpi.so.12
#11 0x0000000000401751 in main (argc=, argv=) at osu_bw.c:137
(gdb)l
44 struct util_cq *cq;
45
46 ep = container_of(ep_fid, struct vrb_ep, util_ep.ep_fid);
47 cq = ep->util_ep.tx_cq;
48
49 cq->cq_fastlock_acquire(&cq->cq_lock);
50 ep->peer_rq_credits += credits;
51 cq->cq_fastlock_release(&cq->cq_lock);
52 }
53
(gdb) p cq
$1 = (struct util_cq *) 0x0
(gdb)
(gdb) p *ep
$1 = {util_ep = {ep_fid = {fid = {fclass = 6, context = 0x0, ops = 0x7fa6e2688d40 <vrb_srq_ep_ops>},
ops = 0x7fa6e2688b60 <vrb_srq_ep_base_ops>, cm = 0x7fa6e2688ba0 <vrb_srq_cm_ops>,
msg = 0x7fa6e2688940 <vrb_srq_msg_ops>, rma = 0x7fa6e2688c00 <vrb_srq_rma_ops>, tagged = 0x0,
atomic = 0x7fa6e2688c60 <vrb_srq_atomic_ops>, collective = 0x0}, domain = 0x18ab1a8, av = 0x18a8a80, av_entry = {
next = 0x18a6390, prev = 0x100000001}, eq = 0x0, rx_cq = 0x0, rx_op_flags = 0, tx_cq = 0x0, tx_op_flags = 0,
inject_op_flags = 0, tx_msg_flags = 0, rx_msg_flags = 0, tx_cntr = 0x0, rx_cntr = 0x0, rd_cntr = 0x0,
wr_cntr = 0x21, rem_rd_cntr = 0x7fa6e2c60030 <_IO_wmem_jumps+144>, rem_wr_cntr = 0x18a6410, tx_cntr_inc = 0x20,
rx_cntr_inc = 0x21, rd_cntr_inc = 0x7362726576, wr_cntr_inc = 0x6d617267, rem_rd_cntr_inc = 0x18aa780,
rem_wr_cntr_inc = 0x91, type = 25920544, caps = 25920544, flags = 112, progress = 0x400, lock = {impl = 24205392,
is_initialized = 0, in_use = 1}, lock_acquire = 0x1c070, lock_release = 0x1c000, coll_cid_mask = 0x50,
coll_ready_queue = {head = 0x10, tail = 0x0}}, ibv_qp = 0x400, sq_credits = 0, peer_rq_credits = 0,
rq_credits_avail = 0, threshold = 0, {id = 0x6, {ep_name = {gid = {
raw = "\006\000\000\000\000\000\000\000A\000\000\000\000\000\000", global = {subnet_prefix = 6,
interface_id = 65}}, qpn = 25784960, lid = 0, pkey = 0, service = 34320, sl = 138 '\212',
padding = "\001\000\000\000"}, service = 25849728}}, inject_limit = 25849696, eq = 0x0, srq_ep = 0x1897650,
info = 0x0, wrs = 0x91, rx_cq_size = 0, conn_param = {private_data = 0x0, private_data_len = 144 '\220',
responder_resources = 0 '\000', initiator_depth = 0 '\000', flow_control = 0 '\000', retry_count = 0 '\000',
rnr_retry_count = 0 '\000', srq = 0 '\000', qp_num = 0}, cm_hdr = 0x0, cm_priv_data = 0x0}
(gdb)

@ooststep ooststep deleted the creditfc_fix branch October 13, 2022 21:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants