Skip to content

Commit

Permalink
Merge pull request ofiwg#7534 from cornelisnetworks/opx-beta-upstream
Browse files Browse the repository at this point in the history
Opx beta upstream
  • Loading branch information
shefty authored Mar 20, 2022
2 parents fbd4751 + 89cbe4d commit 9cf98bd
Show file tree
Hide file tree
Showing 98 changed files with 34,480 additions and 4 deletions.
1 change: 1 addition & 0 deletions Makefile.am
Original file line number Diff line number Diff line change
Expand Up @@ -423,6 +423,7 @@ include prov/rxm/Makefile.include
include prov/mrail/Makefile.include
include prov/rxd/Makefile.include
include prov/bgq/Makefile.include
include prov/opx/Makefile.include
include prov/shm/Makefile.include
include prov/tcp/Makefile.include
include prov/rstream/Makefile.include
Expand Down
17 changes: 17 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -141,6 +141,23 @@ See the `fi_gni(7)` man page for more details.

- The `gni` provider requires `gcc` version 4.9 or higher.

### opx

***

The OPX provider is an updated Libfabric provider for Omni-Path HPC
fabrics. The other provider for Omni-Path is PSM2.

The OPX provider began as a fork of the libfabric BGQ provider, with the
hardware-specific parts re-written for the Omni-Path hfi1 fabric
interface card. Therefore OPX inherits several desirable characteristics
of the BGQ driver, and analysis of instruction counts and cache line
footprints of most HPC operations show OPX being lighter weight than
PSM2 on the host software stack, leading to better overall performance.

See the `fi_opx(7)` man page for more details. See [Cornelis Customer
Center](https://customercenter.cornelisnetworks.com/) for support information.

### psm

***
Expand Down
10 changes: 10 additions & 0 deletions configure.ac
Original file line number Diff line number Diff line change
Expand Up @@ -801,6 +801,15 @@ AS_IF([test "$have_rocr" = "1" && test x"$with_rocr" != x"yes"],

AC_CHECK_SIZEOF([void *])

dnl patching ${archive_cmds} to affect generation of file "libtool" to fix linking with clang
dnl allows for building with clang lto and fast linking with non gnu ld
dnl --> checking if the linker (clang) is GNU ld... no
AS_CASE(["$LD"],
[*clang*],
[AS_CASE(["${host_os}"],
[*linux*],
[archive_cmds='$CC -shared $pic_flag $libobjs $deplibs $compiler_flags $wl-soname $wl$soname $wl-version-script $wl$libname.map -o $lib'])])

dnl Provider-specific checks
FI_PROVIDER_INIT
FI_PROVIDER_SETUP([psm])
Expand All @@ -825,6 +834,7 @@ FI_PROVIDER_SETUP([perf])
FI_PROVIDER_SETUP([hook_debug])
FI_PROVIDER_SETUP([hook_hmem])
FI_PROVIDER_SETUP([dmabuf_peer_mem])
FI_PROVIDER_SETUP([opx])
FI_PROVIDER_FINI
dnl Configure the .pc file
FI_PROVIDER_SETUP_PC
Expand Down
3 changes: 2 additions & 1 deletion fabtests/multinode/src/harness.c
Original file line number Diff line number Diff line change
Expand Up @@ -253,7 +253,8 @@ int pm_get_oob_server_addr()

ret = getaddrinfo(opts.src_addr, NULL, NULL, &res);
if (ret) {
FT_ERR( "getaddrinfo failed\n");
FT_ERR("getaddrinfo failed. src_addr = %s\n",
opts.src_addr ? opts.src_addr : "<NULL>");
return ret;
}

Expand Down
11 changes: 11 additions & 0 deletions include/ofi_prov.h
Original file line number Diff line number Diff line change
Expand Up @@ -287,4 +287,15 @@ HOOK_DMABUF_PEER_MEM_INI ;
# define HOOK_NOOP_INIT fi_hook_noop_ini()
HOOK_NOOP_INI ;

#if (HAVE_OPX) && (HAVE_OPX_DL)
# define OPX_INI FI_EXT_INI
# define OPX_INIT NULL
#elif (HAVE_OPX)
# define OPX_INI INI_SIG(fi_opx_ini)
# define OPX_INIT fi_opx_ini()
OPX_INI ;
#else
# define OPX_INIT NULL
#endif

#endif /* _OFI_PROV_H_ */
3 changes: 3 additions & 0 deletions include/rbtree.h
Original file line number Diff line number Diff line change
Expand Up @@ -87,6 +87,9 @@ RbtIterator rbtEnd(RbtHandle h);
void rbtKeyValue(RbtHandle h, RbtIterator i, void **key, void **value);
// returns key/value pair associated with iterator

void ** rbtValuePtr(RbtHandle h, RbtIterator it);
// returns pointer to the value associated with iterator

RbtIterator rbtFindLeftmost(RbtHandle h, void *key,
int(*compare)(void *a, void *b));
// returns iterator associated with left-most match. This is useful when a new
Expand Down
2 changes: 2 additions & 0 deletions include/rdma/fabric.h
Original file line number Diff line number Diff line change
Expand Up @@ -206,6 +206,7 @@ enum {
FI_ADDR_IB_UD, /* uint64_t[4] */
FI_ADDR_EFA,
FI_ADDR_PSMX3, /* uint64_t[4] */
FI_ADDR_OPX,
};

#define FI_ADDR_UNSPEC ((uint64_t) -1)
Expand Down Expand Up @@ -321,6 +322,7 @@ enum {
FI_PROTO_EFA,
FI_PROTO_PSMX3,
FI_PROTO_RXM_TCP,
FI_PROTO_OPX,
};

enum {
Expand Down
181 changes: 181 additions & 0 deletions man/fi_opx.7.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,181 @@
---
layout: page
title: fi_opx(7)
tagline: Libfabric Programmer's Manual
---
{%include JB/setup %}

#NAME

fi_opx \- The Omni-Path Express Fabric Provider

#OVERVIEW

The OPX provider is a native implementation of the libfabric interfaces
that makes direct use of Omni-Path fabrics as well as libfabric
acceleration features.
The purpose of this provider is to show the scalability and
performance of libfabric, providing an "extreme scale" development
environment for applications and middleware using the libfabric API, and
to support a functional and performant version of MPI on Omni-Path fabrics.

#SUPPORTED FEATURES

The OPX provider supports most features defined for the libfabric API.

Key features include:

Endpoint types
: The Omni-Path HFI hardware is connectionless and reliable.
The OPX provider only supports the *FI_EP_RDM* endpoint type.

Capabilities
: Supported capabilities include *FI_MSG*, *FI_RMA, *FI_TAGGED*, *FI_ATOMIC*,
*FI_NAMED_RX_CTX*, *FI_SOURCE*, *FI_SEND*, *FI_RECV*, *FI_MULTI_RECV*,
*FI_DIRECTED_RECV*, *FI_SOURCE*.

Notes on *FI_DIRECTED_RECV* capability: The immediate data which is sent
within the "senddata" call to support *FI_DIRECTED_RECV* for OPX
must be exactly 4 bytes, which OPX uses to completely identify the
source address to an exascale\-level number of ranks for tag matching on
the recv and can be managed within the MU packet.
Therefore the domain attribute "cq_data_size" is set to 4 which is the OFI
standard minimum.

Modes
: Two modes are defined: *FI_CONTEXT2* and *FI_ASYNC_IOV*.
The OPX provider requires *FI_CONTEXT2*.

Additional features
: Supported additional features include *FABRIC_DIRECT*, *scalable endpoints*,
and *counters*.

Progress
: Only *FI_PROGRESS_MANUAL* is supported.

Address vector
: Only the *FI_AV_MAP* address vector format is supported.

Memory registration modes
: Only *FI_MR_SCALABLE* is supported.

# UNSUPPORTED FEATURES

Endpoint types
: Unsupported endpoint types include *FI_EP_DGRAM* and *FI_EP_MSG*.

Capabilities
: The OPX provider does not support *FI_RMA_EVENT* and *FI_TRIGGER*
capabilities.

Address vector
: The OPX provider does not support the *FI_AV_TABLE* address vector
format. This may be added in the future.

# LIMITATIONS

As OPX is under development this list of limitations is subject
to change.

It runs under the following MPI versions:

Intel MPI from Parallel Studio 2020, update 4.
Intel MPI from OneAPI 2021, update 3.
Open MPI 4.1.2a1 (Older version of Open MPI will not work).
MPICH 3.4.2.

Currently, this provider is PIO-only. SDMA is not supported
at this time.

Usage:

If using with OpenMPI 4.1.x, disable UCX and openib transports.
OPX is not compatible with Open MPI 4.1.x PML/BTL.
DMA, RDMA and SDMA are not implemented.
Performance falls off when using message sizes larger than
1 MTU (4K max size).
Shared memory is not cleaned up after an application crashes. Use
"rm -rf /dev/shm/*" to remove old shared-memory files.

# RUNTIME PARAMETERS

*FI_OPX_UUID*
: OPX requires a unique ID for each job. In order for all processes in a
job to communicate with each other, they require to use the same UUID.
This variable can be set with FI_OPX_UUID=${RANDOM}
The default UUID is 00112233445566778899aabbccddeeff.

*FI_OPX_RELIABILITY_SERVICE_USEC_MAX*
: This setting controls how frequently the reliability/replay function
will issue PING requests to a remote connection. Reducing this value
may improve performance at the expense of increased traffic on the OPX
fabric.
Default setting is 100.

*FI_OPX_RELIABILITY_SERVICE_PRE_ACK_RATE*
: This setting controls how frequently a receiving rank will send ACKs
for packets it has received without being prompted through a PING request.
A non-zero value N tells the receiving rank to send an ACK for the
last N packets every Nth packet. Used in conjunction with an increased
value for FI_OPX_RELIABILITY_SERVICE_USEC_MAX may improve performance.

Valid values are 0 (disabled) and powers of 2 in the range of 1-32,768, inclusive.

Default setting is 0 (disabled).

*FI_OPX_HFI_SELECT*
: Controls how OPX chooses which HFI to use when opening a context.
Has two forms:
- `<hfi-unit>` Force OPX provider to use `hfi-unit`.
- `<selector1>[,<selector2>[,...,<selectorN>]]` Select HFI based on first matching `selector`

Where `selector` is one of the following forms:
- `default` to use the default logic
- `fixed:<hfi-unit>` to fix to one `hfi-unit`
- `<selector-type>:<hfi-unit>:<selector-data>`

The above fields have the following meaning:
- `selector-type` The selector criteria the caller opening the context is evaluated against.
- `hfi-unit` The HFI to use if the caller matches the selector.
- `selector-data` Data the caller must match (e.g. NUMA node ID).

Where `selector-type` is one of the following:
- `numa` True when caller is local to the NUMA node ID given by `selector-data`.
- `core` True when caller is local to the CPU core given by `selector-data`.

And `selector-data` is one of the following:
- `value` The specific value to match
- `<range-start>-<range-end>` Matches with any value in that range

In the second form, when opening a context, OPX uses the `hfi-unit` of the
first-matching selector. Selectors are evaluated left-to-right. OPX will
return an error if the caller does not match any selector.

In either form, it is an error if the specified or selected HFI is not in the
Active state. In this case, OPX will return an error and execution will not
continue.

With this option, it is possible to cause OPX to try to open more contexts on
an HFI than there are free contexts on that HFI. In this case, one or more of
the context-opening calls will fail and OPX will return an error.
For the second form, as which HFI is selected depends on properties of the
caller, deterministic HFI selection requires deterministic caller properties.
E.g. for the `numa` selector, if the caller can migrate between NUMA domains,
then HFI selection will not be deterministic.

The logic used will always be the first valid in a selector list. For example, `default` and
`fixed` will match all callers, so if either are in the beginning of a selector list, you will
only use `fixed` or `default` regardles of if there are any more selectors.

Examples:
- `FI_OPX_HFI_SELECT=1` all callers will open contexts on HFI 0.
- `FI_OPX_HFI_SELECT=numa:0:0,numa:1:1,numa:0:2,numa:1:3` callers local to NUMA nodes 0 and 2 will use HFI 0, callers local to NUMA domains 1 and 3 will use HFI 1.
- `FI_OPX_HFI_SELECT=numa:0:0-3,default` callers local to NUMA nodes 0 thru 3 (including 0 and 3) will use HFI 0, and all else will use default selection logic.
- `FI_OPX_HFI_SELECT=core:1:0,fixed:0` callers local to CPU core 0 will use HFI 1, and all others will use HFI 0.
- `FI_OPX_HFI_SELECT=default,core:1:0` all callers will use default HFI selection logic.

# SEE ALSO

[`fabric`(7)](fabric.7.html),
[`fi_provider`(7)](fi_provider.7.html),
[`fi_getinfo`(7)](fi_getinfo.7.html),
5 changes: 5 additions & 0 deletions man/fi_provider.7.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,10 @@ This distribution of libfabric contains the following providers
: High-speed Ethernet networking from Intel. See
[`fi_psm3`(7)](fi_psm3.7.html) for more information.

*OPX*
: High-speed Omni-Path networking from Cornelis Networks. See
[`fi_opx`(7)](fi_opx.7.html) for more information.

*Sockets*
: A general purpose provider that can be used on any network that
supports TCP/UDP sockets. This provider is not intended to provide
Expand Down Expand Up @@ -237,3 +241,4 @@ Logging is performed using the FI_ERR, FI_LOG, and FI_DEBUG macros.
[`fi_usnic`(7)](fi_usnic.7.html),
[`fi_verbs`(7)](fi_verbs.7.html),
[`fi_bgq`(7)](fi_bgq.7.html),
[`fi_opx`(7)](fi_opx.7.html),
Loading

0 comments on commit 9cf98bd

Please sign in to comment.