Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fabtests: make source addressing usage more user friendly and universal #9807

Open
danielap1996 opened this issue Feb 13, 2024 · 5 comments
Open

Comments

@danielap1996
Copy link

danielap1996 commented Feb 13, 2024

Hi, I was trying to run some of teh fabtest tests but they were getting fail on fi_getinfo(): unit/av_test.c:1148, ret=-61 (No data available)

This is how I was install libfabric:

wget https://github.com/ofiwg/libfabric/releases/download/v1.20.0/libfabric-1.20.0.tar.bz2 
tar -xf ./libfabric-1.20.0.tar.bz2 
cd ./libfabric-1.20.0
./configure --prefix=/opt/libfabric --enable-debug
make -j 32
make install
export PATH=$PATH:/opt/libfabric/bin

This is how I was install fatests:

wget https://github.com/ofiwg/libfabric/releases/download/v1.20.1/fabtests-1.20.1.tar.bz2
tar -xf  ./fabtests-1.20.1.tar.bz2 
cd ./fabtests-1.20
./configure --with-libfabric=/opt/libfabric --prefix=/opt/fabtests 
make -j 32 
 make install

test run example:

$>hostname -i
127.0.1.1
$>cd /opt/fabtests/bin/
$>./fi_av_test -g 127.0.1.1 -n 1
fi_getinfo(): unit/av_test.c:1148, ret=-61 (No data available)

fi_info -l output:

dpitossi@dpitossi-vm-u22:/opt/fabtests/bin$ fi_info -l
opx:
    version: 120.0
usnic:
    version: 1.0
psm3:
    version: 305.1010
ofi_rxm:
    version: 120.0
ofi_rxd:
    version: 120.0
shm:
    version: 120.0
udp:
    version: 120.0
tcp:
    version: 120.0
sockets:
    version: 120.0
ofi_hook_perf:
    version: 120.0
ofi_hook_trace:
    version: 120.0
ofi_hook_debug:
    version: 120.0
ofi_hook_noop:
    version: 120.0
ofi_hook_hmem:
    version: 120.0
ofi_hook_dmabuf_peer_mem:
    version: 120.0
off_coll:
    version: 120.0
sm2:
    version: 120.0
ofi_mrail:
    version: 120.0
@ToddRimmer
Copy link
Contributor

The error returns from fi_getinfo need significant improvement. In general an error like ret=-61 (No data available) means libfabric attempted to enumerate all the NICs, but did not find an acceptable provider which offered an acceptable NIC.

This is an error which often occurs in customers and lacks any actionable information. Usually, the cause is the desired provider was not available on the system or the desired provider was unable to find an acceptable NIC to offer.

The next step is often to repeat the test with FI_LOG_LEVEL=info. However some patches in these code paths a couple years ago (commit f4715e8) made FI_WARN and FI_INFO calls into FI_DBG, so typical non-debug builds lack the key messages about device and provider discovery which are needed to debug this. So end users and in-distro libfabric users are typically stuck atthis point and must resort to provider specific mechanisms to debug what is happening or must locate libfabric source and rebuild it with debug (making sure not to change other options. A task which is beyond that of a typical sysadmin using an in-distro libfabric or an ISV provided MPI or application stack which includes libfabric).

The ideal customer facing answer would be for provider enumeration to accumulate a set of text messages from each provider and when a provider fails to find an acceptable device, the provider could provide a more detailed string as to why (probably a list of strings reflecting NICs it looked at and why it rejected them). Then if the fi_getinfo fails to find any provider, fi_getinfo could output (or return) a detailed message showing what providers it attempted and why they each indicated they could not find a device. Such strings may be long. I've implemented logging mechanisms like this in past products and it amounted to retaining a tree of error messages, with a list per provider and then only outputting the tree at the higher level routine where the issue was "realized" and discarding the tree if at least 1 provider successfully found NICs.

@danielap1996
Copy link
Author

Running the same fab test with FI_LOG_LEVEL=info

$>/opt/fabtests/bin$ ./fi_av_test -g 127.0.1.1 -n 1
libfabric:845162:1707847821::core:core:fi_param_get_():372<info> variable perf_cntr=<not set>
libfabric:845162:1707847821::core:core:fi_param_get_():372<info> variable hook=<not set>
libfabric:845162:1707847821::core:core:fi_param_get_():372<info> variable hmem=<not set>
libfabric:845162:1707847821::core:core:ofi_hmem_init():605<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:845162:1707847821::core:core:ofi_hmem_init():605<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:845162:1707847821::core:core:ofi_hmem_init():605<info> Hmem iface FI_HMEM_ZE not supported
libfabric:845162:1707847821::core:core:ofi_hmem_init():605<info> Hmem iface FI_HMEM_NEURON not supported
libfabric:845162:1707847821::core:core:ofi_hmem_init():605<info> Hmem iface FI_HMEM_SYNAPSEAI not supported
libfabric:845162:1707847821::core:core:fi_param_get_():372<info> variable hmem_disable_p2p=<not set>
libfabric:845162:1707847821::core:core:fi_param_get_():372<info> variable mr_cache_max_size=<not set>
libfabric:845162:1707847821::core:core:fi_param_get_():372<info> variable mr_cache_max_count=<not set>
libfabric:845162:1707847821::core:core:fi_param_get_():372<info> variable mr_cache_monitor=<not set>
libfabric:845162:1707847821::core:core:fi_param_get_():372<info> variable mr_cuda_cache_monitor_enabled=<not set>
libfabric:845162:1707847821::core:core:fi_param_get_():372<info> variable mr_rocr_cache_monitor_enabled=<not set>
libfabric:845162:1707847821::core:core:fi_param_get_():372<info> variable mr_ze_cache_monitor_enabled=<not set>
libfabric:845162:1707847821::core:mr:ofi_default_cache_size():78<info> default cache size=1750878720
libfabric:845162:1707847821::core:core:fi_param_get_():372<info> variable provider=<not set>
libfabric:845162:1707847821::core:core:fi_param_get_():372<info> variable universe_size=<not set>
libfabric:845162:1707847821::core:core:fi_param_get_():372<info> variable av_remove_cleanup=<not set>
libfabric:845162:1707847821::core:core:fi_param_get_():372<info> variable offload_coll_provider=<not set>
libfabric:845162:1707847821::core:core:fi_param_get_():372<info> variable provider_path=<not set>
libfabric:845162:1707847821::psm3:core:fi_psm3_ini():928<info> xxxxxxVM:pid845162: build options: VERSION=305.1010=3.5.1.1, HAVE_PSM3_SRC=1, PSM3_CUDA=0
libfabric:845162:1707847821::psm3:core:psmx3_param_get_bool():94<info> xxxxxxVM:pid845162: variable FI_PSM3_NAME_SERVER=<not set>
libfabric:845162:1707847821::psm3:core:psmx3_param_get_bool():94<info> xxxxxxVM:pid845162: variable FI_PSM3_TAGGED_RMA=<not set>
libfabric:845162:1707847821::psm3:core:psmx3_param_get_str():124<info> xxxxxxVM:pid845162: variable FI_PSM3_UUID=<not set>
libfabric:845162:1707847821::psm3:core:psmx3_param_get_int():109<info> xxxxxxVM:pid845162: variable FI_PSM3_DELAY=<not set>
libfabric:845162:1707847821::psm3:core:psmx3_param_get_int():109<info> xxxxxxVM:pid845162: variable FI_PSM3_TIMEOUT=<not set>
libfabric:845162:1707847821::psm3:core:psmx3_param_get_int():109<info> xxxxxxVM:pid845162: variable FI_PSM3_PROG_INTERVAL=<not set>
libfabric:845162:1707847821::psm3:core:psmx3_param_get_str():124<info> xxxxxxVM:pid845162: variable FI_PSM3_PROG_AFFINITY=<not set>
libfabric:845162:1707847821::psm3:core:psmx3_param_get_int():109<info> xxxxxxVM:pid845162: variable FI_PSM3_INJECT_SIZE=<not set>
libfabric:845162:1707847821::psm3:core:psmx3_param_get_int():109<info> xxxxxxVM:pid845162: variable FI_PSM3_LOCK_LEVEL=<not set>
libfabric:845162:1707847821::psm3:core:psmx3_param_get_bool():94<info> xxxxxxVM:pid845162: variable FI_PSM3_LAZY_CONN=<not set>
libfabric:845162:1707847821::psm3:core:psmx3_param_get_int():109<info> xxxxxxVM:pid845162: variable FI_PSM3_CONN_TIMEOUT=<not set>
libfabric:845162:1707847821::psm3:core:psmx3_param_get_bool():94<info> xxxxxxVM:pid845162: variable FI_PSM3_DISCONNECT=<not set>
libfabric:845162:1707847821::psm3:core:psmx3_param_get_str():124<info> xxxxxxVM:pid845162: variable FI_PSM3_TAG_LAYOUT=<not set>
libfabric:845162:1707847821::psm3:core:psmx3_param_get_bool():94<info> xxxxxxVM:pid845162: variable FI_PSM3_YIELD_MODE=<not set>
libfabric:845162:1707847821::core:core:ofi_register_provider():504<info> registering provider: psm3 (305.1010)
libfabric:845162:1707847821::core:core:ofi_register_provider():504<info> registering provider: usnic (1.0)
libfabric:845162:1707847821::shm:core:fi_param_get_():372<info> variable sar_threshold=<not set>
libfabric:845162:1707847821::shm:core:fi_param_get_():372<info> variable tx_size=<not set>
libfabric:845162:1707847821::shm:core:fi_param_get_():372<info> variable rx_size=<not set>
libfabric:845162:1707847821::shm:core:fi_param_get_():372<info> variable disable_cma=<not set>
libfabric:845162:1707847821::shm:core:fi_param_get_():372<info> variable use_dsa_sar=<not set>
libfabric:845162:1707847821::shm:core:fi_param_get_():372<info> variable use_xpmem=<not set>
libfabric:845162:1707847821::core:core:ofi_register_provider():504<info> registering provider: shm (120.0)
libfabric:845162:1707847821::core:core:ofi_register_provider():504<info> registering provider: sm2 (120.0)
libfabric:845162:1707847821::ofi_rxm:core:fi_param_get_():372<info> variable enable_passthru=<not set>
libfabric:845162:1707847821::ofi_rxm:core:fi_param_get_():372<info> variable buffer_size=<not set>
libfabric:845162:1707847821::ofi_rxm:core:fi_param_get_():372<info> variable tx_size=<not set>
libfabric:845162:1707847821::ofi_rxm:core:fi_param_get_():372<info> variable rx_size=<not set>
libfabric:845162:1707847821::ofi_rxm:core:fi_param_get_():372<info> variable msg_tx_size=<not set>
libfabric:845162:1707847821::ofi_rxm:core:fi_param_get_():372<info> variable msg_rx_size=<not set>
libfabric:845162:1707847821::ofi_rxm:core:fi_param_get_():372<info> variable cm_progress_interval=<not set>
libfabric:845162:1707847821::ofi_rxm:core:fi_param_get_():372<info> variable cq_eq_fairness=<not set>
libfabric:845162:1707847821::ofi_rxm:core:fi_param_get_():372<info> variable data_auto_progress=<not set>
libfabric:845162:1707847821::ofi_rxm:core:fi_param_get_():372<info> variable use_rndv_write=<not set>
libfabric:845162:1707847821::ofi_rxm:core:fi_param_get_():372<info> variable def_wait_obj=<not set>
libfabric:845162:1707847821::ofi_rxm:core:fi_param_get_():372<info> variable def_tcp_wait_obj=<not set>
libfabric:845162:1707847821::core:core:ofi_register_provider():504<info> registering provider: ofi_rxm (120.0)
libfabric:845162:1707847821::ofi_mrail:core:fi_param_get_():372<info> variable config=<not set>
libfabric:845162:1707847821::ofi_mrail:core:fi_param_get_():372<info> variable addr=<not set>
libfabric:845162:1707847821::ofi_mrail:core:fi_param_get_():372<info> variable addr_strc=<not set>
libfabric:845162:1707847821::ofi_mrail:core:mrail_parse_env_vars():115<info> unable to read FI_OFI_MRAIL_ADDR env variable
libfabric:845162:1707847821::core:core:ofi_register_provider():504<info> registering provider: ofi_mrail (120.0)
libfabric:845162:1707847821::ofi_rxd:core:fi_param_get_():372<info> variable spin_count=<not set>
libfabric:845162:1707847821::ofi_rxd:core:fi_param_get_():372<info> variable retry=<not set>
libfabric:845162:1707847821::ofi_rxd:core:fi_param_get_():372<info> variable max_peers=<not set>
libfabric:845162:1707847821::ofi_rxd:core:fi_param_get_():372<info> variable max_unacked=<not set>
libfabric:845162:1707847821::core:core:ofi_register_provider():504<info> registering provider: ofi_rxd (120.0)
libfabric:845162:1707847821::core:core:ofi_register_provider():504<info> registering provider: opx (120.0)
libfabric:845162:1707847821::core:core:ofi_register_provider():504<info> registering provider: udp (120.0)
libfabric:845162:1707847821::core:core:ofi_register_provider():504<info> registering provider: sockets (120.0)
libfabric:845162:1707847821::tcp:core:fi_param_get_():372<info> variable prov_name=<not set>
libfabric:845162:1707847821::tcp:core:fi_param_get_():372<info> variable port_high_range=<not set>
libfabric:845162:1707847821::tcp:core:fi_param_get_():372<info> variable port_low_range=<not set>
libfabric:845162:1707847821::tcp:core:fi_param_get_():372<info> variable tx_size=<not set>
libfabric:845162:1707847821::tcp:core:fi_param_get_():372<info> variable rx_size=<not set>
libfabric:845162:1707847821::tcp:core:fi_param_get_():372<info> variable max_inject=<not set>
libfabric:845162:1707847821::tcp:core:fi_param_get_():372<info> variable max_saved=<not set>
libfabric:845162:1707847821::tcp:core:fi_param_get_():372<info> variable max_saved_size=<not set>
libfabric:845162:1707847821::tcp:core:fi_param_get_():372<info> variable max_rx_size=<not set>
libfabric:845162:1707847821::tcp:core:fi_param_get_():372<info> variable nodelay=<not set>
libfabric:845162:1707847821::tcp:core:fi_param_get_():372<info> variable staging_sbuf_size=<not set>
libfabric:845162:1707847821::tcp:core:fi_param_get_():372<info> variable prefetch_rbuf_size=<not set>
libfabric:845162:1707847821::tcp:core:fi_param_get_():372<info> variable zerocopy_size=<not set>
libfabric:845162:1707847821::tcp:core:fi_param_get_():372<info> variable trace_msg=<not set>
libfabric:845162:1707847821::tcp:core:fi_param_get_():372<info> variable disable_auto_progress=<not set>
libfabric:845162:1707847821::tcp:core:fi_param_get_():372<info> variable io_uring=<not set>
libfabric:845162:1707847821::core:core:ofi_register_provider():504<info> registering provider: tcp (120.0)
libfabric:845162:1707847821::core:core:ofi_register_provider():504<info> registering provider: ofi_hook_perf (120.0)
libfabric:845162:1707847821::core:core:ofi_register_provider():504<info> registering provider: ofi_hook_trace (120.0)
libfabric:845162:1707847821::core:core:ofi_register_provider():504<info> registering provider: ofi_hook_debug (120.0)
libfabric:845162:1707847821::core:core:fi_param_get_():372<info> variable hmem=<not set>
libfabric:845162:1707847821::core:core:ofi_hmem_init():605<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:845162:1707847821::core:core:ofi_hmem_init():605<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:845162:1707847821::core:core:ofi_hmem_init():605<info> Hmem iface FI_HMEM_ZE not supported
libfabric:845162:1707847821::core:core:ofi_hmem_init():605<info> Hmem iface FI_HMEM_NEURON not supported
libfabric:845162:1707847821::core:core:ofi_hmem_init():605<info> Hmem iface FI_HMEM_SYNAPSEAI not supported
libfabric:845162:1707847821::core:core:fi_param_get_():372<info> variable hmem_disable_p2p=<not set>
libfabric:845162:1707847821::core:core:ofi_register_provider():504<info> registering provider: ofi_hook_hmem (120.0)
libfabric:845162:1707847821::core:core:ofi_register_provider():504<info> registering provider: ofi_hook_dmabuf_peer_mem (120.0)
libfabric:845162:1707847821::core:core:ofi_register_provider():504<info> registering provider: ofi_hook_noop (120.0)
libfabric:845162:1707847821::core:core:ofi_register_provider():504<info> registering provider: off_coll (120.0)
libfabric:845162:1707847821::opx:fabric:fi_opx_getinfo():518<trace> Detected 0 hfi1(s) in the system
libfabric:845162:1707847821::core:core:fi_getinfo_():1302<info> fi_getinfo: provider opx returned -61 (No data available)
libfabric:845162:1707847821::usnic:fabric:usdf_getinfo():763<trace> 
libfabric:845162:1707847821::usnic:fabric:usdf_getinfo():777<warn> failed to usdf_get_devinfo, ret=-19 (No such device)
libfabric:845162:1707847821::usnic:fabric:usdf_getinfo():848<info> returning -61 (No data available)
libfabric:845162:1707847821::core:core:fi_getinfo_():1302<info> fi_getinfo: provider usnic returned -61 (No data available)
libfabric:845162:1707847821::psm3:core:psmx3_getinfo():714<info> xxxxxxVM:pid845162: 
libfabric:845162:1707847821::psm3:core:psmx3_init_prov_info():361<info> xxxxxxVM:pid845162: Unsupported address format
libfabric:845162:1707847821::psm3:core:psmx3_init_prov_info():363<info> xxxxxxVM:pid845162: Supported: FI_ADDR_PSMX3
libfabric:845162:1707847821::psm3:core:psmx3_init_prov_info():365<info> xxxxxxVM:pid845162: Supported: FI_ADDR_STR
libfabric:845162:1707847821::psm3:core:psmx3_init_prov_info():367<info> xxxxxxVM:pid845162: Requested: FI_SOCKADDR
libfabric:845162:1707847821::core:core:fi_getinfo_():1302<info> fi_getinfo: provider psm3 returned -61 (No data available)
libfabric:845162:1707847821::ofi_rxm:core:fi_param_get_():372<info> variable use_srx=<not set>
libfabric:845162:1707847821:ofi_rxm:core:core:ofi_layering_ok():1181<info> Provider ofi_rxm is excluded
libfabric:845162:1707847821:ofi_rxm:core:core:ofi_layering_ok():1192<info> Need core provider, skipping ofi_rxd
libfabric:845162:1707847821:ofi_rxm:core:core:ofi_layering_ok():1192<info> Need core provider, skipping ofi_mrail
libfabric:845162:1707847821::ofi_rxm:core:fi_param_get_():372<info> variable use_srx=<not set>
libfabric:845162:1707847821:ofi_rxm:core:core:ofi_layering_ok():1181<info> Provider ofi_rxm is excluded
libfabric:845162:1707847821:ofi_rxm:core:core:ofi_layering_ok():1192<info> Need core provider, skipping ofi_rxd
libfabric:845162:1707847821:ofi_rxm:tcp:core:util_getinfo():218<info> FI_SOURCE set, but no node or service
libfabric:845162:1707847821:ofi_rxm:core:core:fi_getinfo_():1302<info> fi_getinfo: provider tcp returned -22 (Invalid argument)
libfabric:845162:1707847821:ofi_rxm:core:core:ofi_layering_ok():1192<info> Need core provider, skipping ofi_mrail
libfabric:845162:1707847821::ofi_rxm:core:fi_param_get_():372<info> variable use_srx=<not set>
libfabric:845162:1707847821:ofi_rxm:opx:fabric:fi_opx_getinfo():518<trace> Detected 0 hfi1(s) in the system
libfabric:845162:1707847821:ofi_rxm:core:core:fi_getinfo_():1302<info> fi_getinfo: provider opx returned -61 (No data available)
libfabric:845162:1707847821:ofi_rxm:usnic:fabric:usdf_getinfo():763<trace> 
libfabric:845162:1707847821:ofi_rxm:usnic:fabric:usdf_getinfo():848<info> returning -61 (No data available)
libfabric:845162:1707847821:ofi_rxm:core:core:fi_getinfo_():1302<info> fi_getinfo: provider usnic returned -61 (No data available)
libfabric:845162:1707847821:ofi_rxm:core:core:ofi_layering_ok():1199<info> Skipping util;psm3 layering
libfabric:845162:1707847821:ofi_rxm:core:core:ofi_layering_ok():1192<info> Need core provider, skipping ofi_rxm
libfabric:845162:1707847821:ofi_rxm:core:core:ofi_layering_ok():1192<info> Need core provider, skipping ofi_rxd
libfabric:845162:1707847821:ofi_rxm:core:core:ofi_layering_ok():1199<info> Skipping util;shm layering
libfabric:845162:1707847821:ofi_rxm:udp:core:util_getinfo():218<info> FI_SOURCE set, but no node or service
libfabric:845162:1707847821:ofi_rxm:core:core:fi_getinfo_():1302<info> fi_getinfo: provider udp returned -22 (Invalid argument)
libfabric:845162:1707847821:ofi_rxm:tcp:core:util_getinfo():218<info> FI_SOURCE set, but no node or service
libfabric:845162:1707847821:ofi_rxm:core:core:fi_getinfo_():1302<info> fi_getinfo: provider tcp returned -22 (Invalid argument)
libfabric:845162:1707847821:ofi_rxm:core:core:ofi_layering_ok():1199<info> Skipping util;sockets layering
libfabric:845162:1707847821:ofi_rxm:sm2:core:util_getinfo():218<info> FI_SOURCE set, but no node or service
libfabric:845162:1707847821:ofi_rxm:core:core:fi_getinfo_():1302<info> fi_getinfo: provider sm2 returned -22 (Invalid argument)
libfabric:845162:1707847821:ofi_rxm:core:core:ofi_layering_ok():1192<info> Need core provider, skipping ofi_mrail
libfabric:845162:1707847821::ofi_rxm:core:fi_param_get_():372<info> variable use_srx=<not set>
libfabric:845162:1707847821:ofi_rxm:opx:fabric:fi_opx_getinfo():518<trace> Detected 0 hfi1(s) in the system
libfabric:845162:1707847821:ofi_rxm:core:core:fi_getinfo_():1302<info> fi_getinfo: provider opx returned -61 (No data available)
libfabric:845162:1707847821:ofi_rxm:usnic:fabric:usdf_getinfo():763<trace> 
libfabric:845162:1707847821:ofi_rxm:usnic:fabric:usdf_getinfo():848<info> returning -61 (No data available)
libfabric:845162:1707847821:ofi_rxm:core:core:fi_getinfo_():1302<info> fi_getinfo: provider usnic returned -61 (No data available)
libfabric:845162:1707847821:ofi_rxm:core:core:ofi_layering_ok():1199<info> Skipping util;psm3 layering
libfabric:845162:1707847821:ofi_rxm:core:core:ofi_layering_ok():1192<info> Need core provider, skipping ofi_rxm
libfabric:845162:1707847821:ofi_rxm:core:core:ofi_layering_ok():1192<info> Need core provider, skipping ofi_rxd
libfabric:845162:1707847821:ofi_rxm:core:core:ofi_layering_ok():1199<info> Skipping util;shm layering
libfabric:845162:1707847821:ofi_rxm:udp:core:util_getinfo():218<info> FI_SOURCE set, but no node or service
libfabric:845162:1707847821:ofi_rxm:core:core:fi_getinfo_():1302<info> fi_getinfo: provider udp returned -22 (Invalid argument)
libfabric:845162:1707847821:ofi_rxm:tcp:core:util_getinfo():218<info> FI_SOURCE set, but no node or service
libfabric:845162:1707847821:ofi_rxm:core:core:fi_getinfo_():1302<info> fi_getinfo: provider tcp returned -22 (Invalid argument)
libfabric:845162:1707847821:ofi_rxm:core:core:ofi_layering_ok():1199<info> Skipping util;sockets layering
libfabric:845162:1707847821:ofi_rxm:sm2:core:util_getinfo():218<info> FI_SOURCE set, but no node or service
libfabric:845162:1707847821:ofi_rxm:core:core:fi_getinfo_():1302<info> fi_getinfo: provider sm2 returned -22 (Invalid argument)
libfabric:845162:1707847821:ofi_rxm:core:core:ofi_layering_ok():1192<info> Need core provider, skipping ofi_mrail
libfabric:845162:1707847821::core:core:fi_getinfo_():1302<info> fi_getinfo: provider ofi_rxm returned -61 (No data available)
libfabric:845162:1707847821:ofi_rxd:opx:fabric:fi_opx_getinfo():518<trace> Detected 0 hfi1(s) in the system
libfabric:845162:1707847821:ofi_rxd:core:core:fi_getinfo_():1302<info> fi_getinfo: provider opx returned -61 (No data available)
libfabric:845162:1707847821:ofi_rxd:usnic:fabric:usdf_getinfo():763<trace> 
libfabric:845162:1707847821:ofi_rxd:usnic:fabric:usdf_getinfo():848<info> returning -61 (No data available)
libfabric:845162:1707847821:ofi_rxd:core:core:fi_getinfo_():1302<info> fi_getinfo: provider usnic returned -61 (No data available)
libfabric:845162:1707847821:ofi_rxd:core:core:ofi_layering_ok():1199<info> Skipping util;psm3 layering
libfabric:845162:1707847821:ofi_rxd:core:core:ofi_layering_ok():1192<info> Need core provider, skipping ofi_rxm
libfabric:845162:1707847821:ofi_rxd:core:core:ofi_layering_ok():1192<info> Need core provider, skipping ofi_rxd
libfabric:845162:1707847821:ofi_rxd:core:core:ofi_layering_ok():1199<info> Skipping util;shm layering
libfabric:845162:1707847821:ofi_rxd:udp:core:util_getinfo():218<info> FI_SOURCE set, but no node or service
libfabric:845162:1707847821:ofi_rxd:core:core:fi_getinfo_():1302<info> fi_getinfo: provider udp returned -22 (Invalid argument)
libfabric:845162:1707847821:ofi_rxd:tcp:core:util_getinfo():218<info> FI_SOURCE set, but no node or service
libfabric:845162:1707847821:ofi_rxd:core:core:fi_getinfo_():1302<info> fi_getinfo: provider tcp returned -22 (Invalid argument)
libfabric:845162:1707847821:ofi_rxd:core:core:ofi_layering_ok():1199<info> Skipping util;sockets layering
libfabric:845162:1707847821:ofi_rxd:sm2:core:util_getinfo():218<info> FI_SOURCE set, but no node or service
libfabric:845162:1707847821:ofi_rxd:core:core:fi_getinfo_():1302<info> fi_getinfo: provider sm2 returned -22 (Invalid argument)
libfabric:845162:1707847821:ofi_rxd:core:core:ofi_layering_ok():1192<info> Need core provider, skipping ofi_mrail
libfabric:845162:1707847821::core:core:fi_getinfo_():1302<info> fi_getinfo: provider ofi_rxd returned -61 (No data available)
libfabric:845162:1707847821::shm:core:util_getinfo():218<info> FI_SOURCE set, but no node or service
libfabric:845162:1707847821::core:core:fi_getinfo_():1302<info> fi_getinfo: provider shm returned -22 (Invalid argument)
libfabric:845162:1707847821::udp:core:util_getinfo():218<info> FI_SOURCE set, but no node or service
libfabric:845162:1707847821::core:core:fi_getinfo_():1302<info> fi_getinfo: provider udp returned -22 (Invalid argument)
libfabric:845162:1707847821::tcp:core:util_getinfo():218<info> FI_SOURCE set, but no node or service
libfabric:845162:1707847821::core:core:fi_getinfo_():1302<info> fi_getinfo: provider tcp returned -22 (Invalid argument)
libfabric:845162:1707847821::sockets:core:util_getinfo():218<info> FI_SOURCE set, but no node or service
libfabric:845162:1707847821::core:core:fi_getinfo_():1302<info> fi_getinfo: provider sockets returned -22 (Invalid argument)
libfabric:845162:1707847821::sm2:core:util_getinfo():218<info> FI_SOURCE set, but no node or service
libfabric:845162:1707847821::core:core:fi_getinfo_():1302<info> fi_getinfo: provider sm2 returned -22 (Invalid argument)
libfabric:845162:1707847821::ofi_mrail:fabric:mrail_get_core_info():285<info> OFI_MRAIL_ADDR_STRC env variable not set!
libfabric:845162:1707847821::core:core:fi_getinfo_():1302<info> fi_getinfo: provider ofi_mrail returned -61 (No data available)
fi_getinfo(): unit/av_test.c:1148, ret=-61 (No data available)
libfabric:845162:1707847821::usnic:fabric:usdf_fini():1039<trace> 
libfabric:845162:1707847821::psm3:core:psmx3_fini():887<info> xxxxxxVM:pid845162: 

@aingerson
Copy link
Contributor

aingerson commented Feb 13, 2024

Hi there @danielap1996 and thanks for opening the issue!
The av test takes in the address to insert (-g) as well as the source address (-s). You'll need both to properly run the test.
In addition, I recommend explicitly setting the provider (-p) that you're hoping to target to make sure the provider you want on your system is working. For example, to run with the tcp provider:
fi_av_test -g 127.0.0.1 -n 1 -p tcp -s 127.0.0.1
Let me know if you're still seeing an issue.

@danielap1996
Copy link
Author

That response was supper fast !!!
It solved the issue:

$> ./fi_av_test -g 127.0.0.1 -n 1 -p tcp -s 127.0.0.1
Testing AVs on fabric 127.0.0.1/32

Testing with type = FI_AV_MAP
Running av_open_close [Test open and close AVs of varying sizes]...PASS!
Running av_good_sync [Test sync AV insert with good address]...PASS!
Running av_null_fi_addr [Test AV insert without specifying fi_addr]...skipped because: test not valid for AV type FI_AV_MAP
Running av_good_vector_async [Test async AV insert with vector of good addresses]...PASS!
Running av_zero_async [Test async insert AV insert of zero addresses]...PASS!
Running av_good_2vector_async [Test async AV inserts with two address vectors]...PASS!
Running av_insert_stages [Test AV insert at various stages]...PASS!

Testing with invalid address
Running av_bad_sync [Test sync AV insert of bad address]...PASS!
Running av_goodbad_vector_sync [Test sync AV insert of 1 good and 1 bad address]...PASS!
Running av_goodbad_vector_async [Test async AV insert with good and bad address]...PASS!
Running av_goodbad_vector_sync_err [Test AV insert of 1 good, 1 bad address using FI_SYNC_ERR]...skipped because: test not valid for AV type FI_AV_MAP

Testing with type = FI_AV_TABLE
Running av_open_close [Test open and close AVs of varying sizes]...PASS!
Running av_good_sync [Test sync AV insert with good address]...PASS!
Running av_null_fi_addr [Test AV insert without specifying fi_addr]...PASS!
Running av_good_vector_async [Test async AV insert with vector of good addresses]...PASS!
Running av_zero_async [Test async insert AV insert of zero addresses]...PASS!
Running av_good_2vector_async [Test async AV inserts with two address vectors]...PASS!
Running av_insert_stages [Test AV insert at various stages]...PASS!

Testing with invalid address
Running av_bad_sync [Test sync AV insert of bad address]...PASS!
Running av_goodbad_vector_sync [Test sync AV insert of 1 good and 1 bad address]...PASS!
Running av_goodbad_vector_async [Test async AV insert with good and bad address]...PASS!
Running av_goodbad_vector_sync_err [Test AV insert of 1 good, 1 bad address using FI_SYNC_ERR]...PASS!

Summary: all tests passed

Could you please change the test to be a bit more "friendly" to users?
something like:
if user don't give server info - take the current server ip by the hostname -i command
if user give only server ip , w/o client ip - set the client ip to be the same as the server ip
things like that.

@aingerson
Copy link
Contributor

@danielap1996 Yeah there are definitely some issues with fabtests in regards to how it handles source addressing. This is because some providers handle it differently so it's difficult to make a universal solution that is also correct with the API without forcing something that works. I'm going to change your issue title to reflect the request in clarification so we can track it and make sure we address it in the future.
Thank you!

@aingerson aingerson changed the title Fabtests tests are getting fail on fi_getinfo(): unit/av_test.c:1148, ret=-61 (No data available) fabtests: make source addressing usage more user friendly and universal Feb 13, 2024
@aingerson aingerson added enhancement and removed bug labels Feb 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants