Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prov/ucx: fi_rdm_tagged_bw fi_av_insert error #10148

Open
miharulidze opened this issue Jul 3, 2024 · 1 comment
Open

prov/ucx: fi_rdm_tagged_bw fi_av_insert error #10148

miharulidze opened this issue Jul 3, 2024 · 1 comment
Labels

Comments

@miharulidze
Copy link
Contributor

Describe the bug
I'm trying to run fi_rdm_tagged_bw benchmark using UCX provider.
On the client side I get the following error:

[1720021258.401279] [slimfly2:990044:0] ucp_ep.c:1054 UCX ERROR the parameter params->address must not be NULL
[error] fabtests:common/shared.c:1502: fi_av_insert: number of addresses inserted = 0; number of addresses given = 1

To Reproduce

Server: fi_rdm_tagged_bw -p ucx -e rdm -I 512 -w 100 -W 1 -S 2097152 --pin-core 31
Client: fi_rdm_tagged_bw 192.168.1.11 -p ucx -e rdm -I 512 -w 100 -W 1 -S 2097152 --pin-core 31

Output
Output with FI_LOG_LEVEL=debug :

ibfabric:990134:1720021525::core:core:fi_param_get_():372<info> variable perf_cntr=<not set>                                                                      
libfabric:990134:1720021525::core:core:fi_param_get_():372<info> variable hook=<not set>                
libfabric:990134:1720021525::core:core:fi_param_get_():372<info> variable hmem=<not set>   
libfabric:990134:1720021525::core:core:ofi_hmem_init():658<info> Hmem iface FI_HMEM_CUDA not supported    
libfabric:990134:1720021525::core:core:ofi_hmem_init():658<info> Hmem iface FI_HMEM_ROCR not supported              
libfabric:990134:1720021525::core:core:ofi_hmem_init():658<info> Hmem iface FI_HMEM_ZE not supported                 
libfabric:990134:1720021525::core:core:ofi_hmem_init():658<info> Hmem iface FI_HMEM_NEURON not supported             
libfabric:990134:1720021525::core:core:ofi_hmem_init():658<info> Hmem iface FI_HMEM_SYNAPSEAI not supported
libfabric:990134:1720021525::core:core:fi_param_get_():372<info> variable hmem_disable_p2p=<not set>  
libfabric:990134:1720021525::core:mr:ofi_monitors_init():222<info> Initializing memory monitor uffd   
libfabric:990134:1720021525::core:mr:ofi_monitors_init():222<info> Initializing memory monitor memhooks
libfabric:990134:1720021525::core:mr:ofi_monitors_init():222<info> Initializing memory monitor cuda     
libfabric:990134:1720021525::core:mr:ofi_monitors_init():222<info> Initializing memory monitor cuda_ipc    
libfabric:990134:1720021525::core:mr:ofi_monitors_init():222<info> Initializing memory monitor rocr 
libfabric:990134:1720021525::core:mr:ofi_monitors_init():222<info> Initializing memory monitor rocr_ipc             
libfabric:990134:1720021525::core:mr:ofi_monitors_init():222<info> Initializing memory monitor xpmem                           
libfabric:990134:1720021525::core:mr:ofi_monitors_init():222<info> Initializing memory monitor ze                   
libfabric:990134:1720021525::core:mr:ofi_monitors_init():222<info> Initializing memory monitor import          
libfabric:990134:1720021525::core:core:fi_param_get_():372<info> variable mr_cache_max_size=<not set>
libfabric:990134:1720021525::core:core:fi_param_get_():372<info> variable mr_cache_max_count=<not set>
libfabric:990134:1720021525::core:core:fi_param_get_():372<info> variable mr_cache_monitor=<not set>
libfabric:990134:1720021525::core:core:fi_param_get_():372<info> variable mr_cuda_cache_monitor_enabled=<not set>
libfabric:990134:1720021525::core:core:fi_param_get_():372<info> variable mr_rocr_cache_monitor_enabled=<not set>
libfabric:990134:1720021525::core:core:fi_param_get_():372<info> variable mr_ze_cache_monitor_enabled=<not set>
libfabric:990134:1720021525::core:mr:ofi_default_cache_size():83<info> default cache size=1041463168
libfabric:990134:1720021525::core:mr:ofi_monitors_init():306<info> Default memory monitor is: memhooks
libfabric:990134:1720021525::core:core:fi_param_get_():372<info> variable provider=<not set>
libfabric:990134:1720021525::core:core:fi_param_get_():372<info> variable universe_size=<not set>
libfabric:990134:1720021525::core:core:fi_param_get_():372<info> variable av_remove_cleanup=<not set>
libfabric:990134:1720021525::core:core:fi_param_get_():372<info> variable offload_coll_provider=<not set>            
libfabric:990134:1720021525::core:core:fi_param_get_():372<info> variable provider_path=<not set>
libfabric:990134:1720021525::shm:core:fi_param_get_():372<info> variable sar_threshold=<not set>
libfabric:990134:1720021525::shm:core:fi_param_get_():372<info> variable tx_size=<not set>       
libfabric:990134:1720021525::shm:core:fi_param_get_():372<info> variable rx_size=<not set>       
libfabric:990134:1720021525::shm:core:fi_param_get_():372<info> variable disable_cma=<not set>     
libfabric:990134:1720021525::shm:core:fi_param_get_():372<info> variable use_dsa_sar=<not set>
libfabric:990134:1720021525::shm:core:fi_param_get_():372<info> variable use_xpmem=<not set>
libfabric:990134:1720021525::core:core:ofi_register_provider():518<info> registering provider: shm (121.0)                                                    
libfabric:990134:1720021525::core:core:ofi_register_provider():518<info> registering provider: sm2 (121.0)
libfabric:990134:1720021525::ofi_rxm:core:fi_param_get_():372<info> variable enable_passthru=<not set>
libfabric:990134:1720021525::ofi_rxm:core:fi_param_get_():372<info> variable buffer_size=<not set>
libfabric:990134:1720021525::ofi_rxm:core:fi_param_get_():372<info> variable tx_size=<not set>     
libfabric:990134:1720021525::ofi_rxm:core:fi_param_get_():372<info> variable rx_size=<not set>     
libfabric:990134:1720021525::ofi_rxm:core:fi_param_get_():372<info> variable msg_tx_size=<not set>                          
libfabric:990134:1720021525::ofi_rxm:core:fi_param_get_():372<info> variable msg_rx_size=<not set>                      
libfabric:990134:1720021525::ofi_rxm:core:fi_param_get_():372<info> variable cm_progress_interval=<not set>         
libfabric:990134:1720021525::ofi_rxm:core:fi_param_get_():372<info> variable cq_eq_fairness=<not set>
libfabric:990134:1720021525::ofi_rxm:core:fi_param_get_():372<info> variable data_auto_progress=<not set>
libfabric:990134:1720021525::ofi_rxm:core:fi_param_get_():372<info> variable use_rndv_write=<not set>    
libfabric:990134:1720021525::ofi_rxm:core:fi_param_get_():372<info> variable def_wait_obj=<not set>  
libfabric:990134:1720021525::ofi_rxm:core:fi_param_get_():372<info> variable def_tcp_wait_obj=<not set>  
libfabric:990134:1720021525::ofi_rxm:core:fi_param_get_():372<info> variable detect_hmem_iface=<not set>
libfabric:990134:1720021525::core:core:ofi_register_provider():518<info> registering provider: ofi_rxm (121.0)
libfabric:990134:1720021525::core:core:ofi_register_provider():518<info> registering provider: verbs (121.0)
libfabric:990134:1720021525::ofi_mrail:core:fi_param_get_():372<info> variable config=<not set>    
libfabric:990134:1720021525::ofi_mrail:core:fi_param_get_():372<info> variable addr_strc=<not set>
libfabric:990134:1720021525::ofi_mrail:core:mrail_parse_env_vars():115<info> unable to read FI_OFI_MRAIL_ADDR env variable
libfabric:990134:1720021525::core:core:ofi_register_provider():518<info> registering provider: ofi_mrail (121.0)
libfabric:990134:1720021525::ofi_rxd:core:fi_param_get_():372<info> variable spin_count=<not set>
libfabric:990134:1720021525::ofi_rxd:core:fi_param_get_():372<info> variable retry=<not set>
libfabric:990134:1720021525::ofi_rxd:core:fi_param_get_():372<info> variable max_peers=<not set>
libfabric:990134:1720021525::ofi_rxd:core:fi_param_get_():386<info> read int var max_unacked=128
libfabric:990134:1720021525::core:core:ofi_register_provider():518<info> registering provider: ofi_rxd (121.0)
libfabric:990134:1720021525::efa:fabric:efa_device_construct():67<info> efadv_query_device: Unknown error -95(-95)
libfabric:990134:1720021525::core:core:ofi_register_provider():518<info> registering provider: ucx (121.0)
libfabric:990134:1720021525::core:core:ofi_register_provider():518<info> registering provider: udp (121.0)
libfabric:990134:1720021525::core:core:ofi_register_provider():518<info> registering provider: sockets (121.0)
libfabric:990134:1720021525::tcp:core:fi_param_get_():372<info> variable prov_name=<not set>
libfabric:990134:1720021525::tcp:core:fi_param_get_():372<info> variable port_high_range=<not set>
libfabric:990134:1720021525::tcp:core:fi_param_get_():372<info> variable port_low_range=<not set>
libfabric:990134:1720021525::tcp:core:fi_param_get_():372<info> variable tx_size=<not set>
libfabric:990134:1720021525::tcp:core:fi_param_get_():372<info> variable rx_size=<not set>
libfabric:990134:1720021525::tcp:core:fi_param_get_():372<info> variable max_inject=<not set>
libfabric:990134:1720021525::tcp:core:fi_param_get_():372<info> variable max_saved=<not set>
libfabric:990134:1720021525::tcp:core:fi_param_get_():372<info> variable max_saved_size=<not set>
libfabric:990134:1720021525::tcp:core:fi_param_get_():372<info> variable max_rx_size=<not set>
libfabric:990134:1720021525::tcp:core:fi_param_get_():372<info> variable nodelay=<not set>
libfabric:990134:1720021525::tcp:core:fi_param_get_():372<info> variable staging_sbuf_size=<not set>
libfabric:990134:1720021525::tcp:core:fi_param_get_():372<info> variable prefetch_rbuf_size=<not set>
libfabric:990134:1720021525::tcp:core:fi_param_get_():372<info> variable zerocopy_size=<not set>
libfabric:990134:1720021525::tcp:core:fi_param_get_():372<info> variable trace_msg=<not set>
libfabric:990134:1720021525::tcp:core:fi_param_get_():372<info> variable disable_auto_progress=<not set>
libfabric:990134:1720021525::tcp:core:fi_param_get_():372<info> variable io_uring=<not set>
libfabric:990134:1720021525::core:core:ofi_register_provider():518<info> registering provider: tcp (121.0)
libfabric:990134:1720021525::core:core:ofi_register_provider():518<info> registering provider: ofi_hook_perf (121.0)
libfabric:990134:1720021525::core:core:ofi_register_provider():518<info> registering provider: ofi_hook_trace (121.0)
libfabric:990134:1720021525::core:core:ofi_register_provider():518<info> registering provider: ofi_hook_debug (121.0)
libfabric:990134:1720021525::core:core:fi_param_get_():372<info> variable hmem=<not set>
libfabric:990134:1720021525::core:core:ofi_hmem_init():658<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:990134:1720021525::core:core:ofi_hmem_init():658<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:990134:1720021525::core:core:ofi_hmem_init():658<info> Hmem iface FI_HMEM_ZE not supported
libfabric:990134:1720021525::core:core:ofi_hmem_init():658<info> Hmem iface FI_HMEM_NEURON not supported
libfabric:990134:1720021525::core:core:ofi_hmem_init():658<info> Hmem iface FI_HMEM_SYNAPSEAI not supported
libfabric:990134:1720021525::core:core:fi_param_get_():372<info> variable hmem_disable_p2p=<not set>
libfabric:990134:1720021525::core:core:ofi_register_provider():518<info> registering provider: ofi_hook_hmem (121.0)
libfabric:990134:1720021525::core:core:ofi_register_provider():518<info> registering provider: ofi_hook_dmabuf_peer_mem (121.0)
libfabric:990134:1720021525::core:core:ofi_register_provider():518<info> registering provider: ofi_hook_noop (121.0)
libfabric:990134:1720021525::core:core:ofi_register_provider():518<info> registering provider: off_coll (121.0)
libfabric:990134:1720021525::ucx:core:fi_param_get_():372<info> variable devices=<not set>
libfabric:990134:1720021525::ucx:core:ucx_getinfo():228<info> primary detected device: mlx5_0 
libfabric:990134:1720021525::ucx:core:fi_param_get_():372<info> variable inject_limit=<not set>
libfabric:990134:1720021525::ucx:core:ucx_getinfo():267<info> used inject size = 1024 
libfabric:990134:1720021525::ucx:core:fi_param_get_():372<info> variable config=<not set>
libfabric:990134:1720021525::ucx:core:fi_param_get_():372<info> variable ns_enable=<not set>
libfabric:990134:1720021525::ucx:core:fi_param_get_():372<info> variable ns_port=<not set>
libfabric:990134:1720021525::ucx:core:fi_param_get_():372<info> variable tls=<not set>
libfabric:990134:1720021525::ucx:core:fi_param_get_():372<info> variable ep_flush=<not set>
libfabric:990134:1720021525::ucx:core:fi_param_get_():372<info> variable check_req_leak=<not set>
libfabric:990134:1720021525::ucx:core:ucx_getinfo():306<info> Loaded UCX version 1.17.0
libfabric:990134:1720021525::ucx:core:ucx_getinfo():326<warn> fi_getinfo with non-NULL node or service is unsupported
libfabric:990134:1720021525::ucx:core:fi_param_get_():372<info> variable enable_spawn=<not set>
libfabric:990134:1720021525::ucx:core:ucx_getinfo():356<warn> UCX: spawn support 0 
libfabric:990134:1720021525::core:core:ofi_layering_ok():1289<info> Skipping ucx;ofi_rxm layering
libfabric:990134:1720021525::core:core:ofi_layering_ok():1289<info> Skipping ucx;ofi_rxd layering
libfabric:990134:1720021525::core:core:ofi_layering_ok():1289<info> Skipping ucx;ofi_mrail layering
libfabric:990134:1720021525::ucx:core:ucx_fabric_open():160<info> 
libfabric:990134:1720021525::core:core:fi_fabric_():1577<info> Opened fabric: ucx
libfabric:990134:1720021525::core:core:fi_fabric_():1584<info> Using ucx provider 1.21, path:/home/mkhalilo/Development/pcc/libfabric/build/lib/libfabric.so.1
libfabric:990134:1720021525::ucx:core:ofi_check_rx_attr():865<info> Tx only caps ignored in Rx caps
libfabric:990134:1720021525::ucx:core:ofi_check_tx_attr():963<info> Rx only caps ignored in Tx caps
libfabric:990134:1720021525::core:core:fi_param_get_():372<info> variable universe_size=<not set>
libfabric:990134:1720021525::ucx:core:ofi_check_rx_attr():865<info> Tx only caps ignored in Rx caps
libfabric:990134:1720021525::ucx:core:ofi_check_tx_attr():963<info> Rx only caps ignored in Tx caps
libfabric:990134:1720021525::ucx:core:ucx_av_insert():151<info> Try to insert address #0, offset=0 (size=1) fi_addr=0x4165b0
[1720021525.665062] [slimfly2:990134:0]          ucp_ep.c:1054 UCX  ERROR the parameter params->address must not be NULL
[error] fabtests:common/shared.c:1502: fi_av_insert: number of addresses inserted = 0; number of addresses given = 1

Environment:
OS Rocky Linux 9.4
UCX v1.16.0
libfabric master

@j-xiong
Copy link
Contributor

j-xiong commented Jul 3, 2024

The ucx provider doesn't support remote address resolution via fi_getinfo(). Please add the -b option to the command line to enable out-of-band address exchange.

@j-xiong j-xiong removed bug labels Jul 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants