Dual InfiniBand Network Cards in Same Subnet Solution
Dual InfiniBand Network Cards in Same Subnet Solution
Background
In our lab server cluster, each machine is equipped with two InfiniBand (IB) network cards that support both RDMA and TCP/IP protocols.
The IPoIB (IP over InfiniBand) addressing convention in our lab follows this format: 172.16.[host_number].[interface_number]/16, where interface_number is either 1 or 2, corresponding to network interfaces 0 or 1. Each machine's dual NICs and IB cards from different machines are all within the same subnet.
Typically, each researcher uses two machines - one as a CPU server and another as a memory server - connected via IB and RDMA for Far Memory experiments. The following table shows the configuration for the machines I use:
| Hostname | OS Version | IB Interface Name | IPoIB |
|---|---|---|---|
| cpuserver16 | Ubuntu 20.04.5 LTSWith Desktop | ibs5f0 | 172.16.16.1/16 |
| ibs5f1 | 172.16.16.2/16 | ||
| memserver34 | Ubuntu 18.04.3 LTSNo Desktop | ib0 | 172.16.34.1/16 |
| ib1 | 172.16.34.2/16 |
Other researchers can refer to this configuration and check their own machines' IB interface names and IP addresses using the ip address or ifconfig commands. If IB network cards aren't detected, the MLNX_OFED driver needs to be installed.
RDMA Connectivity Testing
In the RDMA protocol stack, the rping tool is used to test network connectivity, which is equivalent to ping in TCP/IP. Unlike ping, rping requires first starting a server-side process before the client-side can initiate a connection.
RDMA Server Example:
rping -s -a 172.16.34.1 -p 9401 -v
-s: Start server process
-a: Bind to IP address (specific IPoIB, using first IB NIC of memserver34)
-p: Listening port (default is 9400)
-v: Print output information
Server process will block waiting for client connection
Server stops only after client completes connection and disconnects
RDMA Client Example:
rping -c -I 172.16.16.1 -a 172.16.34.1 -p 9401 -v
-c: Start client process
-I: Specify local IPoIB address (optional, defaults to routing table lookup)
-a: Server's IPoIB address
-p: Server's listening port
-v: Print output information
Client continuously sends test data after successful connection, stop with Ctrl+C
Optional -C option to limit test data transmissions, e.g., -C 10
If RDMA connection is normal, both server and client terminals will display data output, for example:
ping data: rdma-ping-0: ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_abcdefghijklmnopqr ping data: rdma-ping-1: BCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_abcdefghijklmnopqrs
Problem Description and Reproduction
When using the second IB network card on the Memory Server as the RDMA Server, the RDMA Client on the CPU Server cannot communicate with the RDMA Server, resulting in a RDMA_CM_EVENT_REJECTED error. Example commands:
memserver34
rping -s -a 172.16.34.2 -p 9401 -v
cpuserver16
rping -c -a 172.16.34.2 -p 9401 -v
Then the client side encounters an error:
cma event RDMA_CM_EVENT_REJECTED, error 8
Routing Table Troubleshooting
Using the route -n or ip route command to check memserver34's routing table yields the following:
$ ip route default via 10.208.130.254 dev enp49s0f1 proto static 10.208.130.0/24 dev enp49s0f1 proto kernel scope link src 10.208.130.34 172.16.0.0/16 dev ib0 proto kernel scope link src 172.16.34.1 172.16.0.0/16 dev ib1 proto kernel scope link src 172.16.34.2
The system automatically generates routing table entries with two routes to 172.16.0.0/16, but only the first one is matched with priority. Therefore, if the second IB NIC is used as the server, the connection cannot be established successfully.
Separating Routing Rules
Since all machines' IB NICs should be in the same subnet, default routing rules will inevitably conflict. To enable both NICs to be used simultaneously, each IB NIC should use its own routing table instead of the global routing table.
Referencing this solution, the following routing table modification commands are provided (taking memserver34 as an example):
These commands must be executed as root or with sudo!
Delete all entries related to IB NICs from the global routing table
ip route del 172.16.0.0/16 dev ib0 ip route del 172.16.0.0/16 dev ib1
Add back IB NIC routes with table option, using different table numbers for each NIC
This indicates that the routing information belongs to the specified routing table
ip route add 172.16.0.0/16 dev ib0 proto kernel scope link src 172.16.34.1 table 941 ip route add 172.16.0.0/16 dev ib1 proto kernel scope link src 172.16.34.2 table 942
Specify routing rules, each IB NIC uses its own routing table
ip rule add from 172.16.34.1 table 941 ip rule add from 172.16.34.2 table 942
The configuration for cpuserver is similar, just replace the IPoIB addresses. After configuration, the reference routing table looks like this:
$ ip route default via 10.208.130.254 dev enp49s0f1 proto static 10.208.130.0/24 dev enp49s0f1 proto kernel scope link src 10.208.130.34 $ ip rule 0: from all lookup local 32764: from 172.16.34.2 lookup 942 32765: from 172.16.34.1 lookup 941 32766: from all lookup main 32767: from all lookup default $ ip route show table 941 172.16.0.0/16 dev ib0 proto kernel scope link src 172.16.34.1 $ ip route show table 942 172.16.0.0/16 dev ib1 proto kernel scope link src 172.16.34.2
Now the IB NIC routing informmation has been separated in to two routing tables (ip route table) and bound through routing rules (ip rule).
Note that after IB NICs have their routing rules separated (especially after deleting global routes), when using rping and ping, you must explicitly specify the network interface to use (with the -I option):
16 ping 34
ping -I ibs5f0 172.16.34.1 ping -I ibs5f0 172.16.34.2 ping -I ibs5f1 172.16.34.1 ping -I ibs5f1 172.16.34.2
16 rping 34(ib0)
runs on 34
rping -s -a 172.16.34.1 -p 9401 -v
runs on 16
rping -c -I 172.16.16.1 -a 172.16.34.1 -p 9401 -v rping -c -I 172.16.16.2 -a 172.16.34.1 -p 9401 -v
16 rping 34(ib1)
runs on 34
rping -s -a 172.16.34.2 -p 9401 -v
runs on 16
rping -c -I 172.16.16.1 -a 172.16.34.2 -p 9401 -v rping -c -I 172.16.16.2 -a 172.16.34.2 -p 9401 -v
Currently, the routing table modifications made via the ip route command are not permanently saved and will be lost after a reboot. You can save the above commands as a shell script and execute them after the first login following a system reboot.
$ vim ~/route-ib-mem34.sh #!/bin/bash set -e if [ $(whoami) != "root" ]; then echo "Error: Must run as root!" exit 1 fi echo "delete global ib routes" ip route del 172.16.0.0/16 dev ib0 ip route del 172.16.0.0/16 dev ib1 echo "add ib routes with table" ip route add 172.16.0.0/16 dev ib0 proto kernel scope link src 172.16.34.1 table 941 ip route add 172.16.0.0/16 dev ib1 proto kernel scope link src 172.16.34.2 table 942 echo "add ip rule for ib interface" ip rule add from 172.16.34.1 table 941 ip rule add from 172.16.34.2 table 942
Permanent Configuration
To avoid manually executing the script every time the system boots, the routing modification script needs to be deployed to execute automatically at startup. More specifically, after the IB interface establishes a connection (up).
Different Ubuntu systems, or even different specific machines, may use different network management software. It's necessary to determine which tool is currently managing the network on your machine before configuration, otherwise, errors may occur.
NetworkManager / ifupdown
The cpuserver16 machine has a desktop, so its network is managed by NetworkManager, with corresponding terminal tools nmcli and nmtui. The static IP configuration for IB NICs is located in /etc/NetworkManager/system-connections/.
Referencing this solution, add a script named route-ib-cpu16 in the /etc/network/if-up.d directory. Scripts in the if-up.d directory are automatically executed when a network interface connects, with the current interface name passed through the IFACE variable.
Script content (needs to check the current network interface using the IFACE variable):
#!/bin/bash set -e if [ "$IFACE" == "ibs5f0" ]; then ip route del 172.16.0.0/16 dev ibs5f0 ip route add 172.16.0.0/16 dev ibs5f0 proto kernel scope link src 172.16.16.1 table 941 ip rule add from 172.16.16.1 table 941 elif [ "$IFACE" == "ibs5f1" ]; then ip route del 172.16.0.0/16 dev ibs5f1 ip route add 172.16.0.0/16 dev ibs5f1 proto kernel scope link src 172.16.16.2 table 942 ip rule add from 172.16.16.2 table 942 fi
After creating the script file, executable permissions must be added: sudo chmod +x route-ib-cpu16. Then, reboot the server for the above configuration to take effect permanently.
Additionally, based on some research, if the current Linux system uses the ifupdown tool to manage network connections, the above method might also work, but it hasn't been tested.
netplan
Unlike cpuserver16, memserver34 uses netplan to manage network connections. This machine has Ubuntu 18.04 LTS installed without a desktop, meaning it doesn't have NetworkManager.
After some exploration, the static IP configuration for IB NICs was found in the /etc/netplan/ directory:
$ ls /etc/netplan/ 01-netcfg.yaml 99-netcfg.yaml $ cat /etc/netplan/99-netcfg.yaml network: version: 2 renderer: networkd ethernets: ...... ib0: addresses: [172.16.34.1/16] ib1: addresses: [172.16.34.2/16]
For netplan, referencing this solution, scripts executed after the network interface connects are located in /etc/networkd-dispatcher/, specifically, routing operations should be placed in the routable.d subdirectory. The script writing conventions are the same as above, with netplan specifying the current network interface through the IFACE variable.
Create route-ib-mem34 in the /etc/networkd-dispatcher/routable.d/ directory:
#!/bin/bash set -e if [ "$IFACE" == "ib0" ]; then ip route del 172.16.0.0/16 dev ib0 ip route add 172.16.0.0/16 dev ib0 proto kernel scope link src 172.16.34.1 table 941 ip rule add from 172.16.34.1 tible 941 elif [ "$IFACE" == "ib1" ]; then ip route del 172.16.0.0/16 dev ib1 ip route add 172.16.0.0/16 dev ib1 proto kernel scope link src 172.16.34.2 table 942 ip rule add from 172.16.34.2 table 942 fi
Similarly, executable permissions need to be added with sudo chmod +x. After rebooting, the routing configuration will take effect permanently.
If you're unsure whether the permanent configuration is correct, you can test it after reboot using the ip rule and ip route show table commands mentioned in the Separating Routing Rules section.