Fading Coder

One Final Commit for the Last Sprint

Home > Tech > Content

Dual InfiniBand Network Cards in Same Subnet Solution

Tech 1

Dual InfiniBand Network Cards in Same Subnet Solution

Background

In our lab server cluster, each machine is equipped with two InfiniBand (IB) network cards that support both RDMA and TCP/IP protocols.

The IPoIB (IP over InfiniBand) addressing convention in our lab follows this format: 172.16.[host_number].[interface_number]/16, where interface_number is either 1 or 2, corresponding to network interfaces 0 or 1. Each machine's dual NICs and IB cards from different machines are all within the same subnet.

Typically, each researcher uses two machines - one as a CPU server and another as a memory server - connected via IB and RDMA for Far Memory experiments. The following table shows the configuration for the machines I use:

Hostname OS Version IB Interface Name IPoIB
cpuserver16 Ubuntu 20.04.5 LTSWith Desktop ibs5f0 172.16.16.1/16
ibs5f1 172.16.16.2/16
memserver34 Ubuntu 18.04.3 LTSNo Desktop ib0 172.16.34.1/16
ib1 172.16.34.2/16

Other researchers can refer to this configuration and check their own machines' IB interface names and IP addresses using the ip address or ifconfig commands. If IB network cards aren't detected, the MLNX_OFED driver needs to be installed.

RDMA Connectivity Testing

In the RDMA protocol stack, the rping tool is used to test network connectivity, which is equivalent to ping in TCP/IP. Unlike ping, rping requires first starting a server-side process before the client-side can initiate a connection.

RDMA Server Example:

rping -s -a 172.16.34.1 -p 9401 -v

-s: Start server process

-a: Bind to IP address (specific IPoIB, using first IB NIC of memserver34)

-p: Listening port (default is 9400)

-v: Print output information

Server process will block waiting for client connection

Server stops only after client completes connection and disconnects

RDMA Client Example:

rping -c -I 172.16.16.1 -a 172.16.34.1 -p 9401 -v

-c: Start client process

-I: Specify local IPoIB address (optional, defaults to routing table lookup)

-a: Server's IPoIB address

-p: Server's listening port

-v: Print output information

Client continuously sends test data after successful connection, stop with Ctrl+C

Optional -C option to limit test data transmissions, e.g., -C 10

If RDMA connection is normal, both server and client terminals will display data output, for example:

ping data: rdma-ping-0: ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_abcdefghijklmnopqr ping data: rdma-ping-1: BCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_abcdefghijklmnopqrs

Problem Description and Reproduction

When using the second IB network card on the Memory Server as the RDMA Server, the RDMA Client on the CPU Server cannot communicate with the RDMA Server, resulting in a RDMA_CM_EVENT_REJECTED error. Example commands:

memserver34

rping -s -a 172.16.34.2 -p 9401 -v

cpuserver16

rping -c -a 172.16.34.2 -p 9401 -v

Then the client side encounters an error:

cma event RDMA_CM_EVENT_REJECTED, error 8

Routing Table Troubleshooting

Using the route -n or ip route command to check memserver34's routing table yields the following:

$ ip route default via 10.208.130.254 dev enp49s0f1 proto static 10.208.130.0/24 dev enp49s0f1 proto kernel scope link src 10.208.130.34 172.16.0.0/16 dev ib0 proto kernel scope link src 172.16.34.1 172.16.0.0/16 dev ib1 proto kernel scope link src 172.16.34.2

The system automatically generates routing table entries with two routes to 172.16.0.0/16, but only the first one is matched with priority. Therefore, if the second IB NIC is used as the server, the connection cannot be established successfully.

Separating Routing Rules

Since all machines' IB NICs should be in the same subnet, default routing rules will inevitably conflict. To enable both NICs to be used simultaneously, each IB NIC should use its own routing table instead of the global routing table.

Referencing this solution, the following routing table modification commands are provided (taking memserver34 as an example):

These commands must be executed as root or with sudo!

Delete all entries related to IB NICs from the global routing table

ip route del 172.16.0.0/16 dev ib0 ip route del 172.16.0.0/16 dev ib1

Add back IB NIC routes with table option, using different table numbers for each NIC

This indicates that the routing information belongs to the specified routing table

ip route add 172.16.0.0/16 dev ib0 proto kernel scope link src 172.16.34.1 table 941 ip route add 172.16.0.0/16 dev ib1 proto kernel scope link src 172.16.34.2 table 942

Specify routing rules, each IB NIC uses its own routing table

ip rule add from 172.16.34.1 table 941 ip rule add from 172.16.34.2 table 942

The configuration for cpuserver is similar, just replace the IPoIB addresses. After configuration, the reference routing table looks like this:

$ ip route default via 10.208.130.254 dev enp49s0f1 proto static 10.208.130.0/24 dev enp49s0f1 proto kernel scope link src 10.208.130.34 $ ip rule 0: from all lookup local 32764: from 172.16.34.2 lookup 942 32765: from 172.16.34.1 lookup 941 32766: from all lookup main 32767: from all lookup default $ ip route show table 941 172.16.0.0/16 dev ib0 proto kernel scope link src 172.16.34.1 $ ip route show table 942 172.16.0.0/16 dev ib1 proto kernel scope link src 172.16.34.2

Now the IB NIC routing informmation has been separated in to two routing tables (ip route table) and bound through routing rules (ip rule).

Note that after IB NICs have their routing rules separated (especially after deleting global routes), when using rping and ping, you must explicitly specify the network interface to use (with the -I option):

16 ping 34

ping -I ibs5f0 172.16.34.1 ping -I ibs5f0 172.16.34.2 ping -I ibs5f1 172.16.34.1 ping -I ibs5f1 172.16.34.2

16 rping 34(ib0)

runs on 34

rping -s -a 172.16.34.1 -p 9401 -v

runs on 16

rping -c -I 172.16.16.1 -a 172.16.34.1 -p 9401 -v rping -c -I 172.16.16.2 -a 172.16.34.1 -p 9401 -v

16 rping 34(ib1)

runs on 34

rping -s -a 172.16.34.2 -p 9401 -v

runs on 16

rping -c -I 172.16.16.1 -a 172.16.34.2 -p 9401 -v rping -c -I 172.16.16.2 -a 172.16.34.2 -p 9401 -v

Currently, the routing table modifications made via the ip route command are not permanently saved and will be lost after a reboot. You can save the above commands as a shell script and execute them after the first login following a system reboot.

$ vim ~/route-ib-mem34.sh #!/bin/bash set -e if [ $(whoami) != "root" ]; then echo "Error: Must run as root!" exit 1 fi echo "delete global ib routes" ip route del 172.16.0.0/16 dev ib0 ip route del 172.16.0.0/16 dev ib1 echo "add ib routes with table" ip route add 172.16.0.0/16 dev ib0 proto kernel scope link src 172.16.34.1 table 941 ip route add 172.16.0.0/16 dev ib1 proto kernel scope link src 172.16.34.2 table 942 echo "add ip rule for ib interface" ip rule add from 172.16.34.1 table 941 ip rule add from 172.16.34.2 table 942

Permanent Configuration

To avoid manually executing the script every time the system boots, the routing modification script needs to be deployed to execute automatically at startup. More specifically, after the IB interface establishes a connection (up).

Different Ubuntu systems, or even different specific machines, may use different network management software. It's necessary to determine which tool is currently managing the network on your machine before configuration, otherwise, errors may occur.

NetworkManager / ifupdown

The cpuserver16 machine has a desktop, so its network is managed by NetworkManager, with corresponding terminal tools nmcli and nmtui. The static IP configuration for IB NICs is located in /etc/NetworkManager/system-connections/.

Referencing this solution, add a script named route-ib-cpu16 in the /etc/network/if-up.d directory. Scripts in the if-up.d directory are automatically executed when a network interface connects, with the current interface name passed through the IFACE variable.

Script content (needs to check the current network interface using the IFACE variable):

#!/bin/bash set -e if [ "$IFACE" == "ibs5f0" ]; then ip route del 172.16.0.0/16 dev ibs5f0 ip route add 172.16.0.0/16 dev ibs5f0 proto kernel scope link src 172.16.16.1 table 941 ip rule add from 172.16.16.1 table 941 elif [ "$IFACE" == "ibs5f1" ]; then ip route del 172.16.0.0/16 dev ibs5f1 ip route add 172.16.0.0/16 dev ibs5f1 proto kernel scope link src 172.16.16.2 table 942 ip rule add from 172.16.16.2 table 942 fi

After creating the script file, executable permissions must be added: sudo chmod +x route-ib-cpu16. Then, reboot the server for the above configuration to take effect permanently.

Additionally, based on some research, if the current Linux system uses the ifupdown tool to manage network connections, the above method might also work, but it hasn't been tested.

netplan

Unlike cpuserver16, memserver34 uses netplan to manage network connections. This machine has Ubuntu 18.04 LTS installed without a desktop, meaning it doesn't have NetworkManager.

After some exploration, the static IP configuration for IB NICs was found in the /etc/netplan/ directory:

$ ls /etc/netplan/ 01-netcfg.yaml 99-netcfg.yaml $ cat /etc/netplan/99-netcfg.yaml network: version: 2 renderer: networkd ethernets: ...... ib0: addresses: [172.16.34.1/16] ib1: addresses: [172.16.34.2/16]

For netplan, referencing this solution, scripts executed after the network interface connects are located in /etc/networkd-dispatcher/, specifically, routing operations should be placed in the routable.d subdirectory. The script writing conventions are the same as above, with netplan specifying the current network interface through the IFACE variable.

Create route-ib-mem34 in the /etc/networkd-dispatcher/routable.d/ directory:

#!/bin/bash set -e if [ "$IFACE" == "ib0" ]; then ip route del 172.16.0.0/16 dev ib0 ip route add 172.16.0.0/16 dev ib0 proto kernel scope link src 172.16.34.1 table 941 ip rule add from 172.16.34.1 tible 941 elif [ "$IFACE" == "ib1" ]; then ip route del 172.16.0.0/16 dev ib1 ip route add 172.16.0.0/16 dev ib1 proto kernel scope link src 172.16.34.2 table 942 ip rule add from 172.16.34.2 table 942 fi

Similarly, executable permissions need to be added with sudo chmod +x. After rebooting, the routing configuration will take effect permanently.

If you're unsure whether the permanent configuration is correct, you can test it after reboot using the ip rule and ip route show table commands mentioned in the Separating Routing Rules section.

Related Articles

Understanding Strong and Weak References in Java

Strong References Strong reference are the most prevalent type of object referencing in Java. When an object has a strong reference pointing to it, the garbage collector will not reclaim its memory. F...

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Introduction Server-Side Template Injection (SSTI) is a vulnerability in web applications where user input is improper handled within the template engine and executed on the server. This exploit can r...

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Django’s Admin panel is highly user-friendly, and pairing it with TinyMCE, an effective rich text editor, simplifies content management significantly. Combining the two is particular useful for bloggi...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.