Fading Coder

One Final Commit for the Last Sprint

Home > Tech > Content

Dual InfiniBand Network Cards in Same Subnet Solution

Tech Apr 20 12

Background

In our lab server cluster, each machine is equipped with two InfiniBand (IB) network cards that support both RDMA and TCP/IP protocols.

The IPoIB (IP over InfiniBand) addressing convention in our lab follows this format: 172.16.[host_number].[interface_number]/16, where interface_number is either 1 or 2, corresponding to network interfaces 0 or 1. Each machines' dual NICs and IB cards from different machines are all within the same subnet.

Typically, each researcher uses two machines - one as a CPU server and another as a memory server - connected via IB and RDMA for Far Memory experiments. The following table shows the configuration for the mcahines I use:

Hostname OS Version IB Interface Name IPoIB
cpuserver16 Ubuntu 20.04.5 LTSWith Desktop ibs5f0 172.16.16.1/16
ibs5f1 172.16.16.2/16
memserver34 Ubuntu 18.04.3 LTSNo Desktop ib0 172.16.34.1/16
ib1 172.16.34.2/16

Other researchers can refer to this configuration and check their own machines' IB interface names and IP addresses using the ip address or ifconfig commands. If IB network cards aren't detected, the MLNX_OFED driver needs to be installed.

RDMA Connectivity Testing

In the RDMA protocol stack, the rping tool is used to test network connectivity, which is equivalent to ping in TCP/IP. Unlike ping, rping requires first starting a server-side process before the client-side can initiate a connection.

RDMA Server Example:

rping -s -a 172.16.34.1 -p 9401 -v
# -s: Start server process
# -a: Bind to IP address (specific IPoIB, using first IB NIC of memserver34)
# -p: Listening port (default is 9400)
# -v: Print output information
# Server process will block waiting for client connection
# Server stops only after client completes connection and disconnects

RDMA Client Example:

rping -c -I 172.16.16.1 -a 172.16.34.1 -p 9401 -v
# -c: Start client process
# -I: Specify local IPoIB address (optional, defaults to routing table lookup)
# -a: Server's IPoIB address
# -p: Server's listening port
# -v: Print output information
# Client continuously sends test data after successful connection, stop with Ctrl+C
# Optional -C option to limit test data transmissions, e.g., -C 10

If RDMA connection is normal, both server and client terminals will display data output, for example:

ping data: rdma-ping-0: ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqr
ping data: rdma-ping-1: BCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrs

Problem Description and Reproduction

When using the second IB network card on the Memory Server as the RDMA Server, the RDMA Client on the CPU Server cannot communicate with the RDMA Server, resulting in a RDMA_CM_EVENT_REJECTED error. Example commands:

# memserver34
rping -s -a 172.16.34.2 -p 9401 -v
# cpuserver16
rping -c -a 172.16.34.2 -p 9401 -v

Then the client side encounters an error:

cma event RDMA_CM_EVENT_REJECTED, error 8

Routing Table Troubleshooting

Using the route -n or ip route command to check memserver34's routing table yields the following:

$ ip route
default via 10.208.130.254 dev enp49s0f1 proto static 
10.208.130.0/24 dev enp49s0f1 proto kernel scope link src 10.208.130.34 
172.16.0.0/16 dev ib0 proto kernel scope link src 172.16.34.1 
172.16.0.0/16 dev ib1 proto kernel scope link src 172.16.34.2

The system automatically generates routing table entries with two routes to 172.16.0.0/16, but only the first one is matched with priority. Therefore, if the second IB NIC is used as the server, the connection cannot be established successfully.

Separating Routing Rules

Since all machines' IB NICs should be in the same subnet, default routing rules will inevitably conflict. To enable both NICs to be used simultaneously, each IB NIC should use its own routing table instead of the global routing table.

Referencing this solution, the following routing table modification commands are provided (taking memserver34 as an example):

# These commands must be executed as root or with sudo!

# Delete all entries related to IB NICs from the global routing table
ip route del 172.16.0.0/16 dev ib0
ip route del 172.16.0.0/16 dev ib1
# Add back IB NIC routes with table option, using different table numbers for each NIC
# This indicates that the routing information belongs to the specified routing table
ip route add 172.16.0.0/16 dev ib0 proto kernel scope link src 172.16.34.1 table 941
ip route add 172.16.0.0/16 dev ib1 proto kernel scope link src 172.16.34.2 table 942
# Specify routing rules, each IB NIC uses its own routing table
ip rule add from 172.16.34.1 table 941
ip rule add from 172.16.34.2 table 942

The configuration for cpuserver is similar, just replace the IPoIB addresses. After configuration, the reference routing table looks like this:

$ ip route
default via 10.208.130.254 dev enp49s0f1 proto static 
10.208.130.0/24 dev enp49s0f1 proto kernel scope link src 10.208.130.34 
$ ip rule
0:      from all lookup local 
32764:  from 172.16.34.2 lookup 942 
32765:  from 172.16.34.1 lookup 941 
32766:  from all lookup main 
32767:  from all lookup default 
$ ip route show table 941
172.16.0.0/16 dev ib0 proto kernel scope link src 172.16.34.1 
$ ip route show table 942
172.16.0.0/16 dev ib1 proto kernel scope link src 172.16.34.2 

Now the IB NIC routing information has been separated into two routing tables (ip route table) and bound through routing rules (ip rule).

Note that after IB NICs have their routing rules separated (especially after deleting global routes), when using rping and ping, you must explicitly specify the network interface to use (with the -I option):

# 16 ping 34
ping -I ibs5f0 172.16.34.1
ping -I ibs5f0 172.16.34.2
ping -I ibs5f1 172.16.34.1
ping -I ibs5f1 172.16.34.2

# 16 rping 34(ib0)
# runs on 34
rping -s -a 172.16.34.1 -p 9401 -v
# runs on 16
rping -c -I 172.16.16.1 -a 172.16.34.1 -p 9401 -v
rping -c -I 172.16.16.2 -a 172.16.34.1 -p 9401 -v

# 16 rping 34(ib1)
# runs on 34
rping -s -a 172.16.34.2 -p 9401 -v
# runs on 16
rping -c -I 172.16.16.1 -a 172.16.34.2 -p 9401 -v
rping -c -I 172.16.16.2 -a 172.16.34.2 -p 9401 -v

Currently, the routing table modifications made via the ip route command are not permanently saved and will be lost after a reboot. You can save the above commands as a shell script and execute them after the first login following a system reboot.

$ vim ~/route-ib-mem34.sh
#!/bin/bash
set -e
if [ $(whoami) != "root" ]; then
    echo "Error: Must run as root!"
    exit 1
fi
echo "delete global ib routes"
ip route del 172.16.0.0/16 dev ib0
ip route del 172.16.0.0/16 dev ib1
echo "add ib routes with table"
ip route add 172.16.0.0/16 dev ib0 proto kernel scope link src 172.16.34.1 table 941
ip route add 172.16.0.0/16 dev ib1 proto kernel scope link src 172.16.34.2 table 942
echo "add ip rule for ib interface"
ip rule add from 172.16.34.1 table 941
ip rule add from 172.16.34.2 table 942

Permanent Configuration

To avoid manually executing the script every time the system boots, the routing modification script needs to be deployed to execute automatically at startup. More specifically, after the IB interface establishes a connection (up).

Different Ubuntu systems, or even different specific machines, may use different network management software. It's necessary to determine which tool is currently managing the network on your machine before configuration, otherwise, errors may occur.

NetworkManager / ifupdown

The cpuserver16 machine has a desktop, so its network is managed by NetworkManager, with corresponding terminal tools nmcli and nmtui. The static IP configuration for IB NICs is located in /etc/NetworkManager/system-connections/.

Referencing this solution, add a script named route-ib-cpu16 in the /etc/network/if-up.d directory. Scripts in the if-up.d directory are automatically executed when a network interface connects, with the current interface name passed through the IFACE variable.

Script content (needs to check the current network interface using the IFACE variable):

#!/bin/bash
set -e
if [ "$IFACE" == "ibs5f0" ]; then
    ip route del 172.16.0.0/16 dev ibs5f0
    ip route add 172.16.0.0/16 dev ibs5f0 proto kernel scope link src 172.16.16.1 table 941
    ip rule add from 172.16.16.1 table 941
elif [ "$IFACE" == "ibs5f1" ]; then
    ip route del 172.16.0.0/16 dev ibs5f1
    ip route add 172.16.0.0/16 dev ibs5f1 proto kernel scope link src 172.16.16.2 table 942
    ip rule add from 172.16.16.2 table 942
fi

After creating the script file, executable permissions must be added: sudo chmod +x route-ib-cpu16. Then, reboot the server for the above configuration to take effect permanently.

Additionally, based on some research, if the current Linux system uses the ifupdown tool to manage network connections, the above method might also work, but it hasn't been tested.

netplan

Unlike cpuserver16, memserver34 uses netplan to manage network connections. This machine has Ubuntu 18.04 LTS installed without a desktop, meaning it doesn't have NetworkManager.

After some exploration, the static IP configuration for IB NICs was found in the /etc/netplan/ directory:

$ ls /etc/netplan/
01-netcfg.yaml  99-netcfg.yaml
$ cat /etc/netplan/99-netcfg.yaml
network:
  version: 2
  renderer: networkd
  ethernets:
......
    ib0:
      addresses: [172.16.34.1/16]
    ib1:
      addresses: [172.16.34.2/16]

For netplan, referencing this solution, scripts executed after the network interface connects are located in /etc/networkd-dispatcher/, specifically, routing operations should be placed in the routable.d subdirectory. The script writing conventions are the same as above, with netplan specifying the current network interface through the IFACE variable.

Create route-ib-mem34 in the /etc/networkd-dispatcher/routable.d/ directory:

#!/bin/bash
set -e
if [ "$IFACE" == "ib0" ]; then
    ip route del 172.16.0.0/16 dev ib0
    ip route add 172.16.0.0/16 dev ib0 proto kernel scope link src 172.16.34.1 table 941
    ip rule add from 172.16.34.1 table 941
elif [ "$IFACE" == "ib1" ]; then
    ip route del 172.16.0.0/16 dev ib1
    ip route add 172.16.0.0/16 dev ib1 proto kernel scope link src 172.16.34.2 table 942
    ip rule add from 172.16.34.2 table 942
fi

Similarly, executable permissions need to be added with sudo chmod +x. After rebooting, the routing configuration will take effect permanently.

If you're unsure whether the permanent configuration is correct, you can test it after reboot using the ip rule and ip route show table commands mentioned in the Separating Routing Rules section.

Related Articles

Understanding Strong and Weak References in Java

Strong References Strong reference are the most prevalent type of object referencing in Java. When an object has a strong reference pointing to it, the garbage collector will not reclaim its memory. F...

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Introduction Server-Side Template Injection (SSTI) is a vulnerability in web applications where user input is improper handled within the template engine and executed on the server. This exploit can r...

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Django’s Admin panel is highly user-friendly, and pairing it with TinyMCE, an effective rich text editor, simplifies content management significantly. Combining the two is particular useful for bloggi...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.