=====InfiniBand===== Configuration is a real pain. On my Debian hosts, it's all in /etc/rc.local, which is gross. I haven't figured out how to do it on Ubuntu, since Netplan apparently has support for vxlan and IP over IB (IPoIB) now, I think. **We'll get there**. ===The Problem=== All the InfiniBand hardware I have is Mellanox FDR-generation, so, ConnectX-3, Connect-IB, and SX6005 IB-only switches. (Every time I think "the Ethernet version sure would be nice" I remind myself that I'd use InfiniBand mode anyway.) For hardware support, Mellanox provides MLNX_OFED, an overlay for several distributions. Unfortunately, MLNX_OFED only supports Debian through 11, RHEL through 8, and Ubuntu through 20.04, **OR** ConnectX-4 or newer cards only. The drivers built into Linux still recognize and work with ConnectX-3 and Connect-IB, but the kernel drivers packaged won't even recognize the cards. They just show up as unclaimed PCI devices. ^ Version ^ Minimum hardware ^ Debian ^ Ubuntu ^ OpenSM version ^ | Inbox | | All | All | 3.3.23-2 | | MLNX_OFED 4.9-x | ConnectX-2 | ≤ 11 | ≤ 20.04 | 5.7.2 | | MLNX_OFED 5.8-x | ConnectX-4 | ≥ 9 | ≥ 18.04 | 5.17.0 | ===How I'm Getting Around It=== "Inbox" (distribution-provided) drivers and utilities are basically good enough, and probably will probably support my hardware until I'm ready to throw it away. ====The Inbox Part==== Install stuff: apt install \ rdma-core ibacm sockperf qperf ibutils infiniband-diags \ nvme-cli srptools \ mstflint mstflint-dkms \ libibmad-dev libibverbs-dev libibnetdisc-dev ibverbs-utils \ dapl2-utils libdapl-dev \ libvma libvma-utils ucx-utils \ I still downloaded MLNX_OFED 4.9-7.1.0.0 which includes .debs and sources. I don't know how to use the sources yet, but that's ok because I don't need them yet. ====The MLNX part==== Old OpenSM has this annoying problem where, if the HCA goes away while OpenSM is running, it will start to spew into its logfile, and it will run the system out of disk space. I don't know when it got fixed, but as of August 2024 it doesn't do that anymore. MLNX_OFED_LINUX-24.07-0.6.1.0-debian12.5-x86_64/DEBS# dpkg -i opensm*deb libopensm*deb libibumad*deb There's also ibdump, with which you must be very careful: it really does capture everything! # dpkg -i ibdump_6.0.0-1.2407061_amd64.deb =====The Subnet Manager: OpenSM===== ====Virtualization==== For virtualization to work, you have to be using Mellanox's OpenSM fork. When I installed it and ran `opensm -c` to create its default configuration, it had this block in it: # Virtualization support # 0: Ignore Virtualization - No virtualization support # 1: Disable Virtualization - Disable virtualization on all # Virtualization supporting ports # 2: Enable Virtualization - Enable (virtualization on all # Virtualization supporting ports virt_enabled 2 ====Partitions: OpenSM configuration==== Atop an InfiniBand fabric, one or more partitions must be defined for hosts to join before they can communicate. IB has a concept of "full" and "partial" membership. Full members can communicate with any other host in the partition. Partial members can communicate with full members, but not with each other. Whether a host is a full or partial member is controlled by the high bit of the partition number. There's also a "both" option. I can't come up with a use case for "both". Linux has very limited support for partial membership. It's best to give all hosts in a partition full membership. Partial membership is probably useful if you have Windows guests and want to keep them isolated at a layer 2 level; I have no need to play in that playground. The default partition configuration sucks. Make it look like this: Default=0x7fff, ipoib, rate=12, mtu=5, scope=2, defmember=full: mgid=ff12:401b::ffff:ffff # IPv4 Broadcast address mgid=ff12:401b::1 # IPv4 All Hosts group mgid=ff12:401b::2 # IPv4 All Routers group mgid=ff12:401b::16 # IPv4 IGMP group mgid=ff12:401b::fb # IPv4 mDNS group mgid=ff12:401b::fc # IPv4 Multicast Link Local Name Resolution group mgid=ff12:401b::101 # IPv4 NTP group mgid=ff12:401b::202 # IPv4 Sun RPC mgid=ff12:601b::1 # IPv6 All Hosts group mgid=ff12:601b::2 # IPv6 All Routers group mgid=ff12:601b::16 # IPv6 MLDv2-capable Routers group mgid=ff12:601b::fb # IPv6 mDNS group mgid=ff12:601b::101 # IPv6 NTP group mgid=ff12:601b::202 # IPv6 Sun RPC group mgid=ff12:601b::1:3 # IPv6 Multicast Link Local Name Resolution group ALL=full, ALL_SWITCHES=full; The default config file lists rates through QDR. For FDR and newer rates, see include/iba/ib_types.h in the OpenSM source repository. They're listed as parameters to ib_path_rec_rate() in the comments below the sub declarition. 12 is correct for FDR x4 links. (Wider links are only possible between managed switches. Support for narrower links got removed at some point.) Here's a block for my ATA over Ethernet experiments. Subject to change. IP addresses are necessary for setting up VXLAN tunnels. Checking if IPv6 tunnels perform differently from IPv4 tunnels is on the to-do list. I suspect they perform better. Needs testing. storage=0xb128, ipoib, rate=12, mtu=5, scope=2, defmember=full: mgid=ff12:401b::ffff:ffff # IPv4 Broadcast address mgid=ff12:401b::1 # IPv4 All Hosts group ALL=full, ALL_SWITCHES=full; ====Partitions: Host configuration==== There's no functional netlink interface for creating child interfaces. You must use the sysfs interface. # echo 0xb128 > /sys/class/net/ib0/create_child Resist the temptation to rename the interface to something descriptive. **It's already self-descriptive**. Creative naming is for VXLAN tunnels and bridges, e.g.: # ip link add vx128 type vxlan id 128 local 172.20.128.13 group 225.172.20.128 # ip link set master aoe1 dev vx128 The sysfs interface for deleting child interfaces doesn't work (for me at least). You must use the netlink interface. # ip link del ib0.b128 If you unset the high bit on the partition number (0x3128 instead of 0xb128) Linux will set the high bit before joining the partition. If OpenSM's configuration has that partition's membership set for "partial" or "both", the Linux host will not be able to connect to everything on that subnet, or possibly //anything// on that subnet, regardless of which value you use. It's worth finding out if Netplan can manage IB child interfaces. ====Connected vs. Datagram==== IPoIB can run in one of three modes: - Datagram, in which the IP MTU matches the subnet MTU. Performance is pretty trash. - Connected, in which the MTU is limited only by kernel and network structures. Practically, this means 64k, or 65520 bytes after protocol overhead is accounted for. Performance for some workloads is greatly improved. - Enhanced, which is not compatible with Connected mode, but has more offloads and is (from what I hear) generally better. * Enhanced IPoIB is only available on ConnectX-4 and newer cards. IPoIB is still a dog, performance-wise. It's a very fast, very expensive winmodem. It should be thought of as a way of getting IP addresses, which you then use to set up RDMA-aware protocols. As far as I can tell, IPoIB interfaces are always created in datagram mode. The documentation says that Connect-IB cards default to connected mode; that has not been my experience. There's also no way to set connected mode as the default when the module is loaded, or otherwise ensure that interfaces get created in connected mode. Before changing modes, the interface must be down. The netlink interface for setting the connection mode has not worked for me; I might just be doing it wrong: # ip link set mode connected ib0.b129 Error: argument "connected" is wrong: Invalid link mode # ^connected^datagram ip link set mode datagram ib0.b129 Error: argument "datagram" is wrong: Invalid link mode The sysfs interface works: # echo connected > /sys/class/net/ib0.b129/mode Since I don't have any newer hardware, I don't have any information about Enhanced IPoIB. =====SR-IOV===== ====Hardware Settings==== The BIOS needs to have SR-IOV, ARI, and ACS support enabled. On Connect-IB cards, these firmware options need to be set: NUM_OF_VFS 7 NUM_OF_PF 1 FPP_EN True(1) SRIOV_EN True(1) FPP_EN (Function Per Port ENable) controls whether the card appears as two PCI devices, or as a single device with two ports. Under mlx4, every VF on a dual-port HCA has both ports, and NUM_OF_VFs is how many dual-port devices to create. Under mlx5, each port gets its own pool of VFs and NUM_OF_VFs is per-port. I haven't tried large numbers of VFs. The hardware upper limit is 63 for Connect-IB and 127 for ConnectX-3. Any number of system components could impose lower limits. For example, my consumer boards that are SR-IOV capable can only have VFs on port 1, not on port 2; the EPYC server system can create VFs on both ports. I don't expect to need so many guests with IOV networking anyway... To make VFs exist, put a number <= NUM_OF_VFS into sriov_numvfs for that device. Before doing so, I recommend turning off VF probing. Otherwise the VFs will all make IPoIB interfaces, which probably isn't what you want. This setting is per PF. I'm still checking if there's a way to configure the driver so this becomes the default setting. # echo 0 > /sys/class/infiniband/ibp13s0f0/device/sriov_drivers_autoprobe # echo 0 > /sys/class/infiniband/ibp13s0f1/device/sriov_drivers_autoprobe If it works, there will be new PCI devices as well as VFs listed under `ip link`: # echo 7 > /sys/class/infiniband/ibp6s0f0/device/sriov_numvfs # no output on success; check dmesg for interesting but probably useless messages. # lspci | grep nfi 06:00.0 Infiniband controller: Mellanox Technologies MT27600 [Connect-IB] 06:00.1 Infiniband controller: Mellanox Technologies MT27600 [Connect-IB] 06:00.2 Infiniband controller: Mellanox Technologies MT27600 Family [Connect-IB Virtual Function] 06:00.3 Infiniband controller: Mellanox Technologies MT27600 Family [Connect-IB Virtual Function] 06:00.4 Infiniband controller: Mellanox Technologies MT27600 Family [Connect-IB Virtual Function] 06:00.5 Infiniband controller: Mellanox Technologies MT27600 Family [Connect-IB Virtual Function] 06:00.6 Infiniband controller: Mellanox Technologies MT27600 Family [Connect-IB Virtual Function] 06:00.7 Infiniband controller: Mellanox Technologies MT27600 Family [Connect-IB Virtual Function] 06:01.0 Infiniband controller: Mellanox Technologies MT27600 Family [Connect-IB Virtual Function] Warning: The output from ip link is very wide. # ip link 1: lo: mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 2: eth0: mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether 3c:ec:ef:6d:10:62 brd ff:ff:ff:ff:ff:ff 3: eth1: mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether 3c:ec:ef:6d:10:63 brd ff:ff:ff:ff:ff:ff 4: ib0: mtu 4092 qdisc fq_codel state UP mode DEFAULT group default qlen 256 link/infiniband 80:00:00:29:fe:80:00:00:00:00:00:00:58:49:56:0e:53:b7:0b:01 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff vf 0 link/infiniband 80:00:00:29:fe:80:00:00:00:00:00:00:58:49:56:0e:53:b7:0b:01 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff, spoof checking off, NODE_GUID 00:00:00:00:00:00:00:00, PORT_GUID 00:00:00:00:00:00:00:00, link-state disable, trust off, query_rss off vf 1 link/infiniband 80:00:00:29:fe:80:00:00:00:00:00:00:58:49:56:0e:53:b7:0b:01 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff, spoof checking off, NODE_GUID 00:00:00:00:00:00:00:00, PORT_GUID 00:00:00:00:00:00:00:00, link-state disable, trust off, query_rss off vf 2 link/infiniband 80:00:00:29:fe:80:00:00:00:00:00:00:58:49:56:0e:53:b7:0b:01 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff, spoof checking off, NODE_GUID 00:00:00:00:00:00:00:00, PORT_GUID 00:00:00:00:00:00:00:00, link-state disable, trust off, query_rss off vf 3 link/infiniband 80:00:00:29:fe:80:00:00:00:00:00:00:58:49:56:0e:53:b7:0b:01 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff, spoof checking off, NODE_GUID 00:00:00:00:00:00:00:00, PORT_GUID 00:00:00:00:00:00:00:00, link-state disable, trust off, query_rss off vf 4 link/infiniband 80:00:00:29:fe:80:00:00:00:00:00:00:58:49:56:0e:53:b7:0b:01 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff, spoof checking off, NODE_GUID 00:00:00:00:00:00:00:00, PORT_GUID 00:00:00:00:00:00:00:00, link-state disable, trust off, query_rss off vf 5 link/infiniband 80:00:00:29:fe:80:00:00:00:00:00:00:58:49:56:0e:53:b7:0b:01 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff, spoof checking off, NODE_GUID 00:00:00:00:00:00:00:00, PORT_GUID 00:00:00:00:00:00:00:00, link-state disable, trust off, query_rss off vf 6 link/infiniband 80:00:00:29:fe:80:00:00:00:00:00:00:58:49:56:0e:53:b7:0b:01 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff, spoof checking off, NODE_GUID 00:00:00:00:00:00:00:00, PORT_GUID 00:00:00:00:00:00:00:00, link-state disable, trust off, query_rss off 5: ib1: mtu 4092 qdisc fq_codel state DOWN mode DEFAULT group default qlen 256 link/infiniband 80:00:00:28:fe:80:00:00:00:00:00:00:58:49:56:0e:53:b7:0b:09 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff ====VF Configuration==== The official documentation covers a sysfs interface for configuring VF properties. That interface hasn't existed for years. Before using a VF, you must set node_guid, port_guid, and state using ip link. Make port_guid == node_guid == unique. (I use the base port guid + VF + 1.) GUIDs need to be set before attaching a VF to a VM. It should be possible to change state (simulating unplugging the cable) while a VM is using a VF but I haven't tested this. Configuration is managed in /etc/rc.local. =====Upper-Layer Protocols (ULPs)===== RDMA opens all kinds of possibilities for RDMA-aware protocols to be amazing and fast. They probably all deserve their own pages. ===IP over InfiniBand (IPoIB)=== It's already been used on this page extensively. IPoIB encapsulates IP traffic in InfiniBand datagrams so that protocols built for Ethernet (primarily TCP and UDP) can run on IB networks. Performance is generally poor. There are some hacks to make it faster, but the real reason to use IPoIB is to give your IB hosts IP addresses for setting up other ULPs. ===VMA=== libvma is a way of speeding up normal TCP sockets by replacing them behind the scenes with RDMA transfers. This is basically the successor to Sockets Direct Protocol (SDP) which is no longer maintained. VMA might be useful for accelerating connections with: * Apache * PHP * MySQL (though I think MySQL has its own RDMA backend?) * SSH (and consequently rsync and scp, right?) * (It would also be cool to find out if newer cards with crypto functions could do hardware-accelerated SSH with RDMA.) None of these services is so performance-critical that I'll spend time configuring it for VMA, except maybe as a learning exercise. **Later**. ====Storage==== * NFS/RDMA (probably needs a page) * rpcrdma kernel module * insecure in /etc/exports * stuff in /etc? * See [[iSCSI]] for information about enabling and using iSER. * See [[NVMe-oF]] for using NVMe over InfiniBand. * I'm curious about the performance of SRP. I have gotten it to work before but it's not working now. Also not a huge priority. ====Networking==== ===VXLAN=== VXLAN is not the only way to get an Ethernet device on Infiniband, but as far as I can tell it's the only decent one. None of my hardware has VXLAN offload support. * VXLAN id can be anything from 0-16777215 inclusive. I make it match the network number. * You're making a virtual network interface (VNI) * "local" is the IP address of the parent interface. (Probably 172.20.64.something) * group is a multicast IP address. * The first octet is 225 (two-twenty-five), not 255 (two-fifty-five). * I make the last 3 octets make the first 3 octets of the network. So, ibnet0 will be 225.172.20.64. export vxlan=64 export vxlan_name=ibnet0 # this doesn't get used by anything or stored anywhere export local=172.20.64.9 export group=225.172.20.64 export dev=ib0 ip link add name vxlan$id type vxlan id $id local $local group $group dev $dev dstport 0 Once the VNI exists, it can be added to a bridge: ip link set master green dev vxlan64 //Et viola//, Ethernet on top of InfiniBand. If/when I get hardware capable of VXLAN offload, the dstport might have to change. ====Multimedia==== Yeah, someday I want to throw video frames around. There's an RFC or ISO standard for that, IIRC. There's also lgproxy, which is RDMA-aware. I also want to throw audio frames around with "no latency added". Someday, someday, someday. =====GUIDs===== * 5849560e59150301 - shark Connect-IB * 5849560e53b70b01 - southpark Connect-IB * 5849560e53660101 - duckling Connect-IB * 7cfe900300a0a080 - uninstalled Connect-IB * (there are several more uninstalled Connect-IB cards) * f4521403002c18b0 - uninstalled ConnectX-3 2014-01-29 * 0002c90300b37f10 - uninstalled ConnectX-3 with no date on the label * 001175000079b560 - uninstalled qib * 001175000079b856 - uninstalled qib