User Tools

Site Tools


nndocs:infiniband

This is an old revision of the document!


InfiniBand

Configuration is a real pain. On my Debian hosts, it's all in /etc/rc.local, which is gross. I haven't figured out how to do it on Ubuntu, since Netplan apparently has support for vxlan and ipoib now, I think. We'll get there.

The Problem

All the InfiniBand hardware I have is Mellanox FDR-generation, so, ConnectX-3, Connect-IB, and SX60xx IB-only switches. The officially-supported distribution overlay only supports Debian through 11, RHEL through 8, and Ubuntu through 20.04, OR ConnectX-4 or newer cards only. The newer drivers won't even load for CX-3 or C-IB adapters.

MLNX_OFED version Minimum hardware Debian Ubuntu OpenSM version
Inbox All All 3.3.23-2
4.9-x ConnectX-2 ≤ 11 ≤ 20.04 5.7.2
5.8-x ConnectX-4 ≥ 9 ≥ 18.04 5.17.0

How I'm Getting Around It

“Inbox” (distribution-provided) drivers and utilities are basically good enough, and probably will probably support my hardware until I'm ready to throw it away.

The Inbox Part

Install stuff:

  apt install \
      rdma-core ibacm sockperf qperf ibutils infiniband-diags \
      nvme-cli srptools \
      mstflint mstflint-dkms \
      libibmad-dev libibverbs-dev libibnetdisc-dev ibverbs-utils \
      dapl2-utils libdapl-dev \
      libvma libvma-utils ucx-utils \

I still downloaded MLNX_OFED 4.9-7.1.0.0 which includes .debs and sources. I don't know how to use the sources yet, but that's ok because I don't need them yet.

The MLNX part

There may be other parts of MLNX_OFED worth stealing; I haven't tried. It is possible to install OpenSM 5.7 from OFED 4.9. From this path run this command:

  MLNX_OFED_LINUX-4.9-7.1.0.0-ubuntu20.04-x86_64/DEBS/MLNX_LIBS# dpkg -i opensm*deb libopensm*deb libibumad*deb

There's also ibdump, with which you must be very careful: it really does capture everything!

  # dpkg -i ibdump_6.0.0-1.49710_amd64.deb

SR-IOV

OpenSM Configuration

For virtualization to work, you have to be using Mellanox's OpenSM fork. When I installed it and ran `opensm -c` to create its default configuration, it had this block in it:

  # Virtualization support
  # 0: Ignore Virtualization - No virtualization support
  # 1: Disable Virtualization - Disable virtualization on all
  #                              Virtualization supporting ports
  # 2: Enable Virtualization - Enable (virtualization on all
  #                             Virtualization supporting ports
  virt_enabled 2

In partitions.conf, well, just make the defualt partition look like this:

  Default=0x7fff, ipoib, rate=12, mtu=5, scope=2, defmember=full:
      mgid=ff12:401b::ffff:ffff   # IPv4 Broadcast address
      mgid=ff12:401b::1           # IPv4 All Hosts group
      mgid=ff12:401b::2           # IPv4 All Routers group
      mgid=ff12:401b::16          # IPv4 IGMP group
      mgid=ff12:401b::fb          # IPv4 mDNS group
      mgid=ff12:401b::fc          # IPv4 Multicast Link Local Name Resolution group
      mgid=ff12:401b::101         # IPv4 NTP group
      mgid=ff12:401b::202         # IPv4 Sun RPC
      mgid=ff12:601b::1           # IPv6 All Hosts group
      mgid=ff12:601b::2           # IPv6 All Routers group
      mgid=ff12:601b::16          # IPv6 MLDv2-capable Routers group
      mgid=ff12:601b::fb          # IPv6 mDNS group
      mgid=ff12:601b::101         # IPv6 NTP group
      mgid=ff12:601b::202         # IPv6 Sun RPC group
      mgid=ff12:601b::1:3         # IPv6 Multicast Link Local Name Resolution group
      ALL=full, ALL_SWITCHES=full;

It's important to use “full” and not “both” or “limited” for every group membership. Otherwise VFs inside VMs will have a pkey of 0x8000 which is invalid, and they'll show NO-CARRIER.

Hardware Settings

The BIOS needs to have SR-IOV, ARI, and ACS support enabled.

On Connect-IB cards, these firmware options need to be set:

  NUM_OF_VFS                                  7     
  NUM_OF_PF                                   1
  FPP_EN                                      True(1)
  SRIOV_EN                                    True(1)

FPP_EN (Flow Priority something) controls whether the card appears as two PCI devices, or as a single device with two ports. Under mlx4, every VF on a dual-port HCA has both ports, and NUM_OF_VFs is how many dual-port devices to create. Under mlx5, each port gets its own pool of VFs and NUM_OF_VFs is per-port.

I haven't tried large numbers of VFs. The hardware upper limit is 63 for Connect-IB and 127 for ConnectX-3. Firmware might have a lower limit than that. I don't expect to need that many guests with IOV networking…

After a reboot, there should be a new file, /sys/bus/pci/devices/0000:b:d:f/sriov_numvfs. Try turning it on. If it works, there will be new PCI devices as well as VFs listed under `ip link`:

  # echo 7 > /sys/class/infiniband/ibp6s0f0/device/sriov_numvfs # no output on success; check dmesg for interesting but probably useless messages.
  
  # lspci | grep nfi
  06:00.0 Infiniband controller: Mellanox Technologies MT27600 [Connect-IB]
  06:00.1 Infiniband controller: Mellanox Technologies MT27600 [Connect-IB]
  06:00.2 Infiniband controller: Mellanox Technologies MT27600 Family [Connect-IB Virtual Function]
  06:00.3 Infiniband controller: Mellanox Technologies MT27600 Family [Connect-IB Virtual Function]
  06:00.4 Infiniband controller: Mellanox Technologies MT27600 Family [Connect-IB Virtual Function]
  06:00.5 Infiniband controller: Mellanox Technologies MT27600 Family [Connect-IB Virtual Function]
  06:00.6 Infiniband controller: Mellanox Technologies MT27600 Family [Connect-IB Virtual Function]
  06:00.7 Infiniband controller: Mellanox Technologies MT27600 Family [Connect-IB Virtual Function]
  06:01.0 Infiniband controller: Mellanox Technologies MT27600 Family [Connect-IB Virtual Function]

Warning: The output from ip link is very wide.

  # ip link
  1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
      link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
  2: eth0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
      link/ether 3c:ec:ef:6d:10:62 brd ff:ff:ff:ff:ff:ff
  3: eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
      link/ether 3c:ec:ef:6d:10:63 brd ff:ff:ff:ff:ff:ff
  4: ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 4092 qdisc fq_codel state UP mode DEFAULT group default qlen 256
      link/infiniband 80:00:00:29:fe:80:00:00:00:00:00:00:58:49:56:0e:53:b7:0b:01 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
      vf 0     link/infiniband 80:00:00:29:fe:80:00:00:00:00:00:00:58:49:56:0e:53:b7:0b:01 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff, spoof checking off, NODE_GUID 00:00:00:00:00:00:00:00, PORT_GUID 00:00:00:00:00:00:00:00, link-state disable, trust off, query_rss off
      vf 1     link/infiniband 80:00:00:29:fe:80:00:00:00:00:00:00:58:49:56:0e:53:b7:0b:01 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff, spoof checking off, NODE_GUID 00:00:00:00:00:00:00:00, PORT_GUID 00:00:00:00:00:00:00:00, link-state disable, trust off, query_rss off
      vf 2     link/infiniband 80:00:00:29:fe:80:00:00:00:00:00:00:58:49:56:0e:53:b7:0b:01 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff, spoof checking off, NODE_GUID 00:00:00:00:00:00:00:00, PORT_GUID 00:00:00:00:00:00:00:00, link-state disable, trust off, query_rss off
      vf 3     link/infiniband 80:00:00:29:fe:80:00:00:00:00:00:00:58:49:56:0e:53:b7:0b:01 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff, spoof checking off, NODE_GUID 00:00:00:00:00:00:00:00, PORT_GUID 00:00:00:00:00:00:00:00, link-state disable, trust off, query_rss off
      vf 4     link/infiniband 80:00:00:29:fe:80:00:00:00:00:00:00:58:49:56:0e:53:b7:0b:01 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff, spoof checking off, NODE_GUID 00:00:00:00:00:00:00:00, PORT_GUID 00:00:00:00:00:00:00:00, link-state disable, trust off, query_rss off
      vf 5     link/infiniband 80:00:00:29:fe:80:00:00:00:00:00:00:58:49:56:0e:53:b7:0b:01 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff, spoof checking off, NODE_GUID 00:00:00:00:00:00:00:00, PORT_GUID 00:00:00:00:00:00:00:00, link-state disable, trust off, query_rss off
      vf 6     link/infiniband 80:00:00:29:fe:80:00:00:00:00:00:00:58:49:56:0e:53:b7:0b:01 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff, spoof checking off, NODE_GUID 00:00:00:00:00:00:00:00, PORT_GUID 00:00:00:00:00:00:00:00, link-state disable, trust off, query_rss off
  5: ib1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 4092 qdisc fq_codel state DOWN mode DEFAULT group default qlen 256
      link/infiniband 80:00:00:28:fe:80:00:00:00:00:00:00:58:49:56:0e:53:b7:0b:09 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
  10: ib2: <BROADCAST,MULTICAST> mtu 4092 qdisc noop state DOWN mode DEFAULT group default qlen 256
      link/infiniband 80:00:00:27:fe:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
  11: ib3: <BROADCAST,MULTICAST> mtu 4092 qdisc noop state DOWN mode DEFAULT group default qlen 256
      link/infiniband 80:00:00:27:fe:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
  12: ib4: <BROADCAST,MULTICAST> mtu 4092 qdisc noop state DOWN mode DEFAULT group default qlen 256
      link/infiniband 80:00:00:27:fe:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
  13: ib5: <BROADCAST,MULTICAST> mtu 4092 qdisc noop state DOWN mode DEFAULT group default qlen 256
      link/infiniband 80:00:00:27:fe:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
  14: ib6: <BROADCAST,MULTICAST> mtu 4092 qdisc noop state DOWN mode DEFAULT group default qlen 256
      link/infiniband 80:00:00:27:fe:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
  15: ib7: <BROADCAST,MULTICAST> mtu 4092 qdisc noop state DOWN mode DEFAULT group default qlen 256
      link/infiniband 80:00:00:27:fe:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
  16: ib8: <BROADCAST,MULTICAST> mtu 4092 qdisc noop state DOWN mode DEFAULT group default qlen 256
      link/infiniband 80:00:00:27:fe:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff

VF Configuration

To set the GUID for VFs, set node_guid, port_guid, and state using ip link. Make port_guid == node_guid == unique. (I use the base port guid + VF + 1.)

Lazy copy-pasta for southpark:

ip link set dev ib0 vf 0 node_guid 58:49:56:0e:53:b7:0b:02
ip link set dev ib0 vf 0 port_guid 58:49:56:0e:53:b7:0b:02
ip link set dev ib0 vf 0 state enable
ip link set dev ib0 vf 1 node_guid 58:49:56:0e:53:b7:0b:03
ip link set dev ib0 vf 1 port_guid 58:49:56:0e:53:b7:0b:03
ip link set dev ib0 vf 1 state enable
ip link set dev ib0 vf 2 node_guid 58:49:56:0e:53:b7:0b:04
ip link set dev ib0 vf 2 port_guid 58:49:56:0e:53:b7:0b:04
ip link set dev ib0 vf 2 state enable
ip link set dev ib0 vf 3 node_guid 58:49:56:0e:53:b7:0b:05
ip link set dev ib0 vf 3 port_guid 58:49:56:0e:53:b7:0b:05
ip link set dev ib0 vf 3 state enable
ip link set dev ib0 vf 4 node_guid 58:49:56:0e:53:b7:0b:06
ip link set dev ib0 vf 4 port_guid 58:49:56:0e:53:b7:0b:06
ip link set dev ib0 vf 4 state enable
ip link set dev ib0 vf 5 node_guid 58:49:56:0e:53:b7:0b:07
ip link set dev ib0 vf 5 port_guid 58:49:56:0e:53:b7:0b:07
ip link set dev ib0 vf 5 state enable
ip link set dev ib0 vf 6 node_guid 58:49:56:0e:53:b7:0b:08
ip link set dev ib0 vf 6 port_guid 58:49:56:0e:53:b7:0b:08
ip link set dev ib0 vf 6 state enable

Lazy copy-pasta for sadness:

ip link set dev ib0 vf 0 node_guid 58:49:56:0e:58:5c:03:02
ip link set dev ib0 vf 0 port_guid 58:49:56:0e:58:5c:03:02
ip link set dev ib0 vf 0 state enable
ip link set dev ib0 vf 1 node_guid 58:49:56:0e:58:5c:03:03
ip link set dev ib0 vf 1 port_guid 58:49:56:0e:58:5c:03:03
ip link set dev ib0 vf 1 state enable
ip link set dev ib0 vf 2 node_guid 58:49:56:0e:58:5c:03:04
ip link set dev ib0 vf 2 port_guid 58:49:56:0e:58:5c:03:04
ip link set dev ib0 vf 2 state enable
ip link set dev ib0 vf 3 node_guid 58:49:56:0e:58:5c:03:05
ip link set dev ib0 vf 3 port_guid 58:49:56:0e:58:5c:03:05
ip link set dev ib0 vf 3 state enable
ip link set dev ib0 vf 4 node_guid 58:49:56:0e:58:5c:03:06
ip link set dev ib0 vf 4 port_guid 58:49:56:0e:58:5c:03:06
ip link set dev ib0 vf 4 state enable
ip link set dev ib0 vf 5 node_guid 58:49:56:0e:58:5c:03:07
ip link set dev ib0 vf 5 port_guid 58:49:56:0e:58:5c:03:07
ip link set dev ib0 vf 5 state enable
ip link set dev ib0 vf 6 node_guid 58:49:56:0e:58:5c:03:08
ip link set dev ib0 vf 6 port_guid 58:49:56:0e:58:5c:03:08
ip link set dev ib0 vf 6 state enable

Lazy copy-pasta for shark:

ip link set dev ib0 vf 0 node_guid 58:49:56:0e:59:11:02:02
ip link set dev ib0 vf 0 port_guid 58:49:56:0e:59:11:02:02
ip link set dev ib0 vf 0 state enable

Upper-Layer Protocols (ULPs)

RDMA opens all kinds of possibilities for RDMA-aware protocols to be amazing and fast.

VMA

libvma is a way of speeding up normal TCP sockets by replacing them behind the scenes with RDMA transfers. This is basically the successor to Sockets Direct Protocol (SDP) which is no longer maintained. VMA might be useful for accelerating connections with:

  • Apache
  • PHP
  • MySQL
  • SSH (and consequently rsync and scp, right?)

None of these services is so performance-critical that I'll spend time configuring it for VMA, except maybe as a learning exercise. Later.

Storage

  • NFS/RDMA (probably needs a page)
    • rpcrdma kernel module
    • insecure in /etc/exports
    • stuff in /etc?
  • See iSCSI for information about enabling and using iSER.
  • See NVMe-oF for using NVMe over InfiniBand.
  • I'm curious about the performance of SRP. I have gotten it to work before but it's not working now. Also not a huge priority.

Networking

VXLAN

VXLAN is not the only way to get an Ethernet device on Infiniband, but as far as I can tell it's the only decent one. Neither ConnectX-3 nor Connect-IB has VXLAN offload support so it's like a Winmodem but the number is a million times bigger.

  • VXLAN id can be anything from 0-16777215 inclusive. I make it match the network number.
  • You're making a virtual network interface (VNI)
  • “local” is the IP address of the parent interface. (Probably 172.20.64.something)
  • group is a multicast IP address.
  • The first octet is 225 (two-twenty-five), not 255 (two-fifty-five).
  • I make the last 3 octets make the first 3 octets of the network. So, ibnet0 will be 225.172.20.64.
  export vxlan=64
  export vxlan_name=ibnet0 # this doesn't get used by anything or stored anywhere
  export local=172.20.64.9
  export group=225.172.20.64
  ip link add name vxlan64 type vxlan id 64 local 172.20.64.10 group 225.172.20.64 dev ib0 dstport 0

If/when I get hardware capable of VXLAN offload, the dstport might have to change.

Once the VNI exists, it can be added to a bridge:

  ip link set master green dev vxlan64

Et viola, Ethernet on top of InfiniBand.

Multimedia

Yeah, someday I want to throw video frames around. There's an RFC or ISO standard for that, IIRC. There's also lgproxy.

I also want to throw audio frames around with “no latency added”. Someday, someday, someday.

nndocs/infiniband.1710630908.txt.gz · Last modified: 2024/03/16 23:15 by naptastic