User Tools

Site Tools


nndocs:infiniband

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
nndocs:infiniband [2024/03/25 17:40] – [Connected vs. Datagram] Elucidate naptasticnndocs:infiniband [2025/01/21 14:38] (current) – [Networking] correct a thing naptastic
Line 7: Line 7:
 For hardware support, Mellanox provides MLNX_OFED, an overlay for several distributions. Unfortunately, MLNX_OFED only supports Debian through 11, RHEL through 8, and Ubuntu through 20.04, **OR** ConnectX-4 or newer cards only. The drivers built into Linux still recognize and work with ConnectX-3 and Connect-IB, but the kernel drivers packaged won't even recognize the cards. They just show up as unclaimed PCI devices. For hardware support, Mellanox provides MLNX_OFED, an overlay for several distributions. Unfortunately, MLNX_OFED only supports Debian through 11, RHEL through 8, and Ubuntu through 20.04, **OR** ConnectX-4 or newer cards only. The drivers built into Linux still recognize and work with ConnectX-3 and Connect-IB, but the kernel drivers packaged won't even recognize the cards. They just show up as unclaimed PCI devices.
  
-MLNX_OFED version ^ Minimum hardware ^ Debian ^ Ubuntu ^ OpenSM version ^+Version ^ Minimum hardware ^ Debian ^ Ubuntu ^ OpenSM version ^
 | Inbox |  | All | All | 3.3.23-2 | | Inbox |  | All | All | 3.3.23-2 |
-| 4.9-x | ConnectX-2 | ≤ 11 | ≤ 20.04 | 5.7.2 | +MLNX_OFED 4.9-x | ConnectX-2 | ≤ 11 | ≤ 20.04 | 5.7.2 | 
-| 5.8-x | ConnectX-4 | ≥ 9 | ≥ 18.04 | 5.17.0 |+MLNX_OFED 5.8-x | ConnectX-4 | ≥ 9 | ≥ 18.04 | 5.17.0 |
  
 ===How I'm Getting Around It=== ===How I'm Getting Around It===
Line 29: Line 29:
  
 ====The MLNX part==== ====The MLNX part====
-It's worth investigating other tools provided with MLNX_OFED to see if they offer compelling advantages over inbox versions. I'm not doing that right now because I suspect the Mellanox version of OpenSM is the only thing actually //need//It is possible to install OpenSM 5.7 from OFED 4.9. From this path run this command:+Old OpenSM has this annoying problem where, if the HCA goes away while OpenSM is running, it will start to spew into its logfile, and it will run the system out of disk space. don't know when it got fixed, but as of August 2024 it doesn't do that anymore.
  
-    MLNX_OFED_LINUX-4.9-7.1.0.0-ubuntu20.04-x86_64/DEBS/MLNX_LIBS# dpkg -i opensm*deb libopensm*deb libibumad*deb +    MLNX_OFED_LINUX-24.07-0.6.1.0-debian12.5-x86_64/DEBS# dpkg -i opensm*deb libopensm*deb libibumad*deb
- +
-Newer versions of MLNX_OFED have newer versions of OpenSM. I haven't tried them.+
  
 There's also ibdump, with which you must be very careful: it really does capture everything! There's also ibdump, with which you must be very careful: it really does capture everything!
  
-    # dpkg -i ibdump_6.0.0-1.49710_amd64.deb+    # dpkg -i ibdump_6.0.0-1.2407061_amd64.deb
  
 =====The Subnet Manager: OpenSM===== =====The Subnet Manager: OpenSM=====
Line 80: Line 78:
 Here's a block for my ATA over Ethernet experiments. Subject to change. IP addresses are necessary for setting up VXLAN tunnels. Checking if IPv6 tunnels perform differently from IPv4 tunnels is on the to-do list. I suspect they perform better. Needs testing. Here's a block for my ATA over Ethernet experiments. Subject to change. IP addresses are necessary for setting up VXLAN tunnels. Checking if IPv6 tunnels perform differently from IPv4 tunnels is on the to-do list. I suspect they perform better. Needs testing.
  
-  mp1=0x3128, ipoib, rate=12, mtu=5, scope=2, defmember=full:+  storage=0xb128, ipoib, rate=12, mtu=5, scope=2, defmember=full:
       mgid=ff12:401b::ffff:ffff   # IPv4 Broadcast address       mgid=ff12:401b::ffff:ffff   # IPv4 Broadcast address
       mgid=ff12:401b::          # IPv4 All Hosts group       mgid=ff12:401b::          # IPv4 All Hosts group
Line 87: Line 85:
 ====Partitions: Host configuration==== ====Partitions: Host configuration====
 There's no functional netlink interface for creating child interfaces. You must use the sysfs interface. There's no functional netlink interface for creating child interfaces. You must use the sysfs interface.
-  echo 0xb129 > /sys/class/net/ib0/create_child+  echo 0xb128 > /sys/class/net/ib0/create_child
  
-The sysfs interface for deleting child interfaces doesn't work (for me at least)You must use the netlink interface. +Resist the temptation to rename the interface to something descriptive. **It's already self-descriptive**Creative naming is for VXLAN tunnels and bridges, e.g.:
-  ip link del ib0.b129+
  
-Resist the temptation to rename the interface to something descriptive**It's already self-descriptive**Creative naming is for VXLAN tunnels and bridges.+  # ip link add vx128 type vxlan id 128 local 172.20.128.13 group 225.172.20.128 
 +  # ip link set master aoe1 dev vx128 
 + 
 +The sysfs interface for deleting child interfaces doesn't work (for me at least). You must use the netlink interface. 
 +  # ip link del ib0.b128
  
-If you unset the high bit on the partition number (0x3129 instead of 0xb129) Linux will set the high bit before joining the partition. If OpenSM's configuration has that partition's membership set for "partial" or "both", the Linux host will not be able to connect to everything on that subnet, or possibly //anything// on that subnet, regardless of which value you use.+If you unset the high bit on the partition number (0x3128 instead of 0xb128) Linux will set the high bit before joining the partition. If OpenSM's configuration has that partition's membership set for "partial" or "both", the Linux host will not be able to connect to everything on that subnet, or possibly //anything// on that subnet, regardless of which value you use.
  
 It's worth finding out if Netplan can manage IB child interfaces. It's worth finding out if Netplan can manage IB child interfaces.
Line 119: Line 120:
  
 Since I don't have any newer hardware, I don't have any information about Enhanced IPoIB. Since I don't have any newer hardware, I don't have any information about Enhanced IPoIB.
 +
 =====SR-IOV===== =====SR-IOV=====
 ====Hardware Settings==== ====Hardware Settings====
Line 129: Line 131:
     SRIOV_EN                                    True(1)     SRIOV_EN                                    True(1)
  
-FPP_EN (Flow Priority something) controls whether the card appears as two PCI devices, or as a single device with two ports. Under mlx4, every VF on a dual-port HCA has both ports, and NUM_OF_VFs is how many dual-port devices to create. Under mlx5, each port gets its own pool of VFs and NUM_OF_VFs is per-port.+FPP_EN (Function Per Port ENable) controls whether the card appears as two PCI devices, or as a single device with two ports. Under mlx4, every VF on a dual-port HCA has both ports, and NUM_OF_VFs is how many dual-port devices to create. Under mlx5, each port gets its own pool of VFs and NUM_OF_VFs is per-port.
  
 I haven't tried large numbers of VFs. The hardware upper limit is 63 for Connect-IB and 127 for ConnectX-3. Any number of system components could impose lower limits. For example, my consumer boards that are SR-IOV capable can only have VFs on port 1, not on port 2; the EPYC server system can create VFs on both ports. I don't expect to need so many guests with IOV networking anyway... I haven't tried large numbers of VFs. The hardware upper limit is 63 for Connect-IB and 127 for ConnectX-3. Any number of system components could impose lower limits. For example, my consumer boards that are SR-IOV capable can only have VFs on port 1, not on port 2; the EPYC server system can create VFs on both ports. I don't expect to need so many guests with IOV networking anyway...
  
-After reboot, there should be new file, /sys/bus/pci/devices/0000:b:d:f/sriov_numvfs. Try turning it on. If it works, there will be new PCI devices as well as VFs listed under `ip link`:+To make VFs exist, put number <= NUM_OF_VFS into sriov_numvfs for that device. Before doing soI recommend turning off VF probing. Otherwise the VFs will all make IPoIB interfaces, which probably isn't what you want. This setting is per PF. 
 + 
 +I'm still checking if there'way to configure the driver so this becomes the default setting. 
 + 
 +  # echo 0 > /sys/class/infiniband/ibp13s0f0/device/sriov_drivers_autoprobe 
 +  # echo 0 > /sys/class/infiniband/ibp13s0f1/device/sriov_drivers_autoprobe 
 + 
 +If it works, there will be new PCI devices as well as VFs listed under `ip link`:
  
   # echo 7 > /sys/class/infiniband/ibp6s0f0/device/sriov_numvfs # no output on success; check dmesg for interesting but probably useless messages.   # echo 7 > /sys/class/infiniband/ibp6s0f0/device/sriov_numvfs # no output on success; check dmesg for interesting but probably useless messages.
-   
   # lspci | grep nfi   # lspci | grep nfi
   06:00.0 Infiniband controller: Mellanox Technologies MT27600 [Connect-IB]   06:00.0 Infiniband controller: Mellanox Technologies MT27600 [Connect-IB]
Line 168: Line 176:
   5: ib1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 4092 qdisc fq_codel state DOWN mode DEFAULT group default qlen 256   5: ib1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 4092 qdisc fq_codel state DOWN mode DEFAULT group default qlen 256
       link/infiniband 80:00:00:28:fe:80:00:00:00:00:00:00:58:49:56:0e:53:b7:0b:09 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff       link/infiniband 80:00:00:28:fe:80:00:00:00:00:00:00:58:49:56:0e:53:b7:0b:09 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
-  10: ib2: <BROADCAST,MULTICAST> mtu 4092 qdisc noop state DOWN mode DEFAULT group default qlen 256 
-      link/infiniband 80:00:00:27:fe:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff 
-  11: ib3: <BROADCAST,MULTICAST> mtu 4092 qdisc noop state DOWN mode DEFAULT group default qlen 256 
-      link/infiniband 80:00:00:27:fe:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff 
-  12: ib4: <BROADCAST,MULTICAST> mtu 4092 qdisc noop state DOWN mode DEFAULT group default qlen 256 
-      link/infiniband 80:00:00:27:fe:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff 
-  13: ib5: <BROADCAST,MULTICAST> mtu 4092 qdisc noop state DOWN mode DEFAULT group default qlen 256 
-      link/infiniband 80:00:00:27:fe:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff 
-  14: ib6: <BROADCAST,MULTICAST> mtu 4092 qdisc noop state DOWN mode DEFAULT group default qlen 256 
-      link/infiniband 80:00:00:27:fe:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff 
-  15: ib7: <BROADCAST,MULTICAST> mtu 4092 qdisc noop state DOWN mode DEFAULT group default qlen 256 
-      link/infiniband 80:00:00:27:fe:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff 
-  16: ib8: <BROADCAST,MULTICAST> mtu 4092 qdisc noop state DOWN mode DEFAULT group default qlen 256 
-      link/infiniband 80:00:00:27:fe:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff 
  
 ====VF Configuration==== ====VF Configuration====
-To set the GUID for VFs, set node_guid, port_guid, and state using ip link. Make port_guid == node_guid == unique. (I use the base port guid + VF + 1.)+The official documentation covers a sysfs interface for configuring VF properties. That interface hasn't existed for years. Before using a VFyou must set node_guid, port_guid, and state using ip link. Make port_guid == node_guid == unique. (I use the base port guid + VF + 1.)
  
-Lazy copy-pasta for southparkNote this only sets up VFs for port 1, but that's the only port plugged in right now anyway, so w/e. +GUIDs need to be set before attaching a VF to a VMIt should be possible to change state (simulating unplugging the cable) while a VM is using a VF but I haven't tested this.
-  ip link set dev ib0 vf 0 node_guid 58:49:56:0e:53:b7:0b:02 +
-  ip link set dev ib0 vf 0 port_guid 58:49:56:0e:53:b7:0b:02 +
-  ip link set dev ib0 vf 0 state enable +
-  ip link set dev ib0 vf 1 node_guid 58:49:56:0e:53:b7:0b:03 +
-  ip link set dev ib0 vf 1 port_guid 58:49:56:0e:53:b7:0b:03 +
-  ip link set dev ib0 vf 1 state enable +
-  ip link set dev ib0 vf 2 node_guid 58:49:56:0e:53:b7:0b:04 +
-  ip link set dev ib0 vf 2 port_guid 58:49:56:0e:53:b7:0b:04 +
-  ip link set dev ib0 vf 2 state enable +
-  ip link set dev ib0 vf 3 node_guid 58:49:56:0e:53:b7:0b:05 +
-  ip link set dev ib0 vf 3 port_guid 58:49:56:0e:53:b7:0b:05 +
-  ip link set dev ib0 vf 3 state enable +
-  ip link set dev ib0 vf 4 node_guid 58:49:56:0e:53:b7:0b:06 +
-  ip link set dev ib0 vf 4 port_guid 58:49:56:0e:53:b7:0b:06 +
-  ip link set dev ib0 vf 4 state enable +
-  ip link set dev ib0 vf 5 node_guid 58:49:56:0e:53:b7:0b:07 +
-  ip link set dev ib0 vf 5 port_guid 58:49:56:0e:53:b7:0b:07 +
-  ip link set dev ib0 vf 5 state enable +
-  ip link set dev ib0 vf 6 node_guid 58:49:56:0e:53:b7:0b:08 +
-  ip link set dev ib0 vf 6 port_guid 58:49:56:0e:53:b7:0b:08 +
-  ip link set dev ib0 vf 6 state enable +
- +
-Lazy copy-pasta for sadness: +
-  ip link set dev ib0 vf 0 node_guid 58:49:56:0e:58:5c:03:02 +
-  ip link set dev ib0 vf 0 port_guid 58:49:56:0e:58:5c:03:02 +
-  ip link set dev ib0 vf 0 state enable +
-  ip link set dev ib0 vf 1 node_guid 58:49:56:0e:58:5c:03:03 +
-  ip link set dev ib0 vf 1 port_guid 58:49:56:0e:58:5c:03:03 +
-  ip link set dev ib0 vf 1 state enable +
-  ip link set dev ib0 vf 2 node_guid 58:49:56:0e:58:5c:03:04 +
-  ip link set dev ib0 vf 2 port_guid 58:49:56:0e:58:5c:03:04 +
-  ip link set dev ib0 vf 2 state enable +
-  ip link set dev ib0 vf 3 node_guid 58:49:56:0e:58:5c:03:05 +
-  ip link set dev ib0 vf 3 port_guid 58:49:56:0e:58:5c:03:05 +
-  ip link set dev ib0 vf 3 state enable +
-  ip link set dev ib0 vf 4 node_guid 58:49:56:0e:58:5c:03:06 +
-  ip link set dev ib0 vf 4 port_guid 58:49:56:0e:58:5c:03:06 +
-  ip link set dev ib0 vf 4 state enable +
-  ip link set dev ib0 vf 5 node_guid 58:49:56:0e:58:5c:03:07 +
-  ip link set dev ib0 vf 5 port_guid 58:49:56:0e:58:5c:03:07 +
-  ip link set dev ib0 vf 5 state enable +
-  ip link set dev ib0 vf 6 node_guid 58:49:56:0e:58:5c:03:08 +
-  ip link set dev ib0 vf 6 port_guid 58:49:56:0e:58:5c:03:08 +
-  ip link set dev ib0 vf 6 state enable +
- +
-Lazy copy-pasta for shark: +
-  ip link set dev ib0 vf 0 node_guid 58:49:56:0e:59:11:02:02 +
-  ip link set dev ib0 vf 0 port_guid 58:49:56:0e:59:11:02:02 +
-  ip link set dev ib0 vf 0 state enable +
- +
-These should really go on their own page. Or better yet, figure out how to configure them on the host!+
  
 +Configuration is managed in /etc/rc.local.
 =====Upper-Layer Protocols (ULPs)===== =====Upper-Layer Protocols (ULPs)=====
 RDMA opens all kinds of possibilities for RDMA-aware protocols to be amazing and fast. They probably all deserve their own pages. RDMA opens all kinds of possibilities for RDMA-aware protocols to be amazing and fast. They probably all deserve their own pages.
Line 264: Line 208:
 ====Networking==== ====Networking====
 ===VXLAN=== ===VXLAN===
-VXLAN is not the only way to get an Ethernet device on Infiniband, but as far as I can tell it's the only decent one. Neither ConnectX-3 nor Connect-IB has VXLAN offload support. Despite this, in connected mode, VXLAN is still blazing despite being basically a million-times-faster winmodem.+VXLAN is not the only way to get an Ethernet device on Infiniband, but as far as I can tell it's the only decent one. None of my hardware has VXLAN offload support.
  
   * VXLAN id can be anything from 0-16777215 inclusive. I make it match the network number.   * VXLAN id can be anything from 0-16777215 inclusive. I make it match the network number.
Line 292: Line 236:
  
 I also want to throw audio frames around with "no latency added". Someday, someday, someday. I also want to throw audio frames around with "no latency added". Someday, someday, someday.
 +
 +=====GUIDs=====
 +  * 5849560e59150301 - shark Connect-IB
 +  * 5849560e53b70b01 - southpark Connect-IB
 +  * 5849560e53660101 - duckling Connect-IB
 +  * 7cfe900300a0a080 - uninstalled Connect-IB
 +  * (there are several more uninstalled Connect-IB cards)
 +  * f4521403002c18b0 - uninstalled ConnectX-3 2014-01-29
 +  * 0002c90300b37f10 - uninstalled ConnectX-3 with no date on the label
 +  * 001175000079b560 - uninstalled qib
 +  * 001175000079b856 - uninstalled qib
 +
nndocs/infiniband.1711388407.txt.gz · Last modified: 2024/03/25 17:40 by naptastic