User Tools

Site Tools


nndocs:infiniband

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
nndocs:infiniband [2024/03/14 20:16] – [Upper-Layer Protocols (ULPs)] expand VMA section naptasticnndocs:infiniband [2025/01/21 14:38] (current) – [Networking] correct a thing naptastic
Line 1: Line 1:
 =====InfiniBand===== =====InfiniBand=====
-Configuration is a real pain. On my Debian hosts, it's all in /etc/rc.local, which is gross. I haven't figured out how to do it on Ubuntu, since Netplan apparently has support for vxlan and ipoib now, I think. **We'll get there**.+Configuration is a real pain. On my Debian hosts, it's all in /etc/rc.local, which is gross. I haven't figured out how to do it on Ubuntu, since Netplan apparently has support for vxlan and IP over IB (IPoIB) now, I think. **We'll get there**.
  
 ===The Problem=== ===The Problem===
-All the InfiniBand hardware I have is Mellanox FDR-generation, so, ConnectX-3, Connect-IB, and SX60xx IB-only switches. The officially-supported distribution overlay only supports Debian through 11, RHEL through 8, and Ubuntu through 20.04, **OR** ConnectX-4 or newer cards only. The newer drivers won't even load for CX-3 or C-IB adapters.+All the InfiniBand hardware I have is Mellanox FDR-generation, so, ConnectX-3, Connect-IB, and SX6005 IB-only switches. (Every time I think "the Ethernet version sure would be nice" I remind myself that I'd use InfiniBand mode anyway.)
  
-MLNX_OFED version ^ Minimum hardware ^ Debian ^ Ubuntu ^ OpenSM version ^+For hardware support, Mellanox provides MLNX_OFED, an overlay for several distributions. Unfortunately, MLNX_OFED only supports Debian through 11, RHEL through 8, and Ubuntu through 20.04, **OR** ConnectX-4 or newer cards only. The drivers built into Linux still recognize and work with ConnectX-3 and Connect-IB, but the kernel drivers packaged won't even recognize the cards. They just show up as unclaimed PCI devices. 
 + 
 +^ Version ^ Minimum hardware ^ Debian ^ Ubuntu ^ OpenSM version ^
 | Inbox |  | All | All | 3.3.23-2 | | Inbox |  | All | All | 3.3.23-2 |
-| 4.9-x | ConnectX-2 | ≤ 11 | ≤ 20.04 | 5.7.2 | +MLNX_OFED 4.9-x | ConnectX-2 | ≤ 11 | ≤ 20.04 | 5.7.2 | 
-| 5.8-x | ConnectX-4 | ≥ 9 | ≥ 18.04 | 5.17.0 |+MLNX_OFED 5.8-x | ConnectX-4 | ≥ 9 | ≥ 18.04 | 5.17.0 |
  
 ===How I'm Getting Around It=== ===How I'm Getting Around It===
-"Inbox" (distribution-provided) drivers and utilities are basically good enough, and probably will probably support my hardware until I'm ready to throw it away.+"Inbox" (distribution-provided) drivers and utilities are basically good enough, and probably will probably support my hardware until I'm ready to throw it away. 
  
 ====The Inbox Part==== ====The Inbox Part====
Line 27: Line 29:
  
 ====The MLNX part==== ====The MLNX part====
-There may be other parts of MLNX_OFED worth stealing; haven'triedIt is possible to install OpenSM 5.7 from OFED 4.9. From this path run this command:+Old OpenSM has this annoying problem where, if the HCA goes away while OpenSM is running, it will start to spew into its logfile, and it will run the system out of disk space. don'know when it got fixed, but as of August 2024 it doesn't do that anymore.
  
-    MLNX_OFED_LINUX-4.9-7.1.0.0-ubuntu20.04-x86_64/DEBS/MLNX_LIBS# dpkg -i opensm*deb libopensm*deb libibumad*deb+    MLNX_OFED_LINUX-24.07-0.6.1.0-debian12.5-x86_64/DEBS# dpkg -i opensm*deb libopensm*deb libibumad*deb
  
 There's also ibdump, with which you must be very careful: it really does capture everything! There's also ibdump, with which you must be very careful: it really does capture everything!
  
-    # dpkg -i ibdump_6.0.0-1.49710_amd64.deb+    # dpkg -i ibdump_6.0.0-1.2407061_amd64.deb
  
-=====SR-IOV===== +=====The Subnet Manager: OpenSM===== 
-====OpenSM Configuration====+====Virtualization====
 For virtualization to work, you have to be using Mellanox's OpenSM fork. When I installed it and ran `opensm -c` to create its default configuration, it had this block in it: For virtualization to work, you have to be using Mellanox's OpenSM fork. When I installed it and ran `opensm -c` to create its default configuration, it had this block in it:
  
Line 47: Line 49:
     virt_enabled 2     virt_enabled 2
  
-In partitions.confwell, just make the defualt partition look like this:+====Partitions: OpenSM configuration==== 
 +Atop an InfiniBand fabric, one or more partitions must be defined for hosts to join before they can communicateIB has a concept of "full" and "partial" membership. Full members can communicate with any other host in the partition. Partial members can communicate with full membersbut not with each other. Whether a host is a full or partial member is controlled by the high bit of the partition number. There's also a "both" option. I can't come up with a use case for "both".
  
-    Default=0x7fff, ipoib, rate=12, mtu=5, scope=2, defmember=full+Linux has very limited support for partial membership. It's best to give all hosts in a partition full membership. Partial membership is probably useful if you have Windows guests and want to keep them isolated at a layer levelI have no need to play in that playground.
-        mgid=ff12:401b::ffff:ffff   # IPv4 Broadcast address +
-        mgid=ff12:401b::          # IPv4 All Hosts group +
-        mgid=ff12:401b::          # IPv4 All Routers group +
-        mgid=ff12:401b::16          # IPv4 IGMP group +
-        mgid=ff12:401b::fb          # IPv4 mDNS group +
-        mgid=ff12:401b::fc          # IPv4 Multicast Link Local Name Resolution group +
-        mgid=ff12:401b::101         # IPv4 NTP group +
-        mgid=ff12:401b::202         # IPv4 Sun RPC +
-        mgid=ff12:601b::          # IPv6 All Hosts group +
-        mgid=ff12:601b::          # IPv6 All Routers group +
-        mgid=ff12:601b::16          # IPv6 MLDv2-capable Routers group +
-        mgid=ff12:601b::fb          # IPv6 mDNS group +
-        mgid=ff12:601b::101         # IPv6 NTP group +
-        mgid=ff12:601b::202         # IPv6 Sun RPC group +
-        mgid=ff12:601b::1:        # IPv6 Multicast Link Local Name Resolution group +
-        ALL=full, ALL_SWITCHES=full;+
  
-It's important to use "full" and not "both" or "limited" for every group membership. Otherwise VFs inside VMs will have a pkey of 0x8000 which is invalid, and they'll show NO-CARRIER.+The default partition configuration sucksMake it look like this:
  
 +  Default=0x7fff, ipoib, rate=12, mtu=5, scope=2, defmember=full:
 +      mgid=ff12:401b::ffff:ffff   # IPv4 Broadcast address
 +      mgid=ff12:401b::          # IPv4 All Hosts group
 +      mgid=ff12:401b::          # IPv4 All Routers group
 +      mgid=ff12:401b::16          # IPv4 IGMP group
 +      mgid=ff12:401b::fb          # IPv4 mDNS group
 +      mgid=ff12:401b::fc          # IPv4 Multicast Link Local Name Resolution group
 +      mgid=ff12:401b::101         # IPv4 NTP group
 +      mgid=ff12:401b::202         # IPv4 Sun RPC
 +      mgid=ff12:601b::          # IPv6 All Hosts group
 +      mgid=ff12:601b::          # IPv6 All Routers group
 +      mgid=ff12:601b::16          # IPv6 MLDv2-capable Routers group
 +      mgid=ff12:601b::fb          # IPv6 mDNS group
 +      mgid=ff12:601b::101         # IPv6 NTP group
 +      mgid=ff12:601b::202         # IPv6 Sun RPC group
 +      mgid=ff12:601b::1:        # IPv6 Multicast Link Local Name Resolution group
 +      ALL=full, ALL_SWITCHES=full;
 +
 +The default config file lists rates through QDR. For FDR and newer rates, see include/iba/ib_types.h in the OpenSM source repository. They're listed as parameters to ib_path_rec_rate() in the comments below the sub declarition. 12 is correct for FDR x4 links. (Wider links are only possible between managed switches. Support for narrower links got removed at some point.)
 +
 +Here's a block for my ATA over Ethernet experiments. Subject to change. IP addresses are necessary for setting up VXLAN tunnels. Checking if IPv6 tunnels perform differently from IPv4 tunnels is on the to-do list. I suspect they perform better. Needs testing.
 +
 +  storage=0xb128, ipoib, rate=12, mtu=5, scope=2, defmember=full:
 +      mgid=ff12:401b::ffff:ffff   # IPv4 Broadcast address
 +      mgid=ff12:401b::          # IPv4 All Hosts group
 +      ALL=full, ALL_SWITCHES=full;
 +
 +====Partitions: Host configuration====
 +There's no functional netlink interface for creating child interfaces. You must use the sysfs interface.
 +  # echo 0xb128 > /sys/class/net/ib0/create_child
 +
 +Resist the temptation to rename the interface to something descriptive. **It's already self-descriptive**. Creative naming is for VXLAN tunnels and bridges, e.g.:
 +
 +  # ip link add vx128 type vxlan id 128 local 172.20.128.13 group 225.172.20.128
 +  # ip link set master aoe1 dev vx128
 +
 +The sysfs interface for deleting child interfaces doesn't work (for me at least). You must use the netlink interface.
 +  # ip link del ib0.b128
 +
 +If you unset the high bit on the partition number (0x3128 instead of 0xb128) Linux will set the high bit before joining the partition. If OpenSM's configuration has that partition's membership set for "partial" or "both", the Linux host will not be able to connect to everything on that subnet, or possibly //anything// on that subnet, regardless of which value you use.
 +
 +It's worth finding out if Netplan can manage IB child interfaces.
 +
 +====Connected vs. Datagram====
 +IPoIB can run in one of three modes:
 +  - Datagram, in which the IP MTU matches the subnet MTU. Performance is pretty trash.
 +  - Connected, in which the MTU is limited only by kernel and network structures. Practically, this means 64k, or 65520 bytes after protocol overhead is accounted for. Performance for some workloads is greatly improved.
 +  - Enhanced, which is not compatible with Connected mode, but has more offloads and is (from what I hear) generally better.
 +    * Enhanced IPoIB is only available on ConnectX-4 and newer cards.
 +
 +IPoIB is still a dog, performance-wise. It's a very fast, very expensive winmodem. It should be thought of as a way of getting IP addresses, which you then use to set up RDMA-aware protocols.
 +
 +As far as I can tell, IPoIB interfaces are always created in datagram mode. The documentation says that Connect-IB cards default to connected mode; that has not been my experience. There's also no way to set connected mode as the default when the module is loaded, or otherwise ensure that interfaces get created in connected mode. Before changing modes, the interface must be down. The netlink interface for setting the connection mode has not worked for me; I might just be doing it wrong:
 +  # ip link set mode connected ib0.b129
 +  Error: argument "connected" is wrong: Invalid link mode
 +  
 +  # ^connected^datagram
 +  ip link set mode datagram ib0.b129
 +  Error: argument "datagram" is wrong: Invalid link mode
 +
 +The sysfs interface works:
 +  # echo connected > /sys/class/net/ib0.b129/mode
 +
 +Since I don't have any newer hardware, I don't have any information about Enhanced IPoIB.
 +
 +=====SR-IOV=====
 ====Hardware Settings==== ====Hardware Settings====
 The BIOS needs to have SR-IOV, ARI, and ACS support enabled. The BIOS needs to have SR-IOV, ARI, and ACS support enabled.
Line 78: Line 131:
     SRIOV_EN                                    True(1)     SRIOV_EN                                    True(1)
  
-FPP_EN (Flow Priority something) controls whether the card appears as two PCI devices, or as a single device with two ports. Under mlx4, every VF on a dual-port HCA has both ports, and NUM_OF_VFs is how many dual-port devices to create. Under mlx5, each port gets its own pool of VFs and NUM_OF_VFs is per-port.+FPP_EN (Function Per Port ENable) controls whether the card appears as two PCI devices, or as a single device with two ports. Under mlx4, every VF on a dual-port HCA has both ports, and NUM_OF_VFs is how many dual-port devices to create. Under mlx5, each port gets its own pool of VFs and NUM_OF_VFs is per-port.
  
-I haven't tried large numbers of VFs. The hardware upper limit is 63 for Connect-IB and 127 for ConnectX-3. Firmware might have a lower limit than that. I don't expect to need that many guests with IOV networking...+I haven't tried large numbers of VFs. The hardware upper limit is 63 for Connect-IB and 127 for ConnectX-3. Any number of system components could impose lower limits. For example, my consumer boards that are SR-IOV capable can only have VFs on port 1, not on port 2; the EPYC server system can create VFs on both ports. I don't expect to need so many guests with IOV networking anyway...
  
-After a rebootthere should be new file, /sys/bus/pci/devices/0000:b:d:f/sriov_numvfs. Try turning it onIf it works, there will be new PCI devices as well as VFs listed under `ip link`:+To make VFs existput number <= NUM_OF_VFS into sriov_numvfs for that deviceBefore doing so, I recommend turning off VF probingOtherwise the VFs will all make IPoIB interfaces, which probably isn't what you want. This setting is per PF.
  
-    # echo 7 > /sys/class/infiniband/ibp6s0f0/device/sriov_numvfs # no output on success; check dmesg for interesting but probably useless messages. +I'm still checking if there's a way to configure the driver so this becomes the default setting. 
-     + 
-    # lspci | grep nfi +  # echo 0 > /sys/class/infiniband/ibp13s0f0/device/sriov_drivers_autoprobe 
-    06:00.0 Infiniband controller: Mellanox Technologies MT27600 [Connect-IB] +  # echo 0 > /sys/class/infiniband/ibp13s0f1/device/sriov_drivers_autoprobe 
-    06:00.1 Infiniband controller: Mellanox Technologies MT27600 [Connect-IB] + 
-    06:00.2 Infiniband controller: Mellanox Technologies MT27600 Family [Connect-IB Virtual Function] +If it works, there will be new PCI devices as well as VFs listed under `ip link`: 
-    06:00.3 Infiniband controller: Mellanox Technologies MT27600 Family [Connect-IB Virtual Function] + 
-    06:00.4 Infiniband controller: Mellanox Technologies MT27600 Family [Connect-IB Virtual Function] +  # echo 7 > /sys/class/infiniband/ibp6s0f0/device/sriov_numvfs # no output on success; check dmesg for interesting but probably useless messages. 
-    06:00.5 Infiniband controller: Mellanox Technologies MT27600 Family [Connect-IB Virtual Function] +  # lspci | grep nfi 
-    06:00.6 Infiniband controller: Mellanox Technologies MT27600 Family [Connect-IB Virtual Function] +  06:00.0 Infiniband controller: Mellanox Technologies MT27600 [Connect-IB] 
-    06:00.7 Infiniband controller: Mellanox Technologies MT27600 Family [Connect-IB Virtual Function] +  06:00.1 Infiniband controller: Mellanox Technologies MT27600 [Connect-IB] 
-    06:01.0 Infiniband controller: Mellanox Technologies MT27600 Family [Connect-IB Virtual Function]+  06:00.2 Infiniband controller: Mellanox Technologies MT27600 Family [Connect-IB Virtual Function] 
 +  06:00.3 Infiniband controller: Mellanox Technologies MT27600 Family [Connect-IB Virtual Function] 
 +  06:00.4 Infiniband controller: Mellanox Technologies MT27600 Family [Connect-IB Virtual Function] 
 +  06:00.5 Infiniband controller: Mellanox Technologies MT27600 Family [Connect-IB Virtual Function] 
 +  06:00.6 Infiniband controller: Mellanox Technologies MT27600 Family [Connect-IB Virtual Function] 
 +  06:00.7 Infiniband controller: Mellanox Technologies MT27600 Family [Connect-IB Virtual Function] 
 +  06:01.0 Infiniband controller: Mellanox Technologies MT27600 Family [Connect-IB Virtual Function]
  
 Warning: The output from ip link is very wide. Warning: The output from ip link is very wide.
  
-    # ip link +  # ip link 
-    1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000 +  1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000 
-        link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 +      link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 
-    2: eth0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 +  2: eth0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 
-        link/ether 3c:ec:ef:6d:10:62 brd ff:ff:ff:ff:ff:ff +      link/ether 3c:ec:ef:6d:10:62 brd ff:ff:ff:ff:ff:ff 
-    3: eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 +  3: eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 
-        link/ether 3c:ec:ef:6d:10:63 brd ff:ff:ff:ff:ff:ff +      link/ether 3c:ec:ef:6d:10:63 brd ff:ff:ff:ff:ff:ff 
-    4: ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 4092 qdisc fq_codel state UP mode DEFAULT group default qlen 256 +  4: ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 4092 qdisc fq_codel state UP mode DEFAULT group default qlen 256 
-        link/infiniband 80:00:00:29:fe:80:00:00:00:00:00:00:58:49:56:0e:53:b7:0b:01 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff +      link/infiniband 80:00:00:29:fe:80:00:00:00:00:00:00:58:49:56:0e:53:b7:0b:01 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff 
-        vf 0     link/infiniband 80:00:00:29:fe:80:00:00:00:00:00:00:58:49:56:0e:53:b7:0b:01 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff, spoof checking off, NODE_GUID 00:00:00:00:00:00:00:00, PORT_GUID 00:00:00:00:00:00:00:00, link-state disable, trust off, query_rss off +      vf 0     link/infiniband 80:00:00:29:fe:80:00:00:00:00:00:00:58:49:56:0e:53:b7:0b:01 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff, spoof checking off, NODE_GUID 00:00:00:00:00:00:00:00, PORT_GUID 00:00:00:00:00:00:00:00, link-state disable, trust off, query_rss off 
-        vf 1     link/infiniband 80:00:00:29:fe:80:00:00:00:00:00:00:58:49:56:0e:53:b7:0b:01 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff, spoof checking off, NODE_GUID 00:00:00:00:00:00:00:00, PORT_GUID 00:00:00:00:00:00:00:00, link-state disable, trust off, query_rss off +      vf 1     link/infiniband 80:00:00:29:fe:80:00:00:00:00:00:00:58:49:56:0e:53:b7:0b:01 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff, spoof checking off, NODE_GUID 00:00:00:00:00:00:00:00, PORT_GUID 00:00:00:00:00:00:00:00, link-state disable, trust off, query_rss off 
-        vf 2     link/infiniband 80:00:00:29:fe:80:00:00:00:00:00:00:58:49:56:0e:53:b7:0b:01 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff, spoof checking off, NODE_GUID 00:00:00:00:00:00:00:00, PORT_GUID 00:00:00:00:00:00:00:00, link-state disable, trust off, query_rss off +      vf 2     link/infiniband 80:00:00:29:fe:80:00:00:00:00:00:00:58:49:56:0e:53:b7:0b:01 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff, spoof checking off, NODE_GUID 00:00:00:00:00:00:00:00, PORT_GUID 00:00:00:00:00:00:00:00, link-state disable, trust off, query_rss off 
-        vf 3     link/infiniband 80:00:00:29:fe:80:00:00:00:00:00:00:58:49:56:0e:53:b7:0b:01 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff, spoof checking off, NODE_GUID 00:00:00:00:00:00:00:00, PORT_GUID 00:00:00:00:00:00:00:00, link-state disable, trust off, query_rss off +      vf 3     link/infiniband 80:00:00:29:fe:80:00:00:00:00:00:00:58:49:56:0e:53:b7:0b:01 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff, spoof checking off, NODE_GUID 00:00:00:00:00:00:00:00, PORT_GUID 00:00:00:00:00:00:00:00, link-state disable, trust off, query_rss off 
-        vf 4     link/infiniband 80:00:00:29:fe:80:00:00:00:00:00:00:58:49:56:0e:53:b7:0b:01 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff, spoof checking off, NODE_GUID 00:00:00:00:00:00:00:00, PORT_GUID 00:00:00:00:00:00:00:00, link-state disable, trust off, query_rss off +      vf 4     link/infiniband 80:00:00:29:fe:80:00:00:00:00:00:00:58:49:56:0e:53:b7:0b:01 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff, spoof checking off, NODE_GUID 00:00:00:00:00:00:00:00, PORT_GUID 00:00:00:00:00:00:00:00, link-state disable, trust off, query_rss off 
-        vf 5     link/infiniband 80:00:00:29:fe:80:00:00:00:00:00:00:58:49:56:0e:53:b7:0b:01 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff, spoof checking off, NODE_GUID 00:00:00:00:00:00:00:00, PORT_GUID 00:00:00:00:00:00:00:00, link-state disable, trust off, query_rss off +      vf 5     link/infiniband 80:00:00:29:fe:80:00:00:00:00:00:00:58:49:56:0e:53:b7:0b:01 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff, spoof checking off, NODE_GUID 00:00:00:00:00:00:00:00, PORT_GUID 00:00:00:00:00:00:00:00, link-state disable, trust off, query_rss off 
-        vf 6     link/infiniband 80:00:00:29:fe:80:00:00:00:00:00:00:58:49:56:0e:53:b7:0b:01 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff, spoof checking off, NODE_GUID 00:00:00:00:00:00:00:00, PORT_GUID 00:00:00:00:00:00:00:00, link-state disable, trust off, query_rss off +      vf 6     link/infiniband 80:00:00:29:fe:80:00:00:00:00:00:00:58:49:56:0e:53:b7:0b:01 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff, spoof checking off, NODE_GUID 00:00:00:00:00:00:00:00, PORT_GUID 00:00:00:00:00:00:00:00, link-state disable, trust off, query_rss off 
-    5: ib1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 4092 qdisc fq_codel state DOWN mode DEFAULT group default qlen 256 +  5: ib1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 4092 qdisc fq_codel state DOWN mode DEFAULT group default qlen 256 
-        link/infiniband 80:00:00:28:fe:80:00:00:00:00:00:00:58:49:56:0e:53:b7:0b:09 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff +      link/infiniband 80:00:00:28:fe:80:00:00:00:00:00:00:58:49:56:0e:53:b7:0b:09 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
-    10: ib2: <BROADCAST,MULTICAST> mtu 4092 qdisc noop state DOWN mode DEFAULT group default qlen 256 +
-        link/infiniband 80:00:00:27:fe:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff +
-    11: ib3: <BROADCAST,MULTICAST> mtu 4092 qdisc noop state DOWN mode DEFAULT group default qlen 256 +
-        link/infiniband 80:00:00:27:fe:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff +
-    12: ib4: <BROADCAST,MULTICAST> mtu 4092 qdisc noop state DOWN mode DEFAULT group default qlen 256 +
-        link/infiniband 80:00:00:27:fe:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff +
-    13: ib5: <BROADCAST,MULTICAST> mtu 4092 qdisc noop state DOWN mode DEFAULT group default qlen 256 +
-        link/infiniband 80:00:00:27:fe:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff +
-    14: ib6: <BROADCAST,MULTICAST> mtu 4092 qdisc noop state DOWN mode DEFAULT group default qlen 256 +
-        link/infiniband 80:00:00:27:fe:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff +
-    15: ib7: <BROADCAST,MULTICAST> mtu 4092 qdisc noop state DOWN mode DEFAULT group default qlen 256 +
-        link/infiniband 80:00:00:27:fe:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff +
-    16: ib8: <BROADCAST,MULTICAST> mtu 4092 qdisc noop state DOWN mode DEFAULT group default qlen 256 +
-        link/infiniband 80:00:00:27:fe:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff+
  
 ====VF Configuration==== ====VF Configuration====
-To set the GUID for VFs:+The official documentation covers a sysfs interface for configuring VF properties. That interface hasn't existed for years. Before using a VF, you must set node_guid, port_guid, and state using ip link. Make port_guid == node_guid == unique. (I use the base port guid + VF + 1.)
  
-    ip link set dev ib0 vf 0 node_guid 58:49:56:0e:53:b7:0b:02 +GUIDs need to be set before attaching a VF to a VM. It should be possible to change state (simulating unplugging the cable) while a VM is using a VF but I haven't tested this.
-    ip link set dev ib0 vf 0 port_guid 58:49:56:0e:53:b7:0b:02 +
-    ip link set dev ib0 vf 0 state enable+
  
-Make port_guid == node_guid == unique(I use the base port guid + VF + 1.+Configuration is managed in /etc/rc.local.
- +
-Lazy copy-pasta for shark: +
-    ip link set dev ib0 vf 0 node_guid 58:49:56:0e:59:11:02:02 +
-    ip link set dev ib0 vf 0 port_guid 58:49:56:0e:59:11:02:02 +
-    ip link set dev ib0 vf 0 state enable+
 =====Upper-Layer Protocols (ULPs)===== =====Upper-Layer Protocols (ULPs)=====
-RDMA opens all kinds of possibilities for RDMA-aware protocols to be amazing and fast.+RDMA opens all kinds of possibilities for RDMA-aware protocols to be amazing and fast. They probably all deserve their own pages. 
 + 
 +===IP over InfiniBand (IPoIB)=== 
 +It's already been used on this page extensively. IPoIB encapsulates IP traffic in InfiniBand datagrams so that protocols built for Ethernet (primarily TCP and UDP) can run on IB networks. Performance is generally poor. There are some hacks to make it faster, but the real reason to use IPoIB is to give your IB hosts IP addresses for setting up other ULPs.
  
 ===VMA=== ===VMA===
Line 152: Line 193:
   * Apache   * Apache
   * PHP   * PHP
-  * MySQL+  * MySQL (though I think MySQL has its own RDMA backend?)
   * SSH (and consequently rsync and scp, right?)   * SSH (and consequently rsync and scp, right?)
 +    * (It would also be cool to find out if newer cards with crypto functions could do hardware-accelerated SSH with RDMA.)
  
 None of these services is so performance-critical that I'll spend time configuring it for VMA, except maybe as a learning exercise. **Later**. None of these services is so performance-critical that I'll spend time configuring it for VMA, except maybe as a learning exercise. **Later**.
Line 166: Line 208:
 ====Networking==== ====Networking====
 ===VXLAN=== ===VXLAN===
-VXLAN is not the only way to get an Ethernet device on Infiniband, but as far as I can tell it's the only decent one. Neither ConnectX-3 nor Connect-IB has VXLAN offload support so it's like a Winmodem but the number is a million times bigger.+VXLAN is not the only way to get an Ethernet device on Infiniband, but as far as I can tell it's the only decent one. None of my hardware has VXLAN offload support.
  
   * VXLAN id can be anything from 0-16777215 inclusive. I make it match the network number.   * VXLAN id can be anything from 0-16777215 inclusive. I make it match the network number.
Line 179: Line 221:
     export local=172.20.64.9     export local=172.20.64.9
     export group=225.172.20.64     export group=225.172.20.64
-    ip link add name vxlan64 type vxlan id 64 local 172.20.64.10 group 225.172.20.64 dev ib0 dstport 0 +    export dev=ib0 
- +    ip link add name vxlan$id type vxlan id $id local $local group $group dev $dev dstport 0
-If/when I get hardware capable of VXLAN offload, the dstport might have to change.+
  
 Once the VNI exists, it can be added to a bridge: Once the VNI exists, it can be added to a bridge:
Line 188: Line 229:
  
 //Et viola//, Ethernet on top of InfiniBand. //Et viola//, Ethernet on top of InfiniBand.
 +
 +If/when I get hardware capable of VXLAN offload, the dstport might have to change.
 +
 ====Multimedia==== ====Multimedia====
-Yeah, someday I want to throw video frames around. There's an RFC or ISO standard for that, IIRC. There's also lgproxy.+Yeah, someday I want to throw video frames around. There's an RFC or ISO standard for that, IIRC. There's also lgproxy, which is RDMA-aware.
  
 I also want to throw audio frames around with "no latency added". Someday, someday, someday. I also want to throw audio frames around with "no latency added". Someday, someday, someday.
 +
 +=====GUIDs=====
 +  * 5849560e59150301 - shark Connect-IB
 +  * 5849560e53b70b01 - southpark Connect-IB
 +  * 5849560e53660101 - duckling Connect-IB
 +  * 7cfe900300a0a080 - uninstalled Connect-IB
 +  * (there are several more uninstalled Connect-IB cards)
 +  * f4521403002c18b0 - uninstalled ConnectX-3 2014-01-29
 +  * 0002c90300b37f10 - uninstalled ConnectX-3 with no date on the label
 +  * 001175000079b560 - uninstalled qib
 +  * 001175000079b856 - uninstalled qib
 +
nndocs/infiniband.1710447395.txt.gz · Last modified: 2024/03/14 20:16 by naptastic