nndocs:infiniband
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
nndocs:infiniband [2024/03/15 00:56] – [VF Configuration] partial lazy copy-pasta for southpark SR-IOV naptastic | nndocs:infiniband [2025/01/21 14:38] (current) – [Networking] correct a thing naptastic | ||
---|---|---|---|
Line 1: | Line 1: | ||
=====InfiniBand===== | =====InfiniBand===== | ||
- | Configuration is a real pain. On my Debian hosts, it's all in / | + | Configuration is a real pain. On my Debian hosts, it's all in / |
===The Problem=== | ===The Problem=== | ||
- | All the InfiniBand hardware I have is Mellanox FDR-generation, | + | All the InfiniBand hardware I have is Mellanox FDR-generation, |
- | ^ MLNX_OFED | + | For hardware support, Mellanox provides |
+ | |||
+ | ^ Version | ||
| Inbox | | All | All | 3.3.23-2 | | | Inbox | | All | All | 3.3.23-2 | | ||
- | | 4.9-x | ConnectX-2 | ≤ 11 | ≤ 20.04 | 5.7.2 | | + | | MLNX_OFED |
- | | 5.8-x | ConnectX-4 | ≥ 9 | ≥ 18.04 | 5.17.0 | | + | | MLNX_OFED |
===How I'm Getting Around It=== | ===How I'm Getting Around It=== | ||
- | " | + | " |
====The Inbox Part==== | ====The Inbox Part==== | ||
Line 27: | Line 29: | ||
====The MLNX part==== | ====The MLNX part==== | ||
- | There may be other parts of MLNX_OFED worth stealing; | + | Old OpenSM has this annoying problem where, if the HCA goes away while OpenSM is running, it will start to spew into its logfile, and it will run the system out of disk space. |
- | MLNX_OFED_LINUX-4.9-7.1.0.0-ubuntu20.04-x86_64/ | + | MLNX_OFED_LINUX-24.07-0.6.1.0-debian12.5-x86_64/ |
There' | There' | ||
- | # dpkg -i ibdump_6.0.0-1.49710_amd64.deb | + | # dpkg -i ibdump_6.0.0-1.2407061_amd64.deb |
- | =====SR-IOV===== | + | =====The Subnet Manager: OpenSM===== |
- | ====OpenSM Configuration==== | + | ====Virtualization==== |
For virtualization to work, you have to be using Mellanox' | For virtualization to work, you have to be using Mellanox' | ||
Line 47: | Line 49: | ||
virt_enabled 2 | virt_enabled 2 | ||
- | In partitions.conf, well, just make the defualt | + | ====Partitions: |
+ | Atop an InfiniBand fabric, one or more partitions | ||
- | Default=0x7fff, | + | Linux has very limited support for partial membership. It's best to give all hosts in a partition |
- | mgid=ff12: | + | |
- | mgid=ff12: | + | |
- | mgid=ff12: | + | |
- | mgid=ff12: | + | |
- | mgid=ff12: | + | |
- | mgid=ff12: | + | |
- | mgid=ff12: | + | |
- | mgid=ff12: | + | |
- | mgid=ff12: | + | |
- | mgid=ff12: | + | |
- | mgid=ff12: | + | |
- | mgid=ff12: | + | |
- | mgid=ff12: | + | |
- | mgid=ff12: | + | |
- | mgid=ff12: | + | |
- | ALL=full, ALL_SWITCHES=full; | + | |
- | It's important to use " | + | The default partition configuration sucks. Make it look like this: |
+ | Default=0x7fff, | ||
+ | mgid=ff12: | ||
+ | mgid=ff12: | ||
+ | mgid=ff12: | ||
+ | mgid=ff12: | ||
+ | mgid=ff12: | ||
+ | mgid=ff12: | ||
+ | mgid=ff12: | ||
+ | mgid=ff12: | ||
+ | mgid=ff12: | ||
+ | mgid=ff12: | ||
+ | mgid=ff12: | ||
+ | mgid=ff12: | ||
+ | mgid=ff12: | ||
+ | mgid=ff12: | ||
+ | mgid=ff12: | ||
+ | ALL=full, ALL_SWITCHES=full; | ||
+ | |||
+ | The default config file lists rates through QDR. For FDR and newer rates, see include/ | ||
+ | |||
+ | Here's a block for my ATA over Ethernet experiments. Subject to change. IP addresses are necessary for setting up VXLAN tunnels. Checking if IPv6 tunnels perform differently from IPv4 tunnels is on the to-do list. I suspect they perform better. Needs testing. | ||
+ | |||
+ | storage=0xb128, | ||
+ | mgid=ff12: | ||
+ | mgid=ff12: | ||
+ | ALL=full, ALL_SWITCHES=full; | ||
+ | |||
+ | ====Partitions: | ||
+ | There' | ||
+ | # echo 0xb128 > / | ||
+ | |||
+ | Resist the temptation to rename the interface to something descriptive. **It's already self-descriptive**. Creative naming is for VXLAN tunnels and bridges, e.g.: | ||
+ | |||
+ | # ip link add vx128 type vxlan id 128 local 172.20.128.13 group 225.172.20.128 | ||
+ | # ip link set master aoe1 dev vx128 | ||
+ | |||
+ | The sysfs interface for deleting child interfaces doesn' | ||
+ | # ip link del ib0.b128 | ||
+ | |||
+ | If you unset the high bit on the partition number (0x3128 instead of 0xb128) Linux will set the high bit before joining the partition. If OpenSM' | ||
+ | |||
+ | It's worth finding out if Netplan can manage IB child interfaces. | ||
+ | |||
+ | ====Connected vs. Datagram==== | ||
+ | IPoIB can run in one of three modes: | ||
+ | - Datagram, in which the IP MTU matches the subnet MTU. Performance is pretty trash. | ||
+ | - Connected, in which the MTU is limited only by kernel and network structures. Practically, | ||
+ | - Enhanced, which is not compatible with Connected mode, but has more offloads and is (from what I hear) generally better. | ||
+ | * Enhanced IPoIB is only available on ConnectX-4 and newer cards. | ||
+ | |||
+ | IPoIB is still a dog, performance-wise. It's a very fast, very expensive winmodem. It should be thought of as a way of getting IP addresses, which you then use to set up RDMA-aware protocols. | ||
+ | |||
+ | As far as I can tell, IPoIB interfaces are always created in datagram mode. The documentation says that Connect-IB cards default to connected mode; that has not been my experience. There' | ||
+ | # ip link set mode connected ib0.b129 | ||
+ | Error: argument " | ||
+ | | ||
+ | # ^connected^datagram | ||
+ | ip link set mode datagram ib0.b129 | ||
+ | Error: argument " | ||
+ | |||
+ | The sysfs interface works: | ||
+ | # echo connected > / | ||
+ | |||
+ | Since I don't have any newer hardware, I don't have any information about Enhanced IPoIB. | ||
+ | |||
+ | =====SR-IOV===== | ||
====Hardware Settings==== | ====Hardware Settings==== | ||
The BIOS needs to have SR-IOV, ARI, and ACS support enabled. | The BIOS needs to have SR-IOV, ARI, and ACS support enabled. | ||
Line 78: | Line 131: | ||
SRIOV_EN | SRIOV_EN | ||
- | FPP_EN (Flow Priority something) controls whether the card appears as two PCI devices, or as a single device with two ports. Under mlx4, every VF on a dual-port HCA has both ports, and NUM_OF_VFs is how many dual-port devices to create. Under mlx5, each port gets its own pool of VFs and NUM_OF_VFs is per-port. | + | FPP_EN (Function Per Port ENable) controls whether the card appears as two PCI devices, or as a single device with two ports. Under mlx4, every VF on a dual-port HCA has both ports, and NUM_OF_VFs is how many dual-port devices to create. Under mlx5, each port gets its own pool of VFs and NUM_OF_VFs is per-port. |
- | I haven' | + | I haven' |
- | After a reboot, there should be a new file, / | + | To make VFs exist, put a number <= NUM_OF_VFS into sriov_numvfs |
- | | + | I'm still checking if there' |
- | + | ||
- | | + | # echo 0 > / |
- | 06:00.0 Infiniband controller: Mellanox Technologies MT27600 [Connect-IB] | + | # echo 0 > / |
- | 06:00.1 Infiniband controller: Mellanox Technologies MT27600 [Connect-IB] | + | |
- | 06:00.2 Infiniband controller: Mellanox Technologies MT27600 Family [Connect-IB Virtual Function] | + | If it works, there will be new PCI devices as well as VFs listed under `ip link`: |
- | 06:00.3 Infiniband controller: Mellanox Technologies MT27600 Family [Connect-IB Virtual Function] | + | |
- | 06:00.4 Infiniband controller: Mellanox Technologies MT27600 Family [Connect-IB Virtual Function] | + | |
- | 06:00.5 Infiniband controller: Mellanox Technologies MT27600 Family [Connect-IB Virtual Function] | + | # lspci | grep nfi |
- | 06:00.6 Infiniband controller: Mellanox Technologies MT27600 Family [Connect-IB Virtual Function] | + | 06:00.0 Infiniband controller: Mellanox Technologies MT27600 [Connect-IB] |
- | 06:00.7 Infiniband controller: Mellanox Technologies MT27600 Family [Connect-IB Virtual Function] | + | 06:00.1 Infiniband controller: Mellanox Technologies MT27600 [Connect-IB] |
- | 06:01.0 Infiniband controller: Mellanox Technologies MT27600 Family [Connect-IB Virtual Function] | + | 06:00.2 Infiniband controller: Mellanox Technologies MT27600 Family [Connect-IB Virtual Function] |
+ | 06:00.3 Infiniband controller: Mellanox Technologies MT27600 Family [Connect-IB Virtual Function] | ||
+ | 06:00.4 Infiniband controller: Mellanox Technologies MT27600 Family [Connect-IB Virtual Function] | ||
+ | 06:00.5 Infiniband controller: Mellanox Technologies MT27600 Family [Connect-IB Virtual Function] | ||
+ | 06:00.6 Infiniband controller: Mellanox Technologies MT27600 Family [Connect-IB Virtual Function] | ||
+ | 06:00.7 Infiniband controller: Mellanox Technologies MT27600 Family [Connect-IB Virtual Function] | ||
+ | 06:01.0 Infiniband controller: Mellanox Technologies MT27600 Family [Connect-IB Virtual Function] | ||
Warning: The output from ip link is very wide. | Warning: The output from ip link is very wide. | ||
- | | + | |
- | 1: lo: < | + | 1: lo: < |
- | link/ | + | link/ |
- | 2: eth0: < | + | 2: eth0: < |
- | link/ether 3c: | + | link/ether 3c: |
- | 3: eth1: < | + | 3: eth1: < |
- | link/ether 3c: | + | link/ether 3c: |
- | 4: ib0: < | + | 4: ib0: < |
- | link/ | + | link/ |
- | vf 0 | + | vf 0 |
- | vf 1 | + | vf 1 |
- | vf 2 | + | vf 2 |
- | vf 3 | + | vf 3 |
- | vf 4 | + | vf 4 |
- | vf 5 | + | vf 5 |
- | vf 6 | + | vf 6 |
- | 5: ib1: < | + | 5: ib1: < |
- | link/ | + | link/ |
- | 10: ib2: < | + | |
- | link/ | + | |
- | 11: ib3: < | + | |
- | link/ | + | |
- | 12: ib4: < | + | |
- | link/ | + | |
- | 13: ib5: < | + | |
- | link/ | + | |
- | 14: ib6: < | + | |
- | link/ | + | |
- | 15: ib7: < | + | |
- | link/ | + | |
- | 16: ib8: < | + | |
- | link/ | + | |
====VF Configuration==== | ====VF Configuration==== | ||
- | To set the GUID for VFs, set node_guid, port_guid, and state using ip link. Make port_guid == node_guid == unique. (I use the base port guid + VF + 1.) | + | The official documentation covers a sysfs interface |
- | Lazy copy-pasta for southpark: | + | GUIDs need to be set before attaching a VF to a VM. It should be possible to change |
- | ip link set dev ib0 vf 0 node_guid 58: | + | |
- | ip link set dev ib0 vf 0 port_guid 58: | + | |
- | ip link set dev ib0 vf 0 state enable | + | |
- | ip link set dev ib0 vf 1 node_guid 58: | + | |
- | ip link set dev ib0 vf 1 port_guid 58: | + | |
- | ip link set dev ib0 vf 1 state enable | + | |
- | ip link set dev ib0 vf 2 node_guid 58: | + | |
- | ip link set dev ib0 vf 2 port_guid 58: | + | |
- | ip link set dev ib0 vf 2 state enable | + | |
- | ip link set dev ib0 vf 3 node_guid 58: | + | |
- | ip link set dev ib0 vf 3 port_guid 58: | + | |
- | ip link set dev ib0 vf 3 state enable | + | |
- | ip link set dev ib0 vf 4 node_guid 58: | + | |
- | ip link set dev ib0 vf 4 port_guid 58: | + | |
- | ip link set dev ib0 vf 4 state enable | + | |
- | ip link set dev ib0 vf 5 node_guid 58: | + | |
- | ip link set dev ib0 vf 5 port_guid 58: | + | |
- | ip link set dev ib0 vf 5 state enable | + | |
- | ip link set dev ib0 vf 6 node_guid 58: | + | |
- | ip link set dev ib0 vf 6 port_guid 58: | + | |
- | ip link set dev ib0 vf 6 state enable | + | |
- | Lazy copy-pasta for shark: | + | Configuration is managed in / |
- | ip link set dev ib0 vf 0 node_guid 58: | + | |
- | ip link set dev ib0 vf 0 port_guid 58: | + | |
- | ip link set dev ib0 vf 0 state enable | + | |
=====Upper-Layer Protocols (ULPs)===== | =====Upper-Layer Protocols (ULPs)===== | ||
- | RDMA opens all kinds of possibilities for RDMA-aware protocols to be amazing and fast. | + | RDMA opens all kinds of possibilities for RDMA-aware protocols to be amazing and fast. They probably all deserve their own pages. |
+ | |||
+ | ===IP over InfiniBand (IPoIB)=== | ||
+ | It's already been used on this page extensively. IPoIB encapsulates IP traffic in InfiniBand datagrams so that protocols built for Ethernet (primarily TCP and UDP) can run on IB networks. Performance is generally poor. There are some hacks to make it faster, but the real reason to use IPoIB is to give your IB hosts IP addresses for setting up other ULPs. | ||
===VMA=== | ===VMA=== | ||
Line 169: | Line 193: | ||
* Apache | * Apache | ||
* PHP | * PHP | ||
- | * MySQL | + | * MySQL (though I think MySQL has its own RDMA backend?) |
* SSH (and consequently rsync and scp, right?) | * SSH (and consequently rsync and scp, right?) | ||
+ | * (It would also be cool to find out if newer cards with crypto functions could do hardware-accelerated SSH with RDMA.) | ||
None of these services is so performance-critical that I'll spend time configuring it for VMA, except maybe as a learning exercise. **Later**. | None of these services is so performance-critical that I'll spend time configuring it for VMA, except maybe as a learning exercise. **Later**. | ||
Line 183: | Line 208: | ||
====Networking==== | ====Networking==== | ||
===VXLAN=== | ===VXLAN=== | ||
- | VXLAN is not the only way to get an Ethernet device on Infiniband, but as far as I can tell it's the only decent one. Neither ConnectX-3 nor Connect-IB | + | VXLAN is not the only way to get an Ethernet device on Infiniband, but as far as I can tell it's the only decent one. None of my hardware |
* VXLAN id can be anything from 0-16777215 inclusive. I make it match the network number. | * VXLAN id can be anything from 0-16777215 inclusive. I make it match the network number. | ||
Line 196: | Line 221: | ||
export local=172.20.64.9 | export local=172.20.64.9 | ||
export group=225.172.20.64 | export group=225.172.20.64 | ||
- | ip link add name vxlan64 | + | |
- | + | | |
- | If/when I get hardware capable of VXLAN offload, the dstport might have to change. | + | |
Once the VNI exists, it can be added to a bridge: | Once the VNI exists, it can be added to a bridge: | ||
Line 205: | Line 229: | ||
//Et viola//, Ethernet on top of InfiniBand. | //Et viola//, Ethernet on top of InfiniBand. | ||
+ | |||
+ | If/when I get hardware capable of VXLAN offload, the dstport might have to change. | ||
+ | |||
====Multimedia==== | ====Multimedia==== | ||
- | Yeah, someday I want to throw video frames around. There' | + | Yeah, someday I want to throw video frames around. There' |
I also want to throw audio frames around with "no latency added" | I also want to throw audio frames around with "no latency added" | ||
+ | |||
+ | =====GUIDs===== | ||
+ | * 5849560e59150301 - shark Connect-IB | ||
+ | * 5849560e53b70b01 - southpark Connect-IB | ||
+ | * 5849560e53660101 - duckling Connect-IB | ||
+ | * 7cfe900300a0a080 - uninstalled Connect-IB | ||
+ | * (there are several more uninstalled Connect-IB cards) | ||
+ | * f4521403002c18b0 - uninstalled ConnectX-3 2014-01-29 | ||
+ | * 0002c90300b37f10 - uninstalled ConnectX-3 with no date on the label | ||
+ | * 001175000079b560 - uninstalled qib | ||
+ | * 001175000079b856 - uninstalled qib | ||
+ |
nndocs/infiniband.1710464178.txt.gz · Last modified: 2024/03/15 00:56 by naptastic