User Tools

Site Tools


nndocs:ata-over-ethernet

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
nndocs:ata-over-ethernet [2023/01/07 00:54] – created naptasticnndocs:ata-over-ethernet [2024/08/23 16:02] (current) – remove bug; no longer able to reproduce. naptastic
Line 3: Line 3:
 ====Perfect vs. Good: A Fight to the Death==== ====Perfect vs. Good: A Fight to the Death====
  
----- 
 ===Preface (maybe doesn't belong in this video?)=== ===Preface (maybe doesn't belong in this video?)===
  
 A few months ago I had to move suddenly and put my lab into storage. Where I moved, there was basic WiFi, and nowhere to set up a desktop. My web services were offline for weeks and I got pretty discouraged. Now I've got an opportunity to set it all up again, and enough people have expressed interest, I'm going to document and publish the whole process, or try anyway. A few months ago I had to move suddenly and put my lab into storage. Where I moved, there was basic WiFi, and nowhere to set up a desktop. My web services were offline for weeks and I got pretty discouraged. Now I've got an opportunity to set it all up again, and enough people have expressed interest, I'm going to document and publish the whole process, or try anyway.
  
-Follow-through is not my forte thoughso I'm giving myself an incentive--an ulterior motiveif you will: really want some faster network gear and equipment that can do SR-IOV so that I can play with hyperconvergence. If I finish this project, I'm going to buy myself that gear (unless somebody else buys it for me.) Even though I really can't justify the expensespace used, or power consumed, if I can prove to myself that I'm capable of finishing a project like this, then dammit have earned my shinies. And I will use them to make videos about the cool things you can do with the faster networking equipment, software-defined networking and all that.+The first set of videos is going to be details on how my SAN is set upalong with a comparison of some of the things I've tried. The format consists of a description of each technologywhen do and don't use it and whyand then a little bit of actual how-to in case that technology appeals to you. hope any instruction provide is helpful.
  
-The first set of videos is going to be details on how my SAN is set up, along with a comparison of all the things I've triedThe format consists of description of each technology, when I do and don't use it and why, and then little bit of actual how-to in case that technology appeals to youI hope any instruction I provide is helpful.+===Introduction to ATA over Ethernet (AoE)=== 
 +You will almost certainly never see ATA over Ethernet used in productionIt was used in few SAN products but eventually lost out to iSCSI and Fibre Channel. I'm covering it anyway, and first mainly because it'good teaching tool. It's easy to get started, and easy to show off different concepts that will become relevant with the more popular technologiesIt's also a really handy tool to have in your toolbox for moving data.
  
----- +For full support (initiator and target) you just need two packages: 
- +  apt install vblade aoetools
-You will almost certainly never see ATA over Ethernet used in production. It was used in a few SAN products but eventually lost out to iSCSI and Fibre Channel. I'm covering it anyway, and first mainly because it's a good teaching tool. It's easy to get started, and easy to show off different concepts that will become relevant with the more popular technologies. It's a really handy tool to have in your toolbox for moving data if all you have is Ethernet. +
- +
-Right now, it has a bug that can cause systems on the network not to shut down or reboot if there's an AoE server on the network, so it shouldn't be used in production. (I need to dig into this.)+
  
 To export a block device to the network, you use a program called vblade. A daemonized version, vbladed, works with the same options. It starts a server that listens on layer 2 for ATA commands and responds to them. Here is (basically) how you use vblade: To export a block device to the network, you use a program called vblade. A daemonized version, vbladed, works with the same options. It starts a server that listens on layer 2 for ATA commands and responds to them. Here is (basically) how you use vblade:
- +  vbladed shelf slot ethdev filename
-vbladed shelf slot ethdev filename+
  
 Other options include sharing only part of a file, SYNC and DIRECT I/O modes, and buffer count. I/O modes and buffer counts require testing. Partial file sharing is there so the operator can logically divide a disk or file but in my opinion that's a bad enough idea I'm not even going to try it. Splitting a device for export is a concern that belongs to a filesystem, or a controller, or something that provides thin provisioning behind a strong layer of abstraction, like... a filesystem. Other options include sharing only part of a file, SYNC and DIRECT I/O modes, and buffer count. I/O modes and buffer counts require testing. Partial file sharing is there so the operator can logically divide a disk or file but in my opinion that's a bad enough idea I'm not even going to try it. Splitting a device for export is a concern that belongs to a filesystem, or a controller, or something that provides thin provisioning behind a strong layer of abstraction, like... a filesystem.
Line 26: Line 22:
 On BTRFS, if a directory has the +C attribute, you can preallocate a file of a given size and (is it contiguous?) it's pretty close to native I/O. (How close?) On BTRFS, if a directory has the +C attribute, you can preallocate a file of a given size and (is it contiguous?) it's pretty close to native I/O. (How close?)
  
-ATA over Ethernet organizes disks by "slots" in "shelves". The operator supplies the values. Shelf can be any value from 0-65534 except 4095. Slot can be any value from 0-254. Unfortunately, vblade doesn't protect you from setting invalid values. If an invalid value is set, the initiator/client machine will get confused, probably fail to read the drive, and maybe give you a helpful error message, but probably not. ("Check DIP switches"?)+ATA over Ethernet organizes disks by "slots" in "shelves". The operator supplies the values. 
 +  * Shelf can be any value from 0-65534 except 4095. 
 +  * Slot can be any value from 0-254. 
 +  * Ethdev must be an **Ethernet** device. Bridges, VLAN interfaces, and VXLAN tunnels are all as good as gigabit Ethernet. 
 + 
 +Unfortunately, vblade doesn't protect you from setting invalid values. If an invalid value is set, the initiator/client machine will get confused, probably fail to read the drive, and maybe give you a helpful error message, but probably not. ("Check DIP switches"?)
  
 So, if you want to export a raw VM image from your current directory, you'd do this: So, if you want to export a raw VM image from your current directory, you'd do this:
- +  vbladed 1 1 eth0 vm.raw
-vbladed 1 1 eth0 vm.raw+
  
 ...then on the initiator machine, run aoe-discover, aoe-stat, ls -al /dev/etherd ...then on the initiator machine, run aoe-discover, aoe-stat, ls -al /dev/etherd
  
-Once the remote device is in /dev, you can use it like any other device. If it has partitions, Linux will find them automatically. Attaching it to a virtual machine is especially handy, since it will have the same device name on any system that can access it. (Demo VM migration with AoE backing store?)+Once the remote device is in /dev, you can use it almost like any other block device. If it has partitions, Linux will find them automatically.
  
-----+The one downside is that, for reasons I haven't investigated, AoE devices can't be attached directly to virtual machines. I wonder if KVM/QEMU doesn't like the device name; if that's the case, could udev rename the block device something more consistent with modern sensibilities? (E.g., /dev/aoe${shelf}s${slot}p${partition} and the partition indicator is optional?)
  
 +===Aside: Why "Target" and "Initiator"?===
 Difference between server/client and target/initiator: Difference between server/client and target/initiator:
-  - server/client connections are ephemeral +  - server/client connections are ephemeral, but targets are expected never to go away. (Demo confusing aoe-discover) 
-  - targets are expected never to go away. (Confuse aoe-discover) +  - target/initiator connections are mugh higher-performance, but typically require more configuration, and sacrifice flexibility. 
-  - the paradigm is different +  - storage protocols require guarantees about reliability, ordering, latencyand well-defined behavior if those guarantees aren't met. 
-  high performance +  - Most importantly: target/initiator connections are **only** for storage and retrieval of data. The data that's returned from a target will always be whatever the initator stored there most recently.
-  - sacrifices in flexibility +
-  - requires low latency and delivery guarantees+
  
 +===Multipath===
 +ATA over Ethernet supports multipath natively and automatically. If AoE discovers a new link to the same slot and shelf on a different Ethernet interface, it will start sending commands and responses on both links round-robin, providing both failover and load balancing. In my experience, speed increases almost linearly with the number of links added. My personal best over GigE is 519 MiB/s using five links. Using VXLAN over InfiniBand, I got 1.4 GiB/s, but that was probably limited by the drive. Benchmarks later.
  
 Section 1.1 of the ATA over Ethernet standard: Section 1.1 of the ATA over Ethernet standard:
Line 52: Line 53:
 https://web.archive.org/web/20161025044402/http://brantleycoilecompany.com/AoEr11.pdf https://web.archive.org/web/20161025044402/http://brantleycoilecompany.com/AoEr11.pdf
  
 +When something goes wrong such as a link disappearing, AoE blocks for 10 seconds by default. That's a long time for your users to be wondering what's going on, and it only has to happen a couple of times before they stop trusting you. The timeout value lives in FIXME and is specified in seconds. A shorter value would make more sense. In the context of a modern SAN, 10ms is **plenty** for a timeout. Maybe it should be higher for spindle drives--whatever, I can't change the kernel. At least I don't think I can.
  
-When something goes wrong such as link disappearingAoE blocks for 10 seconds by default. That's a long time for your users to be wondering what'going on, and it only has to happen a couple of times before they stop trusting you. +===Persistent Configuration=== 
- +vblade and vbladed do not maintain state between or across instances. If you need an ATA over Ethernet export to come back after rebootyou will need your OS to manage vblade processes. On Debian, that'done by putting shell script fragments in /etc/vblade.conf.d/. There is an example file installed there with all directives commented out:
-----+
  
-So I glossed over security and VLANs earlier. ATA over Ethernet is designed to run inside of trusted networks. By default, it runs wide openany host in the same layer 2 broadcast domain can access Originally that meant physical separationNow that separation is more likely to be implemented with VLANs[show off VLAN setup iterationsDoes performance change? Testing needed...]+  # This is a POSIX shell fragment 
 +   
 +  # configuration of a single vblade instance 
 +   
 +  # Supported variables: 
 +   
 +  # shelf addressMandatory 
 +  # shelf= 
 +   
 +  # slot addressMandatory 
 +  # slot= 
 +   
 +  # Network interface nameMandatory 
 +  # netif= 
 +   
 +  # The name of the regular file or block device to exportMandatory 
 +  # filename= 
 +   
 +  # Other options, see vblade(8) 
 +  # options= 
 +   
 +  # ionice= 
 +  # Set the I/O scheduling class and priority. 
 +  # Must be understood by ionice(1) 
 +   
 +  # Example: 
 +  # shelf=10 
 +  # slot=3 
 +  # netif=em3 
 +  # filename=/dev/mapper/export 
 +  # options='-m 11:22:33:44:55:66 -o 8' 
 +  # ionice='--class best-effort --classdata 7'
  
-----+===Security=== 
 +ATA over Ethernet is intended to run inside of trusted networks. By default, it runs wide open: any host in the same layer 2 broadcast domain has full access to any exported volume. There is no distinction between read-only and read-write access. Preventing unwanted access has to be done by dividing broadcast domains. Originally that meant physical separation--different network adapters, cables, and switches. Now, that separation is more likely to be implemented inside the switch using VLANs or VXLAN tunnels.
  
-AoE can also restrict access by MAC addressMAC addresses are easy to spoof so this isn't actually secureIt's just good practice for what's coming. [lightning strike, lol?]+SAN technologies generally have some kind of ACL mechanism. This has benefits for security and discoverability. As a configuration or command-line option, vblade can take one or more MAC addresses to which to restrict access. Hosts not on the list can't (see|access) that device. This should not be considered an especially robust mechanism since Ethernet addresses are nearly trivial to spoof.
  
 As you put these values into these configuration files, imagine that you are actually plugging different hard drives into different computers. It's not about moving data to a different drive anymore; it's about moving the drive to where the user needs it to be, and doing so in a completely virtual way. As you put these values into these configuration files, imagine that you are actually plugging different hard drives into different computers. It's not about moving data to a different drive anymore; it's about moving the drive to where the user needs it to be, and doing so in a completely virtual way.
  
--> ACLs +===Boot=== 
--> VLANs+And that brings us neatly to maybe the most useful thing about a SAN: It makes local storage unnecessary. iPXE supports ATA over Ethernet directly. The DHCP has to provide a suitable root-path option. For isc-dhcp-server, telling a host to boot from shelf 12, slot 9 looks like this:
  
-----+  option root-path "aoe:e12.9";
  
-ATA over Ethernet supports multipath natively and automatically. If AoE discovers a new link to the same deviceit will start sending commands and responses on each link round-robin, providing both failover and load balancing. In my experience, speed increases almost linearly with the number of links added. My personal best is 519 MiB/s using five gigabit Ethernet links, but I might be able to do better than that. Benchmarks later.+The DHCP server must not also provide a TFTP next-server and filename. If it doesiPXE will boot via TFTP instead.
  
 +FIXME As far as I can tell, there's no way to have your root volume on ATA over Ethernet. iPXE can use AoE to fetch a bootloader, but that's it: neither Linux nor Windows can use it as a root volume.
nndocs/ata-over-ethernet.1673052868.txt.gz · Last modified: 2023/01/07 00:54 by naptastic