More storage for Utah Cloudlab cluster
Spinning this off from the more general #567. So sayeth that issue:
Utah: 43.5TB in one server, 6.7TB free. We have a couple of options for more space here.
One is the DriveScale chassis which have over 100TB between them but needs more work to
integrate with the model. The other is to wire up the Apt half of the Dell storage box.
This would add a second server with 36TB. Requires running a 40Gb link from the Apt side
of the pod over to bighp1.
These are still the primary hardware solutions, with a couple of updates:
- DriveScale got bought up, so we don't have to worry about using their SW anymore! The JBOD is pretty basic I think, as is the controller node it connects with (just a PC running Linux), so we could probably just get it running with FreeNAS as a regular storage server.
- The "steal Apt's storage server" solution is more practical because we never put the Apt server back in operation after rebuilding the storage box. Right now it is just an oversized FreeNAS testing box.
- A cheaper and faster(?) to implement alternative to running a 40Gb fiber around the pod would be to just export the current Apt RAID volumes out the Cloudlab SATA ports. This leaves the second storage server with just its 1TB of fast local storage, which sould be enough for testing. It does limit the potential throughput we could get with two server with twice as much SATA and network BW.
- We could also consider populating more of the slots in the MD3260, we are using only 25 of the 60 bays. Dell certified drives from harddrivesdirect.com range from $235 (qty 1) to $181 (qty 10) each for 4TB models. 6TB models range from $218 to $202 though we would need to double check that they are compatible.
On the software side, I have discovered that the ZFS zvol volume size (-V when creating) is a space limit from the perspective of the user, and not a limit on how much space the zvol will consume. While knowing this at some level, I failed to appreciate just how much overhead ZFS can introduce on top of that, including metadata such as RAID parity, checksums, and fragmentation due to misalignment of various blocksizes. Apparently the recommendation for iSCSI on zvols is to not allocate more than 50% of the capacity. In our most extreme example, we have a 15TB dataset, which is using 21TB of disk space even though the overlayed Linux filesystem is only using 7TB. Potential improvements here:
- Use "thin" zvols. This really doesn't solve anything though, it just allows us to overbook storage and probably everyone will run out of space in a truly ugly way down the road.
- Turn on compression for the zvols. This is one way of implementing that "don't allocate the full disk capacity", but of course depends on the nature of the data and doesn't address the metadata overhead.
- Switch to using iSCSI volumes on top of ZFS filesystem files. This is what a number of people say you should do, as apparently it has fewer "misalignment" problems and performs better overall.
- Turn on the "discard" option when we create ext4 filesystems, so that they TRIM. This will help when the overlayed filesystem has a lot of free space.
- Use something other than ZFS? This would be a lot of work to implement at this point.