SethBlog

I've spent the better part of the last two days working with a co-worker who is a Hyper-V administrator trying to figure out why backing up VMs on his new Server 2012 R2 Hyper-V cluster were failing with error 156. We initially were using the default (auto) snapshot provider type, and that failed, showing a number of errors in the Hyper-V host's log relating to SNMP being unable to communicate with Physical Disks in the SAN.

This clued us in that it must be attempting to use snapshots at the array level rather than in the operating system. Changing the option to "Software" worked some of the time but still resulted in occasional 156 errors, changing it to "System" failed completely. But we opted not to engage Microsoft on this, but rather to try to solve the hardware-based snapshots first.

Today, he contacted Equallogic support, and together properly configured the Equallogic's support for VSS. Notably, enabling VSS access, and setting access permissions from "Volume Only" to "Volumes and Snapshots". That cleared up the error below:

In addition, the Host Integration Toolkit was upgraded to 4.7. After that was done, setting the snapshot type back to "auto" was successful, and performance is quite good, with the 1Gbps Hyper-V host NIC acting as the bottleneck for the backups.

Watching in the Equallogic management console, you can clearly see the snapshots being created at the array level during backups now.

I thought I'd post this, as administering vSphere on Equallogic, I would not have expected the array to get itself involved in backup operations unless explicitly configured, but apparently it does for Hyper-V deployed against Equallogic, at least when using NetBackup 7.6.0.2 with the snapshot provider type set to auto.

I had an older Equallogic PS6000E SAN, configured for RAID 6 that was attached to a couple of vSphere hosts. Being comprised of a bunch of 1TB 7200 RPM SATA disks, it wasn't exactly built for performance and I would often see it top out on IOPS for long periods of time in SAN HQ. After a bit of shuffling in our other datacenter, I freed up a PS6000XV SAN (600GB 15,000 RPM disks, in RAID 10) and decided to add it to the same pool in order to utilize the auto-tiering capabilities and boost performance of the SATA SAN. My problems with IOPS were solved, but read latency remained stubbornly high. As I spent more time looking at the graphs, I realized that, strangely, the latency was highest when the IOPS were lowest, which is the opposite of what you'd expect. Shouldn't requests be answered faster when there is less work to do?

I did a bit of Googling, and decided to re-read the Best Practices for VMware guide for Dell's Equallogic storage. Buried inside there are two very helpful tips, that I don't remember being there years ago when I set up those SANs for the first time.

The important bits are found on pages 9-11. The section on Delayed ACK describes EXACTLY what I was seeing, so I disabled it, and Large Receive Offload (LRO) for good measure. Note that this will require a reboot of your hosts, but that's what we have vMotion for, right?

As you can see in the graphs below, the improvements in my read latency were pretty stunning and instant. If you are experiencing high latency during periods of relatively low IOPS with your Equallogic SANs, then definitely give this a try.