Very odd slowdown after a host recovery

    Posted May 18, 2015 11:45 AM

    So my environment is on a budget and not optimal, but please read through, I find some of the behavior after a host-freeze-and-reset to be weird and I can't pinpoint the issue.

    I have 3 hosts, a Dell R720 (192GB) and two HP DL380g8 (96GB) - vmware 5.1 updated as of a couple weeks ago, I have "Essentials plus" with vcenter.

    My SAN is a Dell R720xd running FreeNAS (freebsd, istgt) - disks are not fast, it has 8x 7k spindles - HW-raid is not used, i utilize zfs directly to each spindle.

    My Switch, which handles SAN and all other traffic is a Cisco 3750X - separate ports for vmotion, san1, san2, mgm, and guest network vlans (4x 1gig to each host)

    I have 70 vm's or so, 90% windows server 2008r2 - spread over 10 shared datastores (iscsi targets), vmdk files..

    All of this is working fine and dandy for the most part, SAN is a bit slow when a lot of things are going on but that is expected - things like vmotion has no issues, restart

    Every now and again, a host will freeze on me (like once per month) - it appears to be random which one - and this question is not about narrowing down that issue (someday I will get a decent SAN and all hosts in the cluster to be equal hardware). The way it freezes is I lose network connectivity, it is odd seems to start with some vm's then within minutes all vm's are unreachable -  I can usually reach the host itself, but restarting management systems don't solve it, I spent hours on that and I always end up resetting  the host.

    So - when a reset of the host happens, all the VM's fire up on the other hosts - a few weeks back I thought my slowdown issue was simply that my SAN was way overworked due to all these vm's firing up at the same time.. so I disabled HA for vm's, I wanted the ability to manually fire up vm's, so that I can control what is happening - and here is where the weird behavior kicks in, it is still incredibly slow -  I assume it is related to disk or IO but I am unable to pinpoint it - nothing show any spikes or ceilings that I can find, I cant find a way to measure IOPS so I dont know the numbers, but CPU and numbers by iostat don't seem to go high to me.

    To summarize

      - One host is frozem all it's vm's unreachable - but the rest of the stuff (2/3 of the vms are running fine)

      - the host reset is performed

      - as the host recovers, all the "jailed" guests show up as powered off

      - I fire up a single VM

      - all vm's running on the other two hosts become syrup, this probably happens without any vm's fireed back up, havent really tested that

    Given that this affects all hosts I assume it is storage related - IOPS capacity is not great I am sure, but I can not fathom what is happpening, why is there an increased activity that literally takes all my capacity by simply bringing up a recovered host?

    does vm (5.1) hosts do a bunch of datastore cleanup during such a recovery?