ESXi

 View Only
Expand all | Collapse all

Storage DRS interrupts network on guests

  • 1.  Storage DRS interrupts network on guests

    Posted Feb 05, 2014 08:41 PM

    Migrating Virtual Machine files off of datastores in a datastore cluster. I do this by putting the datastore into SDRS maintenance mode. While the VM files are migrating, I've noticed that the guests sometimes experience a network interruption. In one recent case a guest could not be pinged for over two minutes. Usually it is more like 15 seconds. There is nothing in the OS logs that indicate the OS is aware of any network outage.

    Does anyone know if this is considered "normal" or not? Is there any way to avoid this? My guest OS is Red Hat Linux. We are using the VMXNET3 NIC.



  • 2.  RE: Storage DRS interrupts network on guests

    Posted Feb 05, 2014 08:52 PM

    Is your storage connected via the network?  (i.e. ISCSI, NFS)

    Have you checked performance data for the hosts involved to see if a network link was overloaded?



  • 3.  RE: Storage DRS interrupts network on guests

    Posted Feb 05, 2014 09:00 PM
    > Is your storage connected via the network?  (i.e. ISCSI, NFS)

    No, this is fiber SAN storage. EMC disks.

    I have not monitored the network on the hosts while the storage vMotion was running. I can do that, but I was not expecting that this storage migration would result in any extra network traffic.  



  • 4.  RE: Storage DRS interrupts network on guests

    Posted Feb 05, 2014 09:06 PM

    It shouldn't cause network traffic. 

    You can see high CPU with some high storage utilization and that could cause network drops.  You may also be pounding your storage into submission which causes extremely high latency on the VMs in that array.   I would check host CPU stats, disk queue, disk latency on the VMs that are dropping and also check load on the SAN and storage itself. 

    How many storage vmotions are you doing at once?  I've found my high end Netapp storage can only handle 2-3 at a time before I see the Controller CPU go through the roof.  Storage Vmotions can create some serious throughput and load.  



  • 5.  RE: Storage DRS interrupts network on guests

    Posted Feb 05, 2014 09:49 PM

    Thanks for the response. I am looking at VM performance stats from a VM that was unreachable for about a minute and a half today. Unfortunately the historical perf charts look like they use a 5 minute sample size, so I'll have to monitor again in real time. I do not see any CPU or I/O peaks during the time of the network outage, but I'm not sure why I'd see them on the VM.  If I look at network or I/O stats using Linux tools on the guest, it was not doing much at the time of the network outage, and it did not log any problems.

    I can't see why heavy SAN traffic would effect network at all, unless it caused problems on one of the ESXi hosts.

    When I put a datastore into SDRS maintenance mode, all the files on the datastore start to migrate.  I suppose a good test would be to do these only one at a time, and see if we still have guest network problems.  

    We do have alerts set up for our ESXi hosts, and I would expect that if a host was experiencing IO latency or pegged CPU, I'd see an alert.  There are ten hosts on this cluster. I suppose I could limit my migration to files belonging to VMs running on just one of the hosts.



  • 6.  RE: Storage DRS interrupts network on guests

    Posted Feb 06, 2014 02:22 PM

    Well I looked closely at the ESXi host that was running one of the VMs that had the network outage.  In one case I did see a spike in CPU and IO (busy, not latency).  In another case I saw only a minor spike, less than previous spikes that caused no outage.  I can't help but think that if this was an issue of performance on the host, then other VMs on that host would also have network outages, not just the VM whose files were being moved. 



  • 7.  RE: Storage DRS interrupts network on guests

    Posted Feb 06, 2014 04:33 PM

    Did you check the stats on the SAN itself?   Sure sounds like storage overloading to me.    With 10 hosts, default rules would allow each host to perform two migrations at a time resulting in up to 20 storage vmotions at once. 



  • 8.  RE: Storage DRS interrupts network on guests

    Posted Feb 07, 2014 07:40 PM

    I tried a single manual Storage vMotion, using the 'svmotion' command from the vMA. I wrapped it in a script that also pinged the VM every second. The VM went off the network for about 1.5 minutes.  I then looked closely at all the performance charts on the ESXi host that ran this VM. The only interesting chart was Disk, which showed a big, wide spike in disk Usage (also Read and Write Rate). This spike dropped right at the time that the VM went off the network.  So the vMotion was running just fine, and generating lots of IO without affecting my VM. Then something happened and the VM went off the network.  The vMotion also failed with the "exceeded maximum swithcover time of 100 seconds" error. 



  • 9.  RE: Storage DRS interrupts network on guests

    Posted Feb 10, 2014 08:58 PM

    I've seen entire VMs crash if the back-end storage array is overloaded when multiple svMotions are happening simultaneously. SDRS maintenance mode is putting a lot of stress on your storage, so it may not be able to handle the spike in load.



  • 10.  RE: Storage DRS interrupts network on guests

    Posted Feb 10, 2014 09:16 PM

    Thanks, but I'm pretty sure this is not the problem. I see the same thing happening when I migrate a single VM, and there is no indication on the guest OS of processes waiting on IO. Sometimes the ESXi host shows a pretty big spike in IO usage, but not contention.  I migrated some more this weekend, and it seems real common for the VM to go into a quiescent state at the end of its migration, successful or not. This state lasts between 5 and 150 seconds.



  • 11.  RE: Storage DRS interrupts network on guests

    Posted Feb 10, 2014 09:52 PM

    Do you mind uploading the host and VM logs?



  • 12.  RE: Storage DRS interrupts network on guests

    Posted Feb 11, 2014 03:12 PM
    Do you mind uploading the host and VM logs?

    Yes, I would mind that. I do appreciate your willingness to help, though.  I opened a support request for this issue last Thursday, and I'll try to post any good info here.  I am also about to start migrating a bunch of other disks on a different cluster; if for some reason the problem does not occur there, I'll let you all know.



  • 13.  RE: Storage DRS interrupts network on guests

    Posted Feb 12, 2014 03:07 PM

    An update to this — I am currently doing storage vMotions of VMs on another cluster, attached to a different storage array, in a different physical data center.  I am doing these one at a time. I am still seeing network interruptions. Usually it is only one or two seconds, but some VMs are not reachable for up to 45 seconds.  The long network outages appear to happen at the end of the storage migration, and in most cases the migration is successful.



  • 14.  RE: Storage DRS interrupts network on guests

    Posted Feb 18, 2014 02:41 PM

    UPDATE: VMware says this is normal.  It is normal for the guest OS to be frozen for a period of time after a storage migration.

    For us the average time the server is frozen is about 15 seconds. Plenty of the VM are only unavailable for one second, but for others it is as high as 90 seconds.  Usually anything over 100 seconds means the migration fails.