Hello,
Since a couple of days, maybe more than a week we're facing with some strange ping latency.
At the beginning we thought that it has something to do with our network equipments but after a few hours of investigation we come up with the following conclusions:
1) Pinging VM to VM(both hosted on the same ESXi host) result in an average > 1.5ms with spikes till 50ms out of 100 pings at default interval.
2) Created two VMs on a separate vSwitch without any dedicated NIC and tried a ping between them. The result was better but not good enough - average > 0.5ms with spikes till 3ms and even 10ms
3) Pinging ESXi management interfaces from other devices on the same LANs revealed a good ping latency - average around 0.2ms with spikes till 1.7ms
4) Pinging devices on the network from the ESXi console itself(from SSH) showed us a latency higher than expected - average > 0.6ms with spikes till 5ms
5) The interesting part: ping to localhost from ESXi console - average >0.3ms with spikes till 2-3ms
We thought that there might be a contention/bottleneck somewhere on the ESXi but couldn't conclude this, not yet at least. The CPU usage is around 65-80% with spikes till 85% in esxtop. Can this be the cause of our issue? Here is an output from esxtop:
PCPU USED(%): 59 61 52 63 74 59 68 74 AVG: 64
PCPU UTIL(%): 60 61 53 63 75 60 69 74 AVG: 64
ID GID NAME NWLD %USED %RUN %SYS %WAIT %VMWAIT %RDY %IDLE %OVRLP %CSTP %MLMTD %SWPWT
1 1 idle 8 275.55 546.00 0.01 0.00 - 224.18 0.00 7.03 0.00 0.00 0.00
8 8 helper 86 97.66 99.76 0.00 8101.96 - 51.99 0.00 2.24 0.00 0.00 0.00
1786346 1786346 FreeBSD9_037 10 71.96 71.78 2.37 771.33 1.69 58.69 196.71 2.49 53.55 0.00 0.00
6218425 6218425 FreeBSD9_152 8 69.71 70.18 2.55 657.31 0.43 35.38 85.34 3.17 0.36 0.00 0.00
4825332 4825332 webhosting01.wh 12 41.80 39.86 3.44 1070.75 0.43 36.05 305.25 2.12 1.45 0.00 0.00
6363251 6363251 esxtop.36586035 1 20.00 19.33 0.00 75.09 - 0.11 0.00 0.04 0.00 0.00 0.00
5587218 5587218 CentOS5_148 10 17.43 15.48 1.89 907.99 1.08 32.95 333.24 0.62 0.00 0.00 0.00
1528430 1528430 FreeBSD9_116 8 17.00 16.97 0.81 707.03 0.13 39.60 134.28 0.95 0.19 0.00 0.00
4108400 4108400 FreeBSD9_140 8 13.54 13.67 0.38 725.60 4.52 22.63 146.88 0.57 4.08 0.00 0.00
1884461 1884461 FreeBSD9_134 8 12.79 12.49 0.67 738.98 0.18 13.53 165.37 0.53 0.00 0.00 0.00
6112231 6112231 FreeBSD9_143 7 12.24 11.99 0.96 647.13 0.00 7.75 75.78 0.79 0.00 0.00 0.00
4409984 4409984 Win7_128 8 9.06 9.20 0.04 742.67 0.02 3.87 176.57 0.25 0.00 0.00 0.00
6285951 6285951 Unattended_Depl 9 8.73 8.01 0.84 835.92 0.04 6.54 174.64 0.48 0.00 0.00 0.00
helper process is using quite much CPU but I have no ideea how to debug this process further or could be the cause of this.
There is no bottleneck/contention on the network side.
Our setup is pretty simple:
One ESXi 5.1.0 build 1065491 running on an old HP DL585 G2 with 8x Opteron 8218 CPUs and 64GB RAM. The host is connect to the rest of the infrastructure via two switches: one gigabit switch for production and reachable from outside(public IP addresses) and one 100Mbps gigabit for internal management using private IP addresses. One NIC is connected to each of these two switches and we're using vDS. Two management/VMKernel interfaces - one on the public interface and one on the internal interface. Customer's VMs are in the same LAN/network with the public management interface, no VLANs.
For storage we are using SAN - EMC Clariion CX3-20 - connected to the ESXi server via 2xBrocade switches running at 4Gbps.
If someone had similar issues or if you have any ideea what could cause such latencies I would appreciate a little help:)
Regards,
Raul