I've been looking at esxtop stats lately to try and understand why VMs sometimes seem to feel sluggish. Before I go on to tackle the storage side of things, I figured I'd start with the hosts. I've been reading a bunch of stuff but I'm still having a hard time understanding so hopefully someone can help me.
In watching esxtop, I saw a high %VMWAIT on a VM so I expanded the VM and saw this:
ID GID NAME NWLD %USED %RUN %SYS %WAIT %VMWAIT %RDY %IDLE %OVRLP %CSTP %MLMTD %SWPWT
30289238 57794167 vmx 1 0.81 0.06 0.75 96.40 - 0.01 0.00 0.00 0.00 0.00 0.00
30289240 57794167 vmast.30289239 1 0.02 0.02 0.00 96.45 - 0.00 0.00 0.00 0.00 0.00 0.00
30289241 57794167 vmx-vthread-4:V 1 0.00 0.00 0.00 96.48 - 0.00 0.00 0.00 0.00 0.00 0.00
30289242 57794167 vmx-vthread-5:V 1 0.00 0.00 0.00 96.48 - 0.00 0.00 0.00 0.00 0.00 0.00
30289243 57794167 vmx-mks:VM863_W 1 0.00 0.00 0.00 96.47 - 0.00 0.00 0.00 0.00 0.00 0.00
30289244 57794167 vmx-svga:VM863_ 1 0.20 0.22 0.00 96.17 - 0.09 0.00 0.00 0.00 0.00 0.00
30289245 57794167 vmx-vcpu-0:VM86 1 9.31 9.31 0.00 86.69 29.53 0.48 57.16 0.23 0.00 0.00 0.00
Can someone tell me what %VMWAIT is telling me in this case? It looks like the VM is doing nothing so what is the vm "waiting" for? The value goes back down to zero after a few refreshes but seems to come back a little later.
Another thing I see is %USED > %RUN and sometimes %RUN > %USED and I'm not sure what that means. Is %USED what is physically used on processor and %RUN what the VM is actually trying to use? If so, does that mean that when %RUN > %USED , I have a resource problem?
Example:
On a host with 30% cpu used (as reported by the vcphere client and esxtop), %USED > %RUN (although sometimes I see things flip and %RUN > %USED for a few seconds but generally, the values are within 10 pionts of each other, with the VM using the most CPU being closer to 5-10).
On 2nd host with 46% reported in vsphere client, the load averages in esxtop are more like 0.8 and %RUN > %USED (some VMs have %RUN twice as much as %USED.)
Why is there such a difference between the GUI and esxtop in regards to load for the 2nd host? Is the 2nd host overloaded because %RUN > %USED for almost all vms? %RDY is anywhere from 1 to 10%. I think this is telling me that even though I seem to have cpu speed to spare, I have too many VMs on the 2nd host and trying to schedule them all is making it so none of them is getting enough CPU time?
Thanks