ESXi

 View Only
  • 1.  help interpreting esxtop vm cpu values

    Posted Mar 22, 2015 06:36 PM

    I've been looking at esxtop stats lately to try and understand why VMs sometimes seem to feel sluggish. Before I go on to tackle the storage side of things, I figured I'd start with the hosts. I've been reading a bunch of stuff but I'm still having a hard time understanding so hopefully someone can help me.

    In watching esxtop, I saw a high %VMWAIT on a VM so I expanded the VM and saw this:

             ID      GID NAME             NWLD   %USED    %RUN    %SYS   %WAIT %VMWAIT    %RDY   %IDLE  %OVRLP   %CSTP  %MLMTD  %SWPWT

    30289238 57794167 vmx                 1        0.81    0.06    0.75   96.40       -    0.01    0.00    0.00    0.00    0.00    0.00

    30289240 57794167 vmast.30289239      1        0.02    0.02    0.00   96.45       -    0.00    0.00    0.00    0.00    0.00    0.00

    30289241 57794167 vmx-vthread-4:V     1        0.00    0.00    0.00   96.48       -    0.00    0.00    0.00    0.00    0.00    0.00

    30289242 57794167 vmx-vthread-5:V     1        0.00    0.00    0.00   96.48       -    0.00    0.00    0.00    0.00    0.00    0.00

    30289243 57794167 vmx-mks:VM863_W     1        0.00    0.00    0.00   96.47       -    0.00    0.00    0.00    0.00    0.00    0.00

    30289244 57794167 vmx-svga:VM863_     1        0.20    0.22    0.00   96.17       -    0.09    0.00    0.00    0.00    0.00    0.00

    30289245 57794167 vmx-vcpu-0:VM86     1        9.31    9.31    0.00   86.69   29.53    0.48   57.16    0.23    0.00    0.00    0.00


    Can someone tell me what %VMWAIT is telling me in this case? It looks like the VM is doing nothing so what is the vm "waiting" for? The value goes back down to zero after a few refreshes but seems to come back a little later.

    Another thing I see is %USED > %RUN and sometimes %RUN > %USED and I'm not sure what that means. Is %USED what is physically used on processor and %RUN what the VM is actually trying to use? If so, does that mean that when %RUN > %USED , I have a resource problem?

    Example:

    On a host with 30% cpu used (as reported by the vcphere client and esxtop), %USED > %RUN (although sometimes I see things flip and %RUN > %USED for a few seconds but generally, the values are within 10 pionts of each other, with the VM using the most CPU being closer to 5-10).

    On 2nd host with 46% reported in vsphere client, the load averages in esxtop are more like 0.8 and %RUN > %USED (some VMs have %RUN twice as much as %USED.)

    Why is there such a difference between the GUI and esxtop in regards to load for the 2nd host? Is the 2nd host overloaded because %RUN > %USED for almost all vms? %RDY is anywhere from 1 to 10%. I think this is telling me that even though I seem to have cpu speed to spare, I have too many VMs on the 2nd host and trying to schedule them all is making it so none of them is getting enough CPU time?

    Thanks



  • 2.  RE: help interpreting esxtop vm cpu values

    Posted Mar 22, 2015 07:57 PM

    Do not rely on the GUI for real time stats.  Esxtop will alway be in real time.

    You are also seeing %OVRLP.

    I would look at I/O bottle necks.  Is that VM on different storage or requesting a lot I/O?



    Info on esxtop below.  Hope some of this helps.

    http://www.yellow-bricks.com/esxtop/

    Interpreting esxtop 4.1 Statistics

    http://www.running-system.com/wp-content/uploads/2012/08/esxtop_english_v11.pdf

    "

    Another thing I see is %USED > %RUN and sometimes %RUN > %USED and I'm not sure what that means. Is %USED what is physically used on processor and %RUN what the VM is actually trying to use? If so, does that mean that when %RUN > %USED , I have a resource problem?"


    From the esxtop bible.

    • "%USED"

    The percentage physical CPU time accounted to the world. If a system service runs on behalf of this world, the time spent by that service (i.e. %SYS) should be charged to this world. If not, the time spent (i.e. %OVRLP) should not be charged against this world. See notes on %SYS and %OVRLP.

    %USED = %RUN + %SYS - %OVRLP

    • "%RUN"

    The percentage of total scheduled time for the world to run.

    +Q: What is the difference between %USED and %RUN?+

    A: %USED = %RUN + %SYS - %OVRLP. (%USED takes care of the system service time.) Details above.

    +Q: What does it mean if %RUN of a VM is high?+

    +A: The VM is using lots of CPU resource. It does not necessarily mean the VM is under resource constraint. Check the description of %RDY below, for determining CPU contention.

    • "%WAIT"

    The percentage of time the world spent in wait state.

    This %WAIT is the total wait time. I.e., the world is waiting for some VMKernel resource. This wait time includes I/O wait time, idle time and among other resources. Idle time is presented as %IDLE.

    --------------

    "%VMWAIT" does not take into account idle time.





  • 3.  RE: help interpreting esxtop vm cpu values

    Posted Mar 22, 2015 08:57 PM

    I've read all that stuff multiple times, but I still have a hard time grasping everything. I figured maybe someone might be able to explain it in a slightly different way.

    Anyway, taking the %USED formula confuses me because the numbers never adds up. Taking another VM as an example (because the %VMWAIT on the original doesn't seem to be going high anymore):

    %USED: 120.70

    %RUN: 190.43

    %SYS: 1.07

    %WAIT: 619.36

    %VMWAIT: 6.15

    %RDY: 0.44

    %IDLE: 5.75

    %OVRLP: 0.25

    %CSTP: 0.05

    The used/run numbers are high but I don't think that's a problem because the VM is actually busy doing work but the problem is the %USED = %RUN + %SYS - %OVRLP doesn't add up in this example. What am I missing?

    As for the IO potentially causing VMWAIT to be high, this is what I am seeing for that VM:

    CMDS/s: 72.83

    READS/s: 0.35

    WRITES/s: 72.48

    MBREAD/s: 0.02

    MBWRITE/s: 2.23

    Lat/rd:3.98

    Lat/wr: 1.46

    We have a couple of EqualLogic SANs (not grouped together) carved into 1.5TB luns and presented to the hosts. There's roughly 400vms total and we end up with 20-30 vms per luns.