I understand you frustration ... I am also not satisfied with some parts of vSphere documentation in terms of technical deepness sometimes it leaves many questions.(For me this is only one place where MS beats VMware...:)
Its quite often that it tells you A but not a B ... seems to me its a business intention.
If I put together you info correctly you have:
AES-DynSQL - with CRM database, 80GB RAM, 350GB disk cap., 2 vCPU averaging 65% to 68% CPU usage with a few spikes to 100% ; CPU latency = 10%, max.27%, SAN latency 2-7ms
This VM whis is running on dual-socket Hex-core host with HT enabled, 192GB of RAM.(2.5GHz Westmere processors)
vSAN-50-2 - Windows iSCSI target server, 1 vCPU, averaging 40% CPU usage, CPU latency from 60 to 88%, SAN latency 2-7ms
This VM whis is running on quad-socket, 10-core (Intel(R) Xeon(R) CPU E7- 4870 @ 2.40GHz) host HT enabled, 512GB RAM
Back to the CPU latency duscussion from ESXTOP metrics posted thi is my estimation for VSAN VM:
(Its just my understanding based on my experience and public vSphere documentation)
vSAN-50-2 ESXTOP metrics:
%SYS (yours 75%) - means how busy vmkernel services are (on behalf of world) to satisfy the needs of the VM/worlds ; its typical fo high I/O VMs
%WAIT (58.1%) vs %IDLE (56.4) difference between these two metrics (%VMWAIT; your is 1,7%) means time VM spent in blocked state waiting for some events to
complete; usually waiting for I/O
High VMWAIT can be caused by poor storage performance or high latency of some pass-through device configured for the VM.
%OVRLP (your is zero) time spent by VM "in system queue" while vmkernel services scheduled another VM/world ... (i.e. time the VM was not scheduled on behalf of other worlds..)
This is usualy indication that the host is experiencing high I/O
You have also zero for CSTP, MLMTD, SWPWT, RDY
Based on that I am really convinced that your VM are not constrained on I/O but to prove this we shloud also look at GAVG or DAGV/KAGV metrics.
I have found quite good explanation what CPU latency means (someone post answer from VMware support) see:
***********************************************************************************************************************************************************************************************
CPU Latency rises when VM cannot run on the best core (where it was running) while Ready time rises when none of the cores in entire motherboard is available.
[ CPU latency includes: ready, cstp, ht busy time and effects of dynamic voltage frequency scaling and doesn't include mlmtd ]
***********************************************************************************************************************************************************************************************
My understanding is if above is true that CPU latency (%LAT_C) PRECEDES Ready time (%RDY) and can be seen when scheduler is not able to optimally schedule VM.
So from that point of view existing CPU Latency doesn't meant to be "critical" performance problem until you are HIT by excessive ready time (%RDY).
For that reason ready time is always referred as a key metric for performance troubleshooting and its worth to keep an eye on it.
CPU latency examples which comes to my mind:
- Extensive switching/migrations between logical CPUs (core sharing) HT busy?! ***
- Dynamic voltage frequency scaling (BIOS CPU power management and CPU sleep states) ; Intel Turbo Mode
- VM scheduled outside of prefered NUMA node boundary?! that could be verified through N%L memory metric which is current percentage memory being accessed by VM
that is local
>But its only my personal interpretation which could be totally wrong I would also appreciate if some VMware staff/guru could post some comment regarding %LAT_C counter...<
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
*** vSphere documentation:
The ESXi CPU scheduler can interpret processor topology, including the relationship between sockets, cores, and logical processors. The scheduler uses topology
information to optimize the placement of virtual CPUs onto different sockets to maximize overall cache utilization, and to improve cache affinity by minimizing virtual CPU
migrations. Virtual machines are preferentially scheduled on two different cores rather than on two logical processors on the same core.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Anyway I would try to handle Power Mangement differently instead of using (OS Control) option I always chose Max.Performance (on Dell RXXX hosts) so hypervizor
cannot interfere to BIOS PM.You will than see in host PM settings under Active Policy : Not Supported
Assumption is that hardware will always know better how to manage its power policies and offloading another task from vmkernel is always better.
Also using Intel HT technology for SQL databases (unless is fully supported) should be always tested. So you could try how will behave your SQL workloads without HT....
e.g. for MS Dynamics AX 2012 is for SQL server recommended to disable hyperthreading.
http://technet.microsoft.com/en-us/library/dd309734.aspx
In some special cases/workloads turning off Intel Turbo boost also have some gains...
Form more info see:
http://www.vmware.com/files/pdf/techpaper/VMW-Tuning-Latency-Sensitive-Workloads.pdf
"At the time of the second screenshots, the host was running 65 VMs with a total of 104 vCPUs. With 80, logical CPUs, the ration of vCPU to logical CPU is only
1.3 to 1. Even vCPU to core ration is less than 2.6 to 1."
I that case I would take into count olny physical cores because Hyperthreading delivers olny 30% performance gain at the best (mostly lower between 10-15%). So
although vCPU to pCPU/core ratio of 2,6 could be absolutely sufficient for some workloads it can be the edge for others.... you can try to lower vCPU count on some
VMs on that host or spread VMs between other hosts.
Hope this helps you to make some progress
Regards,
P.