Hi
I was interested in your scenario, so I tried to check on my own LAB and I found out some interesting things that I want to share with you. I hope this could be helpful:
There are 4 different columns in ESXTOP command when you check the memory that we should consider, NHN, NMIG, NRMEM, NLMEM.
- NHN or NUMA Home Node is the NUMA node that your VM has been put by NUMA Scheduler.
- NMIG or the number of migration is related to the number of migration made by the NUMA scheduler because of its nature and also Action Affinity.
- NRMEM or NUMA remote memory is the amount of remote memory accessed by VM
- NLMEM or NUMA local memory is the amount of local memory accessed by VM
The behavior of NUMA Scheduler:
As soon as a VM power-up, the NUMA scheduler will put the VM into a single or multi NUMA node (it depends on VM configuration). but this is not the end of the story, every time that a NUMA node is a better place for that specific VM (because of free memory or CPU metrics), the NUMA scheduler will migrate that VM to that NUMA node. it happens all the time. but the speed of vCPU migration is far more than the speed of memory migration. in this case, you will see the amount of remote memory accessed by that VM, under the NRMEM counter, is much more than the amount of memory in NLMEM. as you said, for example, 80% NRMEM and suddenly it changes to 20%
The behavior of Memory Scheduler:
When memory scheduler tries to access the remote memory (it can happen because of the lack of memory on local node or other reasons) NUMA scheduler will decide to migrate the whole VM to that NUMA node or NOT. so in this case, despite the fact that your VM was fitted into a single NUMA node, you will see some amount of remote memory access and local memory access. for example 20% of remote memory access and 80% of local memory access.
What is Action Affinity feature:
In your virtualization environment, you have 2 different places that data can be accessed by vCPUs, CPU L3 Cache and Memory. because of the different latency time between these two places, VMware always considers CPU L3 Cache as a better place to access data for vCPUs. so when there are two different vCPUs (for example one for VM1 and another for VM2) shared the same data or they are communicating with each other, NUMA scheduler will decide to place them closely, so they both can have access to the same L3 Cache data but the memory data for one of them is in another NUMA node, and this means remote memory access (Despite the fact that your VM is fittable into one NUMA node). this may cause CPU contention but as VMware says, the contention can be handled by NUMA scheduler (you can check KB 2097369).
You can check this with the NMIG counter, but it happens fast so you should put your eyes on the screen!
By Action Affinity, you may see the NHN of most of your VMs is the same, that doesn't mean one of your NUMA nodes is overloaded and another one is all free. it is expected behavior and generally improves performance even if such a concentrated placement causes non-negligible ready time.
If the increased contention is negatively affecting performance, you can turn off that by changing the value of NUMA.LocalityWeightActionAffinity to 0 in your host advanced configuration. Be aware that it means all of your workloads will be affected, so be careful.
Conclusion:
Accessing remote memory is a normal behavior of NUMA scheduler, I attempted to run different scenarios and saw the same behavior. NUMA scheduler will try to fit your VM into the best NUMA node, which means VM migration between NUMA nodes. as I said before, the speed of vCPUs migration is much more than memory, so you see a big amount of remote memory access at the moment.
You can check out my different scenarios screenshots in this link.