Hi Manivel,
Your detailed breakdown helps a lot in understanding your current situation. Let's break this down a bit.
Memory Overcommitment: VMware ESXi allows overcommitment of physical memory. When you sum the virtual memory (configured) of all VMs, it doesn't necessarily mean that amount is actively in use. However, the memory usage stats reflect both active and overhead memory. ESXi would provide memory to VMs based on their demand and not necessarily what's configured. This means even if VMs are only using 5% of their configured memory, ESXi might still allocate more than that due to memory overhead or other factors.
Memory Ballooning: The fact that ballooning is in a high state indicates there's contention. Ballooning is a mechanism where the hypervisor reclaims memory from VMs (through a balloon driver) when there's a memory shortage. This usually happens when there's an actual or perceived memory pressure on the host.
Resource Pools & Expandable Reservation: The expandable reservation on the resource pool means that if a VM within the pool needs more resources than are currently reserved, it can borrow from the parent pool. This can lead to scenarios where VMs in one resource pool are using more memory than you might expect. However, I don't think this is the core of your problem if no reservation is set.
Open-VM Tools vs VMware Tools: While open-VM tools are recommended for Linux VMs like RHEL, they should handle memory management functions quite similarly to the original VMware Tools. I doubt this is the root of your problem, though it's always good to ensure they are updated.
VSAN Memory Usage: You mentioned VSAN is using 50 GB of memory per node, which is significant. While this is expected as VSAN requires memory for things like caching, it adds to the overall memory usage.
I'd recommend the following steps:
Deep Dive with ESXTOP: Check for other memory metrics like swap rate, compression, etc. High swap rates can indicate memory contention.
Check Allocated Memory: The 'Consumed' memory metric in vCenter will tell you how much memory the VM is currently using, including overhead, not just what's active.
VM Memory Metrics: Dive deeper into each VM's memory usage in vCenter. Look for metrics like 'active', 'consumed', 'overhead', and 'shared'.
Cluster-Wide Settings: Ensure there aren't any cluster-wide memory settings or reservations that might be causing unexpected behaviors.
Lastly, consider reaching out to VMware support. They can provide deeper insights by analyzing your logs and metrics directly.
Hope this helps clarify things a bit!