Look, you wrote "one virtual socket with 16 cores", which was quite confusing considering the circumstances. You screenshot now says "16 virtual sockets with 1 core each".
With this the vNUMA layout of the VM is already optimal as we can see in the esxtop memory screenshot:
The VM resides on NUMA home node 0 and 1, with 32GB memory on each node and memory locality is 100% local. This means no remote memory access that could potentially impact memory throughput or latency (as long as the guest OS is also aware of the NUMA layout, which is the case as provided by your in-guest screenshot).
I'm not sure whether the preferHT parameter will actually help overall performance in this case because the memory layout is already optimal. It might improve CPU cache hit rate, but you will also be confined to one socket's CPU resources of 8 real cores and 8 HT threads, instead of 16 real cores from both sockets as is the case now.
What is PreferHT and When To Use It | VMware vSphere Blog - VMware Blogs
It is important to recognize though that by using this setting, you are telling vSphere you’d rather have access to processor cache and NUMA memory locality as priority, over the additional compute cycles. So there is a trade off.
Apart form this, your initial question wasn't about performance issues or improvements, but a difference in measurement values between VM and Host stats. As mentioned earlier, the %usage because the host includes HT threads (32*100% vs 16*100%), so compare the advanced performance charts in the vSphere Client for total Mhz usage. Not sure why vCOPS or whatever measures more Mhz than your physical host (Mhz capacity counts physical cores only, no HT threads).