Personally, I ignore #2. Intel VT-x has matured to the point that the chances of VMexits (a VM having to go back to the hypervisor, similar to what an ordinary process will context switch with the OS), and the cost of the transitions (number of clock cycles spent) is reduced. The virtual RAM is also managed by the CPU and no longer done through VMware software. And if you go back to the mid 2000s, when multicore were not so common and virtual RAM was managed by software; you HAD to run multiple VMs; that advice would be useless; sure it can sometimes be as slow as molasses but what choice do you have?
And if you have the HyperV involved, things will be slower and the chances of VM transition go up as HyperV API will be running at ring 2.
It really boils down to the workload of the VM(s) and the host.
Anyway, the 16vCPU limit for your R630 is about RAM locality and L1/L2/L3 cache. L1/L2 resides in each core shared by 2 hyperthreads, L3 cache is shared across cores in one socket. Once a process (whether on host or VM) has to cross over to another CPU, L1/L2/L3 cache built up is no longer available and has to go back to RAM and this time it is on another CPU.
For example, you don't want to get in a situation you assign 24vCPU to a database/application server VM, it retrieves a large amount of data using CPU0 and now has to do sorting and it gets assigned to CPU1 and now the VM has to ask CPU0 for that data in RAM that was already retrieved. CPU0 is now getting disturbed to get data in its RAM for CPU1 and unable to continuously do work for other processes (whether on the host or a VM). This is just a simplistic illustration to show the additional overhead when RAM locality and L1/L2/L3 cache advantages are lost.
If the dual CPUs were 20c/40t each, you'd still be fine to create a monster 32vCPU VM as RAM locality can still be achieved and still have high chance of retaining L1/L2/L3 cache.