PreferHT is used to consolidate the vCPUs as much as possible and create the fewest number of NUMA clients. In your situation, there are four NUMA nodes, 22 cores with each 384 GB of memory (assuming the DIMMs are equally distributed across sockets and offer the same capacity). With PreferHT, the NUMA scheduler takes the SMT capabilities into account, and therefore, each NUMA node can now accept NUMA clients similar to the HT thread count. In your situation, that is 44.
With this theory, your VM of 64 vCPUs should be distributed across two NUMA clients, each NUMA client grouping 32 vCPUs.
The advice of setting numa.autosize.vcpu.maxPerVirtualNode in the book is to propagate the virtual NUMA topology to the guest OS. This is typically recommended for VMs that exceed the memory capacity of a NUMA node while being able to all the vCPUs in a single NUMA node. In your scenario, setting it to 11 restricts the scheduler to prefer HT threads for sizing the NUMA client. 64/11 = 5.8; thus, it should round up to 6, but it doesn't do this, so I expect other advanced settings are influencing the NUMA client configuration.
Instead of diving into and wasting much time on figuring out this anomaly, my recommendation is to remove the setting numa.autosize.vcpu.maxPerVirtualNode, and allow the NUMA scheduler to align the NUMA client configuration to the PreferHT setting. Typically most customers do not use PreferHT on a virtual machine that has that many CPUs. PreferHT is an advanced setting designed to help workloads that are cache-intensive, but not CPU intensive. By scheduling all the vCPUs into the same NUMA node, it can leverage the same L1/L2/L3 cache and thus take advantage of the spatial locality of memory access. I suspect, you assigning 64 vCPUs is to have a lot of CPU resources available for the application.
We decoupled the NUMA scheduling constructs from the user setting Cores per Socket (CPS). Thus the setting provides customers to align with licensing requirements while the NUMA scheduler can manage its constructs to optimize memory access. When setting it, it provides a socket topology, and the Guest OS translates that as cache scheduling domains. In your situation, with 64 vCPUs with PreferHT enabled and maxPerVirtualNode disabled, you should get two NUMA clients with each containing 32 vCPUs. Your CPS setting should be 32 cores per Socket. That will present a dual-socket configuration, and your Guest OS will create a Cache map, where 32 CPUs share the same cache. The application and Guest OS can now determine where to place the threads when they expect these threads to share and access the same memory blocks.
Hope this clarifies it for you, if you decide to apply my recommendation, please test it first in a non-production environment and post the results as others might experience a similar situation and are helped by your experience.