Hello everyone,
I have an update to share on this - it may be fairly "wordy" (I know that's not a word), but here goes....
Issue
Linked-clone Windows 10 VM's take a considerable amount of time (5-30 minutes, depending) to complete a guest OS restart cycle.
Troubleshooting
To make a long story short, I eventually noticed that disk IO was ZERO during a Windows 10 VM restart - it zero's out until the VM gets to the Windows 10 logo with the spinning circles, then the disk IO suddenly spikes once it gets to the login screen. From this, I began to look at disk configuration settings directly on the VM (vSphere -> right-click VM -> edit). I noticed the disk was using LSI SAS. I reviewed event logs, and noticed (on several Win10 VM's I tested with) there was a 10+ minute span of time with LSI_SAS event warnings. From there, I created a brand new Windows 10 master/gold VM using Paravirtual SCSI instead of LSI SAS. I then sanitized the mast VM and created a new pool - the Win10 linked-clone VM's are now down to 5 minute restarts. Much better, but still not great by any means.
We then noticed that the .vmdk on the VM (all the linked-clone VM's, in fact) was using a "vmname.checkpoint.vmdk" - this "checkpoint" in the .vmdk indicates it's using a snapshot from the master VM. After many hours of testing, we found a sort of fix/workaround....
Fix
If you:
1. Shut down the VM and migrate the storage (storage only)
2. Select configure per disk
3. Change the storage datastore for each disk on the VM, click Next, let it complete (may take anywhere from a few minutes to several, depending on disk size)
4. You should then have an alert on the VM indicating a consolidation is needed
5. Right-click the VM -> Snapshots -> Consolidate
6. Once the consolidation is finished, right click the VM - notice that you can now change the disk size (if needed)
7. Power up the VM, log in, and restart
The restart time is now down to under 30 seconds (I've seen it as low as 7-8 seconds). I tested this using thick and thin disk provisioning, and either one didn't seem to make a difference. With 100% certainity (at least in my case) it has to do with the VM running off a "checkpoint.vmdk" disk - once the VM's disk is pointed back to its own self-named .vmdk, the restart issue is resolved.
Now to provide some additional context to this:
In vSphere, I create a new VM, set configuration options, point to an .iso, power up VM and proceed through our imaging process. Once this is complete, I log into the VM using a domain admin account, configure the OS, sanitize (prep it for cloning), and take a single snapshot.
I then create (or edit) a View pool (for Win10 testing they have all been persisten/dedicated pool's) which provisioning VM's off the snap from that new gold VM. Something interesting during this composing process I noticed earlier today - when the linked-clone (we unfortunately do not have licensing for instant clones) is being provisioned, I noticed the disk being used is its own named "vmname.vmdk" - great! However, once provisioning is complete, the disk changes to "vmname.checkpoint.vmdk"
Anyone have any insight into why that occurs? Why wouldn't the cloned VM just continue to use it own named .vmdk? Any helpful answer on that would be much appreciated!
At any rate, we're able to fix/work around the Win10 restart slowness by, again, migrating the VM's storage only, to a new datastore, consolidate under snapshots, power up VM, and BAM - fixed.
I hope all of this makes sense - if anyone is still experiencing Win10 restart slowness and is able to try this storage migration/consolidate process, please give it a try and let me know the result. So far it's been consisten for me, but I'd like to hear about what others experience is.
This almost feels like a possible bug in vCenter or View somewhere, but I can't say for sure - anyone with VMWare have any thoughts to share?
Also, fwiw, we're running a Pure storage array (not vSAN) with plenty of space available and excellent dedupe. Also, we're using EFI (though I've tried EFI and BIOS, neither seems to make a difference, as far as restart times are concerned).
-Nathan