I posted a question about this over at Duncan Eppings blog but of course he cannot comment on futures.
So let me see if I can rustle up some useful dialog here instead.
In a nutshell, the problem is as follows:
While VAAI introduced reclamation of deallocated blocks at the VMFS level through the T10 unmap & write_same primitives, this really only solves half the problem. It does reduce overhead on thinly provisioned storage systems (although not currently by default and automatically), but only for deleted VMDKs.
Thin reclamation won't be properly sorted until VMFS is able to recoup blocks deallocated within guests. The currently supported way of doing this is to run the VM through Converter. I'm sure most of us can agree that this is a non-option for production servers, and anyway imagine doing that for hundreds or thousands of VMs on a regular basis.
Lately, the filesystems themselves are beginning to provide a solution to the problem. EXT4 as of kernel 2.6.27 supports the mount option "discard", which uses the same mechanism as VMFS to reclaim space, eg the T10 unmap & write_same commands. This is primarily in order to be compatible with the SSD trim requirement, but it also means that a thinly provisioned storage system will be told when a block is dereferenced and can be moved into the spare pool. NTFS also needs to support SSDs and implements the TRIM command in Windows 2008 R2 (I believe). Server 8 may be smarter and also built for thin provisioning - or not, they seem to be trying yet again to move into the storage space with their server products..
Anyway; recent and future versions of the two most common operating systems deplyed within guests will come equipped with various ways of autimatically and near-instantly reclaiming deleted blocks, but currently the SCSI controller layer in the VM does not honour UNMAP/TRIM in any useful fashion. And this is a shame since it provides the missing piece of the puzzle for an end-to-end optimal thin provisioning.
Unfortunately, it's not entirely trivial to solve - since there is no 1-to-1 mapping between a logical block in the guest filesystem and the VMFS blocks, you cannot just pass the UNMAP through all the way from the guest filesystem down to the array. Instead, you must track these unmapped blocks and when an entire VMFS block has been deallocated, it can be unmapped from the storage array. Or; Maybe the sub-block addressing in VMFS allows for a partial unmap?
A possible interim solution would be to optionally translate the unmap into a zero block.. but only if there is a way to avoid that becoming an actual write IOp consisting of zeroes hitting the spindles of your storage system. Many storage systems consider an entirely empty block to be equivalend to an unmap (this is how you reclaim disk using vmkfstools -y after all), but wether it is a penalty-free operation is hard to say.
Anyway; if anyone have further insights they wish to share then please do - for me, this is one of the major storage-related issues which need to be tackled, and soon, but I may be in the minority - and maybe making a big deal out of a non-issue?
Message was edited by: schistad; added more information for clarity. Vital parts were left in my head and not in original text ;)