Since about 6 years I offer VMFS-recovery-services. Because of that I see a lot of vsphere-environments of various sizes.
Let me sum up the most important lessons I learned by visiting those environments.
The chance to run into serious problems while using vSphere is quite small in very, very small environments - environments that are large enough to implement replication and daily automated backups also run a very small risk.
The highest chance to run into serious problems while using vsphere goes to environments with a handful of ESXi hosts that use local storage.
So environments like the one you have run the highest risk.
Why is that ?
My theory: - Neither previous experiences with running traditional Windows-servers nor VMware Best Practice documentation really applies to typical small setups.Tips for selecting a good RAID-level usually only use 2 parameters: amount of effective storage-place versus performance.
So far I have not seen any RAID-level discussion that also talks about the parameter "survival rate"Discussions and Blogs also rarely use the "survival rate" parameter in context with the VMFS-filesystem. VMFS is considered to be rockstable and "enterprise-class"
Most new VMware-admins transfer their experience with NTFS to VMFS and are very surprised about the tips I have to offer after a recovery ....
When admins read best practice tips or the VMware design suggestions they assume this tips apply to all scales of vSphere-environments.
This is a very dangerous assumption if you run a small environment with local disks - better be prepared for some big surprises.Let me sum up my exeriences after 5 years of doing VMware-recovery.
VMFS is rockstable : this is correct if you configure your VMs with eagerzeroed-thick provisioned vmdks only and do not use snapshots.
VMFS is "enterprise-class" software : - this only applies to environments that are large enough to implement regular automatic backups plus replication.
Thin-provisioned vmdks and snapshots are as safe as normal vmdks: this is a very dangerous assumption. VMFS-datastores that use a large amount of vmdks that change their allocation during normal use do not handle unexpected power failures well.
A datastore with static vmdks will survive a powerfailure similar to a Windows-server running on bare metal using NTFS.
A datastore with lots of thin provisioned vmdks and snapshots can lose all its content after a single powerfailure - this is something that is completely unexpected by 99% of the admins of small environments.
RAID-level 5 is a good compromise of space vs performance. It sure is better than using single disks without RAID. - Allmost everybody would agree to that I think but I learned it the hard way and can not say it applies to the combination: small environment + RAID5 + VMFS.
Instead I tell all my customers that run a small environment:
Never even consider to use RAID5 for a VMFS-datastore. !!!
I would even claim that using thick provisioned vmdks on single disks is a safer choice.
Raid5 is only acceptable if your environment is large enough to have replication.
In 2016 I have seen about one small environment per week that had severe RAID5 related problems. Typical problems are:
- datastore appears blank after a powerfailure
- after exchanging a disk marked as faulty the RAID-rebuild was a complete desaster and all ddata seems lost
- even service-crews of raidcontroller vendors misconfigure the RAID5 config after problems that looked harmless
I have more customers with severe dataloss using RAID5 then I have with all other possible configs summed up.
To add to that ... the type of corruption you get after a RAID5 rebuild can be really really ugly.
So let me sum it up: IMHO running a small environment with locaL disks + VMFS + vSphere 5 and higher + thin-provisioned disks + RAID5 + lack of emergency powersupply is a completely unacceptable risk.
Do the math yourself - if you add the "survivalrate" while deciding the free space vs performance question - then RAID10 or RAID1 is the only option that makes sense.
Ulli
My view is biased - if you run a medium or large environment this tips dont apply to you.