This is a long post, but hopefully someone will get some useful info out of it.
Hopefully I can get some useful info from replies as well.
I have deployed 1 VSAN cluster, but based on my current experience with it I will be hesitant to recommend it again.
Let me walk you through my nightmare.
(3) Dell R530
- Perc H730 mini storage controllers. (Has a queue depth of 925 last time I checked.)
- 1 200G SSD
- 4 2T HDDs
- 10G networking with redundant switches
- 48G RAM
Dataset size
6T of data
Based on VMware's recommendation to start with SSD capacity that is 10% of the dataset size we started with 600G of SSD capacity.
Issue #1
Customer calls complaining of poor performance. VSAN benchmarked very well in the lab and the VMs were flying along for several weeks so I did not immediately think storage was an issue.
Looking into the performance issue I eventually found VSAN doing Component Resyncs and it had several hundred Gig remaining, with 4 hours left until completion.
During this time VM performance degraded so much that the customer sent employees home because email, files, print, applications, etc were all effectively down. Storage latency went into the thousands of ms. Exchange complains above 20ms, imagine how happy is was @ 2000ms.
I noticed the components that were doing a resync were VMs that earlier in the day another engineer made changes to. He increased the virtual disk size since the VMs were running out of room. Expanding the disks of VMs has never caused an issue and is something we regularly do doing hours.
When a drive is expanded on VSAN the drive is simply not just expanded, here's the process best as I can tell.
1. Tech expands drive from 500G to 650G
2. VSAN creates 3 new components (a component is basically a chunk up to 256G)
3. VSAN then copies data from the 2 existing 256G components into the 3 new components.
By default, a component will have a single stripe width, so a single disk is hammered hard while the data is being read or written.
Fault tolerance settings are set so data resides on 2 servers, so the high I/O is happening on multilpe disks on multiple hosts.
4. When that copy is done it will delete the original 2 components.
KEY THING TO NOTE:
1. You need double the disk space during an expansion. Have a 500G disk you are expanding to 650G, then you need to have capacity for that 500G of data to be duplicated. If you watch free capacity on the VSAN while this is happening space will be slowly consumed as the copy happens and once complete there will be a large jump in free space when the original components are deleted. If your disks are too large and there is not enough free space on a VSAN you will not be able to expand your drive.
Resolution to #1
After discussions with VMware and within VMware it was decided there is not enough SSD capacity, so the systems are going to spinning disk too often causing the slowness. At the time it sounded reasonable, but it's doing a resync of all the data, it has no choice but to go to disk.
So we buy 3 new 200G SSDs and I start a project of adding an SSD to each server.
I couldn't get a clear set of instructions on how to do it with a 3 host VSAN, so I will lay that out here for anyone that might need it.
The final result will be 2 disk groups per server, each group will have 1 200g ssd, and 2 2T HDDss. The current state is 1 200G SSD, and 4 2T HDDs.
1. Update firmware and drivers on everything in your server to meet the VMware HCL
2. Install the SSD drive into a host, and ONLY one host (See Issue #3 for why only one host)
3. Make sure the disk is in pass-thru mode on the storage controller
4. Edit the existing disk group on that host.
-remove 2 of the HDDs from the existing group. You'll get the option to evacuate data to another disk or to just limp along.
-evacuate data if you can, otherwise limp along. I chose limp along since that was my only option with a 3 node cluster.
5. create a new disk group using the new SSD and the 2 disks you just removed.
- VSAN will start to resync data back to these disks. It does not look to see if those components already exist, so it does a full copy of all the data it already contained.
6. when the resync is complete repeat 1-6 on remaining hosts.
Issue #2
After adding the SSD to one of the hosts and reconfiguring the storage group I figured I'd get to the next host the next day. I was hoping to be able to do them all in a single weekend.
About 24 hours after the first host was done, and before I started the second, we got a bunch of VSAN errors and VSAN on the host I had just updated had gone belly up, serving no data at all.
Looking at issues with VMware support it was determined that the SSDs were running different firmware versions and was the only thing anyone could come up with as to why VSAN had crashed.
Resolution #2
Update the disk firmware.
I updated the firmware on the SSDs and HDDs and it did a complete resync of data again. Recall that during resyncs VMs were basically useless, so it wasn't a pleasant experience for anyone.
Once that was sorted and it ran smoothly for a week I proceeded to do all the same steps on the remaining 2 hosts, each 1 week apart.
Issue #3
We decided we need to add a dedicated HDD to each host that is not part of the VSAN cluster. Since our hosts boot off SD cards there are capacity issues and warnings that never go away unless temp things can be directed to a HDD.
We ordered 3 500G HDDs from Dell. One of our engineers put them in today but did no configuration. 45 minutes later we started getting errors about VSAN being down on a host. "Virtual SAN device is under permanent failure ". That can't possibly be good.
VSAN disk claiming is in manual mode, so I know the disk wasn't added to the disk group.
I had him immediately pull the 500G disks from all hosts. It was too late however. Host #2 had lost both disk groups, and host #1 lost one of its disk groups. That means some data components are inaccessible and VMs started crashing.
Resolution #3
Reboot host #2 since it was in the worst shape. It came up normally, the disk groups looked fine but data was out of sync. Data started doing a resync, so instead of rebooting host #1 I just let the resync finish. When done there were still ~50% of the components in a degraded state, and some that were inaccesible. Mind you I have been on hold with Vmware support for 1.5 hours waiting for some help. When the resync was done I decided to reboot host #1. It came up normally, all components are now showing normal and it's doing a resync of a ton of data that will take an estimated 5 hours.
Remember issue #1, poor performance during resync. I can definitively say adding SSD made little difference. Most of my VMs are currently powered off since latency is so high that they won't boot successfully. Some machines that are running show disk latency of 500-4000ms.
Cause of #3
I think it might be due to disk firmware mis-match, but won't know until VMware answers the phone and we can take a look at it.
-----------------------------
My key takeaways from my experiences:
1. If it ain't broke, don't touch it.
2. If you have to touch it, only ever work on a single host at a time, and do host changes a week apart.
3. Expand disks after hours or during slow times. The resync can affect VMs you did not make changes to.
4. Expect VSAN to crash when adding new disks to the host, thus only work on one host at a time. I am 2 for 2 on crashes when adding disks.
5. Update firmware on everything when making any physical changes, especially the disks themselves which I think is often overlooked.
6. Get more SSD capacity than you think you need, cache never hurts anyone.
7. Never use a single disk group, multiple groups make managing and reparing easier.
8. Don't use 7200 rpm drives. They are generally fine, but during resync you will wish you have gotten 10k or better.
9. Consider using stripe width greater than 1.
10. Use 4 hosts minimum if you can.
11. Resync won't cause slowness on all VMs, just ones that have components sharing disks.