Hi all,
this year we have migrated a system to new hardware and software. The old system was ESXi 5.5 on a Lenovo x3550 M5 64GB RAM 1x SSD Datastore (RAID10) and 1x HDD Datastore (RAID5). All the production systems were on a single VM which was Windows 2008 R2.
The new system is ESXi 6.7 on a Fujitsu RX2530 M4 64GB RAM with a single all SSD Datastore (RAID10). The production server is now Windows 2016 server. Both servers use Megaraid based RAID controllers, specifically the Fujitsu is a PRAID EP540i and the IBM is a ServeRAID M5201.
The reason for this post is that we are experiencing some issues since upgrading that we were not expecting. Firstly, the most strange issue. Previously (for a couple of years) we were periodically taking a snapshots of the main VM during working hours and never had any issues. Since we moved to the Fujutsu users complain of performance issues when using the system during and after snapshots, and on more than one occasion the whole Windows VM has frozen and we have had to reset the VM (no errors shown from ESX side). We now avoid taking any snapshots during working hours. Another issue is that when running backups (Veeam) the system becomes quite unresponsive at times and we are now avoiding any backups within working hours, previously we didn't see this issue (We are using CBT backups, but often and without error Veeam insists on reading practically the entire VM which is over 2TB in size). Veeam reports backup throughput of about 300-400MB/sec with the source as bottleneck which, while not slow, doesn't seem particularly amazing for a RAID 10 array of 6 SSDs. I'm not worried about the 3000400MB/sec speed as such, just mentioning it in case it seems unusually to anyone else. And lastly, and most importantly, it seems that when the Windows VM uses any page file that users experience general lag in the system. We have spent the last week ensuring that the system fits within the physical memory and have therefore reduced the impact of the issue substantially, but Windows still likes to use some page even when there is a lot of RAM it seems, and in any case, if we have SSD RAID 10 we'd hope that any page usage would be pretty speedy. Worth mentioning we are using the paravirtualized SCSI controller for the VM disks. Also worth noting that currently this VM is the only active VM on the Fujitsu and there is sufficient physical memory for it to run without using swap.
We haven't done any specific tuning with respect to the RAID controller as I didn't see anything specific to Megaraid controllers when having a search for info, so its using the default settings for a RAID array over 6 drives with RAID10.
So basically I'm wondering if anyone has any thoughts, experienced and/or fixed any similar issues. It's equally disappointing to be having issues on newer faster hardware that we did not experience on the old hardware as it is to have the lagging issue on the Windows VM having invested in an all SSD solution. So any input greatfully recieved,
thanks, Andy,