Hi everyone,
Rather bizarre one I'm struggling to figure out. We have suffered data loss from a handful of VMs running on two different ESXi hosts. This dataloss was initially observed after one of the Linux VMs I was sshd into dropped connection along with this being reflected as unavailable in our monitoring software. The VM appear to had rebooted and this was reflected after I had sshd back to it a short time later where the uptime reflected the reboot. I then decided to shutdown the VM and noticed via vSphere the VM was still running, but was not available by ssh. Odd. I used the vSphere to web console to connect onto the VM where the network interface was down, so I brought it up. While on the VM I ran uptime where the same VM that had just been rebooted and shutdown showed an uptime of 1 day (there had been a power outage the day before). Okay now this is super weird. Upon further looking at the health of the VM there was data loss (this VM runs jenkins and pipelines/builds were missing), almost as far back as two months. What's odd is that /var/log/messages of the VM appear to be populated over the time period data was lost, yet other files are lost. Regarding the powerloss, due to where in the world these ESXi hosts are, they are subject to frequent powercuts. Probably every other month.
To add some further information to the background of our ESXi hosts. A little over a month ago we started using ghettoVCB. When restoring from a backup made a week ago, this backup also showed the same dataloss described above. So ghettoVCB backups lost data as well. Well when restoring a backup of a different VM the data contained within appeared up to date as to when the backup was made.
The dataloss of both the VM described and it's backup appears to be before ghettoVCB backups were implemented. Does anything described here sound familiar or indicate anything? I've checked logs in a few places both at the VM level and ESXi host and nothing stands out. Unfortuately I'm unable to determine if the powerlosses, ghettoVCB backups or something else has caused this. Any suggestions would be much appreciated.
My questions are:
Is there any way to check the health of virtual machines and there disks to see if data loss of other VMs can be prevented?
Has anyone experienced something weird like this?
Is data lost recoverable? For the time being we have restored from file based backups rather than ghettoVCBs image based backups.
What is the behaviour of ESXi data storage during a power loss?
Could ghettoVCB have caused this?
Any questions feel free to ask and I'll provide as much information as possible.
Thanks for your time in reading this,
Kind regards