Hi Bob
Thanks for the detailed response, let me answer your questions in detail.
Can you elaborate on what you mean by "corruption" - e.g. was the VM unusable and required some form of file-system check or restore from backup to become functional or do you just mean the VM crashed?
I assume it that it was a corrupted disk. I am still not sure what exactly happened. The system did not respond anymore because it was offline. From what I could see, it was offline because HA killed the machine, and then started moving it to the other host. Upon boot of the machine, it would prompt me to enter the decryption password for one of the disks. After entering the password, the boot process would freeze and shortly after the machine would be offline again and a move to the other ESXi would begin.
I am assuming that the disk became unreadable for some reason, though I have no evidence to back that up. The issue persisted event after disabling HA on that cluster completely. My gut tells me that HA only kicked in because something happened during snapshotting that lead to the system being unresponsive. After that HA just kept trying to start the machine again on another host.
Are these vmdks by any chance shared with another VM? (and/or have some other configuration which would make them unsuitable/unsupported for snapshot-based backups)
No, none of the 4 disks that server used were shared with another VM. But now that you mention it, I just recalled that during my first VEEAM backup attempt I had a message saying that one of the disks, SERVERNAME.vmdk, was skipped because it was in independent mode (sorry that I didn't mention this earlier, this stuff happened about 4 weeks ago, so I completely forgot about that):
Disk SERVERNAME.vmdk has been skipped due to an unsupported type (independent disk)
That disk has a size of 2 TB and stores customer data. I believe the reasoning behind the independent flag was that if, for whatever reason, the machine had to be restored, that data partition would not be touched (though I'm not sure that really was the reason). Once I saw that message, I stopped the backup job - since I wanted the job to stop immediately, I used the "immediate stop" option, instead of the "graceful stop" (AFAIK the graceful stop would've just skipped any other machines in the backup queue, but still completed the backup of the already running machine). According to VEEAM any snapshots already created, should've been deleted, even when the "immediate stop" option is used.
I then edited the virtual machine and removed the independent flag on that disk, and started the backup again. This time I've waited for 5 minutes before noticing that another production backup job started running and blocked the resources for my data center migration job. I decided to stop the job again (it didn't even start reading data yet), waited for the other backup job to complete, and then started the job again.
This time the job ran for about 6 minutes (already read 14.5 GB) before I noticed that the server that was being backuped, wasn't responding anymore. I decided to cancel the backup job again, but this time the VEEAM job exited with the following error message in the log:
Removing VM snapshot Details: The operation cannot be allowed at the current time because the virtual machine has a question pending: 'msg.hbacommon.corruptredo:The redo log of SERVERNAME-000001.vmdk is corrupted. If the problem persists, discard the redo log. '.
This all happened in a time window of around 35 - 40 minutes. At that point I was pretty stressed out and panicking, as I feared that I just destroyed the old production system and just started creating a backup copy of the machine directory, just to be safe (I still have that copy if you think that there are logs that might be helpful).
This would have the added benefit of you having a copy that you can test whatever you want with while retaining the other until you are sure what the plan is - since you are moving this anyway, you could just clone it to the new datacenter and leave it at that (is the storage at the new location accessible from the current infrastructure to allow this and/or can it be attached?).
I can't attach the storage of the new data center directly, though I could theoretically mount the CIFS share of the VEEAM server on the ESXi directly, and just copy the machine directory directly there - I've never actually attached another share on a ESX using SSH and there doesn't seem to be a /etc/fstab file, is there anything I should be aware of when attaching external shares? Any specific command I should use to copy the directory or plain old cp (there doesn't seem to be a rsync command on ESX)?
If you delete (or move to a temporary folder) the .vmsd file the consolidation message may go away (and a new .vmsd created automatically the next time the VM is snapshotted).
Thanks, I will keep that in mind and try it out once I've created a backup copy of the server.
Anything else about this VM (e.g. vm-tools version, VM hardware version, OS, multiple VSS's configured) or the version of Veeam/ESXi (e.g. incompatible) that might cause issues during backups?
It's a legacy system, so it's quite old. It's running SLES 11.4.16 (2015-06-18), VMWare Tools is 10.0.0.50046 build-3000743 (according to output of command vmware-toolbox-cmd -v). VM version is 8, the ESXi hosts themselves are 5.5.0 build-3248547. Though I'm not sure how I'd go about finding if anything about this combination would cause backup trouble. I'd rather not have to migrate this machine itself but go about migrating the application on it to our new application server, but the risk of total loss in that data center is becoming bigger every day, so I don't have a lot of time to plan for an application migration (hardware down there has been running uninterrupted for about 2.5 years without any maintenance or monitoring).
Were you attempting this from the ESXi host that the VM is registered and running on?
I just checked, and yes, the host I ran this command on, is the host that is running the virtual machine.
Can you elaborate on this? Did you just revert to base (e.g. skip them) and delete them from the VM folder? This could also be the cause of the .vmsd stating what it does.
Gladly. After we created a copy of the now-corrupted machine, the first thing we tried is to run disk consolidation, but that threw the following error:
An error occurred while consolidating disks: 9 (Bad file descriptor)
Next we tried to get rid of the snapshots, in the hope that the base disks were unaffected. We had a look at the directory and saw that there were snapshot files lying around (based on the suffixes with 00000X-flat.vmdk suffixes) and tried moving all of them into a temporary directory (including SERVERNAME_2-000001-flat.vmdk). As this was the weekend and our customer monitoring tool told us that no new data was delivered in the affected time frame, we decided it would be safe to get rid of the snapshot deltas anyway.
At this point, we didn't yet realize that SERVERNAME_2-000001-flat.vmdk was actually a base disk, so we moved it away. We then edited the VMX file and pointed them to the VMDK files without the snapshot suffixes. It was around this time that we noticed that something doesn't add up (4 disks configured, but numbers in the disk names only went up to 2 instead of 3 like we'd expect it). That's when we checked the copy directory and saw that all disks pointed to a snapshot disk, except for disk SERVERNAME_2-000001-flat.vmdk - that one didn't seem to have changed (SERVERNAME-flat.vmdk for example now pointed to SERVERNAME-00001-flat.vmdk).
Once we realized that there was a strangely named disk, we knew we couldn't trust the file names and started to check the VMDK files to make out the real snapshot disks. We then edited the VMX file to point to the base disks, moved the delta files away and that's when everything started to work again.
This has been quite a long answer, I hope you still find time to read it, really appreciate your input.
Cheers,
ahatius