Backup & Recovery

 View Only
Expand all | Collapse all

Disk consolidation needed: How to fix it manually?

  • 1.  Disk consolidation needed: How to fix it manually?

    Posted Aug 31, 2020 02:59 PM

    Hi there

    I've got a big (2.5TB) legacy server in an old data center that I need to move to our new data center. I have almost no knowledge about the infrastructure in our old data center and it seems that there is no working backup there (I wasn't working here when that data center was set up and nothing is documented).

    The old data center is running a vCenter 5.5.0 with 2 ESX hosts, the new data center uses vCenter 7 with 4 ESX hosts. I planned on using VEEAM Backup to move the server from the old data center to our new one (back up from old data center, restore on new data center). I am using VEEAM because the vCenter in the old data center is so old, that I have trouble getting the Standalone converter to both read from the old data center and copy to the new one (SSL issues for example).

    I've got a 1GBit connection between the 2 data centers. On a weekend I attempted to start a backup while the server was running in the hopes that I could do incremental backups before finally migrating the server to the new data center. That approach failed horribly.

    During the VEEAM backup job the legacy server stopped responding. Upon further inspection I noticed that the HA cluster, for whatever reason, decided to do a failover to the other ESX node. After a few minutes, it switched back to the other ESX again. It kept doing that, because the disk was now corrupt and the server kept freezing during the boot procedure. I am still not sure what caused the corruption, maybe there was an pre-exsting condition with the VMWare disk files that caused the corruption.

    It took us quite a while to figure out what needed to be done to get the server back running again, but once the server was running again we had the following disk files lying around:

    • SERVERNAME-ctk.vmdk
    • SERVERNAME-flat.vmdk
    • SERVERNAME.vmdk
    • SERVERNAME_1-ctk.vmdk
    • SERVERNAME_1-flat.vmdk
    • SERVERNAME_1.vmdk
    • SERVERNAME_2-000001-ctk.vmdk
    • SERVERNAME_2-000001-flat.vmdk
    • SERVERNAME_2-000001.vmdk
    • SERVERNAME_2-ctk.vmdk
    • SERVERNAME_2-flat.vmdk
    • SERVERNAME_2.vmdk

    What catched my eye is that despite the fact that vCenter doesn't show any more snapshots, there is still a SERVERNAME_2-000001.vmdk there, suggesting a snapshot of disk SERVERNAME_2. But that is not the case, it is the actual disk of the operating system whereas SERVERNAME_2.vmdk is the disk of an application data partition. It is actually referenced in the VMX file:

    scsi0.virtualDev = "lsilogic"

    scsi0.present = "TRUE"

    scsi0:0.deviceType = "scsi-hardDisk"

    scsi0:0.fileName = "SERVERNAME_2-000001.vmdk"

    scsi0:0.present = "TRUE"

    scsi0:0.redo = ""

    scsi0.pciSlotNumber = "16"

    scsi0:1.deviceType = "scsi-hardDisk"

    scsi0:1.fileName = "SERVERNAME.vmdk"

    scsi0:1.ctkEnabled = "TRUE"

    scsi0:1.present = "TRUE"

    scsi0:1.redo = ""

    sched.scsi0:1.throughputCap = "off"

    sched.scsi0:1.shares = "normal"

    scsi0:2.deviceType = "scsi-hardDisk"

    scsi0:2.fileName = "SERVERNAME_1.vmdk"

    scsi0:2.ctkEnabled = "TRUE"

    scsi0:2.present = "TRUE"

    scsi0:2.redo = ""

    scsi0:3.deviceType = "scsi-hardDisk"

    scsi0:3.fileName = "SERVERNAME_2.vmdk"

    scsi0:3.ctkEnabled = "TRUE"

    scsi0:3.present = "TRUE"

    scsi0:3.redo = ""

    vCenter tells me that the server needs a disk consolidation, but I'm too afraid to run the consolidation, fearing that it might attempt to actually consolidate disks SERVERNAME_2-000001 and SERVERNAME_2 together and that this might have been the issue to begin with (when starting the VEEAM backup job).

    I have no idea if the names were already botched up in the beginning or if that is an outcome of the failed backup job. I've been reading a few KB articles and some seem to suggest to create a new snapshot, and then delete it, though that seems a bit risky for me on this server, as I don't have a backup. For the same reason, I don't want to run the consolidate option in the snapshots menu.

    How does vCenter / ESX actually detect that a consolidation is needed? Is it just based on the file names? If I were to rename SERVERNAME_2-000001 to SERVERNAME_3 and then update the VMX file, should that be working and the vCenter warning be gone?

    Any help is greatly appreciated.

    Cheers,

    ahatius



  • 2.  RE: Disk consolidation needed: How to fix it manually?

    Posted Aug 31, 2020 04:08 PM

    Welcome to the Community,

    is that really "SERVERNAME_2-000001-flat.vmdk", i.e. not "delta" or "sesparse" in the file name?

    To me it looks like someone "messed" with the virtual disks, and created that one with a file name that just looks like a snapshot file name. Usually you would run hexdump -C -n 4  SERVERNAME_2-000001-flat.vmdk to see what the first 4 Bytes look like, and to find what it is (e.g. "COWD" from right to left for ESXi delta snapshots).

    Since it's a working VM, you may also simply look at the "SERVERNAME_2-000001.vmdk" to see if it contains a "parentFileNameHint" entry (in case it's a snapshot), or the virtual disk's geometry (in case of a base disk).

    You may - as you mentioned - rename SERVERNAME_2-000001 to SERVERNAME_3, if it's simply a base virtual disk with a confusing name.

    The steps to rename the virtual disk are:

    1. cleanly shutdown the VM
    2. run: vmkfstools -E SERVERNAME_2-000001.vmdk SERVERNAME_3.vmdk
    3. edit the .vmx file, and replace SERVERNAME_2-000001.vmdk with SERVERNAME_3.vmdk
    4. finally reload the VM (see steps 2+3 in https://kb.vmware.com/s/article/1026043​)

    That said, I'm not really sure whether renaming will help. I rather think that there's another issue with the hosts if they stop responding during backups.

    André



  • 3.  RE: Disk consolidation needed: How to fix it manually?

    Posted Aug 31, 2020 04:34 PM

    Hi André

    Thank you for your feedback.

    I checked the VMDK files and none have value set for parentFileNameHint. As you mentioned, the server is actually running fine right now, so I'm at least sure that it's not a snapshot file of another disk.

    Maybe someone messed something up when creating the machine (from what I saw this machine is a clone of another machine, so it might be possible that someone copied files around an named the incorrectly).

    I will run the hexdump command tomorrow in the office, just to make sure that the file is indeed a base disk. If all else fails I will probably have to create a full copy of the machine directory, then attempt to consolidate / add a snapshot and if that results in a corrupt machine, just move the directory back in place.

    Only problem with that approach is time. I will probably need around 10 hours for the copy of that job to finish (the storage system in the old data center is terrible).

    Thanks

    ahatius



  • 4.  RE: Disk consolidation needed: How to fix it manually?

    Posted Aug 31, 2020 04:54 PM

    What may be an option - if other things don't work - is to create a manual snapshot, and copy the base virtual disks up front. Once they are copied over, shut down the VM, and copy the remaining files (configuration files, snapshot files).

    André



  • 5.  RE: Disk consolidation needed: How to fix it manually?

    Posted Aug 31, 2020 06:20 PM

    If you have enough storage, one easy fix is to clone the virtual machine. Then delete the old one.



  • 6.  RE: Disk consolidation needed: How to fix it manually?

    Posted Aug 31, 2020 07:29 PM

    Hello Ahatius,

    Welcome to Communities.

    Can you attach (or PM if don't want it public for whatever reason) the .vmsd (VM Snapshot Dictionary) file associated with this VM? (should be found in the VMs folder if present)

    This can have references to snapshots that either no longer exist or that are no longer referenced and thus the VM stating it needs snapshot consolidation.

    In addition to each vmdk descriptor not stating a 'parentFileNameHint', do they also have the 'parentCID' set to the equivalent of none?

    e.g. I have no parent, I am the base-disk - parentCID=ffffffff

    "During the VEEAM backup job the legacy server stopped responding."

    Did you have quiescing enabled and/or might the VM been busy with some job that culminated in the VM becoming non-responsive and attempting to failover? (which IIRC requires HA settings for VM Monitoring configured to occur)

    Other than that though, that vmdk that is named like a snapshot is either the result of someone messing with names or directly attaching a snapshot instead of a base disk (which *in theory* will work without issues if it had been running on the snapshot long enough and with the right input to change/write to every bit of data on the disk, in this case the snapshot actually would have all the current state of the data and the base-disks original data would just be discarded/overwritten on consolidation).

    Bob



  • 7.  RE: Disk consolidation needed: How to fix it manually?

    Posted Sep 01, 2020 12:23 PM

    Hi all

    Thank you all for your replies.

    What may be an option - if other things don't work - is to create a manual snapshot, and copy the base virtual disks up front. Once they are copied over, shut down the VM, and copy the remaining files (configuration files, snapshot files).

    As of now I can't try that because I don't know if creating another snapshot will cause another corruption again. My current plan would be to shut down the machine this weekend, copy the machine directory on the filesystem and then attempt to create a snapshot and see what happens.

    If you have enough storage, one easy fix is to clone the virtual machine. Then delete the old one.

    As cloning also creates a snapshot, I can't do that right now.

    Can you attach (or PM if don't want it public for whatever reason) the .vmsd (VM Snapshot Dictionary) file associated with this VM? (should be found in the VMs folder if present)

    Doesn't really contain anything interesting. I compared the vmsd file before our cleanup and after the cleanup, they both look the same (I assume there is only 1 vmsd file per machine and not per disk?):

    .encoding = "UTF-8"

    snapshot.lastUID = "1011"

    snapshot.needConsolidate = "TRUE"

    Only interesting thing is the fact that the VMSD file seems to contain a flag about the consolidation state.

    In addition to each vmdk descriptor not stating a 'parentFileNameHint', do they also have the 'parentCID' set to the equivalent of none?

    e.g. I have no parent, I am the base-disk - parentCID=ffffffff

    I checked all 4 VMDK files, they look like this:

    parentCID=ffffffff

    Did you have quiescing enabled and/or might the VM been busy with some job that culminated in the VM becoming non-responsive and attempting to failover? (which IIRC requires HA settings for VM Monitoring configured to occur)

    I wouldn't know, would have to check the VEEAM server to see if it enables quiescing when creating a snapshot. I didn't create the snapshot myself, the VEEAM server did that when it started the backup job. UPDATE: The job did not have quiescence enabled. At time of the backup, there wasn't any mentionable load as it was the weekend.

    Usually you would run hexdump -C -n 4  SERVERNAME_2-000001-flat.vmdk to see what the first 4 Bytes look like, and to find what it is (e.g. "COWD" from right to left for ESXi delta snapshots).

    I tried running the hexdump command on the running machines disk files, but that doesn't seem to be working as long as the machine is powered on:

    hexdump: SERERNAME_2-000001-flat.vmdk: Device or resource busy

    I think I will have to try that again when the machine is powered down. Since we cleaned up the snapshots manually, is it possible to run a check on vCenter that would check the consolidation state, without actually modifying anything? Maybe vCenter / ESXi is thinking it still needs to be consolidated because the snapshot delta files were removed manually on the file system and the VMX file edited by hand to point to the correct disk files.

    Again, thank you all very much for your feedback.

    Cheers,

    ahatius



  • 8.  RE: Disk consolidation needed: How to fix it manually?

    Posted Sep 01, 2020 06:36 PM

    Hello Ahatius,

    "As of now I can't try that because I don't know if creating another snapshot will cause another corruption again."

    Can you elaborate on what you mean by "corruption" - e.g. was the VM unusable and required some form of file-system check or restore from backup to become functional or do you just mean the VM crashed?

    Are these vmdks by any chance shared with another VM? (and/or have some other configuration which would make them unsuitable/unsupported for snapshot-based backups)

    "As cloning also creates a snapshot, I can't do that right now."

    Not necessarily - if it is tolerable by the business for the VM to be shut down long enough to clone the data to new files (e.g. using vmkfstools -i). This would have the added benefit of you having a copy that you can test whatever you want with while retaining the other until you are sure what the plan is - since you are moving this anyway, you could just clone it to the new datacenter and leave it at that (is the storage at the new location accessible from the current infrastructure to allow this and/or can it be attached?).

    "Doesn't really contain anything interesting."

    I wouldn't say that - it states the VM needs consolidation and is likely the source of the alert in vCenter stating this, we have already confirmed that this isn't actually running on snapshots - I don't think that is just picking up on the naming convention of that one 000001-flat.vmdk to state this (but may test this to confirm).

    If you delete (or move to a temporary folder) the .vmsd file the consolidation message may go away (and a new .vmsd created automatically the next time the VM is snapshotted).

    "UPDATE: The job did not have quiescence enabled. At time of the backup, there wasn't any mentionable load as it was the weekend"

    Anything else about this VM (e.g. vm-tools version, VM hardware version, OS, multiple VSS's configured) or the version of Veeam/ESXi (e.g. incompatible) that might cause issues during backups?

    "hexdump: SERERNAME_2-000001-flat.vmdk: Device or resource busy"

    Were you attempting this from the ESXi host that the VM is registered and running on?

    "Since we cleaned up the snapshots manually"

    Can you elaborate on this? Did you just revert to base (e.g. skip them) and delete them from the VM folder? This could also be the cause of the .vmsd stating what it does.

    Bob



  • 9.  RE: Disk consolidation needed: How to fix it manually?

    Posted Sep 01, 2020 08:13 PM

    Hi Bob

    Thanks for the detailed response, let me answer your questions in detail.

    Can you elaborate on what you mean by "corruption" - e.g. was the VM unusable and required some form of file-system check or restore from backup to become functional or do you just mean the VM crashed?

    I assume it that it was a corrupted disk. I am still not sure what exactly happened. The system did not respond anymore because it was offline. From what I could see, it was offline because HA killed the machine, and then started moving it to the other host. Upon boot of the machine, it would prompt me to enter the decryption password for one of the disks. After entering the password, the boot process would freeze and shortly after the machine would be offline again and a move to the other ESXi would begin.

    I am assuming that the disk became unreadable for some reason, though I have no evidence to back that up. The issue persisted event after disabling HA on that cluster completely. My gut tells me that HA only kicked in because something happened during snapshotting that lead to the system being unresponsive. After that HA just kept trying to start the machine again on another host.

    Are these vmdks by any chance shared with another VM? (and/or have some other configuration which would make them unsuitable/unsupported for snapshot-based backups)

    No, none of the 4 disks that server used were shared with another VM. But now that you mention it, I just recalled that during my first VEEAM backup attempt I had a message saying that one of the disks, SERVERNAME.vmdk, was skipped because it was in independent mode (sorry that I didn't mention this earlier, this stuff happened about 4 weeks ago, so I completely forgot about that):

    Disk SERVERNAME.vmdk has been skipped due to an unsupported type (independent disk)

    That disk has a size of 2 TB and stores customer data. I believe the reasoning behind the independent flag was that if, for whatever reason, the machine had to be restored, that data partition would not be touched (though I'm not sure that really was the reason). Once I saw that message, I stopped the backup job - since I wanted the job to stop immediately, I used the "immediate stop" option, instead of the "graceful stop" (AFAIK the graceful stop would've just skipped any other machines in the backup queue, but still completed the backup of the already running machine). According to VEEAM any snapshots already created, should've been deleted, even when the "immediate stop" option is used.

    I then edited the virtual machine and removed the independent flag on that disk, and started the backup again. This time I've waited for 5 minutes before noticing that another production backup job started running and blocked the resources for my data center migration job. I decided to stop the job again (it didn't even start reading data yet), waited for the other backup job to complete, and then started the job again.

    This time the job ran for about 6 minutes (already read 14.5 GB) before I noticed that the server that was being backuped, wasn't responding anymore. I decided to cancel the backup job again, but this time the VEEAM job exited with the following error message in the log:

    Removing VM snapshot Details: The operation cannot be allowed at the current time because the virtual machine has a question pending: 'msg.hbacommon.corruptredo:The redo log of SERVERNAME-000001.vmdk is corrupted. If the problem persists, discard the redo log. '.

    This all happened in a time window of around 35 - 40 minutes. At that point I was pretty stressed out and panicking, as I feared that I just destroyed the old production system and just started creating a backup copy of the machine directory, just to be safe (I still have that copy if you think that there are logs that might be helpful).

    This would have the added benefit of you having a copy that you can test whatever you want with while retaining the other until you are sure what the plan is - since you are moving this anyway, you could just clone it to the new datacenter and leave it at that (is the storage at the new location accessible from the current infrastructure to allow this and/or can it be attached?).

    I can't attach the storage of the new data center directly, though I could theoretically mount the CIFS share of the VEEAM server on the ESXi directly, and just copy the machine directory directly there - I've never actually attached another share on a ESX using SSH and there doesn't seem to be a /etc/fstab file, is there anything I should be aware of when attaching external shares? Any specific command I should use to copy the directory or plain old cp (there doesn't seem to be a rsync command on ESX)?

    If you delete (or move to a temporary folder) the .vmsd file the consolidation message may go away (and a new .vmsd created automatically the next time the VM is snapshotted).

    Thanks, I will keep that in mind and try it out once I've created a backup copy of the server.

    Anything else about this VM (e.g. vm-tools version, VM hardware version, OS, multiple VSS's configured) or the version of Veeam/ESXi (e.g. incompatible) that might cause issues during backups?

    It's a legacy system, so it's quite old. It's running SLES 11.4.16 (2015-06-18), VMWare Tools is 10.0.0.50046 build-3000743 (according to output of command vmware-toolbox-cmd -v). VM version is 8, the ESXi hosts themselves are 5.5.0 build-3248547. Though I'm not sure how I'd go about finding if anything about this combination would cause backup trouble. I'd rather not have to migrate this machine itself but go about migrating the application on it to our new application server, but the risk of total loss in that data center is becoming bigger every day, so I don't have a lot of time to plan for an application migration (hardware down there has been running uninterrupted for about 2.5 years without any maintenance or monitoring).

    Were you attempting this from the ESXi host that the VM is registered and running on?

    I just checked, and yes, the host I ran this command on, is the host that is running the virtual machine.

    Can you elaborate on this? Did you just revert to base (e.g. skip them) and delete them from the VM folder? This could also be the cause of the .vmsd stating what it does.

    Gladly. After we created a copy of the now-corrupted machine, the first thing we tried is to run disk consolidation, but that threw the following error:

    An error occurred while consolidating disks: 9 (Bad file descriptor)

    Next we tried to get rid of the snapshots, in the hope that the base disks were unaffected. We had a look at the directory and saw that there were snapshot files lying around (based on the suffixes with 00000X-flat.vmdk suffixes) and tried moving all of them into a temporary directory (including SERVERNAME_2-000001-flat.vmdk). As this was the weekend and our customer monitoring tool told us that no new data was delivered in the affected time frame, we decided it would be safe to get rid of the snapshot deltas anyway.

    At this point, we didn't yet realize that SERVERNAME_2-000001-flat.vmdk was actually a base disk, so we moved it away. We then edited the VMX file and pointed them to the VMDK files without the snapshot suffixes. It was around this time that we noticed that something doesn't add up (4 disks configured, but numbers in the disk names only went up to 2 instead of 3 like we'd expect it). That's when we checked the copy directory and saw that all disks pointed to a snapshot disk, except for disk SERVERNAME_2-000001-flat.vmdk - that one didn't seem to have changed (SERVERNAME-flat.vmdk for example now pointed to SERVERNAME-00001-flat.vmdk).

    Once we realized that there was a strangely named disk, we knew we couldn't trust the file names and started to check the VMDK files to make out the real snapshot disks. We then edited the VMX file to point to the base disks, moved the delta files away and that's when everything started to work again.

    This has been quite a long answer, I hope you still find time to read it, really appreciate your input.

    Cheers,

    ahatius



  • 10.  RE: Disk consolidation needed: How to fix it manually?

    Posted Sep 03, 2020 09:21 AM

    Hi all

    I've thought about this and I think the safest and most efficient way to move the machine to the new data center would be to shut down the machine and just copy the complete machine directory via SCP from an ESX host on the new data center.

    That way I could do a backup copy on the faster storage in the new data center and not waste as much time on the old storage system. Also I won't have to deal with snapshotting. Once copying is done I thought I would just register the machine on the new vCenter. Stuff like MAC address should stay the same, as the application running on the machine is licensed to it.

    Are there any additional steps I should be thinking of when manually migrating the machine and registering it in the new vCenter?

    Thank you & Cheers,

    ahatius



  • 11.  RE: Disk consolidation needed: How to fix it manually?
    Best Answer

    Posted Sep 03, 2020 07:14 PM

    Hello Ahatius,

    Apologies for the late reply - while I am a VMware employee (GSS-EMEA-vSAN) I don't post on here during work hours and only *sometimes* have the available energy to think about snapshot issues (as it reminds me too much of my older roles than present).

    Thanks for taking the time to write detailed information - there is no such thing as a post/reply being 'too long' as the more detail the better. (For some context the .txt document I use for Communities is currently at 322844 words :smileygrin: (though only about ~70% of this are my words as I copy some snippets from responses))

    "I assume it that it was a corrupted disk. I am still not sure what exactly happened. The system did not respond anymore because it was offline. From what I could see, it was offline because HA killed the machine, and then started moving it to the other host."

    Did you check the vmware.log of the VM at this time for possible causes (e.g. a vmdk was locked or inaccessible) and or the vmkernel.log for anything that would explain the defunct state?

    "The issue persisted event after disabling HA on that cluster completely. "

    It could have just been in a bit of a state from HA trying to register/restart it on multiple hosts and/or the cluster fighting over ownership of it (I really really don't miss the older days of such things in HA - works damn solid in modern versions and even thinking of old issues we used to see is giving me flashbacks that are accompanied by Vietnam-style helicopter noises)

    "I then edited the virtual machine and removed the independent flag on that disk, and started the backup again."

    Did you do this with the VM powered-on or off? Did you reload the vmx?

    "Removing VM snapshot Details: The operation cannot be allowed at the current time because the virtual machine has a question pending: 'msg.hbacommon.corruptredo:The redo log of SERVERNAME-000001.vmdk is corrupted. If the problem persists, discard the redo log. ' "

    While no-one that works with data (or IT in general) is particularly fond of the word "corrupted", the above just indicates that likely at worst the data in the snapshot is potentially unusable - this isn't a particularly big deal if these snapshots were created 5 minutes ago as part of the backup process.

    "I could theoretically mount the CIFS share of the VEEAM server on the ESXi directly, and just copy the machine directory directly there"

    Nope, ESXi can't attach a CIFS share (though maybe there is some way to wangle it indirectly).

    "Any specific command I should use to copy the directory or plain old cp (there doesn't seem to be a rsync command on ESX)?"

    I would NOT advise using cp to copy vmdk data ever - it doesn't do the necessary checks to ensure the data is sane and/or bail-out accordingly - if you don't have good backups of this VM then what I would advise (assuming you don't want to try snapshot-based backups again and assuming you have sufficient available storage space and assuming the VM can be powered off to do so) making a copy of all the vmdk data using vmkfstools -i, this will mean you will have a current copy to do whatever with (snapshot backup attempt, migrate, download from datastore) without risk.

    "Gladly. After we created a copy of the now-corrupted machine, the first thing we tried is to run disk consolidation"

    Again, I don't think it was corrupted, just the snapshot(s) were in an unusable state (and my gut-feeling is that this was likely from the abrupt cancel of job and/or other things done).

    "and tried moving all of them into a temporary directory (including SERVERNAME_2-000001-flat.vmdk)"

    Is there any possibility that the 'named-like-a-snapshot' vmdk got named like this by an accidental overwrite? e.g. mv newname originalfilename

    I know this is highly unlikely as that wouldn't explain how the flat would be like this also (unless the same typo was used for both files or if the flat was done first and then the descriptor manually edited to point to it).

    "4 disks configured, but numbers in the disk names only went up to 2 instead of 3 like we'd expect it"

    Just a side-note on this - disks are named like this automatically when creating/adding new disks to VMs, however I would always advise caution with this as disks and never to make assumptions about what disk is what number and mounted in what order as these can be created/migrated with any name/numbering convention and shuffled/moved/removed so assuming _2 disk is always the 3rd disk added (e.g. if using single controller scsi0:3 or if each disk is on its own controller scsi3:0) isn't always reliable. It is even feasible to have a VM that has 20 different vmdks all with the same name but each residing in a different namespace (obviously this is daft hyperbole but I used to often see VMs with 2-3 vmdks like this).

    "I've thought about this and I think the safest and most efficient way to move the machine to the new data center would be to shut down the machine and just copy the complete machine directory via SCP from an ESX host on the new data center."

    Again, I really wouldn't advise using cp/scp - more reliable alternatives being: download from datastore browser to some other storage and upload to new storage, shared-nothing SvMotion provided you have the licensing/capabilities/patience for such features, get a disk/NAS that can be attached to the old ESXi and copy the data over and then attach it to the new storage.

    Bob



  • 12.  RE: Disk consolidation needed: How to fix it manually?

    Posted Sep 04, 2020 12:54 PM

    Hi all

    Thanks again for the detailed replies, I'm learning quite a lot from this.

    Did you check the vmware.log of the VM at this time for possible causes (e.g. a vmdk was locked or inaccessible) and or the vmkernel.log for anything that would explain the defunct state?

    I have attached the vmware log of the machine at that time. I cannot check the vmkernel.log because everything before 18th of August was already purged (this all happened on August 8th).

    I cannot see anything that might have caused this. I can see that at some point he shuts down because the redo log is corrupted, but the log entries before seem to suggest that the previously created snapshots were deleted. But I cannot find an entry of the snapshot being deleted at the time of the crash, but that might have to do with the fact that VEEAM couldn't have issued the delete statement since the redo disk was seemingly corrupted already (no consolidation possible).

    It could have just been in a bit of a state from HA trying to register/restart it on multiple hosts and/or the cluster fighting over ownership of it (I really really don't miss the older days of such things in HA - works damn solid in modern versions and even thinking of old issues we used to see is giving me flashbacks that are accompanied by Vietnam-style helicopter noises)

    Haha, I feel you - this wasn't the first time I had trouble with HA, up until a few weeks ago we had another old cluster with HA enabled, boy did that give me some sleepless nights.

    Did you do this with the VM powered-on or off? Did you reload the vmx?

    As the machine was unable to boot, we did this in the powered state. However, we did not reload the VMX file. We only updated the references to the VMDK files in the VMX file, and vCenter also showed the correct paths in the UI after editing.

    this isn't a particularly big deal if these snapshots were created 5 minutes ago as part of the backup process.

    Yes that's what we decided too. We also had the benefit of an external monitoring system confirming that no new data was delivered during those 10 minutes, so we just got rid of those snapshots.

    Nope, ESXi can't attach a CIFS share (though maybe there is some way to wangle it indirectly).

    That's a bummer - the VEEAM server is running on Windows and is the only server with a share that is big enough to temporarily hold 2.5 TB worth of data.

    However, I am currently looking into a way to make the storage of the new data center available in the old data center and then attach it using a iSCSI connector. When I then turn off the machine and clone it to the new storage, that shouldn't create a snapshot, right? It would probably only create a snapshot if the machine was powered on.

    I would NOT advise using cp to copy vmdk data ever.

    Yea that's what I thought too. I hope I can get the storage of the new data center connected so that I can just copy it through the vCenter UI. Is the vCenter clone operation safe against small interruptions on the network connection? It's a WAN Gigabit connection, I guess it's a realistic possibility that during the clone operation a short timeout might occur.

    Is there any possibility that the 'named-like-a-snapshot' vmdk got named like this by an accidental overwrite?

    The vmware logs go back as far as November 2019 and I can see that that disk was always called like that. During that log analysis I discovered that the old data center has a VEEAM server (even running on a physical machine and not on the ESXi) - however, nobody knows the login credentials for that server so I have no way of accessing those backups (of course nobody of those people that set it up work here anymore and no password is documented). What's interesting, is that the VEEAM server of the old data center seems to have stopped backing up after the crash of the 8th of August - since then no more VEEAM snapshots were created. Up until that point, every night around 10 PM, a backup snapshot was created and removed again.

    however I would always advise caution with this as disks and never to make assumptions about what disk is what number and mounted in what order

    Trust me, that won't happen again :smileywink:

    Again, I really wouldn't advise using cp/scp

    I hope I won't have to do that, but it really depends on that storage becoming available.

    run hexdump -C against all the flat.vmdks - it is absolutely possible that a valid flat basedisk uses missguiding names and descriptorfiles.

    I will try that, but I'm 100% positive that this disk is not a snapshot disk.

    next look into the Bad File Descriptor issue. This will prevent copy or migration attempts hater so check now.

    I'm not sure that this is still an issue. As far as I can see that was an issue because we didn't realize that the -0000001.vmdk disk was not a snapshot, but a base disk.

    Cheers,

    ahatius



  • 13.  RE: Disk consolidation needed: How to fix it manually?

    Posted Sep 05, 2020 08:20 PM

    Hi all

    Just wanted to let you know that I successfully migrated the machine now.

    I decided against attaching the storage of the new data center to the old data center, as the new data center uses VMFS6 and I suspected that ESXi 5.5 wouldn't be able to handle VMFS6 (though I was unable to find any evidence for that on the internet, but I just assumed it wouldn't work since VMFS6 was introduced with ESXi 6).

    What I did do was to copy the files with SCP (took about 24 hours) and compare the MD5 hashsums of the transferred disk files to make sure that they weren't corrupted during transfer. The checksum generation alone took multiple hours on the old storage system. Fortunately all hashes matched after the transfer, so I was confident enough to register the machine in our new vCenter.

    After registering the VM in vCenter 7 it displayed the message again that disk consolidation was needed. This time we attempted a consolidation (since we now had a working backup copy) and it went through instantly and without errors. This leads me to believe that the warning was just because we cleaned up the snapshot files manually but left the vmsd file the way it was.

    I am glad to have this migration finished and would like to thank you all again for your various inputs on this matter :smileyhappy: Hopefully I won't have to ever deal again with old legacy systems like this anymore.

    Cheers,

    ahatius



  • 14.  RE: Disk consolidation needed: How to fix it manually?

    Posted Sep 04, 2020 12:47 AM

    Once you powered down that VM there are 2 things to do:

    1. run hexdump -C against all the flat.vmdks - it is absolutely possible that a valid flat basedisk uses missguiding names and descriptorfiles.

    Next check the vmware.log of last use and check wether the 000001-flat.vmdk was loaded last boot.

    2. next look into the Bad File Descriptor issue. This will prevent copy or migration attempts hater so check now.

    Now also is a good time to disable all change block tracking for this VM and get rif of the ctk-vmdks in case they exist.

    Also keep in mind that when ESXi claims: VM requires consolidation this can mean: there really is a vmdk with too many snapshots but it can also simply mean that the vmsd is bad and just needs to be deleted !!!

    A bad vmsd in itself is no reason to move a VM or migrate it in the hope that that might help.