VMware vSphere

 View Only
Expand all | Collapse all

Consolidation failure

  • 1.  Consolidation failure

    Posted Jul 21, 2016 01:59 PM

    We have been trialing Dell Rapid-Recovery for ESXi backups, and occasionally experience consolidation failures.  Any advice on how to track down why, so that we can fix it?

    This server is running ESXi 5.5 build 2068190.

    Here is some info from hostd.log:

    2016-07-21T09:00:16.847Z [52E80B70 info 'Vimsvc.TaskManager' opID=hostd-854b user=root] Task Created : haTask-4-vim.VirtualMachine.consolidateDisks-344057184

    2016-07-21T09:00:16.848Z [51080B70 info 'Vmsvc.vm:/vmfs/volumes/548c3c98-2367a4b4-9aa2-0025908c25f8/domain1/domain1.vmx' opID=hostd-854b user=root] State Transition (VM_STATE_ON -> VM_STATE_CONSOLIDATE_ALL_DISKS)

    ...

    (Lots of verbose messages that do not appear to have anything to do with consolidation)

    ...

    2016-07-21T09:00:18.740Z [4F5C1B70 verbose 'Vmsvc.vm:/vmfs/volumes/548c3c98-2367a4b4-9aa2-0025908c25f8/domain1/domain1.vmx'] Consolidate Disks translated error to vim.fault.FileLocked

    2016-07-21T09:00:18.740Z [4F5C1B70 info 'Vmsvc.vm:/vmfs/volumes/548c3c98-2367a4b4-9aa2-0025908c25f8/domain1/domain1.vmx'] Consolidate Disks failed: vim.fault.FileLocked

    2016-07-21T09:00:18.740Z [4F5C1B70 verbose 'Vmsvc.vm:/vmfs/volumes/548c3c98-2367a4b4-9aa2-0025908c25f8/domain1/domain1.vmx'] Consolidate Disks message: An error occurred while consolidating disks: Failed to lock the file.

    -->

    2016-07-21T09:00:18.740Z [4F181B70 info 'Vimsvc.ha-eventmgr'] Event 7495 : Virtual machine domain1 disks consolidation failed on vsphere1 in cluster vsphere1 in ha-datacenter.

    2016-07-21T09:00:18.742Z [4F181B70 verbose 'Vmsvc.vm:/vmfs/volumes/548c3c98-2367a4b4-9aa2-0025908c25f8/domain1/domain1.vmx'] Time to gather Snapshot information ( read from disk,  build tree): 1 msecs. needConsolidate is true.

    2016-07-21T09:00:18.742Z [4F181B70 verbose 'Vmsvc.vm:/vmfs/volumes/548c3c98-2367a4b4-9aa2-0025908c25f8/domain1/domain1.vmx'] Snapshot property update: Configure will be invalidated for:

    2016-07-21T09:00:18.758Z [4F181B70 verbose 'Vmsvc.vm:/vmfs/volumes/548c3c98-2367a4b4-9aa2-0025908c25f8/domain1/domain1.vmx'] Time to gather config: 15 (msecs)

    This is the third time it has done this, each time with a different guest / vmdk.  I'm at a loss on how to proceed.  The only fix that has worked on the prior occasions was to reboot the ESXi host and do a manual consolidation.

    The vSphere Client and command-line tools all give similar errors when attempting to consolidate without a reboot, stating that the they are unable to access the file since it is locked.  Even restarting the hostd daemon is not sufficient to allow consolidation to proceed -- nothing but a full host reboot.

    Any ideas how to proceed?

    Thanks in advance



  • 2.  RE: Consolidation failure

    Posted Jul 21, 2016 02:17 PM


  • 3.  RE: Consolidation failure

    Posted Jul 21, 2016 02:32 PM

    firestartah:

    Thank you for the post.

    I have looked into that article.  Unfortunately, it largely does not apply.  We are not a large datacenter; our ESXi servers are stand-alone.  So, the first 3/4 of that article, which is focused on determining which vSphere server has the vmdk locked, do not apply.

    After determining which server it is, the advice basically boils down to "restart the host".  I already know to do that...

    My goal is to discover why this is happening and how to fix it so that I can trust Rapid Recovery & ESXi to always successfully consolidate after backups.

    Thanks



  • 4.  RE: Consolidation failure

    Posted Jul 21, 2016 09:53 PM

    Try restarting all management agents instead of rebooting.

    To restart all management agents on the host, run the command:
    services.sh restart

    Restarting the Management agents on an ESXi (1003490) | VMware KB

    Typical troubleshooting steps I try when this happens:

    • Try to vMotion the VM to another host
    • Another option is if the backup software uses hot-add (vRanger etc) then look at the VM settings of the VM doing the backups and see if the VM with the error has one of it's disk attached to the backup VM. Detatch if required and try to consolidate again.
    • Try to create a new snapshot and then delete all snapshots
    • You can try to restart your backup server in case this has somehow locked the VMDKs


  • 5.  RE: Consolidation failure

    Posted Jul 22, 2016 01:27 AM

    grba:

    Thank you for the reply.  services.sh restart is a better method than a full host reboot.

    Unfortunately, this is for a small shop.  The license level is Essentials.  There are not enough servers to have the spare capacity to do vMotion, even if the license level supported;  these guests are stuck where they are.

    I have tried creating other snapshots, and then using Delete All.  It fails without restarting the host (or all the management services, at least).

    I have verified that it is not the backup system locking the files.  It's something in ESXi, itself... although I haven't a clue how to track that one down.

    So, while knowing I don't have to restart the host every time is a positive thing, it still leaves us in the situation where simply using our backup software can leave us in a state where our guests crash as they run out of disk space.  Not good.

    The only real, permanent solution is to find out why ESXi is failing to clean up snapshots when told -- why they're in a locked state -- and fix that, so that we can move forward.

    I'm considering using updating that to the latest build of 5.5... but I hate running host updates on otherwise perfectly functional systems without clearly knowing it's the necessary fix.

    I appreciate your help.

    Thank you



  • 6.  RE: Consolidation failure

    Posted Jul 22, 2016 02:57 PM

    Hi David,

    Please consider the size of the disk as well while using the VADP mode for backup. If it is more than 1 TB sometimes and you use LAN for VADP then the consolidation gets fail because of time out issue.

    For clearing the locked files we fix it by restarting the management agent of the host.. Please let us know if upgrading the built fix the issue...



  • 7.  RE: Consolidation failure

    Posted Jul 25, 2016 01:31 PM

    VMBoy79:

    Thank you for your reply.

    Please consider the size of the disk as well while using the VADP mode for backup. If it is more than 1 TB sometimes and you use LAN for VADP then the consolidation gets fail because of time out issue.

    Dell Rapid Recovery is a VADP solution, utilizing CBT.  Some of the vmdk's are more than 1 TB, while others are significantly less.  It fails to consolidate, randomly, on either.  Size does not appear to be an issue.

    Decreasing the number of simultaneous backups being run on a single ESXi host seems to lower the likelihood of a consolidation failure.  However, it has not out-right eliminated this from occurring; even running just a single backup during off-peak hours can sometimes result in a "stuck" snapshot.



  • 8.  RE: Consolidation failure

    Posted Jul 22, 2016 05:23 PM

    Did you see this option:

    • Another option is if the backup software uses hot-add (vRanger etc) then look at the VM settings of the VM doing the backups and see if the VM with the error has one of it's disks attached to the backup VM. Detatch if required and try to consolidate again.

    Hot-add can cause the vmdk to get locked and you will not be able to consolidate if another VM has the disk attached to it. Although it does not explain why a reboot resolved the problem.


    Try the above and let us know how you get on.



  • 9.  RE: Consolidation failure

    Posted Jul 25, 2016 01:34 PM

    grba:

    Thank you for your reply.

    Hot-add can cause the vmdk to get locked and you will not be able to consolidate if another VM has the disk attached to it. Although it does not explain why a reboot resolved the problem.

    Dell Rapid Recovery is a VADP backup solution.  It does not use hot-add.



  • 10.  RE: Consolidation failure

    Posted Aug 22, 2017 05:08 PM

    We have the exact same issue using Rapid Recovery and ESXi. Restarting HOSTD generally fixes the lock, but it also takes the VMPlayer consoles offline which then have to be restarted as well. Not a great solution. Another issue is that occasionally after restarting HOSTD, our VCenter box will not reconnect to the host and has to be removed and added back in.



  • 11.  RE: Consolidation failure

    Posted Nov 15, 2017 05:27 PM

    We had the very same issue.  It was by sheer accident and the fear that too many failed snapshot consolidations would continue building up in the datastore that I paused protection of our backup solution (Rapid Recovery) from continuing to backup our database.

    Once I paused protection on the backup server, I then went one more step and rebooted the backup server.  I then kept Rapid Recovery in a paused state overnight.

    Again, I shut down the backup out of fear that the jobs would just continue to fail once VMware attempted consolidations.  They would just keep building up to the point that I was losing about 100GB per night and failed with error messages indicating that the File Was Locked.

    Rapid Recovery apparently still had a linkage to the file causing it to lock down when VMware attempted to Consolidate.  The behavior mimics a file being locked down when an application is open.

    Once the protection for the database was paused and there was no active jobs running that temporarily used datastore space, VMware ended up triggering off its regular maintainance schedule for Consolidation.  There were several unconsolidated jobs that had build up and hoarded hundreds of gigabytes of unrecovered data space.

    The VMware consolidation ran on its own. but it took about 5 hours to complete.  When I came into my office the next morning, I found that all of the consolidations had run properly and we had recovered the lost data space.  VMware had successfully completed of the process of release the disk lease, removed snapshot and configured virtual machine.  I attribute this to just turning off the protection overnight.  Little did I know that it was this file reliance which ultimately caused the log jam as to why these jobs would not consolidate. 

    Now as to why this seems to be randomly happening is another question, which I hope that it can be identified by developers.  It is unnerving to have to keep watch on the datastore when this occurs, but pausing the backup solution software thus breaking the file lock down and allowing VMware to do its normal consolidation process helped us to recover the lost data space.  



  • 12.  RE: Consolidation failure

    Posted Nov 15, 2017 05:53 PM

    This type of behavior is, unfortunately, an all-too-common problem not with ESXi or VMware's logic, but with backup vendors writing poor software and not properly implementing steps to ensure their proxies are releasing disks when they should. I have experienced this countless times with backup software vendors, and although they're quick to point fingers at VMware, the issue is actually on their side. The only vendor's product I've found to perform due diligence and clean up after itself--even when it has experienced a failure or interruption--is Veeam Backup & Replication with a feature introduced a couple versions ago called Snapshot Hunter. What tends to happen in these cases is the backup software requests a snapshot of a VM through their VDDK libraries which each product carries. Once the snapshot is confirmed, the software, via a proxy or directly, adds and mounts the base disk and begins to read the changed blocks via the CBT driver. If something occurs with the software where it is interrupted prematurely, the software aborts but it fails to undo what it last did or even check if that disk is still mounted. This manifests to other systems or attempts to consolidate as a lock held. When another system has a lock on a disk, snapshots cannot be removed. When attempts are made to do so, the metadata descriptor is deleted but not the delta files. When this occurs, a consolidation is normally needed, yet because a lock is still held even the consolidation fails. The only way forward is to identify what system holds the lock and remove it, usually by removing the virtual disk from the configuration of a proxy (in the case of hot-add mode being employed).

    Backup vendors need to feel pressure from customers to fix this broken and poor behavior of their products, because it can lead to serious issues including outages due to full datastores, degraded performance, and other unwanted effects. If vendors are loath to comply or continue to insist on finger pointing, it may be time to switch your product for a more reliable one.



  • 13.  RE: Consolidation failure
    Best Answer

    Posted Nov 15, 2017 06:13 PM

    Original poster here.  I had forgotten about this...

    We had to spend several hours across many calls with Dell/Quest tier-2 / tier-3 support.  In the end, a combination of factors seems to have done the trick for us.

    1)  We increased the time-outs in our backup software, giving it more time before it "gave up".

    2)  We decreased the number of backups allowed to run simultaneously, and then scheduled them to run in a staggered fashion so that only one VM per host should be backing up at any one time.

    3)  Updated versions of ESXi and Rapid Recovery.

    Somewhere along the way, between updating everything and fine-tuning the backup software, it seems to have smoothed out.  I believe all three of the above were necessary steps.

    Hope that helps anyone else.

    Thanks!



  • 14.  RE: Consolidation failure

    Posted Jul 25, 2016 01:36 PM

    I plan on using the ISO to upgrade to the latest version of 5.5 this weekend.

    Before doing so, I'd like to ask:  Has anyone had any trouble with this?  Especially going from an early version of 5.5 all the way to Update 3?

    Thanks



  • 15.  RE: Consolidation failure

    Posted Jul 26, 2016 04:03 PM

    David,

    We have ESXi 5.5 update 3 in our environment, Still we are seeing issue with disk consolidation sometimes. but however, the no of lock file issues is once in a week. We are fixing it by restarting the management agents.



  • 16.  RE: Consolidation failure

    Posted Jul 27, 2016 12:44 PM

    VMBoy79,

    That is unfortunate.  Because of the amount of data some of these VMs write, and that most of them are using thick provisioning, we could easily find ourselves running out of space on our datastores;  this is one bug we cannot leave unfixed.

    If anyone has any ideas for a permanent fix, something where I'm not reacting to the problem, but a solution that will actually stop this consolidation issue from happening, I would appreciate it very very much.

    Thanks



  • 17.  RE: Consolidation failure

    Posted Jan 13, 2017 03:12 PM

    Currently we have changed the backup strategy in order to come across the above issue. We have shortlisted the VM having more than 1 TB hard disk; we are running the VADP backup for only the OS drive; and running file level backup for all the other drives.

    This drastically bring down the consolidation issues ..



  • 18.  RE: Consolidation failure

    Posted Mar 16, 2017 09:20 PM

    I had this exact same problem today and came across this post.  I was able to resolve the issue however by performing a storage vMotion.  If you have more than one datastore with the space to hold the VM, you can perform a storage vMotion to move the VM to a different datastore, which automatically successfully consolidates your VM.  You can then storage vMotion the VM back to its original location.



  • 19.  RE: Consolidation failure

    Posted Jun 21, 2017 01:13 PM

    We had the same problem, we have Netvault as a backup tool installed on a physical server, all attempts to consolidate the disks of a VM failed, we restarted the backup server and the consolidation is executed successfully



  • 20.  RE: Consolidation failure

    Posted Jun 21, 2017 06:00 PM

    Just to give my 2 cents here. we have seen this issue on many occasions and it affects even the latest build of esxi (6.5). On customer in particular still has this issue. Due to the snapshots not being deleted they suffered severe performance issues on the applications running on those vms. The customer used commvault for backup which uses "proxy" vms to do the backup. In every case the lock was on one of the backup servers.

    Did you try to restart the backup server/vm and then try to consolidate again?

    this issue seems to affect most backup solutions out there and it does not seem to matter which method they use (hot-add or the vapd)

    we are still trying to find a permanent solution for the customer i mentioned and the backup vendor is investigating as well but so far there is no permanent fix.



  • 21.  RE: Consolidation failure

    Posted Jun 21, 2017 08:13 PM

    Please confirm if you see snapshot related .ctk files remaining in the datastore. For example: vmname-000001-ctk.vmdk. in the vmfolder after successful backup or consolidation.

    Best Regards,

    Deepak Koshal

    CNE|CLA|CWMA|VCP4|VCP5|CCAH



  • 22.  RE: Consolidation failure

    Posted Nov 15, 2017 05:39 PM

    You need free space for consolidation, at least twice the size of the machine,

    try to do it with the machine turned off