VMware vSphere

 View Only
  • 1.  Semaphore timeout during snapshot removal

    Posted Nov 15, 2011 11:51 AM

    Hello All,

    I just posting this message in the hope that someone else might have ran into this issue before and Maybe found a solution for it. We have a service provider who provides us with VM machines on VMware platform. I believe the platform is esxi 4.1. They also use Veeam backup to backup our VM’s.

    The VMware machines are windows 2008 servers - x64 (SQL 2008/.net application servers). All is perfect with the service except sometimes  we get a message like the one below when running distributed transactions against our SQL server from our .net application server.

    Message: A transport-level error has occurred when receiving results from the server. (provider: TCP Provider, error: 0 - The semaphore timeout period has expired.)

    Basically it would seem that no response comes from the SQL server during this time and the error is thrown. We have done some troubleshooting with the company and found that the message occurs at the end of backup of the SQL server.  Seems to be when the snapshot removal is taking place. It only occurs at point. Our own thoughts is that the error is related to the pause take can take place on the SQL Server during the snapshot removal.

    Just wondering has anyone else experienced any issues like this and found any ways to resolve them.  Is there any way to reduce the paused period during snapshot removal without affecting the backups

    Thanks



  • 2.  RE: Semaphore timeout during snapshot removal

    Posted Nov 15, 2011 12:21 PM

    Hi

    I never had that error message but from information which you provided look like storage is a bottle neck there. While host committing snapshot to VM very high IOPS load take place + IOPS from other VMs ( I assume that server lays on shared datastore across multiple VM's) + your operations on SQL itself it gives huge workload to underlying storage.

    In general you shouldn't notice (on OS level) any pause during snapshot commitment.

    My suggestions:

    • move VM on to dedicated datastore and perform test (backup + SQL operations)
    • change a backup window for that SQL server (do it ouside office hours during low workload)
    • maybe backup is taken to rarely  and when backup starts it's takes too long causing grow of snapshot file to monstrual sizes ?
    • I would check VM configuration it self, especially disk layout, does disk with logs are separate and on faster drives then and data disk ?
    • do you use PVSCSI adapter for LOG and DATA drives ?
    • do you have VMXNET3 (prefered) vNIC or E1000 ?

    To be honest it's quite hard to figure out what is happening there without detailed information about underlying storage, hardware etc.



  • 3.  RE: Semaphore timeout during snapshot removal

    Posted Nov 15, 2011 04:56 PM

    Hi Arturka

    Thanks for your post, just answered as many of your questions as I could below.. I understand it's not easy to find out what is happening without further information, unfortunatley I don't have it to hand myself. I just really wanted was snapshot removals impacting on other users in similar ways.. I understand a the load increases while a snapshot removal is been performed but it seem to be more pronouced and leading us to issues on our box.  I'll keep digging and see can I find any other people who might have ran into this issue..

    Thanks

    • I assume that server lays on shared datastore across multiple VM's

    Yes, I believe this is the case (SAN)

    • change a backup window for that SQL server (do it ouside office hours during low workload)
    The backups are been performed out of hours and the machine isn't that busy during this time. However, as the removal of the snapshot is been performed we run into the issue..
    • maybe backup is taken to rarely  and when backup starts it's takes too long causing grow of snapshot file to monstrual sizes ?
    I don't beleive this to be the case, the backups are performed on a daily basis..  Backup sizes should be less than less than 1TB
    • I would check VM configuration it self, especially disk layout, does disk with logs are separate and on faster drives then and data disk ?
    This I don't know to be honest, but I do know that the drives are spread over a SAN storage device.  we get good performance normally on the machine, actually only during th snapshot removeal do we have the issue.
    • do you use PVSCSI adapter for LOG and DATA drives ?
    Again, I'm not sure of this.. I don't see reference to this in the machine, only vmware virtual disk scsi disk..
    • do you have VMXNET3 (prefered) vNIC or E1000 ?

    Yes we have VMnet3