VMware vSphere

 View Only
Expand all | Collapse all

HP EVA Storage Failure & VMware Fault Tolerance

  • 1.  HP EVA Storage Failure & VMware Fault Tolerance

    Posted Nov 11, 2010 12:06 AM

    Good Morning VMware Community,

    Our environment consists of VMware vSphere Server 4.0.0 (Build 258672) and VMware ESXi 4.0.0, 261974 hosts. For storage we have two HP EVA4000 SAN storage arrays with Continuous Access Licensing. Our operating system environment consists of Windows 2003 servers within a windows domain. What we want to achieve is to move our second EVA plus a number of ESXi hosts to a different site for disaster recovery.In our environment we have a number of mission critical servers we want to host on VMware. I have been experimenting with VMware Fault Tolerant feature and this is working fine and will do the job well to keep these servers running and online should a host fail. My problem is however, if we suffer a failure from the storage array this is when things go down hill. I have experimented with the creation of a software mirror on the system partition of a virtual machine with one disk being presented from the data store located on the first storage array and the second disk in the mirror presented from the data store on the second storage array. Following that I had enabled VMware Fault Tolerance and the virtual machine ran fine however when a storage failure occurred, the virtual machine would hang and stop running. My questions are as follows:

    1. Dose VMware have a solution that can provide non stop access to virtual machines during a storage failure?

    2. Is there a virtual RAID controller that I can configure as RAID1 available within VMware?

    3. Is there something missing I am not configuring to achieve this solution.

    4. Is there a supported configuration I can construct to achieve this high availability and fault tolerance to my virtual machines that require non-stop access during a storage failure.

    5. I have read a white paper about VMware FT using Non-Shared storage. This seems to be exactly what I am looking for. If anyone has any more information on this it would be most appreciated.

    Thank you everyone for your assistance with this.

    John



  • 2.  RE: HP EVA Storage Failure & VMware Fault Tolerance

    Posted Nov 11, 2010 02:15 AM

    Hi John,

    In order to do what you are requiring would need to be done at the storage level and is not possible with the EVA4000s.

    Basically you would require a replicated datastore that is R/W at the source and target array and presented as a path to the ESX server. This target array would have to replicate it's cache with the source array as all writes go through cache and both arrays would need to know about them. Add to this that replication would need to be both ways.

    To answer some of your other questions, there is no virtual RAID controller in VMware. One of the solutions VMware provides for array/site failure is Site Recovery Manager but this does require an outage. We actually use this to keep the business going while doing site maintenane at our primary DC. Takes about an hour to failover, with most of this is shutting down the virtual machines at the primary site.

    For you information the reason your software RAID failed is that if you pulled the storage that hosted the virtual machine configuration file then the virtual machine stops existing.

    Now not bagging FT (Fault Tolerance) but be aware there are some limitations around its use - i.e. limited to 1 vCPU, still pointing at the same vmdk, etc

    If you are concerned about array failure then I would look at replicating your datastore and potentially have a clone of this at the target site. This will mean if you have a failure of the source datastore (read somebody formats it) then you will still have a copy on the destination array - be it a time lagged copy.

    May I ask a question - why this level of availability? The EVA4000s are not that unreliable that I'm aware of and VMware HA is pretty quick to restart virtual machines. Because they boot so fast an outage is normally least than a minute.

    Kind regards,

    Glen



  • 3.  RE: HP EVA Storage Failure & VMware Fault Tolerance

    Posted Nov 11, 2010 03:05 AM

    Goof Afternoon Glen & Group,

    Thank you very much for getting back to me on this one. You are quite correct, the EVA's are a pretty solid unit. There is a reason in my madness I am after this level of availability which I will explain. What I have in my environment is a number of physical Microsoft clusters attached to these EVA's. Currently we have everything located in the one datacenter (not ideal). Following a constant push to have a second datacenter approval was given and finally some sort of site level protection will be available. I plan now to relocate one of the EVA's to the other datacenter and split the clusters up one node in one datacentrer and the other node in the other datacenter. For EVA storage failover I wish to implement the HP EVA Cluster Extensions solution. The HP EVA Cluster Extension solution require a Majority Node Set (MNS) quorum to be in configured and this requires a member of the MNS cluster to be located in a third site which will act as the cluster arbitrator. The problem is I have no access to a third site. My intention was to create a virtual server and enable VMware Fault Tolerance on this virtual machine and join it to the cluster as purely the arbitrator of the MNS cluster. I figured that since the availability of a FT virtual machine would suit this configuration acting as a 'virtual' third site. My problem is still the storage. I was reading a paper from VMware (http://www.vmware.com/files/pdf/partners/academic/fttech.pdf) and on page 19 they started talking about FT in a Non-Shared storage configuration which got me rather excited. I have yet to find any more details on this. If anyone has any advice on a way forward for me it would be most appreciated.

    Cheers,

    John



  • 4.  RE: HP EVA Storage Failure & VMware Fault Tolerance

    Posted Nov 11, 2010 04:01 AM

    Hi John,

    Looking at HP EVA Cluster Extension it appears to do the same as VMware Site Recovery Manager but for application services, i.e. SQL server. This means there will be an outage for the services hosted on this cluster as HP EVA Cluster Extension R/W enables the remotely mirrored storage and then brings resources online.

    As you are probably aware, the reason for the MNS quorum is for the GEO-cluster to determine that a site has actually gone down. If you put this in the secondary site (same site as your DR) then you could be creating a situation were the secondary site gets disconnected and therefore because two nodes are here they determine (by majority) that the primary is disconnected and initiate the failover.

    Remember with a MNS quorum none of the nodes needs to see the others quorum except through the network in order to do updates. If you did manage to create a FT virtual machine with storage in each location then again if one site gets disconnected how do the nodes determine which site is actually down, i.e. 2 against 2!!

    We actually have a SQL cluster on physical servers with 2 nodes at the primary site, replicated storage excluding quorum, and another node at the secondary site. In a DR we failover the storage and bring the cluster online. This is very fast and works well. Obviously the HP EVA Cluster Extension automates a lot of this but it really is aimed at a third site. You could rent virtual space for relatively little cost that could serve as this witness node.

    Kind regards,

    Glen



  • 5.  RE: HP EVA Storage Failure & VMware Fault Tolerance

    Posted Nov 11, 2010 05:10 AM

    G'Day Eveyone,

    Thanks for everyone input. It looks like I am kind of stuck here. I have tried hard to get this third site however I just don't see this happening. I cannot even rent external virtual space due to the fact this network has no connectivity to the internet. You wouldn't believe the effort I had to go through just to get a second site. I may have to bite the bullet and get some other form of storage (eg. Lefthand SAN) just for these witness nodes. When I saw that configuration of the Non-Shared FT in that paper I got pretty excited. Following talking with VMware it appears that this paper came from the testing team and that is where it ends. I'll continue to look for other alternatives. I'm sure this can be done somehow. Glen, you are correct, with EVA Cluster Extensions there will be an outage which is acceptable for us, I was just seeing if I could have a Non-Shared storage for FT this would have been the answer for me. I asked the engineer today from VMware about this concept called 'Long Distance FT' and he couldn't help me. Does anyone have any more information on this other then what is described in that paper I posted the link to previously? Any input is very much appreciated and thank you to all involved.

    Cheers,

    John



  • 6.  RE: HP EVA Storage Failure & VMware Fault Tolerance

    Posted Nov 11, 2010 05:19 AM

    Hi,

    I'm going to extend my last post by asking what Microsoft Clusters you're referring to.

    We've long replaced MSCS with Exchange Availability Groups, SQL Mirrors and DFS File Systems, and the need for replicated storage for the majority of MSCS services goes out the window.



  • 7.  RE: HP EVA Storage Failure & VMware Fault Tolerance

    Posted Nov 11, 2010 06:03 AM

    G'Day Josh,

    We use MSCS for Oracle, File, SQL, IIS and a few custom written applications. While we could phase out the requirement for MSCS it is the organisations policy and standard for delivering high available application. I would love to start phasing MSCS out however just to get through the red tape would be a mine field. At the moment I think it is better to follow the rules rather than bypass them. Unfortunately that side of the decision making is out of my control. So, for the moment (and probably for a few more years) MSCS is here to stay for the delivery of these applications.

    Cheers,

    John



  • 8.  RE: HP EVA Storage Failure & VMware Fault Tolerance

    Posted Nov 11, 2010 06:10 AM

    G'Day Josh,

    through the red tape would be a mine field.

    Hi,

    I unfortunately know too well what you mean by that. It's suprising to see Oracle on MSCS though - Dataguard and RAC have been well and truly the standard there for years.

    To your current situation, the technical option is Lefthand VSA - fire up a VM to operate network RAID1 across the two EVAs, and mount the result as you shared storage.

    In reality though, it's not likely to scale too well to the size of the deployment you're referring to.



  • 9.  RE: HP EVA Storage Failure & VMware Fault Tolerance

    Posted Nov 11, 2010 09:15 AM

    Hi John,

    Sorry if I'm missing something here but why do you need another storage array (i.e. Lefthand SAN) for this witness node? The witness node will run quite happily with local storage so could actually be a virtual machine on the EVA. Please don't think I'm recommending this because the whole point of the witness node is to see the other nodes from a different location and since you are trying to provide availability in the advent of a storage array failure it makes no sense to put in on the storage array - but you could do it.

    Kind regards,

    Glen

    PS: feel your pain in regards to red tape. My last job was for a place that lived for red tape!!!



  • 10.  RE: HP EVA Storage Failure & VMware Fault Tolerance

    Posted Nov 11, 2010 09:34 PM

    Good Morning Glen,

    I am considering two virtual storage arrays (i.e.Lefthand SAN) one in each site running from captive storage. Set up a RAID1 between the two virtual arrays and presenting this to a virtual machine with FT will give me a 'virtual' 3rd site where the majority will be the witness node (which is now protected) plus one of the nodes of the physical clusters.

    Cheers,

    John



  • 11.  RE: HP EVA Storage Failure & VMware Fault Tolerance

    Posted Nov 12, 2010 01:05 AM

    Hi John,

    But your virtual 3rd site is still dependant on both sites, i.e. is not actually a witness.

    Lets assume you have a three node cluster (node_A, node_B and node_C) with node_C being the witness node. You have two sites (site_A and site_B) with node_C spanning both.

    Let assume all resources (active nodes) are running from site_A. If site_B gets disconnected from site_A then both node_A and node_B will look to the witness node (node_C) to confirm what has happened. In your scenario (assuming you could get node_C to run FT across sites/arrays) then node_B will be able to see node_C and therefore assume node_B is gone. On site_B, node_B will also see node_C and therefore assuming node_A is offline taking the resources and bringing them online. Now resources have failed to the disconnected site and people have lost access until it comes online again... which may not happen as the cluster could be broken now.

    Now this scenario was for a site failure. In the advent of a storage failure then you would be okay - well not okay but the cluster would behave as expected.

    A couple of other things to think about - do you have multiple 1GB connections between sites? i.e. one for data replication, one for fault tolerance? While the Lefthand SAN allows for replication is this R/W in both directions for a LUN?

    Sorry not trying to be a brick wall, just looking at all angles.

    Kind regards,

    Glen



  • 12.  RE: HP EVA Storage Failure & VMware Fault Tolerance

    Posted Nov 11, 2010 03:10 AM

    Hi John,

    In order to do what you are requiring would need to be done at the storage level and is not possible with the EVA4000s.

    Storage level mirroring is exactly what a Continuous Access license does.

    That said, it's worth looking at the services involved. I've seen people seeking to go to these lengths with HP Continuous Access and Site Recovery Manager before only to have my point out that an Exchange server has built in mirroring and can fail itself over faster at less cost.



  • 13.  RE: HP EVA Storage Failure & VMware Fault Tolerance

    Posted Nov 11, 2010 03:38 AM

    Storage level mirroring is exactly what a Continuous Access license does.

    Yes but if you read what John is requiring CA does not do this and that is what my statement refers too. While the EVA replication does mirror the LUN to another array, it does not allow you to write to that mirror synchronously with the same host.

    Hope this has clarified my statement.

    Thanks,

    Glen