vSphere Storage Appliance

 View Only
  • 1.  VMware, Jbod and Big Data

    Posted Dec 20, 2014 03:01 PM

    I see a lot of recommendations concerning using JBOD for big data appliations like Hadoop.  The reasoning I hear around it is "you don't need raid 5 because

    1. "High availability is provided by the application so we don't need the protection of a RAID 5,6 or 10 array"

    and

    2.  "Performance is better with JBOD on SATA"

    3.  "cost is cheaper"

    So far these points have not made sense to me, because:

    1.  for point one, just because HA is provided by the application, for example, hadoop can tolerate the loss of a singe vm with no problem.  I don't see why I want to make Hadoop do that if I don't need to?  Just because High availabilty is provided at the application layer, why do I need to go and take it away at the storage layer?  If I can lose a single RAID 5 hard drive without losing a VM, why do I need to take that away and use JBOD just because the application has its own HA? 

    2. for point two,  It seems to me you can get the performance you want with RAID 5 if the rest of the storage design allows for it.  For example, I can add more cache to the Array, I can use SSD instead of SAS / SATA.  And the solution I'm seeing proposed is JBOD on SATA. Why does performance force me to use JBOD on SATA?

    3. For point 3, If cost is the constraint, then you have to obviously design around it. But it seems like I need to calculate and compare the cost before I say cost is a constraint.  Has everyone in the industry concluded that big data like hadoop on EMC is just too expensive?  Has anyone found that the cost is NOT prohibitive to using a normal Fibre Channel EMC array for example?

    Please forgive my ignorance and correct my erroneous thinking....



  • 2.  RE: VMware, Jbod and Big Data

    Posted Jan 05, 2015 11:29 PM

    I don't really have any experience on the big data extensions, but in general i think what they mean is:

    1 - You dont need to protect the data on disk using raid, since the data is secured in the application layer. The statement only means the hadoop data, not the VM. If it only covers the hadoop data, then it makes sense atleast to me.

    2 - If thinking that the application is to manage the data on disk, like some kind of software raid solution (a long shot comparison with ZFS, that want direct access to the disks), then the performance might be "better" not using a raidcard, at least you would be removing over head.

    3 - If the two paragraphs above make sense, then I think it would be hard for say EMC to compete on price.

    I might be all wrong on this, just my thoughts.



  • 3.  RE: VMware, Jbod and Big Data

    Posted Jan 10, 2015 12:34 AM

    OK thanks for the input.  The issue here seems to be that JBOD is making fundamental changes to the design of the VMware compute cluster storage.  Whereas before I had a consistent design with all VMFS datastores, I'm throwing in datastores only used by a single vmdk, which belongs to the hadoop VM.  No other VMDKs can run on that datastore.  If I was using SDRS before for all my datastores, this datastore is now an exception. If I could storage Vmotion from any datastore to any datastore before, now I have to keep track of these exception datastores than run one and only one vm, and can only svmotion to other jbod-ready datastores.  Before if I lost a single disk, my raid group rebuilt and the VM kept running.  Now if I lose a single disk, even though Hadoop data is not lost, I have to restore a VM from backup to get the VM up again.

    Is there a cheap way to do this that still maintains the consistency of the design using EMC storage pools and define an EMC tier of "JBOD LUNs" that have only one disk and then using VMware storage profiles to keep hadoop VMs on these datastores?  There has to be some better way....