vSAN1

 View Only
Expand all | Collapse all

Dell PERC H730p / LSI 3108 /Invader implementations

ezequielcarson

ezequielcarsonDec 03, 2014 05:24 PM

  • 1.  Dell PERC H730p / LSI 3108 /Invader implementations

    Posted Dec 03, 2014 05:18 PM

    Hello, would anybody happen to have any guidance or a proven config utilizing the PERC H730p/LSI 3108/Invader controller (FW 25.2.1.0037) in pass-thru with VSAN (ESXi 5.5 build 2143827). We are having stability issues that are exhibited via PSOD and intermittent permanent disk failures on a VSAN platform build based on the above in Dell R730 chassis with Fusion-io ioScale fronted Seagate 10k v7 ST1200MM0007 disk groups.

     

    Common log events include “firmware in fault state” for the HBA and resets and aborts for the individual disks. Errors increment in the individual drive counters correlating with these events.

     

    We have tried different HBA drivers, from the inbox mr3 (0.255.03.01-2) to the latest known PERC9 driver (6.901.55.00.1 - currently evaluating), including some of the mr3/megaraid drivers in between (6.605.10.00-1, 06.803.52.00, 06.803.73.00). The fallback of RAID0 has passed tests so far, but we all know what that means.

     

    We know this configuration is not currently listed on the HCL. We do have cases currently open with VMware and Dell, and are in communication with LSI.

     

     

    Any guidance would be greatly appreciated.

    Hello, would anybody happen to have any guidance or a proven config utilizing the PERC H730p/LSI 3108/Invader controller (FW 25.2.1.0037) in pass-thru with VSAN (ESXi 5.5 build 2143827). We are having stability issues that are exhibited via PSOD and intermittent permanent disk failures on a VSAN platform build based on the above in Dell R730 chassis with Fusion-io ioScale fronted Seagate 10k v7 ST1200MM0007 disk groups.



     



    Common log events include “firmware in fault state” for the HBA and resets and aborts for the individual disks. Errors increment in the individual drive counters correlating with these events.



     



    We have tried different HBA drivers, from the inbox mr3 (0.255.03.01-2) to the latest known PERC9 driver (6.901.55.00.1 - what we’re currently evaluating), including some of the mr3/megaraid drivers in between (6.605.10.00-1, 06.803.52.00, 06.803.73.00). The fallback of RAID0 has passed tests so far, but we all know what that means.



     



    We know this configuration is not currently listed on the HCL. We do have cases currently open with VMware and Dell, and are in communication with LSI.



     



    Any guidance would be greatly appreciated.




  • 2.  RE: Dell PERC H730p / LSI 3108 /Invader implementations

    Posted Dec 03, 2014 05:24 PM

    Do you have SATA or SAS disks?



  • 3.  RE: Dell PERC H730p / LSI 3108 /Invader implementations

    Posted Dec 03, 2014 05:39 PM

    The Seagate 10k v7 ST1200MM0007 are 1.2TB SAS.



  • 4.  RE: Dell PERC H730p / LSI 3108 /Invader implementations

    Posted Dec 03, 2014 05:44 PM

    How much is the normal latency for wr and rd that we have on those disk in VSAN using passthru?



  • 5.  RE: Dell PERC H730p / LSI 3108 /Invader implementations

    Posted Dec 03, 2014 05:36 PM

    Sorry for the multiple threads, all. This was originally posted via the developer forum and I received a message stating the tread was deleted. I didn't realize the posts were appearing here. Can be deleted or combined with the other 2 similar threads.



  • 6.  RE: Dell PERC H730p / LSI 3108 /Invader implementations

    Posted Dec 03, 2014 06:35 PM

    I haven't had any luck setting up H730p controllers up in pass through mode at all.  Everything looks fine and seems to run well on initial setup but it always ended up PSOds, High latency, and false permanent failures on disks.  I tried for several weeks tearing down and resetting up the vSAN cluster, setting up the controller in HBA mode, setting the controller in RAID mode with each drive configured as Non-RAID, etc..  I finally gave up and setup each as individualy as RAID 0 and specified the SSDs in ESXi.  I've been running that setup for a couple of weeks now without issue.

    I know LSI is having problems with pass through mode even with their supported controllers so I wouldn't be surprised if it's tied to that in some way.  When they fix those issues or the H730p I'm going to revisit trying Pass-through mode again.



  • 7.  RE: Dell PERC H730p / LSI 3108 /Invader implementations

    Posted Dec 03, 2014 07:11 PM

    Thanks for the reply! The description of what you have tried helps validate what we're going through. While unlikely a fix or temporary workaround, have you also attempted to run pass-thru with the 6.901.55.00.1 driver or something other than the inbox mr3 driver? By default the inbox mr3 drivers will take precedence; I missed that initially.



  • 8.  RE: Dell PERC H730p / LSI 3108 /Invader implementations

    Posted Dec 03, 2014 07:21 PM

    I tried falling back on the old linux shim driver with http://www.virtuallyghetto.com/2013/11/esxi-55-introduces-new-native-device.html

    esxcli system module set --enabled=false --module=lsi_mr3

    esxcli system module set --enabled=false --module=lsi_msgpt3

    but it was too old to recognize these newer cards.

    Other than that I haven't tried any other drivers.



  • 9.  RE: Dell PERC H730p / LSI 3108 /Invader implementations

    Posted Dec 16, 2014 11:31 PM

    I was running into similar issues with the PERC H730p. For me it turned out to be the way ESXi was trying to reset the controller. When the VM, who owns the controller via passthru, abruptly resets the host needs to send a reset to the device and apparently the default 'd3d0' puts the PERC into bad unrecoverable state (without a host reboot anyhow)..  So in short I told ESXi to use a different method of reset.  Take a peek in /etc/vmware/passthru.map.  Make an entry for the controller and use the 'link' method for reset.  After making the modification, go back to the shell and run 'auto-backup.sh' and reboot the host.

    Snippet from /etc/vmware/passthru.map

    # passthrough attributes for devices

    # file format: vendor-id device-id resetMethod fptShareable

    # vendor/device id: xxxx (in hex) (ffff can be used for wildchar match)

    # reset methods: flr, d3d0, link, bridge, default

    # fptShareable: true/default, false

    .

    .

    .

    # LSI Logic / Symbios Logic MegaRAID SAS-3 3108 [Invader] (rev 02)

    1000  005d  link     default



  • 10.  RE: Dell PERC H730p / LSI 3108 /Invader implementations

    Posted Dec 16, 2014 11:38 PM

    I'm also using the PERC in HBA mode, with the controller in RAID mode with each drive configured as Non-RAID so I can control the disks independently.



  • 11.  RE: Dell PERC H730p / LSI 3108 /Invader implementations

    Posted Dec 17, 2014 02:43 PM

    Hi

    Btw , i guess you know that VMware doesn't yet support this card for VSAN ..

    /P



  • 12.  RE: Dell PERC H730p / LSI 3108 /Invader implementations

    Posted Dec 28, 2014 03:44 AM

    Any updates on this?

    Will the H730P ever be on the HCL?



  • 13.  RE: Dell PERC H730p / LSI 3108 /Invader implementations

    Posted Jan 21, 2015 03:16 AM

    LSI will be in my office Monday and I'll ask but i wouldn't hold your breath.

    Here's a few reasons.

    1. VSAN HCL testing is a lot more rigorous now.

    2. A LOT of controllers that you can enable pass through mode on are NOT supported by LSI in this mode.  Espect firmware crashes, and dataloss if you try.

    Here is LSI's statement on this (SuperMicro's 2308 despite being on the HCL for pass through never should be used).

    The LSI controllers available through distribution channels which support Pass-Through (JBOD) include the following (with those in BOLD REDindicating presence on the VMWare VSAN HCL).  Note that there are other “LSI” Branded controllers listed on the HCL supporting Pass-through that are not available through distribution channels, meaning they are OEM only despite the “LSI” name and should be addressed to the OEM marketing it for support related questions:

    ·         9211-4i                 (on VSAN HCL)

    ·         9207-4i4e            (on VSAN HCL)

    ·         9212-4i4e            (on VSAN HCL)

    ·         9207-8i                 (on VSAN HCL)

    ·         9211-8i                 (on VSAN HCL) (I understand the Dell H200 is closely aligned with this).

    ·         9200-8e

    ·         9207-8e

    ·         9201-16i               (on VSAN HCL)

    ·         9201-16e

    ·         9206-16e

    Trying it on anything that isn't on this list and you may expect data loss, crashing, and a desire to beg Adaptec to make a decent pass through HBA.



  • 14.  RE: Dell PERC H730p / LSI 3108 /Invader implementations

    Posted Jan 21, 2015 06:00 PM

    Hi,

    I would like to know why do you want to use pass-thru instead of raid0?

    Do you have SAS disks?

    Txs

    Ezequiel



  • 15.  RE: Dell PERC H730p / LSI 3108 /Invader implementations

    Posted Jan 21, 2015 06:16 PM

    pass-through mode allows ESXi to communicate directly to disk without being interpreted by the controller.

    There are management benefits such as not having to configure SSDs manually and with drive failures a simple swap of drives is easily done.  Where as with RAID-0 you will have to tag your SSDs manually and if there is a failure manual interaction with the RAID controller to create a new RAID-0 set may be required. 

    Depending on your server configuration with RAID-0 you may be able to make these changes through a DRAC \ iLO \ etc.. or it may require a reboot to get into the controller options.  You may want to instruct another employee to swap the hard drive with orange light while you're away and not want to worry about them getting into the controller interface.

    Performance wise there shouldn't be much or any difference but the management benefits can be understandably important to some people.



  • 16.  RE: Dell PERC H730p / LSI 3108 /Invader implementations

    Posted Jan 21, 2015 06:30 PM

    Got it,

    We have both scenarios , LSI 3008 in pass-thru and 3108 in raid0

    We are using SATA disk so we are getting 32 of QLEN per Disk versus 128 of QLEN on the raid0.

    For the management in the raid0 we use STORCLI , that allows us to configured physical disk on the fly with no need of restarting servers

    Txs

    Ezequiel



  • 17.  RE: Dell PERC H730p / LSI 3108 /Invader implementations

    Posted Jan 26, 2015 07:54 PM

    Good news Perc H730 has been added to HCL going to start testing with the new firmware

    Firmware VersionType Features
    Collapse ESXi 5.5 U2megaraid_perc9 version 6.901.55.00.1vmw25.2.1.0036

    VMware Compatibility Guide: vsanio



  • 18.  RE: Dell PERC H730p / LSI 3108 /Invader implementations

    Posted Jun 23, 2015 03:10 AM

    Just for note.... In my lab cluster i have four hosts with Supermicro 2308 - LSI 9207-8i with P20 Firmware in IT/Pass-through, with scsi-mpt2sas vib: 20.00.00.00.1vmw-1OEM. Running 3x Seagate 1TB 72,000k SAS drives per host. I have yet to get any drops, failures, sense errors, and the latencies are unmovable. Intel 750 Series NVME 400GB for flash. Firmware P19 and driver 19.00 does not perform correctly, where P20 and driver 20.00 exceeds my expectations.



  • 19.  RE: Dell PERC H730p / LSI 3108 /Invader implementations

    Posted Jan 30, 2015 04:41 PM

    Drewdem,

    When you say new firmware for the PERC 730, can you  clarify? Is there a beta firmware that you are using that can be downloaded?

    Also, any luck with the passthrough?

    Thx.



  • 20.  RE: Dell PERC H730p / LSI 3108 /Invader implementations

    Posted May 13, 2015 08:03 PM

    Didn't see this when it was posted unfortunately.

    What I was referring to at the time was actually the driver linked in my post.



  • 21.  RE: Dell PERC H730p / LSI 3108 /Invader implementations

    Posted Jun 22, 2015 08:05 PM

    Anybody out there able to get h730 series raid cards working well in vSphere 6.0 under passthrough? I have a development cluster that went 20 days no issue under raid0 config, but under pass-through/HBA mode, we have hosts randomly PSODing after about 7-10 days. PSOD errors come back with "Megaraid_SAS hardware critical error returning failed". VMWare HCL recommends firmware 25.2.1.0037 and Inbox driver, but that firmware/driver combo isn't even detecting my disks.

    I've tried the following firmware & drivers below. After PSOD a restart fixes it but still obviously a problem to have systems PSOD.

    Firmware: 25.2.1.0037, 25.2.2.004

    Drivers: Inbox, megaraid_perc9 version 6.901.55.00.1vmw, megaraid-perc9 version 6.901.57.00-1OEM, lsi-mr3 version 6.606.12.00-1OEM

    I've seen at least one PSOD on each of the driver versions above except for Inbox. I can't get Inbox driver working because it doesn't even detect the drives I have plugged in on HBA mode. I may have to rebuild with raid0, just seeing if anyone else has a success story with Dell h730 series raid controllers and HBA/pass-through mode.



  • 22.  RE: Dell PERC H730p / LSI 3108 /Invader implementations

    Posted Jun 22, 2015 08:23 PM

    I'm running R730xd servers with H730P mini cards

    Controller Mode - HBA

    Firmware Version - 25.2.2-0004

    lsi-mr3 driver Version  - 6.605.08.00

         I asked VMware support about the lsi-mr3 6.606.12.00 drivers because the release notes were blank on them and they told me to hold off on installing them but couldn't give me any details.

    I haven't had any PSOD issues.



  • 23.  RE: Dell PERC H730p / LSI 3108 /Invader implementations

    Posted Jun 23, 2015 12:04 AM

    Thanks! I'll give 6.605.08.00 a try before I rebuild everything back to raid0



  • 24.  RE: Dell PERC H730p / LSI 3108 /Invader implementations

    Posted Jun 23, 2015 12:06 AM

    Actually never mind, found that 6.605.08.00 is the Inbox driver that ships with 6.0 release. These drivers won't detect my disks :smileysad: all the other drivers released after this one do detect my drives, I probably need to go back to raid0.



  • 25.  RE: Dell PERC H730p / LSI 3108 /Invader implementations

    Posted Jun 23, 2015 03:07 PM

    What disk models are you running elerium?



  • 26.  RE: Dell PERC H730p / LSI 3108 /Invader implementations

    Posted Jun 23, 2015 05:51 PM

    I'm using Western Digital RE4 4TB NL-SAS (not on HCL i know), but these are the same disks Dell would have sourced to me if I went with Dell 4TB at double the price which are on HCL.

    I've already begun rebuilding my VSAN to raid0, going to take a while for data evacuation on each host.



  • 27.  RE: Dell PERC H730p / LSI 3108 /Invader implementations

    Posted Jun 25, 2015 03:36 AM

    I dont have vsphere 6.0..I have 5.5, but went through similar instability. Check my thread

    https://communities.vmware.com/message/2519190#2519190

    In the end I update firmware to 25.2.2.004 Used driver

    megaraid-perc9 version 6.901.57.00-1OEM

    and most importantly updated the firmware on SSD drives




  • 28.  RE: Dell PERC H730p / LSI 3108 /Invader implementations

    Posted Jun 25, 2015 08:49 PM

    In my case, I don't think my SSD giving me any grief. I'm using a P3700 PCIe (not from Dell) and it's been very solid and no latency issues. From vsan.observer I usually see 2-4ms from client latency and a rare occasional spike to 10-15ms. This is on a fairly active development cluster.

    All my VSAN drives are NL-SAS drives but the scratch/syslog disk i use is a 250GB SATA drive. Based on hill0795's comment regarding HBA reliability/performance and issues with mixing SATA/SAS on this card, I think best choice for me is switch back to raid0 for stability/performance.

    I need to build out another VSAN cluster using Dell 13G hardware later in the year, going to see if Dell offers any true LSI HBA cards instead of H730/LSI 3108 based cards.



  • 29.  RE: Dell PERC H730p / LSI 3108 /Invader implementations

    Posted Jul 02, 2015 07:42 AM

    Is adding an HBA post purchase a possibility? Also if you're not seeing what you need with Dell, I have had great successes with Supermicro (Twin-Pros). People commonly settle on LSI 9207-8i (2308) in pass-through, and 600QD is very adequate for most loads. Personally I have yet to have a problem with my Labs 2308. The limitation of SAS2 over SAS3 would only have a real world affect if you switched to a bad-ass all flash array with insane SSDs >6gbs.

    Cheers



  • 30.  RE: Dell PERC H730p / LSI 3108 /Invader implementations

    Posted Jul 02, 2015 02:54 PM

    I will second the 9207-8i, rock solid with good performance.  We went with PCIe cards for the SSD layer, so I can't say how it would handle that layer.  Thank you, Zach.



  • 31.  RE: Dell PERC H730p / LSI 3108 /Invader implementations

    Posted Jul 03, 2015 01:48 AM

    FYI

    Since September 25th 2014 we haven’t seen any firmware update from LSI for 9207-8i (20.00.00.00). Yesterday I checked LSI’s websites and found the same P20 firmware package listed. However this time the release date was different, May 21st 2015, but the same P20 versioning. After looking at the archive/rlsnotes the only thing to change was the firmware, 20.00.04.00 – 9207-8.bin. No release notes on changes to the firmware that I could find. So for those on other LSI cards/firmware, don't be quick to dismiss the idea of a sneaky update being available.

    I have installed it on my Lab cluster (one day), and its seeing major magnetic disk latency improvements during component tasks. :smileyhappy:  It seems to behave more consistently across various workloads. The "mpt2sas" 2308 LSI driver for ESXi is still at 20.00.00.00.1, which im running, hopefully there will be a corresponding update soon. The performance and stability really isn't there when running P19 Firm and mpt2sas v19 IMHO.

    Cheers.



  • 32.  RE: Dell PERC H730p / LSI 3108 /Invader implementations

    Posted Jul 03, 2015 02:01 AM

    Earlier this year before the Intel 750 NVME Series I tried a few Flash configurations on the HBA. Nothing functioned :smileysad:   HCL'd Dell SAS 400GB SSD, Intel DC S3700, and non HCL stuff like Samsung EVO 850 Pros. The drives choked and the adapter. Completely unworkable at nearly every level. All in all NVME Flash is the only way to fly.



  • 33.  RE: Dell PERC H730p / LSI 3108 /Invader implementations

    Posted Jun 23, 2015 04:29 AM

    In my experience HBA mode really sucks on any of the new generation LSI/Avago MegaRAID cards..  You will never get the kind of performance you would with a true HBA like the LSI 3008 based controllers running IT based firmware. If you do use the 3108 in general make sure you have SAS across the board - even SAS SSDs. The performance is severely penalized when using SATA since certain SCSI commands have to be translated to their equivalent ATA command and some of those commands turn into non-queable I/O which can only be serviced one at a time and block ALL data I/O for that disk while doing so.  If you're curious as to what I'm talking about you can see it for yourself by booting your box into CentOS, running continuous 4k random-read with FIO on your SATA SSD, and in another terminal issuing a single sg_sync command to the same SSD.  My Intel s3700 went from a steady 75k IOPs stream all the way down to 500 IOPs for ~2 seconds!

    The other thing to know about HBA mode in the 3108 is that by enabling it you take away the controller's ability to perform certain serviceability items such as LED illumination predictive SMART failures etc.  One last really annoying characteristic of the 3108 in comparison to the 3008 is the sequential I/O performance - any I/Os larger than 128k given to the 3108 are broken down into 128k segments before it hits the drive.  This means that instead of a single 1M write you end up with 8 x 128k writes to disk!  Since VMFS by default uses a 1M block size for its filesystem this will degrade performance; not to mention any application on top of that which uses large sequential I/O patterns like Hadoop or Splunk..

    Anyways - in short I would recommend you consider buying a LSI 9207-8i (2308 chipset) for 12G Dell hardware - it is a Dell supported HBA.  Dell doesn't yet support any LSI 3008 based HBA on their 13G line though unfortunately but I'm sure they will soon..  Good luck!



  • 34.  RE: Dell PERC H730p / LSI 3108 /Invader implementations

    Posted Jun 23, 2015 04:38 AM

    Couldn't agree more!

    LSI 9207-8i (2308) is the way to go. I honestly haven't been able to get SATA disks to perform correctly on anything, once real IOPs make it to the scene. Especially impact full when an abort command is issued to the SATA disk, and the disks controller resets, briefly stopping all IOs, since SATA cannot retain previous commands issued like SCSI does. However I have yet to have a problem with 3108 on SAS disks, but I never tried SATA since well you know.

    Cheers



  • 35.  RE: Dell PERC H730p / LSI 3108 /Invader implementations

    Posted Jun 25, 2015 08:19 PM

    In regards to the LSI 3108 breaking down I/O to 128k blocks, is this for HBA mode only or also for RAID mode? I'm wondering what would be optimal raid0 stripe sizing now that I'm going to migrate everything back to raid0. Default is 64kb, and I see this KB VMware KB: VMware recommended settings for RAID0 logical volumes on certain 6G LSI based RAID VSAN with stripe 256k for settings but it's for h710 series which is a different chipset.

    Anyone know if I should stick with default 64k stripe, use 256k stripe (from VMWare KB) or some other setting?



  • 36.  RE: Dell PERC H730p / LSI 3108 /Invader implementations

    Posted Jul 05, 2015 06:23 AM

    Missed this -- Best practice would be to definitely stick with 256kb stripes, per the KB.



  • 37.  RE: Dell PERC H730p / LSI 3108 /Invader implementations

    Posted Jul 07, 2015 03:58 PM

    Would this be applicable to the Dell R730 P controller? I would think the stripe size would be determined on a per controller basis and the article mentioned above was an related to the 12G/H710 series.



  • 38.  RE: Dell PERC H730p / LSI 3108 /Invader implementations

    Posted Jul 07, 2015 05:02 PM

    Different HBA's will have different max stripe sizes. If memory serves correct the 730P can do 512kb stripes. If your flash disks (PCIE) are up to snuff, which handle the small r/w, the larger the stripe the better for the magnetics. 256kb is the typical maximum you would find, but 512kb would in theory be more ideal. Don't bank on that.



  • 39.  RE: Dell PERC H730p / LSI 3108 /Invader implementations

    Posted Jul 07, 2015 05:23 PM

    Any thoughts on when SSDs are used as capacity drives? ( In terms of stripe size )

    My setup is using Intel P3700 NVMe as the cache tier and Intel S3610 SSD as capacity.

    I am testing Raid-0 64K stripe size now as we have seen HBA mode to be extremely unstable ( with the Dell H730 P Mini Controller), similar to others experiences in this thread.



  • 40.  RE: Dell PERC H730p / LSI 3108 /Invader implementations

    Posted Jul 07, 2015 06:24 PM

    Even in an All Flash Setup, especially then, you will want the highest stripe size possible. The component owner will also benefit from a higher stripe size, I can only assume. I can't see how stripe is relavent to single disk arrays though, and its relation to VSAN. I have never tried VSAN with atleast two magnetics in RAID0 per disk group before. Maybe if you had 16 SAS drives, with eight logical RAID0 drives. Would be big rather annoying maintenance wise, but I have wondered about the performance of a Pass-Through 16 Disk VSAN, versus a 8 Logical drives (2x Disks RAID0).



  • 41.  RE: Dell PERC H730p / LSI 3108 /Invader implementations

    Posted Jul 29, 2015 08:51 PM

    I too am having VSAN issues and thought I would share my experiences.

    PowerEdge R730xd

    PERC H730 Mini (Embedded) Running in RAID 0

        Firmware: 25.3.0.0016

        Driver: 6.605.08.00

    Dell 400Gb SATA SSD drives (Intel S3610)

    Dell 10k RPM SAS Drives as capacity drives

    vSphere 6.0

    Under heavy load (rebuilding an OLAP Cube) our SSD drives would report permanent failures on all disk groups on a host.  This was happening nightly until we stopped processing the data cube.  I have spent numerous hours with both Dell and VMware support.  We are in the process of swapping the SATA drives with SAS drives as Dell stated it was an issue with heavy write load with those drives that the PCB on the SSD drive was issuing a reset command which would then cause VSAN to list drives as permanent failure.

    VMware wants me to update the driver in ESXi on the hosts to: lsi-mr3 version 6.606.12.00-1OEM

    I asked for specific reasons as to why they think that will fix this issue as well as any release notes on it and sited this thread and I have heard nothing back on the support ticket now going on day 2.  I just don't want to cause more harm at this point. VMware hasn't been very responsive throughout this whole process.  Dell hasn't been great either but better than VMware I must say. 

    I will try and post back after I replace those SATA drives with the SAS drives.



  • 42.  RE: Dell PERC H730p / LSI 3108 /Invader implementations

    Posted Jul 29, 2015 10:57 PM

    I would update the driver since VMware's standard (non VSAN) HCL matches driver with raid firmware version.


    You can see in this link here: VMware Compatibility Guide: I/O Device Search that the corresponding driver for firmware 25.3.0.0016 matches up with driver lsi-mr3 6.606.12.00-1OEM

    I updated to the firmware/driver combo shown above and am also on R730xds and it's working well for the last 2 weeks. I am not using SATA SSDs though so that may be a whole other issue.



  • 43.  RE: Dell PERC H730p / LSI 3108 /Invader implementations

    Posted Jul 30, 2015 02:38 PM

    For the H730 Series I'm hearing that this is the pending update...

    H730 controller series with ESXi 5.5u2
    New recommended firmware version: 25.3.0.00016
    New recommended driver: megaraid_perc9 version 6.902.73.00

    H730 controller series with ESXi 6.0
    New recommended firmware version: 25.3.0.00016
    Recommended driver: continue using lsi_mr3 version 6.605.08.00-6vmw.600.0.0.2494585



  • 44.  RE: Dell PERC H730p / LSI 3108 /Invader implementations

    Posted Jul 30, 2015 02:56 PM

    Have you heard any word on the Back plane firmware? 3.03 is working stable for us.

    I will push my avenues to try and make sure that is added - it is important to note that the back planes need flashed as well.



  • 45.  RE: Dell PERC H730p / LSI 3108 /Invader implementations

    Posted Jul 30, 2015 03:10 PM

    Just confirmed we are using the following driver and firmware versions:

    H730 controller series with ESXi 6.0

    New recommended firmware version: 25.3.0.00016

    Recommended driver: continue using lsi_mr3 version 6.605.08.00-6vmw.600.0.0.2494585

    BackPlane: 3.03

    We are using the Intel DC 3610's SATA SSD's as our flash tier.  Just had another instance where permanent failure showed up on the SSD's and I needed to reboot the host.  Dell better get those SAS drives here soon.



  • 46.  RE: Dell PERC H730p / LSI 3108 /Invader implementations

    Posted Jul 30, 2015 06:15 PM

    You might want to try changes these settings and do a host reboot.

    VMware support had us try this first - and these settings have been applied throughout all of our testing.

    ( So it might be in addition to the firmware levels, you also need to apply these settings. )

    esxcfg-advcfg -s 40000 /LSOM/diskIoTimeout

    esxcfg-advcfg -s 5 /LSOM/diskIoRetryFactor




  • 47.  RE: Dell PERC H730p / LSI 3108 /Invader implementations

    Posted Jul 30, 2015 06:21 PM

    Interesting, there are similar disk IO timeout settings mentioned in the following VMware document for the HP P440/P440ar/H240/H240ar controllers.

    https://partnerweb.vmware.com/programs/vsan/KB_P440_H240_Controller_Advanced_Settings.pdf



  • 48.  RE: Dell PERC H730p / LSI 3108 /Invader implementations

    Posted Jul 30, 2015 08:09 PM

    Interesting.  I ran:

    esxcfg-advcfg -g /LSOM/diskIoTimeout

    Result was:

    Value of diskIoTimeout is 20000  (guessing this is a time threshold before a retry happens)

    esxcfg-advcfg -g /LSOM/diskIoRetryFactor

    Result was:

    Value of diskIoRetryFactor is 3  (guessing once this hits a value of 3 my disks report permanent failure)

    It would make sense to bump these up perhaps if this is the case but seeing as I can find no documentation on these other than that link I will probably hold off. 

    My drives did come so I am in the process of evacuating my data off of the disk groups before I swap the drives on a host by host basis.  The Dell rep I have been speaking with this morning also mentioned that with firmware version: v25.3.0.0016 I should be able to run disks in HBA mode.  I am currently running driver version: 6.605.08.00.  Trying to confirm with Dell if I should update driver version or not.



  • 49.  RE: Dell PERC H730p / LSI 3108 /Invader implementations

    Posted Jul 30, 2015 08:25 PM

    Do you know what SATA drives your replacing?



  • 50.  RE: Dell PERC H730p / LSI 3108 /Invader implementations

    Posted Jul 31, 2015 12:40 AM

    SATA drives we are replacing were Intel SSD DC 3610's and they were used as flash tier drives.  New ones are Dell branded but show as SanDisk LT0400MO in iDRAC.



  • 51.  RE: Dell PERC H730p / LSI 3108 /Invader implementations

    Posted Jul 31, 2015 12:37 AM

    Got confirmation that we should run for controller H730 driver version: lsi_mr3 v6.606.12 with firmware version: v25.3.0.0016.  VSAN Health Check flags the driver version but guessing it is best since HCL doesn't seem to be keeping up very well with all this.  The Dell tech said he ran it in a lab just to confirm.

    Migrated one host so far with the new Sandisk SAS SSD caching drives and throughput appears to have doubled using HBA mode as well. I also disabled the caching on the card just on a hunch since the cards seem problematic. Would be interested to see what other people have that setting on.



  • 52.  RE: Dell PERC H730p / LSI 3108 /Invader implementations

    Posted Jul 31, 2015 12:48 AM

    Your hunch was correct. -- All caching disk/controller/array should be disabled in Pass-Through mode. I don't seem to recall if you are RAID0, if you are some enabled caching and saw performance benefits. However I lean heavily toward them not having a SAS/PCIe performance tier, and cache mitigated the effects of a single symptom. As you begin testing various things, be sure to run a vSphere Data Protection perf test, and re-sync some large objects, and get some client end benchmarks during. You should also be safe when running the Multicast Perf test in VSAN health. Vmware recently confirmed they throttle the test, probably to avoid contention. I especially am very interested in your results.  Best, -Jon



  • 53.  RE: Dell PERC H730p / LSI 3108 /Invader implementations

    Posted Jul 31, 2015 07:11 PM

    Well it was a late night but finished moving the new drives onto the hosts.  This would have been a lot less risky with 4 hosts.  Certainly gonna request another one for next year.

    Here is a before and after 10min Proactive Virtual SAN Storage Performance test sample (Going from SSD SATA to higher write IO SAS Drives) running the Performance characterization - 70/30 read/write mix, realistic, optimal flash cache usage:

       

    Before:  

    VMDK Disk NumberDuration (sec)IOPSThroughput MB/sAverage Latency (ms)Maximum Latency (ms)
    060010514.111.71154.51
    160010824.231.59157.63
    260010694.181.67161.6
    360012344.821.5267.94
    460011104.341.57166.43
    560012314.811.4679.99
    660010554.121.68157.41
    760010574.131.69158.72
    860013035.091.43152.87
    9010754.21.64160.87

    After:

       

    VMDK Disk NumberDuration (sec)IOPSThroughput MB/sAverage Latency (ms)Maximum Latency (ms)
    060016686.520.940.82
    160016536.460.9140.64
    260018147.090.852.16
    360019137.470.7637.66
    460019217.50.7643.25
    560018217.110.851.66
    660016776.550.939.64
    760016686.520.945.17
    860018317.150.7951.99
    960016496.440.9143.26


  • 54.  RE: Dell PERC H730p / LSI 3108 /Invader implementations

    Posted Jul 31, 2015 09:46 PM

    Which firm/driver combination are you using now, and or with the benchmarks you did? That is a serious latency improvement, sub ms it looks like...and your max lat is magnitudes faster Yay SAS. Was anything else going on in the VSAN while you were benchmarking? Try a benchmark while objects are being synced/policies changes. Keep up the good work. Thanks, -Jon



  • 55.  RE: Dell PERC H730p / LSI 3108 /Invader implementations

    Posted Jul 30, 2015 12:35 AM

    "We are in the process of swapping the SATA drives with SAS drives as Dell stated it was an issue with heavy write load with those drives that the PCB on the SSD drive was issuing a reset command which would then cause VSAN to list drives as permanent failure." -madnote

    Yes! SATA in any disk controller form is inadequate for VSAN use. SATA has no way of knowing previous commands issued, so there is zero way to cancel an active/queued cmd. The moment the system calls for a cmd to be cancelled, SATA has no way to deal with it, and in order to fulfill the request it will reset the disk. During the reset the drives are inaccessible, and crazy latencies can be seen in the event viewer. SATA really is a "legacy" controller, and has no business what so ever interfacing with crazy low latency flash (especially in a r/w cache use case). You can never go wrong if you use PCIe NVME for your performance tier, and in my experience increases the data storage stability many fold over SAS SSD. There is just no way to cheap out on your VSAN performance tier.

    Thanks,

    -Jon



  • 56.  RE: Dell PERC H730p / LSI 3108 /Invader implementations

    Posted Jul 30, 2015 02:36 PM

    Uhhhhhh If SATA doesn't track commands that have been queued then what does Tagged Command Queuing?

    Not going to argue that NCQ, and NVMe don't have deeper queues, but we use Intel S3700's in production with VSAN just fine.  The issue with the H730P is that there are firmware/driver problems that are about to be resolved (HCL update is pending).  Dell has been working on this for months, and if they told you to swap drives because of this its likely because you had drives that were not on the HCL (like some of the cheap LiteON drives they will sell that are grossly inadequate for any server usage IMHO for having terrible performance consistency and are well below the mandated 10DWPD that the VSAN HCL mandates). 



  • 57.  RE: Dell PERC H730p / LSI 3108 /Invader implementations

    Posted Jul 30, 2015 08:35 PM

    When the command is running, there are no take backs. (as i recall). NVME has a theoretical max of 65536 queues and 65536 cmds per queue. I have personally tested S3700 in my VSAN extensively over a couple months, and I can say without a shadow of a doubt they should not be on the HCL. They have the same problems other AHCI drives, just far less frequently, and extremely slow when compared to SAS/PCIe. Moreover they are a huge bottleneck all around, just run some a Data Protection performance test, and watch your infrastructure crumble. If you have a demanding environment where you need to be certain about latency and throughput, Queue Depth matters. My consumer intel 750 Series NVME's perform many times faster than my SAS SSDs. Obviously throughput is faster, given the larger 20GB two-way PCIE x4, but where things shine is latency and queues. Re-syncing all your VMs storage polices, running a VDA Perf test, while simultaneously running benchmarks on the client, one never sees a degradation of the user experience or >2ms latencies. In your case it would seem the hardware is meeting the needs of the tasks. IMHO  Thanks, -Jon



  • 58.  RE: Dell PERC H730p / LSI 3108 /Invader implementations

    Posted Jul 30, 2015 09:00 PM

    Not disagreeing that our needs are modest for the most part (few thousand IOPS internally) but I've seen some steady bursts without issues on faster drives.

    Why would you constantly be trying to cancel commands though?  That just seems like an odd thing to do, generally once the Guest VM has sent a SCSI packet, if it wants it replaced it just waits on the ACK and sends another one. 

    We did some benchmarks (as did others) with the drives and got pretty good numbers (20-30K with SCSI vtraces of our screwball workload).

    For midsized deployments where your replacing a Traditional spinning disk array (AMS2500/EqualLogic/EVA) the S3700's are quite good on a hybrid configuration and "32" is often good enough.  The giant enterprise HDS array I'm working on today (G400) has a Maximum LUN queue depth of 32 as it is, and If I want to send more that to a  VM then I would need to stripe across VMDK"s or RDM's (Although this is FC so I do get the ability to cancel commands I guess).  You can shove a lot of IOPS down a 32 command pathway. 

    Now if your in all flash array territory, or this is replacing a modern era hybrid array or something, your right.  NVMe/PCI-E is likely warranted (and with the newest generation servers actually an option, a lot of our clusters are over a year old). 

    I've got a friend who has a Intel 750 drive, and It was fun to point out to him that its actually fighting for PCIE throughput from his pair of Titans.  That said it still played games very well and made him happy so mileage will always vary.  For the mid market its not about having the fastest workload as once you hit acceptable performance product selection becomes about TCO.



  • 59.  RE: Dell PERC H730p / LSI 3108 /Invader implementations

    Posted Jul 30, 2015 09:34 PM

    We are on the same page, no worries. In order to answer the command question I would have to dive into AHCI controller stuff, and i'm not that into this subject todo that again for exacting details ;) 

    I got those same numbers "20-30k", however they varied drastically when object/component operations were taking place. I also had a hard time with stability when say applying a real workload to a client file server. Granted mine could have been a fluke, but i was sure to spend a couple months tinkering with it. The only other AHCI that performed better was a M.2 Samsung, but didn't have the enterprise queue resilience* of the S3700's (i couldn't rule out drivers for the m.2). Personally I always try to go overboard especially when implementing a VSAN production system. With more capabilities comes varying degrees of new workloads.

    Utilizing NVME and or PCIe really isn't the just applicable to say all-flash or VDI. It really is a god send with magnetics, the difference was much like getting my first SSD in 07. NVME costs have come down considerably this year, and as you mentioned near-hotswap NVME backplanes and servers are everywhere.

    Thanks,

    -Jon



  • 60.  RE: Dell PERC H730p / LSI 3108 /Invader implementations

    Posted Jul 30, 2015 10:26 PM

    We had issues initially but they were all tied to LSI/Avago code (LSI 2008's in the private beta proved hilariously unreliable on writes, LSI 2208's had stability in pass through and we worked to get revoked from the HCL and had to move to RAID 0).

    I'm curious if your problems were actually related to LSI, and moving to PCI-Express/NVMe just freed you from trying to get their silicon to do something they didn't want it to do.



  • 61.  RE: Dell PERC H730p / LSI 3108 /Invader implementations

    Posted Jul 30, 2015 11:41 PM

    Well the tests initially started on LSI 2308, and I went through all the firmwares and all the drivers. As you state the controllers may just have not liked it, so then I borrowed four Dell H330's and had the same result. The only thing missing from my tests was DP hosts. Yet, I can't see how that would have improved things. You are totally correct as PCIe/NVME allows your HBA's to-do what they do best. Cheers, -Jon



  • 62.  RE: Dell PERC H730p / LSI 3108 /Invader implementations

    Posted Jul 30, 2015 02:38 PM

    Madnote:

    Just got confirmation from Dell, VMware, and our internal validation that the following driver/firmware combinations seem to be stable now when using SATA SSDs as capacity drives. Specifically we saw massive issues when we started stressing our systems under excessive IO patterns. (Drives fasley reported as offline, PSODS, and sporadic latency issues.)

    Please see below for the following stable config:

    We are using Intel DC S3610s in HBA mode for capacity and Intel DC P3700 NVMe AICs for flash cache.

    Have you updated your backplanes? Going to version 3.03 and Perc firmware 25.3.0.00016  seem to clear things up for us.

    I know its not the same setup as yours - but I thought I would update you with our latest findings.




  • 63.  RE: Dell PERC H730p / LSI 3108 /Invader implementations

    Broadcom Employee
    Posted Jul 31, 2015 07:11 PM

    Hi everyone,

    I understand the frustration that has been felt given the seriousness of the problem. Please be assured that engineering at the highest level has been engaged in resolving this issue: we have been working closely with Dell these past several days to come up with a resolution plan for the issues that have been reported in this thread with using the H730 series controllers with VSAN in pass-through mode. I have also personally tried to ensure that anyone with a support ticket open having these symptoms had the very latest information straight from engineering.

    I am pleased to report that as of today the VSAN VCG has been updated to reflect new recommended driver and firmware versions for these controllers, which should resolve the symptoms reported in this thread: VMware Compatibility Guide: vsan


    Here are the new recommended versions:

    H730 controller series with ESXi 5.5u2
    New recommended firmware version: 25.3.0.00016
    New recommended driver: megaraid_perc9 version 6.902.73.00-1OEM

    H730 controller series with ESXi 6.0
    New recommended firmware version: 25.3.0.00016
    New recommended driver: lsi_mr3 version 6.606.12.00-1OEM



  • 64.  RE: Dell PERC H730p / LSI 3108 /Invader implementations

    Posted Jul 31, 2015 07:42 PM

    FYI.  It is still being flagged in the VSAN Health Check.



  • 65.  RE: Dell PERC H730p / LSI 3108 /Invader implementations

    Broadcom Employee
    Posted Jul 31, 2015 08:34 PM

    Thank you for letting me know. The Health Check's internal database should update automatically very soon. I'll keep an eye on it and check with the Health Check team to ensure everything updates correctly.



  • 66.  RE: Dell PERC H730p / LSI 3108 /Invader implementations

    Posted Jul 31, 2015 10:11 PM

    I am pleased to report that as of today the VSAN VCG has been updated to reflect new recommended driver and firmware versions for these controllers, which should resolve the symptoms reported in this thread: VMware Compatibility Guide: vsan


    Here are the new recommended versions:

    H730 controller series with ESXi 5.5u2
    New recommended firmware version: 25.3.0.00016
    New recommended driver: megaraid_perc9 version 6.902.73.00-1OEM

    H730 controller series with ESXi 6.0
    New recommended firmware version: 25.3.0.00016
    New recommended driver: lsi_mr3 version 6.606.12.00-1OEM

    That is very good news, can you share which issues were resolved specifically?

    I had raid controller resets and stalling occuring from the use of a single SATA drive (used only for ESXi scratch and ISO storage) in HBA mode. This stalling would hang or crash hosts in addition to the poorer observed performance while in HBA mode. Are these issues that are resolved in the driver combo above? I currently run this combo but in RAID0 but I would of course be interested in running in HBA for future ease of disk replacement/maintenance.



  • 67.  RE: Dell PERC H730p / LSI 3108 /Invader implementations

    Posted Jul 31, 2015 10:18 PM

    Sorry if I don't remember but you are talking about SATA for your storage tier correct? Thanks, -Jon



  • 68.  RE: Dell PERC H730p / LSI 3108 /Invader implementations

    Posted Jul 31, 2015 10:27 PM

    In my case, i've never used SATA for the storage tier, I am using SAS, however there is a SATA drive (Dell 500GB magnetic) connected for use as the log/scratch disk since for me ESXi doesn't want that on my boot SD card. I was encountering hangs/PSODs instability from having this single SATA drive in the HBA config. A few weeks ago I rebuilt it all as RAID0 since I was also noticing that HBA mode was noticably slower in benchmarking than RAID0, although this may have been fixed since firmware 25.3.0.0016 wasn't released yet.

    Just a pain since I've rebuilt the VSAN twice going from RAID0 to HBA and back and shuffling so many combos of HBA/RAID/driver/firmware settings. Still a great product, the only thing nagging me is not running in HBA which is why I'm interested if this is all fixed now. If so I'd rebuild again to HBA.



  • 69.  RE: Dell PERC H730p / LSI 3108 /Invader implementations

    Posted Jul 31, 2015 10:38 PM

    That's interesting... Obviously you don't get the PSOD/hangs when running just the SD and no scratch?  Is there an onboard SATA controller you could use for both ESXi/sctratch? My reasoning here is rule an assortment of things out. Also I have occasionally run into issues when booting certain machines into ESXi via UEFI. Personally I ditched the USB/SDCARD method a while ago in favor of onboard high-temp SLC SATA DOMs. Thanks, -Jon



  • 70.  RE: Dell PERC H730p / LSI 3108 /Invader implementations

    Posted Jul 31, 2015 10:51 PM

    For the Dell r730xd, the H730 controller would be the onboard controller and any disk being plugged in would need to go through this controller.

    I didn't have time to test if just running SD and no scratch. I also didn't have a spare SAS drive to swap with. Based on all the logs and data I've collected, on the older firmware 25.2.2.0004 probably would have not hung/crashed if using all SAS disks connected while in HBA, however HBA mode was still noticeably slower. VSAN observer in RAID0 would show drives maxing between 350-400 IOPS, where in HBA, 275-300 was max for IOPS in addition to controller hangs and all the other bad stuff. Other benchmarking that I did between RAID0 and HBA also showed that HBA was 20-25% slower on this older firmware.

    The original firmware 25.2.1.0037 wouldn't even detect my SAS storage drives in HBA so I did very little testing on this.



  • 71.  RE: Dell PERC H730p / LSI 3108 /Invader implementations

    Posted Aug 01, 2015 01:14 AM

    You wouldn't happen to have a spare AHCI disk controller you could plop in for testing? Tall order i know. So that your ESXi/Scratch SATA is on that, hopefully eliminating something from the equation. Thanks, -Jon



  • 72.  RE: Dell PERC H730p / LSI 3108 /Invader implementations

    Posted Aug 03, 2015 05:21 PM

    Unfortunately don't have a spare disk controller and the cluster is already in use. Am building out another cluster in a month or so with identical hardware but with all SAS drives, will probably test HBA on that buildout.



  • 73.  RE: Dell PERC H730p / LSI 3108 /Invader implementations

    Posted Aug 04, 2015 07:41 AM

    Hmm... Just another idea, what about attaching an iSCSI disk or PXE into ESXi? Thanks, -Jon



  • 74.  RE: Dell PERC H730p / LSI 3108 /Invader implementations

    Posted Aug 03, 2015 07:40 PM

    As per release_note_lsi-mr3_6.606.12.00-1OEM.600.0.0.2159203.txt : Bugs fixed (compared to earlier release of driver): None Known Issues and Workarounds: None Additional configuration options supported by the driver: None ...



  • 75.  RE: Dell PERC H730p / LSI 3108 /Invader implementations

    Broadcom Employee
    Posted Aug 04, 2015 08:17 PM

    Hi Elerium,

    The updates should resolve issues that manifest as the controller firmware entering a 'fault state', various IO command aborts, and disks being marked as permanently lost. I'm not aware of a fix for the specific issue you mentioned. However, I should stress that VMware strongly recommends running the controller in the configuration specified on the VSAN HCL - whether that be RAID-0 or HBA mode. For the H730 series, we require that these be configured in HBA mode across the board as this was the mode used to do certification testing for these controllers.



  • 76.  RE: Dell PERC H730p / LSI 3108 /Invader implementations

    Posted Jun 01, 2016 07:19 PM

    cdekter‌, with 6u2 what is the intended behaviour of guest issued SCSI device reset command?

    It seems that at least with H730 controller, these commands are making it to the physical controller. I wondered if this might be an unintended side effect of something that has been changed because of VSAN.



  • 77.  RE: Dell PERC H730p / LSI 3108 /Invader implementations

    Broadcom Employee
    Posted Jun 01, 2016 10:00 PM

    VSAN will not pass through any device reset commands originating from VMs. What you are observing is most likely originating from other VMFS volumes (e.g. ones used for storing ESX logs) on disks attached to the H730 controller. At this date it is not supported to run any virtual machines on VMFS volumes alongside VSAN on the same controller.



  • 78.  RE: Dell PERC H730p / LSI 3108 /Invader implementations

    Posted Jun 01, 2016 10:10 PM

    Thanks, correct it is in connection to VMFS volumes (on their own; no VSAN running) but the resets do seem to be being passed from guest to hardware.

    Eg running sg_reset -d /dev/sda within a Linux guest running on an H730 provided internal datastore - or just rebooting a Linux guest - whilst some competing workload is working on that same datastore originating from another VM will cause the array IOPS to drop to zero for 5-20 seconds on that host. This is 100% repeatable with this controller.

    I just wondered if this might be related to the work that has been going on with this controller in connection with VSAN - quite a flurry of firmware and driver updates for it recently.