vSAN1

 View Only
Expand all | Collapse all

Intel DC P3700 Firmware

  • 1.  Intel DC P3700 Firmware

    Posted Mar 24, 2016 01:44 PM

    We are attempting to build a new 4-node VSAN hybrid cluster using Intel P3700 SSDs.  After getting everything together, the health checks fail saying the driver is not on the HCL, even though we are using the correct driver.  Opened an SR, and support is telling me that the problem is actually the firmware on the SSDs.  The drives shipped with FW0171, but FW0131 is required.  We are trying to get a response from Intel to see if the firmware can be downgraded, but so far, no luck.

    As far as I can tell, the Intel P3700 and P3600 are the only 2.5" NVMe drives on the HCL, and unless you have old drives, they can't be used.  Has anyone had any experience with this drive in VSAN? 



  • 2.  RE: Intel DC P3700 Firmware

    Posted Mar 26, 2016 12:39 AM

    I am running VSAN 6.1 hybrid with P3700s and have been before they were officially supported on HCL. I have not run into issues with VSAN that are related to the P3700. I'm using the HHHL form factor and not 2.5" though so not sure if that makes a difference. Although I generally agree with recommendations to firmly stick with VSAN HCL, I haven't run into any problems at all from P3700s on any firmware versions. With no negative impacts, I'm not seeing a reason to downgrade the stock 8DV10171 firmwares that are shipping with these disks. That said I'm not seeing a reason to upgrade to newer firmwares either when they're released since they probably won't be qualified on VSAN HCL yet.

    I'm also not aware of any way to downgrade the firmwares from higher versions to the 8DV10131 that's on HCL.



  • 3.  RE: Intel DC P3700 Firmware

    Posted Mar 28, 2016 10:13 PM

    We're having the same problem with the HHHL P3700 drives.  We're not experiencing any issues other than the Health service saying the drives aren't on the HCL.

    I've opened case with VMware who told me it was Intel's responsibility to make sure HCL is correct.  I've had case open with Intel (Case 00288969) for a couple weeks now with no progress.  They've had me upgrade to the lastest driver (1.0e-2.0-1OEM.550.0.0.1391871) and firmware from 8DV10131 to 8DV10171.  No luck.  

    Maybe it'd be good for you to also open a case with Intel and reference my case so they can see it's affecting more than one person.



  • 4.  RE: Intel DC P3700 Firmware

    Posted Apr 02, 2016 02:20 PM

    Same problem with DC P3600

    It is on the HCL but showing up like it isn't in the HCL.

    Last firmware, last drivers ...

    Also LSI 3008 is in the HCL but not showing up as it is in HCL

    Last firmware, last drivers ...



  • 5.  RE: Intel DC P3700 Firmware

    Broadcom Employee
    Posted Apr 06, 2016 08:20 AM

    Actually with regards to flash devices and drives the statement is that there is a minimum level of firmware which is on the HCL, anything higher is supported as far as I know. I will ask the engineering team to bake this logic in to the health check HCL team.

    EDIT: Apparently this does not apply to the Intel P3700 devices, what is listed on the HCL is a hard requirement, so please do not use a higher version!



  • 6.  RE: Intel DC P3700 Firmware

    Posted Apr 21, 2016 04:14 PM

    The driver we are using is 1.2.0.27-4vmw.550.0.0.1331820, and have had no problems as far as I can tell.  We started off with 1.0e.1.1-1OEM.550.0.0.1391871 which is the driver listed on the HCL and had all sorts of problems.  It is my understanding that Intel is in the process of recertifying the P3700/P3600 with the updated firmware.

    I've been told by VMware support that they will support these drives with this driver/firmware combination, so we have moved the cluster into production.  I would love to see the HCL warning go away soon though :smileyhappy:



  • 7.  RE: Intel DC P3700 Firmware

    Posted Apr 22, 2016 06:17 AM

    Out of curiosity, what kind of problems were you seeing before updating the driver?

    Also what version of VSAN?



  • 8.  RE: Intel DC P3700 Firmware

    Posted Apr 22, 2016 02:45 PM

    We were seeing congestion errors on the SSDs while running stress tests for any more than a couple minutes.  High latency and just crappy performance in general.  Intel told us it was due to the driver not matching the 8DV10171 firmware.

    We're running 6.2, and performance is looking really good at this point.



  • 9.  RE: Intel DC P3700 Firmware

    Posted Apr 22, 2016 06:26 PM

    I am actually seeing the same performance/congestion related issues on my 6.2 lab, specifically with write performance, while everything is working perfectly in 6.1   When I disabled the new 6.2 checksum feature in storage policies it went away, but I'd rather have that option enabled on my clusters.

    I'll give the driver update a try!  Thanks for sharing.



  • 10.  RE: Intel DC P3700 Firmware

    Posted Apr 26, 2016 12:35 AM

    The 1.2.0.27-4vmw.550.0.0.1331820 driver did significantly improve congestion and improve latency in general over the intel-nvme drivers. I'm still seeing an issue where sequential writes and limited to no higher than 250MB/s from VM guests, but only with checksum enabled (disabled i get 800MB+ write speed). Maybe a raid controller or raid driver as I'm using Dell/H730 which isn't on HCLed for 6.2 yet, latest I heard from support is that Dell/VMware may have my raid controller added to 6.2 HCL by end of May.





  • 11.  RE: Intel DC P3700 Firmware

    Posted Apr 23, 2016 08:37 AM

    A fast question ...

    How do I get the firmware version of the P3600 800GB SSD ?

    We to see a lot of latency sometimes +350ms

    We are using driver version : 1.0e.0.35-1vmw.

    But if you would like to go to version : 1.2.0.27-4vmw

    you have to be on Firmware version :  8DV10171

    So I would like to check the firmware version of the SSD so i can upgrade that first.

    Here are the warning I get from Vmware although they are in the HCL

       

    DeviceDriver in useDriver health
    vmhba2: Intel Corporation DC P3600 SSD [2.5" SFF]nvme (1.0e.0.35-1vmw.600.2.34.3620759)Warning
    vmhba3: LSI LSI Logic Fusion-MPT 12GSAS SAS3008 PCI-Expresslsi_msgpt3 (06.255.12.00-8vmw.600.1.17.3029758)Warning
    vmhba2: Intel Corporation DC P3600 SSD [2.5" SFF]nvme (1.0e.0.35-1vmw.600.2.34.3620759)Warning
    vmhba3: Avago (LSI Logic) / Symbios Logic Avago (LSI)3008lsi_msgpt3 (12.00.00.00-1OEM.600.0.0.2768847)Warning
    vmhba2: Intel Corporation DC P3600 SSD [2.5" SFF]nvme (1.0e.0.35-1vmw.600.2.34.3620759)Warning
    vmhba3: LSI LSI Logic Fusion-MPT 12GSAS SAS3008 PCI-Expresslsi_msgpt3 (06.255.12.00-8vmw.600.1.17.3029758)Warning

    Thanks in advance



  • 12.  RE: Intel DC P3700 Firmware

    Posted Apr 25, 2016 02:59 PM

    You can install the SSD Data Center Tool VIB and use it to find the firmware version.  Although the easiest way would be to pull the drive, the FW version is printed on the drive (at least it is on our P3700's)



  • 13.  RE: Intel DC P3700 Firmware

    Posted Jul 26, 2017 03:30 PM

    We have the exact same issue P3700 same driver (1.2.0.27-4vmw..), and nvme P3700 FW  8DV10171, yet the healthcheck comes back with "Warning"

    When will the HCL be updated to not point us to use an obscure version of fw for a Cisco rebranded card?



  • 14.  RE: Intel DC P3700 Firmware

    Posted Aug 30, 2017 08:06 PM

    Vsphere is now telling us that it wants to be on the FJP7 firmware.  We are currently on 8DV10171 and the intel tool tells us it would upgrade it to 8DV101H0?!?  What firmware are people on here having the most success with ATM?  We are on the 1.2.0.27-4vmw driver.

    thanks,

    -ed



  • 15.  RE: Intel DC P3700 Firmware

    Posted Oct 31, 2017 01:57 PM

    We're currently on 8DV10171 with 1.2.0.32-4vmw on 6.5U1  I don't know what the FJP7 comes from, looks like it is a Fujitsu firmware, but we are using Intel branded P3700s.  Everything is working fine for us and everything is on the HCL.  It seems that the health check is never going to work with these drives. 

    One thing to note, at one point we started seeing slow cloning performance after an update while we were on 6.0U3.  Just found this: https://kb.vmware.com/kb/2149876.  Doesn't seem to affect normal operations, but very disappointing as we have invested heavily in the P3700s as cache drives in all of our hybrid and all-flash clusters.



  • 16.  RE: Intel DC P3700 Firmware

    Posted Nov 14, 2017 09:16 PM

    We are testing some P3700s for vFlash Read Cache and the performance is shit there as well.  I asked support if the KB you mentioned, could also cause issues with vFlash Read Cache and I was told

    "we do know that the NVMe device in question (I.E. your HP "MO1600KEFHQ"  also known as "1.6TB NVMe PCIe Write Intensive SFF 2.5-in SC2 764892-B21" which is basically a re-branded Intel P3xxx ) is affected by serious performance issues when dealing with continuous writes in the same blocks or block range.  Although this was noticed specifically with vSAN environments, it is reasonable to state that the bad performance "scenario" (it's actually not a bug, I believe, of the device itself but rather a design flaw) of the P3xxx would be exploited with any intense use of the device itself, like for example, vFlash cache in conjunction with write intensive applications like DB servers."



  • 17.  RE: Intel DC P3700 Firmware

    Posted Nov 22, 2017 07:50 PM

    We were directed by GSS to disable the firmware version healthcheck. We are running retail Intel P3700's and the DID, VID, SDID and SVID values are identical for both Intel and Fujitsu and atleast one other brand.

    We were also instructed to disable log compaction in regards to KB2149876. If log compaction isn't disabled you'll notice very high latency values during certain scenarios (I uncovered this while watching esxtop after completion of a proactive stress test. During the deletion process the latency would spike for 15-20 minutes)

    After these two changes our vSAN environments have been performing flawlessly on retail Intel P3700's and S3520's for capacity.



  • 18.  RE: Intel DC P3700 Firmware

    Posted Dec 01, 2017 02:10 AM

    I have a ton of Intel P3700s in production, have only seen the high latency values on the P3700 infrequently, but unfortunately at some critical times (during resync from dying hardware or during a host failure). How do you disable log compaction in VSAN? And if disabled, can it be safely re-enabled at a later time?



  • 19.  RE: Intel DC P3700 Firmware

    Posted May 24, 2016 03:54 PM

    Has there been any traction on this? I'm also hitting up Intel on their end (Firmware Downgrade |Intel Communities) with the same issue to see if we can push this along. According to Intel, the certification validation lies with VMWare at this point. From what a VMWare Federal Escalation Engineer told me during a call for an unrelated service request, VMWare can either certify in-house OR request results from the hardware company to analyze for certification.



  • 20.  RE: Intel DC P3700 Firmware

    Posted May 24, 2016 08:27 PM

    I have a mix of VSAN 6.1/6.2 hybrid and all flash clusters, all using P3700 or P3600 for cache. I can tell you that in VSAN 6.2, using the HCL firmware/driver combo you would see very poor performance, congestion and latency problems. I recently built a new 6.2 VSAN all flash cluster that happened to ship with 8DV10131 (HCL) firmware. Using 1.0e.1.1-1OEM.550.0.0.1391871 HCL driver, there are severe write performance related issues. Result is the same after upgrading to firmware 8DV10171. Between VMWare and Intel or whoever is responsible for updating the HCL, I don't think any real testing went into it before it got 6.2 qualified. Testing the HCL combo even for 5 minutes, one would immediately notice a major latency/congestion issue, on even light stress testing. I believe it also has something to do with the new checksum functionality added in 6.2, if disabled in the storage profile, all performance returns back to normal levels.

    Personally I don't think it's an issue of downgrading firmware but for Intel to release a new inte-nvme driver (and or firmware update) that resolves issues discovered for version for 6.2.  Also none of the issues exist on 6.1 (probably because checksum feature isn't on 6.1).

    Here are my findings from 6.2 AF VSAN using Intel P3700 400GB for write cache and 4x Intel S3510 800GB for capacity:

    P3700 400GB, firmware 8DV10131, intel-nvme 1.0e.1.1-1OEM.550.0.0.1391871 driver - severe latency/congestion issues from disk writes, no issues if disabling checksum

    P3700 400GB, firmware 8DV10131, intel-nvme 1.0e.2.0-1OEM.550.0.0.1391871 driver - severe latency/congestion issues from disk writes, no issues if disabling checksum

    P3700 400GB, firmware 8DV10171, intel-nvme 1.0e.1.1-1OEM.550.0.0.1391871 driver - severe latency/congestion issues from disk writes, no issues if disabling checksum

    P3700 400GB, firmware 8DV10171, intel-nvme 1.0e.2.0-1OEM.550.0.0.1391871 driver - severe latency/congestion issues from disk writes, no issues if disabling checksum

    P3700 400GB, firmware 8DV10171, nvme 1.2.0.27-4vmw.550.0.0.1331820 driver - no latency/congestion problems, sequential writes limited to 250MB/s, no issues if disabling checksum



  • 21.  RE: Intel DC P3700 Firmware

    Broadcom Employee
    Posted May 25, 2016 08:34 AM

    Are you running benchmarks or is this during normal operations? Also, have you opened up a VMware support ticket for this issue? If so, what is the SR number?



  • 22.  RE: Intel DC P3700 Firmware

    Posted May 25, 2016 08:09 PM

    Thanks Duncan for checking my SR. I received your PM and don't have other recommendations yet, will wait and see if support has additional suggestions. The congestion/latency issues I describe are for normal operations, if running a benchmark, the issues appear quickly within 5-10 minutes.



  • 23.  RE: Intel DC P3700 Firmware

    Posted May 25, 2016 07:13 PM

    Elerium,

    Can you confirm that your S3510 SSDs are not contributing to the problem by swapping them for something else? My cluster consists of 4 x Dell PowerEdge R730 servers each with 2 disk groups running VSAN 6.1 hybrid. On three of the nodes, the disk groups consist of an Intel S3700 800GB for write cache and 6 x Seagate Constellation.2 (ST91000640SS) 1TB hard drives for capacity. The remaining node has Intel P3700 800GB PCIe drives instead of using the Intel S3700. I'm currently running 8DV10171 with nvme 1.2.0.27-4vmw.550.0.0.1331820, so I don't believe I'm seeing any of the issues you've described.



  • 24.  RE: Intel DC P3700 Firmware

    Posted May 25, 2016 08:27 PM

    I am sure the S3510s are not the issue, I have the same thing happening on my hybrid cluster that uses WD RE4s as capacity drives.

    You mention you are on VSAN 6.1, the issue I'm describing occurs only with VSAN 6.2 (probably related to the checksum feature). In VSAN 6.1 or VSAN 6.0 I didn't experience any issues with P3700/P3600 on any cluster.



  • 25.  RE: Intel DC P3700 Firmware

    Posted May 24, 2016 08:41 PM

    I posted the same response to the Intel forums, I'll probably open a case with Intel in the next day or two hopefully will get more visibility to Intel as well.



  • 26.  RE: Intel DC P3700 Firmware

    Posted May 26, 2016 12:48 PM

    May I ask - which model/type of the Intel DC P3700 are you guys running? While searching through the VSAN HCL DB - the "cool way" (JSON-file directly - http://partnerweb.vmware.com/service/vsan/all.json‌‌) - I've found that the SSID of the SSDPE2MD800G4 (800GB, 2,5-inch) is listed as SSID 3703, and the SSDPEDMD800G4 (800GB, HHHL AIC) also has SSID 3703. Our DC P3700, 800GB, HHHL AIC has SSID of 3702, not 3703, which probably is the reason why the Health Check gives us a "Warning" (does not match any SSID. Different driver or firmware will in that case give the same result, as it still doesn't match any SSID).

    Regarding identical SSID in the HCL

    A quick search in the JSON-file, and you'll find the following relevant IDs (output from today - this may change):

    "id": 39653,
    "model": "Intel SSD DC P3700 Series SSDPE2MD800G4 (800 GB, 2.5-inch)",
    "vid": "8086",
    "did": "0953",
    "svid": "8086",
    "ssid": "3703",


    "id": 39659,
    "model": "Intel SSD DC P3700 Series SSDPEDMD800G4 (800 GB, HHHL AIC)",
    "vid": "8086",
    "did": "0953",
    "svid": "8086",
    "ssid": "3703",

    Checking Vendor ID (VID), Device ID (DID), Sub-Vendor ID (SVID) and Sub-Device ID (SSID)

    vmkchdev -l |grep vmhba4

    0000:84:00.0 8086:0953 8086:3702 vmkernel vmhba4

    Regarding driver & firmware-versions

    Our NVMe-device is also shipped with FW 8DV10171 (verified with the Intel DCT).

    Based on our VID, DID, SVID and SSID for the device, the following HCLs is available:

    From the "General HCL"-list (cat=io): FW 8DV10171 & nvme version 1.2.0.27-4vmw (VMware Async)

    From the "VSAN HCL" (cat=ssd): FW 8DV10131 & nvme 1.0e.1.1-1OEM.550.0.0.1391871 (which actually is "intel-nvme", as this is only available as Partner Async-driver, as far as I know).

    Still waiting for a response on our SR - just wanted to let you know our findings (in case it helps).

    Best regards,

    Espen Ødegaard



  • 27.  RE: Intel DC P3700 Firmware

    Posted May 26, 2016 03:09 PM

    I've always had the Intel P3700 give a warning on healthcheck for the HCL category since VSAN 6. I more or less ignored that since I know the part really is on HCL.

    I use these models of P3700 (VID,DID, SVID, SSID):

    Intel DC P3700 400GB HHHL, SSDPEDMD400G4, 8086:0953 8086:3702

    Intel DC P3700 1.6TB HHHL, SSDPEDMD016T4, 8086:0953 8086:3702

    Intel DC P3700 2.0TB HHHL, SSDPEDMD020T4, 8086:0953 8086:3702

    Looking at some other PCIE databases https://pci-ids.ucw.cz/read/PC/8086/0953 , maybe SSID 3703 only refers to the 2.5" SFF version and HHHL is SSID 3702? If so, the HCL has the SSID for the HHHL version entered incorrectly.



  • 28.  RE: Intel DC P3700 Firmware

    Posted May 27, 2016 04:04 AM

    Yes, that's my thoughts as well. I've commented this in my SR w/VMware. If lucky, the VSAN HCL "DB" (JSON-file) will be corrected (unless I'm misunderstanding the logic of the HCL-check).



  • 29.  RE: Intel DC P3700 Firmware

    Posted Jun 15, 2016 12:00 PM

    Quick update, regarding VSAN HCL DB:

    Got confirmed in our SR that the "Warning on the P3700 HHHL AIC" was due to a VSAN Health-plugin issue, and could be ignored (will be fixed in the following health releases).

    Regarding firmware-version:

    Was also told to downgrade the stock firmware (8DV10171) to 8DV10131. Hopefully the VSAN HCL DB will be updated shortly (based on Intel's response from yesterday, regarding the *171-firmware, it should be verified/added by VMware next week). Wondering about VMware's updated recommendation on driver (with the new firmware).



  • 30.  RE: Intel DC P3700 Firmware

    Posted Jun 21, 2016 11:23 AM

    Any update? I have exact same problem with Intel P3700 *171 firmware ssd-s and VSAN.



  • 31.  RE: Intel DC P3700 Firmware

    Posted Jun 22, 2016 11:29 AM

    Great news, just got word from our vendor that the P3700 w/ FW0171 is now certified for VSAN:

    VMware Compatibility Guide - ssd

    We've been using this with the 1.2.0.27-4vmw driver for a while with no problems, but I know Elerium was reporting some write speed limitations with checksum enabled using the 1.2.0.27-4vmw driver.  I see that the intel-nvme 1.0e.2.0 driver is also listed on the HCL, I'm curious if there might be an improvement using this driver.



  • 32.  RE: Intel DC P3700 Firmware

    Posted Jun 24, 2016 09:47 PM

    I've tried all available (intel-nvme 1.0e1.1, intel-nvme 1.0e2.0, nvme 1.0e.035-1vmw (Inbox) and nvme 1.2.0.27-4vmw), all work at different degrees of poor on VSAN 6.2 if you leave checksum enabled. If you MUST use checksum use nvme 1.2.0.27-4vmw, the two intel-nvme drivers are pretty much unusable with checksum enabled. Using nvme 1.2.0.27-4vmw with checksum on will work fine for most workloads that don't involve constant large sequential writes. However, if your VSAN goes into resync in this setup, you can expect serious performance problems and horrible latencies (500ms+) during resync. I have 4 different clusters, hybrid and AF, all using Intel P3600 or P3700 where I have been able to reproduce this behavior consistently. I don't have any other NVME SSDs to test with so there is a possibility still that something else is the cause, but my best guess is it's the nvme driver, SSD firmware or maybe a bug in checksum implementation. 

    If you are using P3700 or P3600 and VSAN 6.2, I would recommend disabling checksum in all your storage policies to avoid the problems above. If you're using VSAN 6.1 or VSAN 6.0 (which doesn't support checksum), you won't see any issues whatsoever. I still have an SR open with VMware about the checksum issue, their development is still looking at it and there's no ETA for a fix yet.

    I raised this issue with Intel and you can read that here Firmware Downgrade |Intel Communities, their reply is below:

    We have already worked with VMware* to have our FW171 added into the HCL and the expectation is for VMWare* to have it updated by next week, you can keep an eye on their website.

    On the other hand, the FW added into the HCL has no relationship with the SW Checksum added by VMWare* (as an option) to their VSAN 6.2; therefore, the latest FW will not fix the latency issues associated with the SW Checksum when enabled, as this is not really related to our drives, since our drives where designed and tested for high integrity and therefore non-validated or intended to operate with VSAN's SW Checksum feature.

    At the end it is up to everyone whether to rely on our high integrity SSD's or enable a SW Checksum which will add latency and therefore sacrifice performance.

    Let us know if you need more information.



  • 33.  RE: Intel DC P3700 Firmware

    Posted Oct 25, 2016 12:37 AM

    After opening an unrelated case regarding poor write latency and failed drives and being given a command to change LSOM congestion limits by support, I've found that the following settings (run on each host) increased overall write performance for me by at least 50% (my resync speeds are now 100% faster too)!

    esxcfg-advcfg -s 24 /LSOM/lsomLogCongestionLowLimitGB

    esxcfg-advcfg -s 48 /LSOM/lsomLogCongestionHighLimitGB

    (defaults are 16 and 24 respectively if you need to revert)

    From what I have figured out, the values above configure the min and max write log size. It directly determines congestion levels (congestion starts when the lowlimitGB value is reached and the highlimitGB sets the max congestion level). As a result, this write log acts a buffer, storing commands that will be written to the capacity disks and is stored on the cache SSD. I have found that increasing this buffer dramatically increases performance when using P3700 + magnetic capacity disks. In IOPs benchmark testing, I see a 50% write improvement and in resync operations i see 100% throughput performance! I also previously had to disable checksum as I had poor write latencies leaving it enabled when VSAN was performing resync, this has fixed that as well for me and I can finally enable checksum!

    As far as I can tell, the downsides are: This buffer increases capacity use on your cache SSD (by the GB amount specified), I really think this drawback is minimal unless you are using tiny sized SSD cache. Secondly, if you increase the values too much (for me it was past 64GB for the HighLimit), VSAN won't throttle VM latencies so it will put higher I/O strain on the capacity layer which may ultimately affect VM latency. You should do your own testing to see what works on your environment, but just wanted to put this out there as I practically got a free 50-100% write boost by changing these settings.



  • 34.  RE: Intel DC P3700 Firmware

    Posted Jul 11, 2016 09:34 AM

    P3700 HHHL AIC is on HCL now
    VMware Compatibility Guide - vfrc



  • 35.  RE: Intel DC P3700 Firmware

    Posted Dec 15, 2016 01:40 AM

    Espen-

    Did you ever get this figured out?   We are still seeing health warnings and have the 3702 SSID listed as well.

    Thoughts?

    RIck