Brocade Fibre Channel Networking Community

Expand all | Collapse all

DEV_LATENCY_IMPACT rule - defALL_PORTS_IO_PERF_IM

  • 1.  DEV_LATENCY_IMPACT rule - defALL_PORTS_IO_PERF_IM

    Posted 03-09-2020 06:26 AM
    Friends,

    On one of the Fabric Switch (CORE), is triggering lot of FPI errors (DEV Latency) every 5 mins on ISL links.
    Affected Entity:  E-Port 3
    Rule Name: defALL_PORTS_IO_PERF_IMPACT
    Condition: ALL_PORTS(DEV_LATENCY_IMPACT==IO_PERF_IMPACT)
    Current Value: [DEV_LATENCY_IMPACT,IO_PERF_IMPACT, 10 ms Frame Delay]
    Dashboard Category: Fabric Performance Impact
    Switch Name: SAN-1
    Switch WWN: 10:00:50:eb:1a:4f:fe:86
    Switch IP: 10.1.1.3
    Fabric Name: uninitialized
    VFID: 128

    I have checked for physical errors and dont see any port flaps or other issues.

    What else could be an issue? How can I check if the ISL bandwidth is sufficient or needs to add more ISLs. Right now there are 8 ISLs with 16Gb each (3 Trunks).



  • 2.  RE: DEV_LATENCY_IMPACT rule - defALL_PORTS_IO_PERF_IM

    Posted 03-09-2020 09:14 AM
    To check if ISL bandwidth is sufficient, I'd use the performance graphing functionality in BNA. However, generally I'd say that 128Gb of ISL bandwidth per fabric would be sufficient for most standard enterprise workloads (although you'll be in a better position to know what traffic is flowing over the ISL's and the scoping decision for 8 x 16Gb).

    My guess would be that bandwidth isn't going to be your limitation. I'd be looking for evidence of buffer credit exhaustion on the ISL interfaces.


  • 3.  RE: DEV_LATENCY_IMPACT rule - defALL_PORTS_IO_PERF_IM

    Posted 03-09-2020 09:24 AM
    hi Calvin,

    Many thanks for quick answer. I will be looking at the bandwidth graph. How would I look into the b2b exhaustion on the ISL?

    regards,
    Kishore Ram.


  • 4.  RE: DEV_LATENCY_IMPACT rule - defALL_PORTS_IO_PERF_IM

    Posted 03-09-2020 10:50 AM
    You need to look for instances where TX BB credit 0 counters are being registered (ie, when the port has 0 BB credits)

    You can find this by either adding the 'BB Credit 0' metric to your BNA performance graph or by looking at the 'tim_txcrd_z' (Time TX Credit Zero) numbers in portstatsshow.


  • 5.  RE: DEV_LATENCY_IMPACT rule - defALL_PORTS_IO_PERF_IM

    Posted 03-09-2020 11:19 AM
    Edited by kishore ram 03-09-2020 11:20 AM
    hi Calvin,

    Unfortunately, we dont have BNA tool to monitor the historic counters of the ports. Adding to this alerts, we get some similar latency alerts on HPe VTL drive ports (NPIV) which are marked as SDDQ once in every 3-4 days. And the ports that get quarantined are similar ones of the same VTL drives ports of 4 VTL lib we have. I release the SDDQ ports because if they are marked as SDDQ the Host (TSM server) lose visibility of the drives.

    But, the backup server (TSM) and the VTL tape lib drives are connected to the same CORE SAN switch in this case. I was under an assumption that these latency alerts on ISL's occur due to the latency of the VTL tape drive ports. But, I dont think so due to 2 reasons,
    1) I checked the latency alerts of the VTL drives or SDDQ do not occur at the same time as ISL latency alerts.
    2) The VTL tape ports and TSM HBA ports are connected to same SAN switch and should not related to ISL latency alerts.

    Since, we do not have BNA I am thinking of putting down a script which will run twice a day and the script the capture the following,
    1. Clear all the port stats in both Core and Edge switches
    2. Wait for 30mins - 1hr
    3. Capture the tim_txcrd_z (latency) value as below for each port,
    tim_txcrd_z_vc 4- 7: 8678122
    4. Put a condition, if the value is above 1000 (or some optimal value you can suggest), then that port will be printed out in the output.

    From the output ports I could get the list of common ports which will captured twice in a day.

    Sorry for putting long detailed explaination. Hope I dint confuse you. What do you say with this approach?




  • 6.  RE: DEV_LATENCY_IMPACT rule - defALL_PORTS_IO_PERF_IM

    Posted 03-10-2020 03:20 AM

    Hi Kishore,

     

    A couple of things to keep in mind:

    Ref. SDDQ

    The sddq feature ( Slow Drain Device Quarantine) is designed to help alleviate a latency issue by putting a slow drain device on a lower priority channel (VC) until the Administrator can assess the cause of the slow drain and re-mediate it. So if this means that your HPe VTL's are in SDDQ every 3-4 days, then you have problems with these ports which need to be sorted first before even start looking at ISL's. I would expect the messages on the ISL to be a symptom rather then a cause.

     

    Ref. the tim_txcrd_z counter.
    I received this from Brocade when working a case and I think it sums it all up regards the tim_txcrd-z counter.

    There are some things to understand about the tim_txcrd_z counter to make effective use of it:

    1. This counter does not provide any indication of how long a frame was delayed due to zero BBC_TRC. It only tells us that at the instant in time the poll looked at the BBC_TRC, it was zero and there was a frame queued for transmission. A nanosecond later an R_Rdy may have been received allowing that frame to be transmitted.
    2. This counter can increase on F ports during normal operations.
    3. This counter can/will increase very rapidly on E ports during normal operations.
    4. In mixed speed environments, this counter will typically increment at a higher rate on lower speed devices than higher speed devices.
    5. This counter can increment at ~380,000 per second in the case of port with total loss of transmit credits.
    6. This counter utilized may wrap in approximately 3 hours if condition on port is severe enough.
    7. In low/moderate latency scenarios, there isn't an objective way to quantify what is considered "too many" increments of this counter. It's really best to identify statistical outliers – look at all the F-ports in the fabric and focus the investigation on those with the highest counters within a 3 hour window.
    8. By dividing the number of tim_txcrd_z by the number of frames transmitted, we can get a rough indication of the severity of the delay on the port. Be sure neither counter has wrapped, or the results will be invalid. A ratio of tim_txcrd_z/frames tx'd approaching 1 or greater that one should considered a red flag and warrants further investigation (see queue depth/ fan-in/ fan-out section).

     

    Remember that the original message on the ISL means that the frame had a 10 ms delay on the ISL, meaning that the frame was not able to be accepted on the other side. There for you need to stat looking at f-port on the other switch, and if these are the sddq ports, then there is you first port of call to look at.


    Regards,

    Ed




  • 7.  RE: DEV_LATENCY_IMPACT rule - defALL_PORTS_IO_PERF_IM

    Posted 03-10-2020 06:10 AM
    Thanks Ed for the detailed explanation.

    1) Reg. VTL SDDQ, we did engage the backup team and specialist to check if the VTL library is doing fine or the firmware code on the library (end even the firmware was upgraded). Lots of workarounds were done (zone changes) etc but nothing helped to reduce the SDDQ alerts.

    2) Reg. ISL, we receive the latency alerts on the CORE switch of the ISL which is connected to 1 EDGE switch. And as per your advise, I need to look into the F-ports of EDGE switch. However, the SDDQ (VTL) ports are connected to CORE switch not EDGE switch. So, its now little more complex to find out the F-port/ports which are causing the latency alerts on the CORE ISLs.

    Let me check the EDGE switch F-Port (tim_tx_crd_z) issues (if any) and get back to you.


  • 8.  RE: DEV_LATENCY_IMPACT rule - defALL_PORTS_IO_PERF_IM

    Posted 03-10-2020 07:06 AM
    Hi Kishore,

    Your response to the VTL SDDQ problem suggests that you don't understand the issues involved. Upgrading firmware and changing zones won't make the slightest bit of difference to slow-drain devices (as you have discovered).

    In general, your SAN environment sounds in a poor condition with some pretty fundamental issues. You sound ill-equipped (both from a tools and knowledge perspective) to resolve these issues and just with a small glimpse in to your world, it's apparent that some pretty big decisions will need to be made and implemented (possible hardware upgrades, fabric segregation/re-architecture, re-cabling work, etc). These are all beyond the abilities of a community board.

    At this point, I would strongly suggest getting in a specialist company who can do a ground-up review of your estate and propose appropriate measures to resolve your existing issues and put you in a solid position for the future.


  • 9.  RE: DEV_LATENCY_IMPACT rule - defALL_PORTS_IO_PERF_IM

    Posted 03-10-2020 07:59 AM
    Thanks Ed for the suggestion.

    We have an ongoing case with Hpe (vendor) and reg SDDQ, they have suggested the zoning changes etc. Unfortunately, none of them worked. They did point out to the high number of NPIV logins on single physical port which could case delay due to multiple logins. These VTLs are on CORE Switch attached.

    Switch ports are experiencing high latency issues from storeonce and I do not find any issues from SAN end, customer has to check the storeonce configuration, IO traffic load balancing from storeonce to SAN ports. As each SAN ports connected to storeonce are NPIV ports , minimum 30 to 50 devices are logged in as below. This can cause high overhead on the ports.

    Index Port Address Media Speed State Proto
    ==================================================
    62 62 0b3e00 id 8G Online FC F-Port 1 N Port + 49 NPIV public // 49 devices logged in
    64 64 0b4000 id N8 Online FC F-Port 1 N Port + 42 NPIV public // 42 devices logged in
    65 65 0b4100 id 8G Online FC F-Port 1 N Port + 29 NPIV public // 29 devices
    79 79 0b4f00 id N8 Online FC F-Port 1 N Port + 29 NPIV public // 29 devices.



    Each device in a fiber channel fabric will have a unique Word Wide Name (WWN). What this means in terms of zoning is that we can identify devices in fabric using WWNN or WWPN. As per brocade is to bind WWPN's of intended devices (ports) together. This binding is called zoning and it enables the devices to communicate with each other.

    image004.jpg@01D5B0EA.F3EEB180" src="https://mail.notes.na.collabserv.com/livemail/0/28ceab7893f344f6002584ce0047321f/Body/M1.5?OpenElement&cid=image004.jpg@01D5B0EA.F3EEB180" class="img-responsive" width="674" height="362" border="0" />



  • 10.  RE: DEV_LATENCY_IMPACT rule - defALL_PORTS_IO_PERF_IM

    Posted 03-11-2020 03:35 AM
    Hi Kishore,
    I would agree with Calvin, if all hardware checks out, you need a specialized company or Professional Services to do a performance review on your SAN, and see what needs to be done to get you back to normal production. And this goes beyond this community.
    Regards,
    Ed


  • 11.  RE: DEV_LATENCY_IMPACT rule - defALL_PORTS_IO_PERF_IM

    Posted 03-11-2020 10:35 AM
    Edited by D Tocci 03-11-2020 10:35 AM
    I had a very similar issue which required analyzing the ratio of tim_txcrd_z vs. frame transmitted, as mentioned above.

    In my case a Brocade 8510-8 (Switch 2) had ICLs to another 8510-8 (Switch1).  There was 512Gb of ICL capacity.
    Switch 2 ICLs frequently posted MAPS IO_Latency messages toward Switch1.   This implies Switch1 has a slow-drain
    condition, however, Switch1 MAPS had no latency alerts to any end-devices.  The same thing was happening
    on the 2nd fabric between Switch4--> Switch3 direction.The lack of MAPS IO_latency alerts only means none of the
    switch ports have exceed 10ms in transmit latency to attach devices.  There could be 1 to 9ms of latency continually
    but MAPS will not fire an alert until 10ms is breached.

     What I ended up doing was calculating the percentage of tim_txcrd_z count vs. frames transmitted :  ((tim_txcrdz/frames transmitted)*100).
    That revealed two servers with high ratio of tim_txcrd_z vs. frames transmitted, which turned out to be the culprits.  I was using
    a threshold of 20%.  The servers had +30%.

    Before doing such an exercise it is suggested to baseline the fc port counters on each switch by clear interface counters.
    The frames transmitted counter and tim_txcrd_z counter displayed under portstatsshow x/y use 32bit registers which
    can wrap quickly on a busy port and throw off your calculations.   I would suggest using the portstat64show counter
    which includes an upper bit for every time the 32 bit counter wraps.   Every time the 32 bit counter wraps the count
    has increased by 4,294,967,296.

    Example of calculating percentage of tim_txcrd_z vs. total frame transmitted using portstats64show:
    (tim_txcrdz/frames transmitted)*100)

    portstats64show 0
    stat64_ftx     2                top_int : Frames transmitted
                        162277409 bottom_int : Frames transmitted

    tim64_txcrd_z  0 top_int : Time BB_credit zero
                            245772137 bottom_int : Time BB_credit zero

    [(245,772,137) / ( (2*4,294,967,296)+162,277,409)]*100 =
    [(245,772,137) /  (8,752,212,001)]*100 = 2.80%