Brocade Fibre Channel Networking Community

Expand all | Collapse all

We get Latency issues whch then cause Oracle RAC to force its disks offline

  • 1.  We get Latency issues whch then cause Oracle RAC to force its disks offline

    Posted 12-28-2014 02:20 PM

    We are using Fabric Watch and Bottleneckmon to highlight issues but this is all after the event. We can address hardware issues which cause the latency afterwards, but meanwile the SAN wide issue causes Oracle RAC to drop disks.  How can we prevent the impact in the first place?  We are running FOS 7.1.0c across the Fabrics.


    #BrocadeFibreChannelNetworkingCommunity


  • 2.  Re: We get Latency issues whch then cause Oracle RAC to force its disks offline

    Posted 12-29-2014 07:22 AM

    Hi,

     

    you can configure port fencing to disable the F-port that misbehaves and start discarding frames on the switch port.

     

     

    Rgds,

    Felipon


    #BrocadeFibreChannelNetworkingCommunity


  • 3.  Re: We get Latency issues whch then cause Oracle RAC to force its disks offline

    Posted 12-30-2014 02:53 AM

    Felipon,

     

    Thanks for your response.  I am already tracking C3 discards - we seem to see isolated events within the same minute.  In a recent example we had 41 C3 discard errors from a port in a minute - fabricwatch timebase seems to only allow you to go as granular as one minute - I think by the time the port was fenced I suspect that the damage, (causing the latency event), would already have been done. We seem to then see no further errors on the port, (but we still get the link checked out).

    I am not sure if MAPS in the next version will allow greater granularity than one minute.


    #BrocadeFibreChannelNetworkingCommunity


  • 4.  Re: We get Latency issues whch then cause Oracle RAC to force its disks offline

    Posted 01-01-2015 02:14 PM
    in order to avoid this, all your top-critical systems (some people call it "tier 0" - i believe that RAC is something like this) should be connected "locally" - i.e. same switch to the same switch, and maybe even to the same ASIC in case if you are using multi-ASIC switches.
    for "tier 1" systems, you'd better use QoS "high" zoning. this will isolate them in the dedicated VCs and thus minimize buffer-to-buffer issues in the lower QoS zones
    all the test/dev/tmp systems should be pushed down to QoS "low" zones. if they are doing some unexpected activity - that will be their own problem and will not affect prod environments.
    #BrocadeFibreChannelNetworkingCommunity


  • 5.  Re: We get Latency issues whch then cause Oracle RAC to force its disks offline

    Posted 01-04-2015 07:24 AM

    Thanks Alexey,

     

     

    Unfortunately we are as likely to see the latency caused by top tier servers – the issues we see are not caused by spikes in workload but by random link errors. As you suggest, isolating the important stuff to the same ASIC would be good, but unfortunately we are running our critical RAC clusters across two sites which means that we have to use a lot of shared infrastructure.

     

     

    Regards

    Tony


    #BrocadeFibreChannelNetworkingCommunity