We are using Fabric Watch and Bottleneckmon to highlight issues but this is all after the event. We can address hardware issues which cause the latency afterwards, but meanwile the SAN wide issue causes Oracle RAC to drop disks. How can we prevent the impact in the first place? We are running FOS 7.1.0c across the Fabrics.
you can configure port fencing to disable the F-port that misbehaves and start discarding frames on the switch port.
Thanks for your response. I am already tracking C3 discards - we seem to see isolated events within the same minute. In a recent example we had 41 C3 discard errors from a port in a minute - fabricwatch timebase seems to only allow you to go as granular as one minute - I think by the time the port was fenced I suspect that the damage, (causing the latency event), would already have been done. We seem to then see no further errors on the port, (but we still get the link checked out).
I am not sure if MAPS in the next version will allow greater granularity than one minute.
Unfortunately we are as likely to see the latency caused by top tier servers – the issues we see are not caused by spikes in workload but by random link errors. As you suggest, isolating the important stuff to the same ASIC would be good, but unfortunately we are running our critical RAC clusters across two sites which means that we have to use a lot of shared infrastructure.