08-03-2010 08:28 AM
I am having pretty much exactly the same issue as described here. Some differences:
Switches are Brocade 5000's running FOS 6.1.1d, HBAs are QLogic dual port affairs, and OS is CentOS 5.x using devmapper for MPIO.
Storage array is an AMS 2300.
We are seeing huge numbers of notxcredits on pretty much any port that shifts traffic, all ports are at 4Gb. The servers (approx 12 of them) are pushing around 4 MB spiking at 10MB on occasion (web/email servers), and generating a constant amount of notxcredits sometimes with the number jumping up quite a bit. Backups push the total to the disk array to about 60MB/s across its two ports. There is no ISL traffic at the moment, servers and storage on the same switch.
No fibre lengths past about 15m, all in the same room.
We have seen occasional path failures and recently the array seemed to vanish!.
Problems seemed to worsen when we tidied up our fabric (moving all servers onto switches with their associated storage ports)..
Any ideas as to how we reduce these errors? did the original poster resolve his issue? I have also wondered about dropping the port speed to 2Gb or 1Gb.
CHeers in advance.
10-07-2010 09:45 AM
This issue has been identified and sort of contained. i'm not sure if there is much I can do about it.
I thought the issue was fabric wide. It ended up being between initiator and target ports only. Only initators that shared target ports with the slow drain devices were affected.
I still believe the issue to be a host issue. The host isn't returning R_RDY's fast enough. What happens is the target port gets robbed of its buffer credits and everything else using those target ports has a huge increase in response times.
The source CLARiiON is the one suffereng. The backup CLARiiON isn't affected.
It's one thing specifically that causes the issue.
Our DBAs were running 4 Stream backups. When they run 4 streams, response times will go up to like 300-400ms. When they use 2 streams, response times go to about 80ms on everything. When they use 1 stream, response times are unaffected. I see the same amount of bandwidth used no matter how many streams they use.
Is there an easy way to contain issue like this? I'd love to be able to limit the amount of buffer credits an initiator can take from my target ports. The host is sucking all the buffer credits and not returning any R-RDY's. Is there a way to tell the target to only send so many buffer credits to a specific port, but treat every other port normally? I only have a few problem hosts that have this problem - Only old servers. New servers using 4 stream backups will just cause response times for the particular LUN to go up, not anything sharing the same target ports.
I got them (DBAs) to go down to 2 streams, and moved the slow drain device to its own set of target ports. I'd love to get them to go to 1 stream for RMAN backups, but I'll take what I can get.