I would like to understand an impact by the traffic of one server in our environement on another to be able to find an appropriate solution to remediate.
What we have:
Two servers (in our focus) are connected to an edge switch.
Server1's application is very sensitive to response time. It should be below 3ms. It uses a flash storage array connected to the core.
Server1 is connected with 3 ports to edge DCX (3 ports to 3 different slots)
Server2 is connected with one port to another blade. It starts it's r/w activity to another storage array connected somewhere beyond the core.
Server2 starts to saturate all the port throughput available (4Gb/s) and that is when server1 sees a higher latency, 4-5ms.
There are 8 trunk groups 4 ISL each to the core switch. each ISL is 16Gb/s. They are utilised on 10% maximum.
tim_txcrd_z counter ticks around 7000 times on one of server1 ports during 30 minutes interval, 5000 times on another and 4000 times on the third port. smth like this. This does not seem much to affect response time so significantly.
On 4 ISL I see an increase of tim_txcrd_z for 80000 but I can not determine if these ISL are used for transmitting server1 frames or not.
So I wonder what causes the increase in response time of server1 ?
Shortage of any resources or what exactly ? How could I determine it?
In the tim_txcrd_z output have you checked the virtual circuits (VCs) within the ISL? Is there a specific VC that is clocking most of the wait counts?
1、How long from statsclear
2、How long Server2 saturate all the port throughput available last after statsclear
3、Have you checked all Trunk-masters of the 8 trunks for tim_txcrd_z counter ?
Point from myself :
1、increase of tim_txcrd_z for 80000 mains that the total latency cost is 80000*2.5us=200ms, so the time it last is important
2、Speed of server2 is 4Gbp/s, witch is a lower-performance device compare to the 16G ISL. So it would make some affection on latency on ISL.