Extreme Networks

Rahman_Duran1 · ‎10-21-2015

Hi,

I am trying to find the cause of a network problem in one of our buildings. We have one C5 and 6 A4-48 port switches there. C5 is set as router and all A4s are access switches and connected to C5 directly.

The problem is; on 4 of the A4 switches there is very high packet lost when I ping them. Also the clients connected to these switches can't connect to internet because of high packet lost and high latencey. When I ping the switch from an other building it shows ~50000 ms and ~%20-30 packet lost. If I disable all access ports on that switch ping times drop to 1-2 ms instantly. When I enable them the problem reappears.

I looked the logs on the switch and only saw 1-2 messages that should be abnormal:
TXQMONITO[217085048]: txq_monitor.c(507) 2097 This is from manager 1 %% Tx queue for interface ge.1.50 is in stalled mode

ge.1.50 is the uplink port to C5. So I found this KB entry: https://extremeportal.force.com/ExtrArticleDetail?an=000077973
And I disabled the flowcontrol as it suggests. So the problem seems resolved.

So why we encountered this problem? There is only 4-5 active clients per switch. And why we only saw this problem on the 4 of them but not all 6 A4H? Any one have an idea? We have over 30 A4H deployed and we never saw any problem like this?

Paul_Poyant · ‎10-22-2015

The RX Pause symptom on A4 port fe.1.37 suggests that the attached client is the apparent origination of this issue. The TX Pause symptom on A4 port ge.1.50 suggests that the A4 is passing the issue along to the core C5, which in turn may well be passing it along to the other A4s to which it is connected.

Yes, any device may send an unusually large number of Pause frames to its attached peer device. And, to make things more interesting, frequently upon reboot of such a device all is well - for awhile - until it starts up again later.

You may already have determined that the Pause count statistics only have meaning when assessed against the uptime of the unit in question. Typically for the most-affected port(s) I calculate down to Rx Pause frames received per second, averaged over the uptime period. What's the actionable limit? Hard to say. One received Pause frame per second on a given port over an extended period of time would likely be affecting the equipment in some undesirable way. As seen in your case, such activity on one port has the potential to affect multiple switches in a flowcontrol domain. This is a good thing to be aware of, whenever a complaint of poor performance arises. [Consider that to be a suggested Note to Self!]

Rahman_Duran1 · ‎10-22-2015

Update: If I enable flowcontrol on all the switches but only disable it on 10.141.1.6 then the problem disappears.

Rahman_Duran1 · ‎10-22-2015

Hi Paul,

I am trying to find the real culprit. I enabled flow control on all switches again. And instanly on switches 10.141.1.4/10.141.1.5/10.141.1.6/10.141.1.7 the problem reappears.

The weird thing is on 10.141.1.4 switch, there are rx pause values on an access port:

fe.1.37 0 21320
ge.1.50 65858 0

fe.1.37 is an access port that suppoed to be connected to client device. ge.1.50 is the uplink to c5.

On 10.141.1.7 there is also one of client port has tx value:

fe.1.48 158 0
ge.1.50 0 3738

So is it normal to receive flow control packets on a client access port? The building is 25 km away from us. I want to find the culprit before going for onsite troubleshoot.

Thanks.

Ps: here is network topology:

c15b4827b20a43fea5394e2401a50863_RackMultipart20151022-26940-1itad36-borcka-topology_inline.png

Paul_Poyant · ‎10-21-2015

Typically a flowcontrol issue will arise in the presence of one or more connected nodes which are originating "excessive" quantities of flowcontrol packets directed at the affected switch. Just what might be considered excessive, and just how affected the switch might be as a result, would depend on a number of factors such as overall switch load, total available port buffering, stacking topology, and the precise timing of both flowcontrol packet and switched/routed packet activity.

Because of that "precise timing" element; it is possible that the txqmonitor feature will kick in to diminish any overall negative effect before anything is noticed by network users, and it is equally possible that network latency could result without txqmonitor being triggered to intervene. Note that though txqmonitor settings can be tuned, I haven't found great value in doing that.

As to why it's happening in the first place; the output of a '

code:

show port flowcontrol

' or '

code:

show txqmonitor flowcontrol

' (these show essentially the same data) should go a long way toward directing further investigation, since for each port it identifies the count of Rx Pause packets received from directly attached peer devices and the count of Tx Pause packets transmitted to directly attached peer devices. Target the Rx Pause activity first, since this starts the process. These counts are since system boot, and they should not continue to accumulate while flowcontrol is locally disabled ('

code:

set flowcontrol disable

') as you have configured here.

Extreme Networks

SecureStack A4H124-48 network problem

SecureStack A4H124-48 network problem