Question

[LAG/LACP] Link Up Transitions counters shown different on two peers


Hello Team,

At customer site, we have two BD-X8 directly connected with LAG/LACP configured. However, as we monitor, the Link Up Transitions counters showing on each are different. Could you please explain how it could be possible? Thank you in advance.

[on M3_MES_DR_Svr_4]
M3_MES_DR_Svr_4.1 # sho ports sharing
Load Sharing Monitor
Config Current Agg Min Ld Share Ld Share Agg Link Link Up
Master Master Control Active Algorithm Group Mbr State Transitions
================================================================================
1:24 1:24 LACP 1 port 1:24 Y A 4
port 1:25 Y A 3
port 2:24 Y A 3
port 2:25 Y A 2

M3_MES_DR_Svr_4.2 # sho lacp member-port 1:24 detail
Member Port Rx Sel Mux Actor Partner
Port Priority State Logic State Flags Port
--------------------------------------------------------------------------------
1:24 0 Current Selected Collect-Dist ATGSCD-- 1024
Up : Yes
Enabled : Yes
Link State : Up
Actor Churn : False
Partner Churn : False
Ready_N : Yes
Wait pending : No
Ack pending : No
LAG Id:
S.pri:0 , S.id:00:04:96:9a🇨🇦c0, K:0x0400, P.pri:0 , P.num:1024
T.pri:0 , T.id:00:04:96:9c:c4:40, L:0x0400, Q.pri:0 , Q.num:1024
Stats:
Rx - Accepted : 17170402
Rx - Dropped due to error in verifying PDU : 0
Rx - Dropped due to LACP not being up on this port : 0
Rx - Dropped due to matching own MAC : 0
Tx - Sent successfully : 17170491
Tx - Transmit error : 0
================================================================================
Actor Flags: A-Activity, T-Timeout, G-Aggregation, S-Synchronization
C-Collecting, D-Distributing, F-Defaulted, E-Expired

M3_MES_DR_Svr_4.3 # sho edp ports 1:24
Port Neighbor Neighbor-ID Remote Age Num
Port Vlans
=============================================================================
1:24 M3_MES_DR_Svr_3 00:00:00:04:96:9a:ca:c0 1:24 39 2
=============================================================================

[On M3_MES_DR_Svr_3]

M3_MES_DR_Svr_3.1 # sho sharing
Load Sharing Monitor
Config Current Agg Min Ld Share Ld Share Agg Link Link Up
Master Master Control Active Algorithm Group Mbr State Transitions
================================================================================
1:24 1:24 LACP 1 port 1:24 Y A 0
port 1:25 Y A 0
port 2:24 Y A 0
port 2:25 Y A 0

M3_MES_DR_Svr_3.2 # sho lacp member-port 1:24 detail
Member Port Rx Sel Mux Actor Partner
Port Priority State Logic State Flags Port
--------------------------------------------------------------------------------
1:24 0 Current Selected Collect-Dist ATGSCD-- 1024
Up : Yes
Enabled : Yes
Link State : Up
Actor Churn : False
Partner Churn : False
Ready_N : Yes
Wait pending : No
Ack pending : No
LAG Id:
S.pri:0 , S.id:00:04:96:9a🇨🇦c0, K:0x0400, P.pri:0 , P.num:1024
T.pri:0 , T.id:00:04:96:9c:c4:40, L:0x0400, Q.pri:0 , Q.num:1024
Stats:
Rx - Accepted : 17170655
Rx - Dropped due to error in verifying PDU : 0
Rx - Dropped due to LACP not being up on this port : 0
Rx - Dropped due to matching own MAC : 0
Tx - Sent successfully : 17170567
Tx - Transmit error : 0
================================================================================
Actor Flags: A-Activity, T-Timeout, G-Aggregation, S-Synchronization
C-Collecting, D-Distributing, F-Defaulted, E-Expired

6 replies

Userlevel 7
The port transitions counters can be somewhat unreliable for troubleshooting an issue because of a couple reasons:

  • The counters can be cleared manually or with a reboot of the machine.
  • The counters increment whenever there is an occurrence. This being said if there are 2 transitions but the switch has been up for 2 years then it is impossible to know when those transitions happened (other than the last transition).
Based on the above information it is possible that the counters were cleared on one of the switches for troubleshooting. One of the switches could of been rebooted which cleared out the transition counters. Doing a "show switch" and looking at the uptime can determine this.

If you were to run a "show port info detail" on the individual ports in the lag you can determine the last the time the port went up or down. That is if the counters weren't cleared.

My estimation is that one of these scenarios happened but it will be difficult to determine which one.
Hello Team,

May I have any feedback? Thank you in advance.
Userlevel 7
Hello SteveNguyen,

It is impossible to tell why these counters are off.

I have some theories but there is no way to prove them conclusively:

  • Counters were cleared on one switch but not the other.
  • One of the switches has been rebooted after the counters increased.
You utilize the "show port info detail" command for each individual port to determine the last time the port went down and came up.
Hi!

It seems odd, and the first thing you should look at is when counters for those interfaces were cleared. If you has a reboot or cleared the interface counters of M3_MES_DR_Svr_3, that would explain it. If not, it seems strange and the only thing I can come up with is that you've had single-ended problems, like fibers that are poorly connected on one end but not the other. It would be very much of a coincident that all four links had that same problem only in one and the same direction though.

Are these optical links? In that case look at optical RX values to see if they are stable and well above warning limits. (show ports transceiver information detail) Are SFPs identical on both sides (if optical link)?

/Fredrik
Hi FredrikB,

Thanks for your feedback. Looking at the switch uptime & transceiver information, we suppose that the counters had been cleared before, which is the only reason that could explain this.
I found a pretty recent reply in a TAC case have with a customer's X460-G2 pair:

"CR# xos0073360 - “X460G2 10G ports are flapping while there is egress congestion on the peer end”
In this situation only one side will declare the link as down. But both end will see the LACP flap.

This link flap is caused by some changes made in the SDK in the Broadcom chip. (by Broadcom)

As a workaround (until this SDK is fixed) the suggestion from engineering is to change the physical port debounce timers:

- configure port 30 debounce time 150
- configure port 32 debounce time 150
This setting may be increased a bit more if you still see LACP flaps."


I have no idea if this applies to your X8's, but there may be a relation as they're both EXOS units. Have a look at the LACP stats and see if both units have the same LACP flaps but different counters for interfaces. If so, your X8 may have the same issue in EXOS/Broadcom.

Reply