C5210 HA pair , AP's disassociate from one controller and randomly reattach to the backup wireless controller


Userlevel 3
C5210 We have recently upgraded to 09.21.11.0004 code which we hoped would resolve this issue.

This system has nearly 1000 AP spread across the two controllers.
We are seeing AP's swap from their primary controller to the back up . this is totally random and unpredictable ( so ,so far no packet capture to sniff ) ( 180/500 swap)

We have been advise so far to increase the poll timers, for the AP's. ( WASSP/CAPWAP ) AP >Global Settings> AP Registration > discovery timers

There does not seem to be any underling networking issues ,as we have no other reported issues or concerns.

Is there a known issue ?
Has anybody else seen this issue and how was it resolved.
Can I priorities the WASSP traffic through the network ( DSCP? )

Regards

17 replies

Userlevel 7
Hi Rod,

I've no answers for you but a general question... why even use fast failover.

Does the network design requires fast failover instead of legacy failover.
I'm a big fan of legacy failover and use it for all my customer installations and don't see a problem with it.
How often does it happen that a controller is defect and not longer reachable.... in that rare case I assume it doesn't matter whether you loose one ping or two till the APs switch to the 2nd controller.
Userlevel 1
Ron wrote:

Hi Rod,

I've no answers for you but a general question... why even use fast failover.

Does the network design requires fast failover instead of legacy failover.
I'm a big fan of legacy failover and use it for all my customer installations and don't see a problem with it.
How often does it happen that a controller is defect and not longer reachable.... in that rare case I assume it doesn't matter whether you loose one ping or two till the APs switch to the 2nd controller.

I also have a C5210 controller and my APs fail over to the other controller. What is the difference between fast and legacy failover? Is there a way I can turn fast off?
Userlevel 3
Ron wrote:

Hi Rod,

I've no answers for you but a general question... why even use fast failover.

Does the network design requires fast failover instead of legacy failover.
I'm a big fan of legacy failover and use it for all my customer installations and don't see a problem with it.
How often does it happen that a controller is defect and not longer reachable.... in that rare case I assume it doesn't matter whether you loose one ping or two till the APs switch to the 2nd controller.

Hi
We have inherited this installation ,therefore are reluctant to make many changes.. I used to install the previous version of this , before enterasys bought it from Siemons
Userlevel 7
To enable legacy failover just remove the checkmark for fast failover.

Legacy failover is slower as the AP doesn't has a tunnel to the 2nd controller already established - slow means that you'd loose 1-2 pings during failover... in my experience.

The difference is that legacy failover has two requirements that MUST be fulfilled to allow the AP to authenticate/switch to the second controller.
1) the AP lose connection to the home controller
2) the controllers lose the connection to each other (=availability tunnel down)

Let's talk about the case in which you don't use legacy failiover.
If the APs connect via i.e. ESA0 and the availbility tunnel is configured on i.e. ESA1.
If ESA0 is down (i.e. broken cable) on the home controller the AP is not longer able to communicate with the controller but as ESA1 is still up (=availability tunnel is still up) the AP is not allowed to authenticate/switch to the second controller.

It's very important if you use legacy failover to use the same interface for AP registration also for the availabilty tunnel configuration.
In a "normal" setup with both controller in the same room and are setup for the same subnets that shouldn't be a problem and you are able to use legacy failover.

So the one thing that you need to make sure in the network design is that there is no such case where the AP is not able to reach the AP registration interface but the controllers could reach each other via the availabilty interface.
Userlevel 3
Thanks for the reply , Im not sure that my customer will accept that as a " solution" a work round yes ..

I have been looking at changing the AP timers , is there a difference between verion 9 and 10 ?

Also looking at an ACL policy to put the UDP AP WASSP traffic into QP8..

I will talk to my customer about removing the " fast failover option.
Userlevel 3
Rod Robertson wrote:

Thanks for the reply , Im not sure that my customer will accept that as a " solution" a work round yes ..

I have been looking at changing the AP timers , is there a difference between verion 9 and 10 ?

Also looking at an ACL policy to put the UDP AP WASSP traffic into QP8..

I will talk to my customer about removing the " fast failover option.

Hi

We are on version 09.21 are the various timers different in version 10.
What are your timers set to currently? I have had this problem in the past, but it doesn't happen any more. My AP poll timeout is set to 4 seconds, discovery timeout is 3 seconds, detect link failure is 2 seconds.

Just to be sure, the APs aren't rebooting are they? What topology are your clients in, B@AP, B@EWC or routed?
Userlevel 3
James A wrote:

What are your timers set to currently? I have had this problem in the past, but it doesn't happen any more. My AP poll timeout is set to 4 seconds, discovery timeout is 3 seconds, detect link failure is 2 seconds.

Just to be sure, the APs aren't rebooting are they? What topology are your clients in, B@AP, B@EWC or routed?

We are using bridge at EWC( B@EWC ) for all AP's ,( Approx 1000 ) I have a meeting next week , with the customer ,to come up with a plan of how we are going to try and resolve the issue..
Userlevel 3
James A wrote:

What are your timers set to currently? I have had this problem in the past, but it doesn't happen any more. My AP poll timeout is set to 4 seconds, discovery timeout is 3 seconds, detect link failure is 2 seconds.

Just to be sure, the APs aren't rebooting are they? What topology are your clients in, B@AP, B@EWC or routed?

Hi

Our timers were set to default, We had been advised by GTAC to extend the timer to 60 , we have done this for a group of AP's and are now waiting to see what happens.
Userlevel 2
Hi Rod, I ran in to the same situation a few weeks ago. I have a pair of C5210's in HA with 1200+ APs on them. We broadcast a few SSIDs via both B@AP and B@EWC. Things were stable for a very long time. A few weeks ago we started seeing the APs bouncing between controllers. After spending sometime looking and adjusting the timers, we contacted the GTAC and were instructed to upgrade from 09.21.07 to 09.21.12. That seemed to have resolved the issue.

We are still not sure why is started happening. We were on the 09.21.07 code for a very long time without issue.

Good Luck
Userlevel 3
Rich Pacheco wrote:

Hi Rod, I ran in to the same situation a few weeks ago. I have a pair of C5210's in HA with 1200+ APs on them. We broadcast a few SSIDs via both B@AP and B@EWC. Things were stable for a very long time. A few weeks ago we started seeing the APs bouncing between controllers. After spending sometime looking and adjusting the timers, we contacted the GTAC and were instructed to upgrade from 09.21.07 to 09.21.12. That seemed to have resolved the issue.

We are still not sure why is started happening. We were on the 09.21.07 code for a very long time without issue.

Good Luck

Many thanks, we recently upgraded to 09.21.11.0004 which was an extreme recommendation, going back to the customer and arranging another upgrade, is something I do not look forward to , without a explicit statement from extreme.

Can somebody from extreme comment on this , does upgrading the controllers to 09.21.12.X resolve this issue.
Userlevel 1
@Rich - If your Wireless network was running on 9.21.07 for a long time, then recently Access Points started moving, was there some other change in the network that could have altered the traffic dynamics?

@Rod - We are actively working on all reported cases of APs timing out or moving between their respective controllers. If you haven't already looked in the Knowledgebase for Poll Timeout articles, you can try this article:

https://gtacknowledge.extremenetworks.com/articles/Solution/IdentiFi-Access-Points-reboot-due-to-Pol...

However, as you noted above, if there are no outstanding problems in the network, and it is a random AP move, then getting a good packet capture from either the AP ethernet port, or the controller port where the AP's register, could be difficult, but it is a necessary piece to help us understand why the APs are moving.

There are no differences in the timers between version 9 and version 10 firmware.

WASSP packets are already sent with a high priority.
Userlevel 2
Scott Whall wrote:

@Rich - If your Wireless network was running on 9.21.07 for a long time, then recently Access Points started moving, was there some other change in the network that could have altered the traffic dynamics?

@Rod - We are actively working on all reported cases of APs timing out or moving between their respective controllers. If you haven't already looked in the Knowledgebase for Poll Timeout articles, you can try this article:

https://gtacknowledge.extremenetworks.com/articles/Solution/IdentiFi-Access-Points-reboot-due-to-Pol...

However, as you noted above, if there are no outstanding problems in the network, and it is a random AP move, then getting a good packet capture from either the AP ethernet port, or the controller port where the AP's register, could be difficult, but it is a necessary piece to help us understand why the APs are moving.

There are no differences in the timers between version 9 and version 10 firmware.

WASSP packets are already sent with a high priority.

Hi Scott,

Working at a university, all of our upgrades/changes were completed before the start of the semester (9/1). We have been in monitor/fix mode since then without any major issues. We really try not to make any significant changes (wired or wireless) during the semester unless it's absolutely necessary.
Userlevel 3
Scott Whall wrote:

@Rich - If your Wireless network was running on 9.21.07 for a long time, then recently Access Points started moving, was there some other change in the network that could have altered the traffic dynamics?

@Rod - We are actively working on all reported cases of APs timing out or moving between their respective controllers. If you haven't already looked in the Knowledgebase for Poll Timeout articles, you can try this article:

https://gtacknowledge.extremenetworks.com/articles/Solution/IdentiFi-Access-Points-reboot-due-to-Pol...

However, as you noted above, if there are no outstanding problems in the network, and it is a random AP move, then getting a good packet capture from either the AP ethernet port, or the controller port where the AP's register, could be difficult, but it is a necessary piece to help us understand why the APs are moving.

There are no differences in the timers between version 9 and version 10 firmware.

WASSP packets are already sent with a high priority.

Hi
Thanks for this info ,how is the WASSP prioritized DSCP? if so what value.
Userlevel 4
After one of our minor controller upgrades on 10.x, we were seeing about 10% of our AP's continuously move between controllers. I worked with GTAC and the only fix we could come up with was to factory reset each AP, by using the 'cset factory' command via ssh. I monitored the failover events using tunnel activation messages from syslog and reset each AP that generated an alarm over the course of a week, after which there were no further failovers. I did not get an explanation as to why this behavior occurs. We are upgrading our controllers again later this week and I will post if the issue recurs.
Userlevel 2
Joshua Puusep wrote:

After one of our minor controller upgrades on 10.x, we were seeing about 10% of our AP's continuously move between controllers. I worked with GTAC and the only fix we could come up with was to factory reset each AP, by using the 'cset factory' command via ssh. I monitored the failover events using tunnel activation messages from syslog and reset each AP that generated an alarm over the course of a week, after which there were no further failovers. I did not get an explanation as to why this behavior occurs. We are upgrading our controllers again later this week and I will post if the issue recurs.

Thanks for the update. I'm curious to see if it starts happening again.
Userlevel 4
Joshua Puusep wrote:

After one of our minor controller upgrades on 10.x, we were seeing about 10% of our AP's continuously move between controllers. I worked with GTAC and the only fix we could come up with was to factory reset each AP, by using the 'cset factory' command via ssh. I monitored the failover events using tunnel activation messages from syslog and reset each AP that generated an alarm over the course of a week, after which there were no further failovers. I did not get an explanation as to why this behavior occurs. We are upgrading our controllers again later this week and I will post if the issue recurs.

We rolled out 10.11.04.0008 this morning and the AP's have remained stable so far.

Reply