Extreme 3825i APs keep "downgrading" to older firmware

We have a mixture of 3825 and 3715 APs . After upgrading to v10 software we updated the AP firmware as usual. Trouble is we can get all 1,037 APs up to v10 firmware, wait a few days, and about 4 APs will show up with v9 firmware. Wait another day and the number jumps to 7. The same 7. I can set for Controlled upgrade, upgrade these 7, ssh into the AP, see it has v10 software, reboot, log back in to verify v10 software. I can set the controller to "Always Upgrade APs to default image" which is v10. No matter what within a couple days an AP or 2 show up with old v9 firmware. APs I have ssh'd into and seen the v10 as well as having the controller list them as v10. This cycle of events has repeated 4 times in the last 10 days.

It is as if they are factory resetting themselves for some reason. Does anyone know what is wrong? Are they just bad APs that need to be RMA'd?

8 replies

Userlevel 3
Are many/any of those APs connecting/registering with their controller(s) over a WAN/VPN??
They are all at schools which the ISP routes back to the controller site as "local" traffic. But it does have to leave and be routed. The problem APs are at 3 different locations. The two controllers are also at two geographically separated location. We are not having this problem from any APs homed to the second controller.
Userlevel 3
I would just question whether the upgrade is ever truly finishing "successfully" even though you are reporting that you SSH to the affected APs and see the v10 version reported at the CLI. In my experience, if you leave the MTU set to the default of 1500 for APs that connect back across a WAN/VPN/MetroE/Dark Fiber/Extended LAN etc sort of link ... that will lead to a lot of packet fragmentation which can lead to many things ... instability in the tunnels the AP uses back to both local and foreign controllers ... unreliable passing of the configuration from controller to AP ... and AP firmware upgrades that either take too long to complete ... or complete but the end result is a corrupted image ... or it may never finish at all. This is usually due to the fact that fragmented packets have to be resent and when resent they sometimes arrive out of order. Can you please tell me if the MTU for the APs in question are still using the default of 1500 and if so ... resize them appropriately using the following KCS article as a guideline for doing so? https://gtacknowledge.extremenetworks.com/articles/Solution/IdentiFi-Wireless-AP-s-do-not-have-backu...
I have now set the problem APs to 1400. Trouble is the primary and backup tunnels were fine. There are over two dozen APs at each of these locations and no more than 2 failing at any location.

I have changed the MTU settings and I am going through the upgrade process again. It will be a few days before the problem creeps back up. Fingers crossed.
Userlevel 3
Did you regressively ping test using the -f and -l parameters until you reached a threshold where the pings would go through without being fragemented?? Or did you just take the 1400 value given as an example in the KCS article and apply that without testing? Your network may require a lower MTU or may even allow a higher MTU ... than the example given in the article. Just trying to make sure you get the best result possible. There may be something else afoot here ... but getting the MTU set correctly can only improve controller <> AP communication and increase the chances that the upgrade goes smoothly and correctly.
1500 failed. 1400 passed. (actually 1472 is the magic number here.)
Userlevel 7
Joshua Beddingfield wrote:

1500 failed. 1400 passed. (actually 1472 is the magic number here.)

If you used Windows, specifying a size of 1472 bytes for the ping payload actually results in a 1500 bytes IP packet, because the IP and ICMP headers not taken into account with the -l option. Thus the MTU does not seem to be a problem in this case.

If the IP and ICMP headers are excluded from the size option of any given ping program depends on the implementation.
Userlevel 3
Awesome. I usually test 1450 if 1400 works ... and then 1475 if 1450 works ... and if 1475 fails ... just go with 1450. No need to cut it too fine. I have seen certain environments that needed the MTU to be set below 1000 though to work properly. Glad you tested for your specific environment from the controller's location to the AP's location or vice versa. After you've performed the Controlled Upgrade to those 7 APs ... please let us know if the firmware version "sticks" better or not.