Extreme Networks

RWCampbell · ‎09-16-2019

Randomly it seems that 1/3rd of our AP's have become un-adopted and no longer function. We have found that when restarting the controller, all the AP's reconnect. But a subset of them un-adopt after 20 seconds or so. Perhaps notably the controller is running v5.5. We want to upgrade this to the latest possible, but not sure how to get the software.

As far as we can tell, the main reason they are un-adopted is we cannot ping Mint ping the other device, and presumedly we can't access the MAC. We've compared configs of working AP's and non-working, and they're identical save the normal variables like names and IP's (minor variations). To our knowledge nothing changed to precipitate this change. The system was used normally over the weekend and the specific AP's were not working this morning.

Any idea what would make the layer 2 communication/Mint communication not work?

-----

Below is a CLI story of the main points that seem to be occurring with one of the APs. Below that is one of the AP configs. Any help would be greatly appreciated.

Controller: RFS-6010-1000-WR
Base ethernet MAC address is B4-C7-99-6D-B7-76
Mint ID: 19.6D.B7.76
IP Address: 10.200.17.10

AP: AP-6532-66040-US
Base ethernet MAC address is 84-24-8D-81-9C-88
Mint ID: 4D.81.9C.88
IP Address: 10.200.17.33

# debugs (from controller)

RFS-SW01# sh mint mlcp his

2018-10-25 11:54:15:cfgd unadopted 4D.81.9C.88
2018-10-25 11:54:15:Unadopted 84-24-8D-81-9C-88 (4D.81.9C.88), cfgd not notified
2018-10-25 11:54:15:Unadopting 84-24-8D-81-9C-88 (4D.81.9C.88) because it is unreachable
2018-10-25 11:53:59:Adopted 84-24-8D-81-9C-88 (4D.81.9C.88), cfgd notified

RFS-SW01#ping 10.200.17.33
PING 10.200.17.33 (10.200.17.33) 100(128) bytes of data.
108 bytes from 10.200.17.33: icmp_seq=1 ttl=64 time=3.99 ms
108 bytes from 10.200.17.33: icmp_seq=2 ttl=64 time=0.410 ms
108 bytes from 10.200.17.33: icmp_seq=3 ttl=64 time=0.359 ms
108 bytes from 10.200.17.33: icmp_seq=4 ttl=64 time=0.363 ms

--- 10.200.17.33 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3004ms
rtt min/avg/max/mdev = 0.359/1.281/3.995/1.567 ms
RFS-SW01#mint ping 4D.81.9C.88
MiNT ping 4D.81.9C.88 with 64 bytes of data.
Ping request 1 timed out. No response from 4D.81.9C.88
Ping request 2 timed out. No response from 4D.81.9C.88
Ping request 3 timed out. No response from 4D.81.9C.88

--- 4D.81.9C.88 ping statistics ---
3 packets transmitted, 0 packets received, 100% packet loss
RFS-SW01#

RFS-SW01#show adoption offline
-----------------------------------------------------------------------------------------------------------------------------
MAC HOST-NAME TYPE RF-DOMAIN TIME OFFLINE CONNECTED-TO
-----------------------------------------------------------------------------------------------------------------------------
84-24-8D-81-9C-88 AP23 ap6532 TEMP DC 0:05:27
-----------------------------------------------------------------------------------------------------------------------------

# debugs (from ap)

AP23#show adoption status
Adopted by:
Type : RFS6000
System Name : RFS-SW01
MAC address : B4-C7-99-6D-B7-76
MiNT address : 19.6D.B7.76
Time : 0 days 00:03:07 ago

AP23#show mint mlcp history
2018-10-25 11:53:58:Received 0 hostnames through option 191
2018-10-25 11:53:57:Received OK from cfgd, adoption complete to 19.6D.B7.76
2018-10-25 11:53:56:Waiting for cfgd OK, adopter should be 19.6D.B7.76
2018-10-25 11:53:56:Adoption state change: 'Connecting to adopter' to 'Waiting for Adoption OK'
2018-10-25 11:53:53:Adoption state change: 'No adopters found' to 'Connecting to adopter'
2018-10-25 11:53:53:Try to adopt to 19.6D.B7.76 (cluster master 00.00.00.00 in adopters)
2018-10-25 11:53:52:Received 0 hostnames through option 191
2018-10-25 11:53:52:Adoption state change: 'Disabled' to 'No adopters found'
2018-10-25 11:53:52:DNS resolution completed, starting MLCP
2018-10-25 11:53:52:Adoption enabled due to configuration

AP23#ping 10.200.17.10
PING 10.200.17.10 (10.200.17.10) 100(128) bytes of data.
108 bytes from 10.200.17.10: icmp_seq=1 ttl=64 time=4.53 ms
108 bytes from 10.200.17.10: icmp_seq=2 ttl=64 time=0.355 ms
^C
--- 10.200.17.10 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 0.355/2.443/4.531/2.088 ms
AP23#mint ping 19.6D.B7.76
MiNT ping 19.6D.B7.76 with 64 bytes of data.
Ping request 1 timed out. No response from 19.6D.B7.76
Ping request 2 timed out. No response from 19.6D.B7.76
Ping request 3 timed out. No response from 19.6D.B7.76

--- 19.6D.B7.76 ping statistics ---
3 packets transmitted, 0 packets received, 100% packet loss
AP23#

-----

code:

version 2.3
!
!
ip snmp-access-list default
 permit any
!
firewall-policy default
 no ip dos tcp-sequence-past-window
 alg sip
!
!
mint-policy global-default
!
wlan-qos-policy default
 qos trust dscp
 qos trust wmm
!
radio-qos-policy default
!
wlan "WMS SSID"
 description WMS RF Environment
 ssid TEMP-WMS-RF
 vlan 1
 bridging-mode tunnel
 encryption-type tkip-ccmp
 authentication-type none
 wpa-wpa2 psk 0 XXXXXXXXXX
 service wpa-wpa2 exclude-ccmp
!
smart-rf-policy "TEMP DC Smart RF"
 sensitivity custom
 assignable-power 2.4GHz max 14
 assignable-power 2.4GHz min 11
 smart-ocs-monitoring client-aware 2.4GHz 1
!
!
management-policy default
 no http server
 https server
 ssh
 user admin password 1 XXXXXX role superuser access all
 snmp-server community 0 private rw
 snmp-server community 0 public ro
 snmp-server user snmptrap v3 encrypted des auth md5 0 motorola
 snmp-server user snmpmanager v3 encrypted des auth md5 0 motorola
!
profile ap6532 default-ap6532
 ip name-server 10.200.16.12
 ip name-server 10.200.16.11
 ip domain-name TEMP.com
 autoinstall configuration
 autoinstall firmware
 crypto ikev1 policy ikev1-default 
  isakmp-proposal default encryption aes-256 group 2 hash sha 
 crypto ikev2 policy ikev2-default 
  isakmp-proposal default encryption aes-256 group 2 hash sha 
 crypto ipsec transform-set default esp-aes-256 esp-sha-hmac
 crypto ikev1 remote-vpn
 crypto ikev2 remote-vpn
 crypto auto-ipsec-secure
 crypto load-management
 crypto remote-vpn-client
 interface radio1
  wlan "WMS SSID" bss 1 primary
 interface radio2
  shutdown
 interface ge1
  ip dhcp trust
  qos trust dscp
  qos trust 802.1p
 interface vlan1
  ip address dhcp
  ip address zeroconf secondary
  ip dhcp client request options all
 interface pppoe1
 use firewall-policy default
 rf-domain-manager capable
 logging on
 service pm sys-restart
 router ospf
!
rf-domain "TEMP DC"
 location "TEMP DC"
 contact "Velociti Inc."
 timezone America/Chicago
 country-code us
 use smart-rf-policy "TEMP DC Smart RF"
 channel-list dynamic
 channel-list 2.4GHz 1,6,11
 control-vlan 1
!
ap6532 84-24-8D-81-9C-88
 use profile default-ap6532
 use rf-domain "TEMP DC"
 hostname AP23
 interface radio1
  power 8
 interface vlan1
  ip address 10.200.17.33/21
!
!
end

ckelly · ‎09-18-2019

Robert,

I'm curious about the DDOS'ing of the handheld. So you're intentionally DDOS'ing the device? Curious as to why. Testing?
In any case though, the DDOS ends up triggering what you see as a MINT BC storm and switches are reacting due to this?

When the customer reported that the wifi was down, was the WLAN itself actually no longer being seen by the clients? Or were they still associated but just not able to pass traffic?

So you had about 5 APs that you noticed would not respond to MINT PINGs...but they were still adopted? This would appear to be contradictory in nature. If the APs are adopted, then MINT between the devices is working...and therefore a MINT PING should work. The caveat here is that there is a definable adjacency hold timer. But if this condition exist a minute or longer and the APs are still shown to be adopted, then I'd have to say that MINT comms are still functioning between the AP and controller.
When this happens, try running show mint links on the controller. This will show all of the active AP MINT link connections that exist. If those 5 or so APs really are unadopted, you won't see a listing for MINT links for them.

When you say that you 'reset' the APs, do you simply mean that you power cycled them or something else?

Regarding the MINT BC, you will have LSP's flooded from each WING device. Each WING device will receive and build their LSP-DB based on the flooded LSPs. With MINT level-1 adoptions though the LSP-DB size can start to get rather large because of this and therefore doesn't scale well for very large AP deployments....which is then where MINT level-2 adoptions come into play. But, your deployment is nowhere near that large so this shouldn't be an issue.

If you run 'show mint info' on any of the APs, you should see that it has some number (all the APs) as the LSP DB size. This number will include all the APs and the controller itself.
If you run show mint lsp-db on a device, you an then see EACH of the WING devices that exist within the LSP-DB. Each listed AP will have just a single adjacency formed (between itself and the controller) and the controller will have as many adjacencies as there number of APs.

So then the LSP-DB is used to create the MINT routes, which if you want to see can be viewed by running the command show mint route.

In addition to all this so far, each WiNG device will also transmit a MINT HELLO packet every 4 seconds for MINT level-1. So you can expect to see that MINT traffic too.

So where am I going with all this? I'm not seeing where a MINT BC storm is going to occur. This is assuming no loops, which could potentially do it though.

If this happens again, see if can SSH into one of the APs that is 'down'.

Given all this, you can certainly go ahead and isolate the MINT traffic by setting up an additional VLAN that is separate from the user-traffic. This is certainly not something that is normally done though with other deployments. My suspicion though is that whatever is happening here would continue to happen, but it would just be isolated to that VLAN.

I don't have the info, but I'm wondering if this is possibly related to an issue that exist on your version of WING code. But if it's only happening some subset of APs, that wouldn't seem to be likely. I would expect the issue to affect all the APs.

RWCampbell · ‎09-18-2019

Well, in the end, the DHCP theory I mentioned was wrong and we resorted to DDOSing a handheld which triggered the mint broadcast situation.

Here's a question that actually was brought to the fore today. After I sent this reply today, the customer reported that the wifi had stopped working in some areas. We noticed that adoption was fine if you looked at the that status but 5 or so of the AP's were not able to do a mint ping. So layer 3 coms was good, layer 2 was not. We ended up resetting the individual AP's in this situation and they were then able to communicate again and came back online to pass traffic.

Doing packet captures on the switch, I saw lots of blocks of mint broadcast traffic before and after the resets of the APs. Is all mint traffic broadcast based?

Our suspicion is that this wifi outage today is another one of the events induced a broadcast storm on the rest of the network in the past, but because we removed the network loops this time it didn't effect the rest of the network. Not really sure though. Not really sure, but it makes a good case for your point that this traffic should be vlan'ed off.

ckelly · ‎09-18-2019

Mmmm....not sure can see I can the correlation between the DHCP lease attempts and the MINT traffic. I've seen some MINT do some 'interesting' things over the years...but this isn't one of them. 🙂

In any case though...storms can be a real PITA.

RWCampbell · ‎09-18-2019

We definitely learned a lot. Thanks.

Just as an interesting aside, our original theory with the whole broadcast storm of MiNt traffic was that if the DHCP server was unavailable (for various reasons at the time the least time at the time was 1 hr.) the DHCP broadcasts attempting to get IP's would trigger the MiNt broadcast storm.

The night we were trying to put the whole broadcast storm thing to bed, we found that disabling DHCP did not precipitate a storm event. After an hour and a half, we did a DDOS targeted at one of the handhelds from a server and very quickly the storm of that traffic rose up and choked the network. From there we were able to identify the loops that had been introduced to the network.

~Robert

ckelly · ‎09-17-2019

Robert - so glad to hear that you got it all setup and working!
Hopefully learned a little in the process too.  Glad to help out!

Extreme Networks

adoption lost after 20 seconds. Layer 3 connectivity normal

adoption lost after 20 seconds. Layer 3 connectivity normal