Header Only - DO NOT REMOVE - Extreme Networks
Question

adoption lost after 20 seconds. Layer 3 connectivity normal

  • 17 September 2019
  • 17 replies
  • 354 views

Randomly it seems that 1/3rd of our AP's have become un-adopted and no longer function. We have found that when restarting the controller, all the AP's reconnect. But a subset of them un-adopt after 20 seconds or so. Perhaps notably the controller is running v5.5. We want to upgrade this to the latest possible, but not sure how to get the software.

As far as we can tell, the main reason they are un-adopted is we cannot ping Mint ping the other device, and presumedly we can't access the MAC. We've compared configs of working AP's and non-working, and they're identical save the normal variables like names and IP's (minor variations). To our knowledge nothing changed to precipitate this change. The system was used normally over the weekend and the specific AP's were not working this morning.

Any idea what would make the layer 2 communication/Mint communication not work?

-----

Below is a CLI story of the main points that seem to be occurring with one of the APs. Below that is one of the AP configs. Any help would be greatly appreciated.


Controller: RFS-6010-1000-WR
Base ethernet MAC address is B4-C7-99-6D-B7-76
Mint ID: 19.6D.B7.76
IP Address: 10.200.17.10

AP: AP-6532-66040-US
Base ethernet MAC address is 84-24-8D-81-9C-88
Mint ID: 4D.81.9C.88
IP Address: 10.200.17.33

# debugs (from controller)

RFS-SW01# sh mint mlcp his

2018-10-25 11:54:15:cfgd unadopted 4D.81.9C.88
2018-10-25 11:54:15:Unadopted 84-24-8D-81-9C-88 (4D.81.9C.88), cfgd not notified
2018-10-25 11:54:15:Unadopting 84-24-8D-81-9C-88 (4D.81.9C.88) because it is unreachable
2018-10-25 11:53:59:Adopted 84-24-8D-81-9C-88 (4D.81.9C.88), cfgd notified

RFS-SW01#ping 10.200.17.33
PING 10.200.17.33 (10.200.17.33) 100(128) bytes of data.
108 bytes from 10.200.17.33: icmp_seq=1 ttl=64 time=3.99 ms
108 bytes from 10.200.17.33: icmp_seq=2 ttl=64 time=0.410 ms
108 bytes from 10.200.17.33: icmp_seq=3 ttl=64 time=0.359 ms
108 bytes from 10.200.17.33: icmp_seq=4 ttl=64 time=0.363 ms

--- 10.200.17.33 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3004ms
rtt min/avg/max/mdev = 0.359/1.281/3.995/1.567 ms
RFS-SW01#mint ping 4D.81.9C.88
MiNT ping 4D.81.9C.88 with 64 bytes of data.
Ping request 1 timed out. No response from 4D.81.9C.88
Ping request 2 timed out. No response from 4D.81.9C.88
Ping request 3 timed out. No response from 4D.81.9C.88

--- 4D.81.9C.88 ping statistics ---
3 packets transmitted, 0 packets received, 100% packet loss
RFS-SW01#

RFS-SW01#show adoption offline
-----------------------------------------------------------------------------------------------------------------------------
MAC HOST-NAME TYPE RF-DOMAIN TIME OFFLINE CONNECTED-TO
-----------------------------------------------------------------------------------------------------------------------------
84-24-8D-81-9C-88 AP23 ap6532 TEMP DC 0:05:27
-----------------------------------------------------------------------------------------------------------------------------

# debugs (from ap)

AP23#show adoption status
Adopted by:
Type : RFS6000
System Name : RFS-SW01
MAC address : B4-C7-99-6D-B7-76
MiNT address : 19.6D.B7.76
Time : 0 days 00:03:07 ago

AP23#show mint mlcp history
2018-10-25 11:53:58:Received 0 hostnames through option 191
2018-10-25 11:53:57:Received OK from cfgd, adoption complete to 19.6D.B7.76
2018-10-25 11:53:56:Waiting for cfgd OK, adopter should be 19.6D.B7.76
2018-10-25 11:53:56:Adoption state change: 'Connecting to adopter' to 'Waiting for Adoption OK'
2018-10-25 11:53:53:Adoption state change: 'No adopters found' to 'Connecting to adopter'
2018-10-25 11:53:53:Try to adopt to 19.6D.B7.76 (cluster master 00.00.00.00 in adopters)
2018-10-25 11:53:52:Received 0 hostnames through option 191
2018-10-25 11:53:52:Adoption state change: 'Disabled' to 'No adopters found'
2018-10-25 11:53:52:DNS resolution completed, starting MLCP
2018-10-25 11:53:52:Adoption enabled due to configuration

AP23#ping 10.200.17.10
PING 10.200.17.10 (10.200.17.10) 100(128) bytes of data.
108 bytes from 10.200.17.10: icmp_seq=1 ttl=64 time=4.53 ms
108 bytes from 10.200.17.10: icmp_seq=2 ttl=64 time=0.355 ms
^C
--- 10.200.17.10 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 0.355/2.443/4.531/2.088 ms
AP23#mint ping 19.6D.B7.76
MiNT ping 19.6D.B7.76 with 64 bytes of data.
Ping request 1 timed out. No response from 19.6D.B7.76
Ping request 2 timed out. No response from 19.6D.B7.76
Ping request 3 timed out. No response from 19.6D.B7.76

--- 19.6D.B7.76 ping statistics ---
3 packets transmitted, 0 packets received, 100% packet loss
AP23#

-----
code:
version 2.3
!
!
ip snmp-access-list default
permit any
!
firewall-policy default
no ip dos tcp-sequence-past-window
alg sip
!
!
mint-policy global-default
!
wlan-qos-policy default
qos trust dscp
qos trust wmm
!
radio-qos-policy default
!
wlan "WMS SSID"
description WMS RF Environment
ssid TEMP-WMS-RF
vlan 1
bridging-mode tunnel
encryption-type tkip-ccmp
authentication-type none
wpa-wpa2 psk 0 XXXXXXXXXX
service wpa-wpa2 exclude-ccmp
!
smart-rf-policy "TEMP DC Smart RF"
sensitivity custom
assignable-power 2.4GHz max 14
assignable-power 2.4GHz min 11
smart-ocs-monitoring client-aware 2.4GHz 1
!
!
management-policy default
no http server
https server
ssh
user admin password 1 XXXXXX role superuser access all
snmp-server community 0 private rw
snmp-server community 0 public ro
snmp-server user snmptrap v3 encrypted des auth md5 0 motorola
snmp-server user snmpmanager v3 encrypted des auth md5 0 motorola
!
profile ap6532 default-ap6532
ip name-server 10.200.16.12
ip name-server 10.200.16.11
ip domain-name TEMP.com
autoinstall configuration
autoinstall firmware
crypto ikev1 policy ikev1-default
isakmp-proposal default encryption aes-256 group 2 hash sha
crypto ikev2 policy ikev2-default
isakmp-proposal default encryption aes-256 group 2 hash sha
crypto ipsec transform-set default esp-aes-256 esp-sha-hmac
crypto ikev1 remote-vpn
crypto ikev2 remote-vpn
crypto auto-ipsec-secure
crypto load-management
crypto remote-vpn-client
interface radio1
wlan "WMS SSID" bss 1 primary
interface radio2
shutdown
interface ge1
ip dhcp trust
qos trust dscp
qos trust 802.1p
interface vlan1
ip address dhcp
ip address zeroconf secondary
ip dhcp client request options all
interface pppoe1
use firewall-policy default
rf-domain-manager capable
logging on
service pm sys-restart
router ospf
!
rf-domain "TEMP DC"
location "TEMP DC"
contact "Velociti Inc."
timezone America/Chicago
country-code us
use smart-rf-policy "TEMP DC Smart RF"
channel-list dynamic
channel-list 2.4GHz 1,6,11
control-vlan 1
!
ap6532 84-24-8D-81-9C-88
use profile default-ap6532
use rf-domain "TEMP DC"
hostname AP23
interface radio1
power 8
interface vlan1
ip address 10.200.17.33/21
!
!
end

17 replies

Userlevel 5
Since the rf-domain has control-vlan defined, I assume that the APs are adopting over L3? If adopting over L3, are you using MiNT level 2 with DHCP option 191 or statically assigning controller host entry?

If APs are local to RFS and adopting over L2, I would remove the the control-vlan parameter from the rf-domain.

If APs are remote to RFS and adopting over L3, I would have the RFS in its own/separate rf-domain and leave the APs in the current rf-domain with control vlan (remote VLAN that APs use to get out). Remote site AP should be using MiNT level 2.

We have many useful documents for different deployments and you may want to get a support case generated if under entitlement. If not currently under entitlement, please reach out to your re-seller for entitlement options.

RFS6000/AP6532s have been EOSL for some time and last supported firmware release is v5.9.1.5 for these models. You must be under entitlement to have access to the firmware.
It's defined, but it's defined to vlan 1, so it's not logically separated. To my understanding it's adopting over L2 and that's the entire problem (so far as we can tell...). AP's are statically assigned L3 addresses and I assume the host entry.

The AP's are local and we can do that, but the working aps have that defined as well, as well as vlan 1 is the default vlan. So I'm not sure that will have an effect, but we can try it.

Velocity was the original installer, and looking over how this was set up, I do think that there were a lot of problems with how they configured these things. Recently we discovered that with this client there were two loops introduced to the network, which the switches dealt with, BUT it seemed that the switches couldn't deal with it when the MINT broadcast was what was coming in. For the last month we've been dealing with periodic episodes with the network being saturated with broadcast packets from this proprietary protocol. Not sure the specific circumstances that caused this intermittent event, but this is why we wanted to get the latest firmware we could for these things.

I'm not a fan of the idea that I need to purchase support to get firmware updates. Maybe major firmware updates with major feature adds. It doesn't seem appropriate to pay for windows updates, and it seems the same in this case. Not trying to argue, but that strikes me as off especially in a day and age when wireless compromises are found/developed so often. Not really relevant to this inquiry, but it is something that I noted.

Finally what are the implications of EOSL? Presumedly firmware is not under development if there are vulnerabilities developed? Is it still supportable under entitlement at all? The end user doesn't have a relationship with velosity anymore, is this something that I can resell to them as their current support consultant? I recently registered with Extreme partner portal.

~Robert
Userlevel 6
Robert,
Quickly, regarding the EOSL, there is a phased approach to the products and how engineering resources are applied. At some number of years after a new product is deployed, firmware stops including new FEATURES...but firmware continues to be released for it...as needed. This includes for things major bugs discovered or security related issues found.
The last phase is where all engineering development stops. It's at this point that bugs are no longer fixed...and as far as I know, even security related issues are not addressed.

Regarding what sounds like MINT traffic loop, it would seem that there's a misconfiguration of the MINT levels with the network topology in place.
We don't know what your physical/logical deployment looks like, but Chris Frazee made some excellent suggestions about how to possibly set things up correctly. It just depends on how the controller and APs are actually deployed on the network.

To maybe help you see how your system might be mis-configured and causing loops, here's some quick details of the three ways that systems should be setup.
1) Distributed
2) Centralized
3) Centralized with controller managed RF-Domains

In your config, we see a control-vlan defined in the RF-Domain that the AP is assigned to use. As Chris F. alluded to, this would normally only be defined in a Distributed style deployment (#1) (WiNG controller in the NOC somewhere assigned to its own RF-Domain all by itself - and APs then placed into their own RF-Domains that represent remote stores/sites over a WAN connection).

The control-vlan is the VLAN that you are assigning to an RF-Domain is will be the VLAN that the APs use specifically for their MINT traffic so they can talk to each and pass information to each other. It's not uncommon, nor technically disallowed, for customers to use a regular user data VLAN for the control-vlan (as long as that data VLAN is not a common broadcast domain across multiple sites). But, if you want to do things 'right', you create and assign a separate VLAN at a site and then make that your control-vlan in the RF-Domain config where those APs exist. And to be clear, you can create and use the same VLAN for the control-vlan at multiple remote sites....as long as that VLAN cannot reach between each of the remote sites (is not common to multiple sites). The whole idea is to keep MINT traffic from one remote site from intermingling with another remote site's MINT traffic (Ends up creating very large LSP-DB tables!). This remote site MINT traffic that the APs use to talk to each other is referred to as MINT level-1 traffic. It's VERY chatty - LSPs exchanged between WING devices. At this point, one of the APs at the remote site will be elected the RF-Domain manager and will be the one single AP that forms a connection back to the NOC controller. This AP will also then form a single level-2 MINT connection to the NOC controller. All of that chatty level-1 MINT traffic cannot pass beyond that RF-Domain/site over the level-2 MINT connection back to the NOC controller. The site's LSPs cannot pass over this level-2 MINT connection. So you end up with just a SINGLE level-2 MINT connection from each remote site back to the controller. That's it. All of the other APs at a site get their instructions and pass back their data to the controller via the elected RF-Domain manager AP...and acts as a sort of proxy for each site. This all then creates MINT isolation between the remote sites.

Now....all of this is describing how things work for a Distributed deployment.

If you have a Centralized deployment (#2), that's for situations like a campus or a single building (no remote over-the-WAN AP connections). In the setup, you have just a single RF-Domain defined - and it contains the controller(s) and the APs. That's it. In that case, there's no need for a control-vlan...why? because each AP is going to form its individual MINT level-1 connection back to the controller. This also means then that the controller becomes the RF-Domain manager vs an AP winning the election and being the 'site's RF-Domain manager'. There's no level-2 MINT involved in this scenario. In this case though, it's strongly recommended that if there are more than 100 APs involved that you use IP based level-1 MINT links vs VLAN-based level-1 MINT.

You can also have a Campus style deployment where you NEED to have multiple RD-Domains representing differently buildings on the campus (#3). This is possible too. In this scenario, Each AP once again has their own level-1 MINT link connection back to the controller...but it's always IP-based. Also, differently from the Distributed architecture (#1) where each remote site has an elected RF-Domain manager AP, in this scenario, you configure the controller to be the RF-Domain manager for the different RF-Domains you've created to represent the buildings. This puts more burden on the controller though because now it's having to do all the work/calculations for the different RF-Domains (which is normally done by the elected RF-Domain manager AP at each site) so there is a limit to the number of RF-Domains that a controller can 'manage' and depends on the controller's hardware level. And again, no control-vlan is used in this scenario either, even though APs are operating in their own separate RF-Domains (like #1) because those RF-Domains have been configured to be 'controller-managed'.

So based on all of this, which one seems to fit your actual deployment?
The fact that the RF-Domain in your config that you supplied has a defined control-vlan, that insinuates that you have a Distributed architecture (#1). Does that sound correct based on the description though?
From what you are saying this is setup as #1 distribution when it should be setup as #2 Centralized. There is 1 controller in a single building with no remote over-the-WAN APs, about 30 APs in total, and 4 to 5 switches connected with fiber. As I mentioned we did not set this up originally and kind of took it over. The issue could for sure be related to a switch loop as we have found a few network loops already. If we were to switch from #1 to #2 Centralized, would all we need to do is remove control-vlan 1 (maybe replace with controller-managed) from all the APs (default profile) and controller? What else needs to change on the controller or APs? Is there any specific switch configurations that would need to be made? Should we still add a different vlan on the switches to remove the noise MiNt traffic? Thanks for all your help!
Userlevel 6
Okay...so it does sound like #2 is where you want to be.

So at the very least, this is what you want to ensure is setup:
1) Have only one RF-Domain created
2) The controller(s) and APs are all assigned to this one RF-Domain
3) The setup for the RF-Domain does NOT have a defined control-vlan.
4) APs and controller need a common management VLAN. This will allow the APs to automatically discover the controller on the VLAN and adopt (this would be VLAN-based layer-2 discovery and adoption). If you cannot have a common management VLAN, then you have no choice but to implement IP-based adoption (see #5 below).
5) Based on the number of APs (40), it's okay to simply have the APs adopted via layer 2 (VLAN) vs IP-based. (don't need controller managed setup either) Better option though is IP-based adoption (APs will need IP addresses obviously) and you'd need to either manually configure each AP with the controller's IP address so it knows where to go to adopt...or setup DHCP Option 191...or if the APs are already adopted now, you can simply modify the AP Profile to include the controller's IP address and let that config get pushed out to the APs as part of a regular AP config update. At this point, also disable 'MINT MLCP VLAN' on the AP Profiles as well since they won't be using it any longer with IP-based adoption.

Nothing needed on the switches other than the management VLAN and any data VLANs that are required by the WLANs that are operating on the APs.

With this Option #2 Centralized setup, the MINT traffic issue goes away. Each AP is reporting back directly to the controller itself...so it's no longer needed.
So when you say " Better option though is IP-based adoption", are you saying with sticking with our current setup of #1 distribution? Right now all devices have a static IP, so if it is okay, we are fine to communicate that way. We want the easiest path to get these adopted.
1) Under the default profile would we add
code:
controller host 10.200.17.10
(the IP is the controller IP)
2) What is the CLI to manually configure each AP with the controller's IP address that are not adopted? A reset might work, as it seems the cfg is being pushed successfully
3) What is the CLI to disable MINT MLCP VLAN?
Userlevel 6
The adoption MINT level is controlled primarily by the 'controller host' statement.
So that statement would look like this for the level-1 or 2 setup:

controller host 10.200.17.10 pool 1 level 1 (if you omit the "pool 1 level 1" those values are assumed)
commit write

It's that "level 1" that indicates the MINT level that should be used and how the AP is going to adopt*. This entry could be placed into the AP's Profile or in the AP's 'override' section. Either is fine. Just make sure you understand the difference.

Simply having the controller this host statement automatically means that you're indicating IP-based adoption (Could also be the case if you've setup DHCP Option 191 or DNS-based adoption). If it was VLAN-based adoption, you wouldn't even need to include the controller host statement. The APs would simply locate on their own the controller on the same management VLAN and attempt to adopt.


*To disable layer-2 discovery for the APs (because in order of preference, the APs will 1st look for a controller using layer-2...so if they can find one - even though they have a layer-3 controller host entry, they'll still go ahead and adopt via layer-2) and force them to only adopt via layer-3, go into the APs Profile or its override section and issue the command:
no mint mlcp vlan (mlcp is 'mint link creation protocol)
commit write

So the two things you need are the controller host statement and the negation of mint mlcp vlan.

...and before I forget, make sure that if you have a controller cluster that it's also setup using MINT level-1. In addition, you don't want to have (it's not supported) any sort of mixture of APs or controller cluster MINT levels. If you have APs adopted MINT level-1, then EVERYTHING everywhere should be using MINT level-1. Same with MINT level-2.

(If you do have a controller cluster, you can verify the MINT level used for the cluster by running the command:
show cluster status
Look at the first output labeled: "Protocol Version". It should be "1", meaning cluster is formed using MINT level-1.

Also, ensure that the "controller vlan" option is not being used. This is NOT the same thing as the "control vlan" setting. The "controller vlan" setting is only used is certain situations. (APs and controllers share multiple common VLANs and the APs are adopted using layer 2).

Also, with doing layer-3 adoption (should've specified this bit in the earlier post, sorry) make sure that there are no ACLs between the APs and controller that would block UDP port 24576. This is what MINT will be using. If it's blocked, the APs won't be able to adopt.
Gentlemen,

Thank you much for engaging so thoroughly on this issue. We were able to get the profile changed and reset the devices so that they'd be adopted via layer 3 communication and they came back online and started functioning for the client.

They're quite thankful that they will not have to use the slow processes in the freezer anymore.

We're going to continue the conversation with the sales support people about the possibility of getting a support entitlement set up for this system. We shall see. Thanks again!

~Robert
Userlevel 6
Robert - so glad to hear that you got it all setup and working!
Hopefully learned a little in the process too. 🙂 Glad to help out!
We definitely learned a lot. Thanks.

Just as an interesting aside, our original theory with the whole broadcast storm of MiNt traffic was that if the DHCP server was unavailable (for various reasons at the time the least time at the time was 1 hr.) the DHCP broadcasts attempting to get IP's would trigger the MiNt broadcast storm.

The night we were trying to put the whole broadcast storm thing to bed, we found that disabling DHCP did not precipitate a storm event. After an hour and a half, we did a DDOS targeted at one of the handhelds from a server and very quickly the storm of that traffic rose up and choked the network. From there we were able to identify the loops that had been introduced to the network.

~Robert
Userlevel 6
Mmmm....not sure can see I can the correlation between the DHCP lease attempts and the MINT traffic. I've seen some MINT do some 'interesting' things over the years...but this isn't one of them. :)

In any case though...storms can be a real PITA.
Well, in the end, the DHCP theory I mentioned was wrong and we resorted to DDOSing a handheld which triggered the mint broadcast situation.

Here's a question that actually was brought to the fore today. After I sent this reply today, the customer reported that the wifi had stopped working in some areas. We noticed that adoption was fine if you looked at the that status but 5 or so of the AP's were not able to do a mint ping. So layer 3 coms was good, layer 2 was not. We ended up resetting the individual AP's in this situation and they were then able to communicate again and came back online to pass traffic.

Doing packet captures on the switch, I saw lots of blocks of mint broadcast traffic before and after the resets of the APs. Is all mint traffic broadcast based?

Our suspicion is that this wifi outage today is another one of the events induced a broadcast storm on the rest of the network in the past, but because we removed the network loops this time it didn't effect the rest of the network. Not really sure though. Not really sure, but it makes a good case for your point that this traffic should be vlan'ed off.
Userlevel 6
Robert,

I'm curious about the DDOS'ing of the handheld. So you're intentionally DDOS'ing the device? Curious as to why. Testing?
In any case though, the DDOS ends up triggering what you see as a MINT BC storm and switches are reacting due to this?

When the customer reported that the wifi was down, was the WLAN itself actually no longer being seen by the clients? Or were they still associated but just not able to pass traffic?

So you had about 5 APs that you noticed would not respond to MINT PINGs...but they were still adopted? This would appear to be contradictory in nature. If the APs are adopted, then MINT between the devices is working...and therefore a MINT PING should work. The caveat here is that there is a definable adjacency hold timer. But if this condition exist a minute or longer and the APs are still shown to be adopted, then I'd have to say that MINT comms are still functioning between the AP and controller.
When this happens, try running show mint links on the controller. This will show all of the active AP MINT link connections that exist. If those 5 or so APs really are unadopted, you won't see a listing for MINT links for them.

When you say that you 'reset' the APs, do you simply mean that you power cycled them or something else?

Regarding the MINT BC, you will have LSP's flooded from each WING device. Each WING device will receive and build their LSP-DB based on the flooded LSPs. With MINT level-1 adoptions though the LSP-DB size can start to get rather large because of this and therefore doesn't scale well for very large AP deployments....which is then where MINT level-2 adoptions come into play. But, your deployment is nowhere near that large so this shouldn't be an issue.

If you run 'show mint info' on any of the APs, you should see that it has some number (all the APs) as the LSP DB size. This number will include all the APs and the controller itself.
If you run show mint lsp-db on a device, you an then see EACH of the WING devices that exist within the LSP-DB. Each listed AP will have just a single adjacency formed (between itself and the controller) and the controller will have as many adjacencies as there number of APs.

So then the LSP-DB is used to create the MINT routes, which if you want to see can be viewed by running the command show mint route.

In addition to all this so far, each WiNG device will also transmit a MINT HELLO packet every 4 seconds for MINT level-1. So you can expect to see that MINT traffic too.

So where am I going with all this? I'm not seeing where a MINT BC storm is going to occur. This is assuming no loops, which could potentially do it though.

If this happens again, see if can SSH into one of the APs that is 'down'.

Given all this, you can certainly go ahead and isolate the MINT traffic by setting up an additional VLAN that is separate from the user-traffic. This is certainly not something that is normally done though with other deployments. My suspicion though is that whatever is happening here would continue to happen, but it would just be isolated to that VLAN.

I don't have the info, but I'm wondering if this is possibly related to an issue that exist on your version of WING code. But if it's only happening some subset of APs, that wouldn't seem to be likely. I would expect the issue to affect all the APs.
For why, our first theory as to how to induce the storm was to pull the DHCP server offline and let the level of DHCP broadcast traffic increase. That didn't work for an hour and a half, but we said we were going to solve this issue tonight, so we just started trying stuff. We started DDoSing the DHCP server, and the wifi controller, and finally a device (might be a few other things in there). Sending all that traffic to a device had the effect we wanted for whatever reason, and from there we got a pretty good sense of the source and started honing in on it. Eventually found a RJ45 that was bridging two switches that also had fiber uplinks to the fiber switch. The next morning we noticed that there was a port in one of the switches that spanning tree had shut down. Thinking we had removed the loop we enabled that port and everything went down. So we knew that that port also was looped. So we figured out where that went and removed it as well. Switches were not really 'reacting' They just had loops which amplified the broadcast traffic until they were removed.

So here's my understanding (which could be wrong... But my understanding from above comments was that adoption can happen on the mint level or on 'layer 3'/ tcp/ip level. The prior event where we noticed that these devices were not adopted, the fix we came up with was to change the profile so that adoption would happen over tcp/ip instead, and reboot. This made adoption happen and communication resumed. With this more recent event (where same APs stopped communicating), adoption was fine, but whatever fundamental issue with the mint traffic is occurring still is occurring and blocking traffic despite the fact that they're still adopted. In the first incident, the dashboard showed on it's online offline graph, that 12 APs were online when there should be 17. Changing adoption setting and resetting APs brought it back up to 17 and the devices in question repopulated the left panel when you expand the RF domain. What was interesting was in the second incedent, even though the same AP's were not communicating, it still showed all 17 online. But I noticed that the 5 in question had disappeared from the left panel. So this made me think that the online/offline dashboard widget just shows devices that are adopted, but if the mint traffic can't get through, it won't show in the left panel and for whatever reason if that's the case, wifi traffic will not pass. This made me wonder if the TCP/IP traffic from the wifi clients ends up being sent to the controller via the mint protocol? I wouldn't think so. But whatever reason, if the mint ping is not able to get through, then traffic will not pass even if TCP/IP traffic has no issues (I realize this seems logically impossible). With the second event, all we did to get traffic moving on those AP's again was power cycle them individually and mint pings started working and they popped back into the left panel on the controller interface...

To address your more direct questions, we don't think there is a broadcast storm occuring now, and there wouldn't have been one before absent the loops we found. I think spanning tree protocol should have arrested those even, but the firmware on those switches with the loops is ancient and mint is proprietary so . Since we have removed the loops, and as I alluded to above, in neither case was TCP/IP communication down. We were able to SSH into them without issue at all times whether they were adopted or not and whether mint traffic was passing or not.

It definitely crossed my mind that there might be a software bug of some sort that could be causing this strange behavior. I guess I thought you had said mint traffic was normally tagged and seperated from the TCP/IP traffic. I must have misinterpreted something.

I have updated the customer on where everything stands and suggested that they get the support established so that we can get the firmware. I'm not a fan of the idea that I need support to get updates to the software that could include bug fixes, but the other stuff that goes along with it is worth it especially with the replacement plan. It wouldn't be an easy thing to just replace the controller, so I think they'd be well served to maintain that as long as they keep the controller and it's supportable, especially at the reasonable price you all are offering that service.

We're not familiar with this system, I'm just continuing the conversation so we both maybe can learn something. I certainly may be be misinterpreting the behavior of the system or your comments somewhere. You've been more than helpful in responding here and pointing us in the right direction.

~Robert
Userlevel 6
This is an important point here to expand on:
"With this more recent event (where same APs stopped communicating), is that adoption is fine, but whatever fundamental issue with the mint traffic is occurring still is occurring and blocking traffic despite the fact that they're still adopted."

Just to be clear, MINT is not normally responsible for carrying user traffic, but in your setup this comment just set off a big red flag for me. I checked your config listing and sure enough.... (I should've noticed this earlier!)

code:
wlan "WMS SSID"
description WMS RF Environment
ssid TEMP-WMS-RF
vlan 1
bridging-mode tunnel <--------- THIS "tunnel"
encryption-type tkip-ccmp
authentication-type none
wpa-wpa2 psk 0 XXXXXXXXXX
service wpa-wpa2 exclude-ccmp


The bridging mode you have configured in this WLAN profile is set to tunnel. What this does is this encapsulates the client traffic within the MINT tunnels that are formed between the APs and the controller. Once the traffic reaches the controller, the controller then strips out the user traffic from the tunnel and drops the user traffic onto the LAN. MOST of the deployments we have are not using this mode, except for special cases, like a Guest WLAN or something like that.

I'd highly recommend changing this setting over to "local". What this does is the AP takes the client traffic and simply bridges it over to the assigned VLAN (in the WLAN Profile) and it goes straight out the AP interface tagged...to the switch. It's just treated like any other traffic. In your case though, the config is indicating that the WLAN is assigned to VLAN 1...which I'm assuming is the untagged native VLAN? If so, no big deal...the same applies.

The point to understand here though is that in "tunnel" mode, since the user traffic is placed inside a MINT tunnel and brought BACK to the controller inside the MINT tunnel, if there's any issues with MINT between the AP and controller, it means that the user traffic is ALSO affected, right? So using local bridging is a much more resilient mode of operation.

"What was interesting was in the second incedent, even though the same AP's were not communicating, it still showed all 17 online. But I noticed that the 5 in question had disappeared from the left panel."

This comment is contradictory. What you see in the tree on the left panel will only be APs that are in fact adopted/online. So WHERE in the UI were you seeing that all 17 were still online while 5 have disapeared from the tree? I'm guessing that you were simply looking in the Devices tab and seeing all the APs listed. If that's the case, seeing the APs listed there is not an indicator of their adoption status. That's just a listing of what APs the controller knows about...whether they're adopted currently or not.

And yes, if MINT traffic can't get through, that will prevent adoption and the AP icons should disappear from the tree on the left. The Dashboard widget with the pie chart will also show you the number of devices online/offline....as well as other widgets.
And....since you have the WLAN setup in tunnel mode...losing MINT comms would then also affect the user traffic from making it back to the controller and being dropped onto the LAN.

With the APs setup in 'local' mode, you can have APs lose adoption but the user traffic is completely unaffected! The APs are just simply un-managed from the controller at that point. They're operating like fat APs and just keep chugging along. :)

Interesting that you still have layer-3 access to the APs when this happens and can SSH in. If this ever happens again, SSH in and you can start doing some interesting diagnostics directly from the AP to determine what's wrong. It has a pretty powerful CLI toolset.

I think the magic bullet here may very likely be in switching the WLAN over to 'local' bridge mode. Unless there' s a specific reason for it being setup this way, I'd change this immediately and then sit back and see how the system operates.
For illustriative purposes, here's the switch logs that showed what we saw during the second event when APs were online but they dissapeared from the left panel. AP25 the device we started troubleshooting:

Controller ssh history from yesterday during the outage:
show adoption status

Adopted Devices:
---------------------------------------------------------------------------------------------------------------
DEVICE-NAME VERSION CFG-STAT MSGS ADOPTED-BY LAST-ADOPTION UPTIME
---------------------------------------------------------------------------------------------------------------
AP24 5.5.5.0-018R configured No RFS-SW01 0 days 00:48:51 0 days 00:51:45
AP20 5.5.5.0-018R configured No RFS-SW01 1 days 20:45:03 9 days 03:24:36
AP21 5.5.5.0-018R configured No RFS-SW01 1 days 20:45:04 2 days 02:32:49
AP23 5.5.5.0-018R configured No RFS-SW01 0 days 00:48:51 0 days 00:51:45
AP25 5.5.5.0-018R configured No RFS-SW01 0 days 00:48:50 0 days 00:51:45
AP10 5.5.5.0-018R configured No RFS-SW01 1 days 20:45:05 13 days 09:30:21
AP04 5.5.5.0-018R configured No RFS-SW01 1 days 20:45:04 13 days 08:06:49
AP02 5.5.5.0-018R configured No RFS-SW01 1 days 20:45:03 13 days 10:58:02
AP05 5.5.5.0-018R configured No RFS-SW01 1 days 20:45:03 13 days 10:58:27
AP06 5.5.5.0-018R configured No RFS-SW01 0 days 01:13:54 13 days 09:04:14
AP14 5.5.5.0-018R configured No RFS-SW01 1 days 20:45:04 13 days 10:04:16
AP11 5.5.5.0-018R configured No RFS-SW01 1 days 20:45:04 13 days 09:33:47
AP15 5.5.5.0-018R *configured No RFS-SW01 0 days 00:48:50 0 days 00:51:45
AP16 5.5.5.0-018R configured No RFS-SW01 0 days 00:48:50 0 days 00:51:45
AP07 5.5.5.0-018R configured No RFS-SW01 1 days 20:45:04 20 days 04:31:37
AP08 5.5.5.0-018R configured No RFS-SW01 1 days 20:45:04 20 days 04:31:38
----------------------------------------------------------------------------------------------------------------
Total number of devices displayed: 16

RFS-SW01#mint ping 4D.81.9D.AC
MiNT ping 4D.81.9D.AC with 64 bytes of data.
Ping request 1 timed out. No response from 4D.81.9D.AC
Ping request 2 timed out. No response from 4D.81.9D.AC
Ping request 3 timed out. No response from 4D.81.9D.AC

--- 4D.81.9D.AC ping statistics ---
3 packets transmitted, 0 packets received, 100% packet loss
RFS-SW01#ping 10.200.17.35
PING 10.200.17.35 (10.200.17.35) 100(128) bytes of data.
108 bytes from 10.200.17.35: icmp_seq=1 ttl=64 time=6.74 ms
108 bytes from 10.200.17.35: icmp_seq=2 ttl=64 time=0.314 ms


AP25 Today
AP25#show mint id
Mint ID: 4D.81.9D.AC
AP25#show mint info
Mint ID: 4D.81.9D.AC
16 Level-1 neighbors
Level-1 LSP DB size 17 LSPs (3 KB)
0 Level-2 neighbors
Level-2 LSP DB size 0 LSPs (0 KB)
Level-2 gateway is unreachable
Reachable adopters: 19.6D.B7.76
Max level-1 path cost: 30 (to 19.6D.B7.76)
AP25#show mint lsp-db
17 LSPs in LSP-db of 4D.81.9D.AC:
LSP 19.6D.B7.76 at level 1, hostname "RFS-SW01", 16 adjacencies, seqnum 285509
LSP 1A.4C.44.B0 at level 1, hostname "AP10", 11 adjacencies, seqnum 280965
LSP 1A.4C.44.D8 at level 1, hostname "AP04", 11 adjacencies, seqnum 280135
LSP 1A.4C.45.2C at level 1, hostname "AP02", 11 adjacencies, seqnum 279898
LSP 1A.4C.45.A4 at level 1, hostname "AP05", 11 adjacencies, seqnum 294789
LSP 1A.4C.46.30 at level 1, hostname "AP06", 11 adjacencies, seqnum 273896
LSP 1A.7C.51.30 at level 1, hostname "AP14", 11 adjacencies, seqnum 280856
LSP 1A.7C.53.5C at level 1, hostname "AP11", 11 adjacencies, seqnum 280674
LSP 1A.7C.53.A4 at level 1, hostname "AP15", 6 adjacencies, seqnum 276869
LSP 1A.7C.53.BC at level 1, hostname "AP16", 7 adjacencies, seqnum 697568
LSP 1A.7C.71.98 at level 1, hostname "AP07", 11 adjacencies, seqnum 280219
LSP 1A.7C.71.D8 at level 1, hostname "AP08", 11 adjacencies, seqnum 280893
LSP 4D.18.47.28 at level 1, hostname "AP24", 7 adjacencies, seqnum 249678
LSP 4D.80.BD.B8 at level 1, hostname "AP20", 11 adjacencies, seqnum 223532
LSP 4D.80.BE.70 at level 1, hostname "AP21", 11 adjacencies, seqnum 547615
LSP 4D.81.9C.88 at level 1, hostname "AP23", 7 adjacencies, seqnum 247070
LSP 4D.81.9D.AC at level 1, hostname "AP25", 5 adjacencies, seqnum 354676
AP25#show mint route
Destination : Next-Hop(s)
4D.80.BE.70 : 4D.81.9C.88 via vlan-1
4D.81.9C.88 : 4D.81.9C.88 via vlan-1
1A.7C.53.A4 : 1A.7C.53.A4 via vlan-1
19.6D.B7.76 : 1A.7C.53.BC via vlan-1, 4D.18.47.28 via vlan-1
1A.7C.53.BC : 1A.7C.53.BC via vlan-1
1A.7C.71.D8 : 1A.7C.53.BC via vlan-1, 4D.18.47.28 via vlan-1
1A.4C.45.A4 : 1A.7C.53.BC via vlan-1, 4D.18.47.28 via vlan-1
1A.7C.53.5C : 4D.18.47.28 via vlan-1
1A.4C.45.2C : 1A.7C.53.A4 via vlan-1
4D.81.9D.AC : 4D.81.9D.AC via self
1A.4C.44.D8 : 1A.7C.53.BC via vlan-1
1A.7C.51.30 : 4D.18.47.28 via vlan-1, 1A.7C.53.BC via vlan-1
1A.4C.44.B0 : 4D.18.47.28 via vlan-1, 1A.7C.53.BC via vlan-1
4D.18.47.28 : 4D.18.47.28 via vlan-1
4D.80.BD.B8 : 1A.7C.53.BC via vlan-1
1A.7C.71.98 : 4D.18.47.28 via vlan-1
1A.4C.46.30 : 4D.81.9C.88 via vlan-1
AP25#
Userlevel 6
Robert, another thing I thought of that MIGHT possibly be affecting reliable MINT traffic is the MINT MTU vlaue. Based on your posted config:
code:
mint-policy global-default
!


It's using the default value, which is 1460 bytes.
No idea what the network looks like, but just to be safe, I'd recommend bumping that value down to something like 1400. But if the local infrastructure is fragmenting even lower than 1400, use THAT value. Otherwise, the MINT traffic undergoes double-fragmentation possibly. (MINT doesn't like being fragmented). This is more a best-practices type of thing.

Now THIS is interesting:
code:
AP25#show mint route
Destination : Next-Hop(s)
4D.80.BE.70 : 4D.81.9C.88 via vlan-1
4D.81.9C.88 : 4D.81.9C.88 via vlan-1
1A.7C.53.A4 : 1A.7C.53.A4 via vlan-1
19.6D.B7.76 : 1A.7C.53.BC via vlan-1, 4D.18.47.28 via vlan-1
1A.7C.53.BC : 1A.7C.53.BC via vlan-1
1A.7C.71.D8 : 1A.7C.53.BC via vlan-1, 4D.18.47.28 via vlan-1
1A.4C.45.A4 : 1A.7C.53.BC via vlan-1, 4D.18.47.28 via vlan-1
1A.7C.53.5C : 4D.18.47.28 via vlan-1
1A.4C.45.2C : 1A.7C.53.A4 via vlan-1
4D.81.9D.AC : 4D.81.9D.AC via self
1A.4C.44.D8 : 1A.7C.53.BC via vlan-1
1A.7C.51.30 : 4D.18.47.28 via vlan-1, 1A.7C.53.BC via vlan-1
1A.4C.44.B0 : 4D.18.47.28 via vlan-1, 1A.7C.53.BC via vlan-1
4D.18.47.28 : 4D.18.47.28 via vlan-1
4D.80.BD.B8 : 1A.7C.53.BC via vlan-1
1A.7C.71.98 : 4D.18.47.28 via vlan-1
1A.4C.46.30 : 4D.81.9C.88 via vlan-1




Controller: RFS-6010-1000-WR
Mint ID: 19.6D.B7.76

This MINT route listing is from the **perspective** of AP25.
There are 5 entries here where in order for AP25 to get to the WING device in the first column, two hops are needed. Does this sound right?

Example: For AP25 to get to 19.6D.B7.76, it has to go through 1A.7C.53.BC via vlan-1 and then through 4D.18.47.28 via vlan-1
code:
19.6D.B7.76 : 1A.7C.53.BC via vlan-1, 4D.18.47.28 via vlan-1



You can confirm using mint traceroute 19.6D.B7.76
What's interesting here is that there are **5** entries with 2 hops....and you seem to indicate that the issue with the APs dropping off is with 5 APs. Coincidence?


The other interesting this here:
code:
AP25#show mint lsp-db
17 LSPs in LSP-db of 4D.81.9D.AC:
LSP 19.6D.B7.76 at level 1, hostname "RFS-SW01", 16 adjacencies, seqnum 285509
LSP 1A.4C.44.B0 at level 1, hostname "AP10", 11 adjacencies, seqnum 280965
LSP 1A.4C.44.D8 at level 1, hostname "AP04", 11 adjacencies, seqnum 280135
LSP 1A.4C.45.2C at level 1, hostname "AP02", 11 adjacencies, seqnum 279898
LSP 1A.4C.45.A4 at level 1, hostname "AP05", 11 adjacencies, seqnum 294789
LSP 1A.4C.46.30 at level 1, hostname "AP06", 11 adjacencies, seqnum 273896
LSP 1A.7C.51.30 at level 1, hostname "AP14", 11 adjacencies, seqnum 280856
LSP 1A.7C.53.5C at level 1, hostname "AP11", 11 adjacencies, seqnum 280674
LSP 1A.7C.53.A4 at level 1, hostname "AP15", 6 adjacencies, seqnum 276869
LSP 1A.7C.53.BC at level 1, hostname "AP16", 7 adjacencies, seqnum 697568
LSP 1A.7C.71.98 at level 1, hostname "AP07", 11 adjacencies, seqnum 280219
LSP 1A.7C.71.D8 at level 1, hostname "AP08", 11 adjacencies, seqnum 280893
LSP 4D.18.47.28 at level 1, hostname "AP24", 7 adjacencies, seqnum 249678
LSP 4D.80.BD.B8 at level 1, hostname "AP20", 11 adjacencies, seqnum 223532
LSP 4D.80.BE.70 at level 1, hostname "AP21", 11 adjacencies, seqnum 547615
LSP 4D.81.9C.88 at level 1, hostname "AP23", 7 adjacencies, seqnum 247070
LSP 4D.81.9D.AC at level 1, hostname "AP25", 5 adjacencies, seqnum 354676



.....is the varying number of adjancencies for each entry. The only one that makes sense is the first one for the controller. It has 16 adjacencies. Each of those representing a connection to the 16 APs. (Is it 16 or 17?)
Interesting that many of them show only 11. Again, 16-11=5. There's that **5** number again. More coincidence? Not sure what to make of the ones that show 5,6, and 7 though. For each value, there should be THAT many entries (there should be 5 entries that show 5 adjacencies, 6 entries that show 6 adjacencies, etc). The differences in adjacency values would be due to APs sitting behind a router. So it's fine that there's a difference, but the resultant output still doesn't jive.
I'm thinking there's something going on with 5 APs that all sit on the same segment behind a router...and they, for whatever reason, are the ones having an issue.

I'm starting to think there's some inconsistency in how these APs is configured on the controller...whereas, what I would expect is that all of these APs are setup cookie-cutter style. If you want to send me a copy of the config from the controller to look at and verify, feel free (ckelly@extremenetworks.com)

Reply