cancel
Showing results for 
Search instead for 
Did you mean: 

adoption lost after 20 seconds. Layer 3 connectivity normal

adoption lost after 20 seconds. Layer 3 connectivity normal

RWCampbell
New Contributor
Randomly it seems that 1/3rd of our AP's have become un-adopted and no longer function. We have found that when restarting the controller, all the AP's reconnect. But a subset of them un-adopt after 20 seconds or so. Perhaps notably the controller is running v5.5. We want to upgrade this to the latest possible, but not sure how to get the software.

As far as we can tell, the main reason they are un-adopted is we cannot ping Mint ping the other device, and presumedly we can't access the MAC. We've compared configs of working AP's and non-working, and they're identical save the normal variables like names and IP's (minor variations). To our knowledge nothing changed to precipitate this change. The system was used normally over the weekend and the specific AP's were not working this morning.

Any idea what would make the layer 2 communication/Mint communication not work?

-----

Below is a CLI story of the main points that seem to be occurring with one of the APs. Below that is one of the AP configs. Any help would be greatly appreciated.


Controller: RFS-6010-1000-WR
Base ethernet MAC address is B4-C7-99-6D-B7-76
Mint ID: 19.6D.B7.76
IP Address: 10.200.17.10

AP: AP-6532-66040-US
Base ethernet MAC address is 84-24-8D-81-9C-88
Mint ID: 4D.81.9C.88
IP Address: 10.200.17.33

# debugs (from controller)

RFS-SW01# sh mint mlcp his

2018-10-25 11:54:15:cfgd unadopted 4D.81.9C.88
2018-10-25 11:54:15:Unadopted 84-24-8D-81-9C-88 (4D.81.9C.88), cfgd not notified
2018-10-25 11:54:15:Unadopting 84-24-8D-81-9C-88 (4D.81.9C.88) because it is unreachable
2018-10-25 11:53:59:Adopted 84-24-8D-81-9C-88 (4D.81.9C.88), cfgd notified

RFS-SW01#ping 10.200.17.33
PING 10.200.17.33 (10.200.17.33) 100(128) bytes of data.
108 bytes from 10.200.17.33: icmp_seq=1 ttl=64 time=3.99 ms
108 bytes from 10.200.17.33: icmp_seq=2 ttl=64 time=0.410 ms
108 bytes from 10.200.17.33: icmp_seq=3 ttl=64 time=0.359 ms
108 bytes from 10.200.17.33: icmp_seq=4 ttl=64 time=0.363 ms

--- 10.200.17.33 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3004ms
rtt min/avg/max/mdev = 0.359/1.281/3.995/1.567 ms
RFS-SW01#mint ping 4D.81.9C.88
MiNT ping 4D.81.9C.88 with 64 bytes of data.
Ping request 1 timed out. No response from 4D.81.9C.88
Ping request 2 timed out. No response from 4D.81.9C.88
Ping request 3 timed out. No response from 4D.81.9C.88

--- 4D.81.9C.88 ping statistics ---
3 packets transmitted, 0 packets received, 100% packet loss
RFS-SW01#

RFS-SW01#show adoption offline
-----------------------------------------------------------------------------------------------------------------------------
MAC HOST-NAME TYPE RF-DOMAIN TIME OFFLINE CONNECTED-TO
-----------------------------------------------------------------------------------------------------------------------------
84-24-8D-81-9C-88 AP23 ap6532 TEMP DC 0:05:27
-----------------------------------------------------------------------------------------------------------------------------

# debugs (from ap)

AP23#show adoption status
Adopted by:
Type : RFS6000
System Name : RFS-SW01
MAC address : B4-C7-99-6D-B7-76
MiNT address : 19.6D.B7.76
Time : 0 days 00:03:07 ago

AP23#show mint mlcp history
2018-10-25 11:53:58:Received 0 hostnames through option 191
2018-10-25 11:53:57:Received OK from cfgd, adoption complete to 19.6D.B7.76
2018-10-25 11:53:56:Waiting for cfgd OK, adopter should be 19.6D.B7.76
2018-10-25 11:53:56:Adoption state change: 'Connecting to adopter' to 'Waiting for Adoption OK'
2018-10-25 11:53:53:Adoption state change: 'No adopters found' to 'Connecting to adopter'
2018-10-25 11:53:53:Try to adopt to 19.6D.B7.76 (cluster master 00.00.00.00 in adopters)
2018-10-25 11:53:52:Received 0 hostnames through option 191
2018-10-25 11:53:52:Adoption state change: 'Disabled' to 'No adopters found'
2018-10-25 11:53:52:DNS resolution completed, starting MLCP
2018-10-25 11:53:52:Adoption enabled due to configuration

AP23#ping 10.200.17.10
PING 10.200.17.10 (10.200.17.10) 100(128) bytes of data.
108 bytes from 10.200.17.10: icmp_seq=1 ttl=64 time=4.53 ms
108 bytes from 10.200.17.10: icmp_seq=2 ttl=64 time=0.355 ms
^C
--- 10.200.17.10 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 0.355/2.443/4.531/2.088 ms
AP23#mint ping 19.6D.B7.76
MiNT ping 19.6D.B7.76 with 64 bytes of data.
Ping request 1 timed out. No response from 19.6D.B7.76
Ping request 2 timed out. No response from 19.6D.B7.76
Ping request 3 timed out. No response from 19.6D.B7.76

--- 19.6D.B7.76 ping statistics ---
3 packets transmitted, 0 packets received, 100% packet loss
AP23#

-----
code:
version 2.3
!
!
ip snmp-access-list default
permit any
!
firewall-policy default
no ip dos tcp-sequence-past-window
alg sip
!
!
mint-policy global-default
!
wlan-qos-policy default
qos trust dscp
qos trust wmm
!
radio-qos-policy default
!
wlan "WMS SSID"
description WMS RF Environment
ssid TEMP-WMS-RF
vlan 1
bridging-mode tunnel
encryption-type tkip-ccmp
authentication-type none
wpa-wpa2 psk 0 XXXXXXXXXX
service wpa-wpa2 exclude-ccmp
!
smart-rf-policy "TEMP DC Smart RF"
sensitivity custom
assignable-power 2.4GHz max 14
assignable-power 2.4GHz min 11
smart-ocs-monitoring client-aware 2.4GHz 1
!
!
management-policy default
no http server
https server
ssh
user admin password 1 XXXXXX role superuser access all
snmp-server community 0 private rw
snmp-server community 0 public ro
snmp-server user snmptrap v3 encrypted des auth md5 0 motorola
snmp-server user snmpmanager v3 encrypted des auth md5 0 motorola
!
profile ap6532 default-ap6532
ip name-server 10.200.16.12
ip name-server 10.200.16.11
ip domain-name TEMP.com
autoinstall configuration
autoinstall firmware
crypto ikev1 policy ikev1-default
isakmp-proposal default encryption aes-256 group 2 hash sha
crypto ikev2 policy ikev2-default
isakmp-proposal default encryption aes-256 group 2 hash sha
crypto ipsec transform-set default esp-aes-256 esp-sha-hmac
crypto ikev1 remote-vpn
crypto ikev2 remote-vpn
crypto auto-ipsec-secure
crypto load-management
crypto remote-vpn-client
interface radio1
wlan "WMS SSID" bss 1 primary
interface radio2
shutdown
interface ge1
ip dhcp trust
qos trust dscp
qos trust 802.1p
interface vlan1
ip address dhcp
ip address zeroconf secondary
ip dhcp client request options all
interface pppoe1
use firewall-policy default
rf-domain-manager capable
logging on
service pm sys-restart
router ospf
!
rf-domain "TEMP DC"
location "TEMP DC"
contact "Velociti Inc."
timezone America/Chicago
country-code us
use smart-rf-policy "TEMP DC Smart RF"
channel-list dynamic
channel-list 2.4GHz 1,6,11
control-vlan 1
!
ap6532 84-24-8D-81-9C-88
use profile default-ap6532
use rf-domain "TEMP DC"
hostname AP23
interface radio1
power 8
interface vlan1
ip address 10.200.17.33/21
!
!
end
17 REPLIES 17

ckelly
Extreme Employee
Robert, another thing I thought of that MIGHT possibly be affecting reliable MINT traffic is the MINT MTU vlaue. Based on your posted config:
code:
mint-policy global-default
!


It's using the default value, which is 1460 bytes.
No idea what the network looks like, but just to be safe, I'd recommend bumping that value down to something like 1400. But if the local infrastructure is fragmenting even lower than 1400, use THAT value. Otherwise, the MINT traffic undergoes double-fragmentation possibly. (MINT doesn't like being fragmented). This is more a best-practices type of thing.

Now THIS is interesting:
code:
AP25#show mint route
Destination : Next-Hop(s)
4D.80.BE.70 : 4D.81.9C.88 via vlan-1
4D.81.9C.88 : 4D.81.9C.88 via vlan-1
1A.7C.53.A4 : 1A.7C.53.A4 via vlan-1
19.6D.B7.76 : 1A.7C.53.BC via vlan-1, 4D.18.47.28 via vlan-1
1A.7C.53.BC : 1A.7C.53.BC via vlan-1
1A.7C.71.D8 : 1A.7C.53.BC via vlan-1, 4D.18.47.28 via vlan-1
1A.4C.45.A4 : 1A.7C.53.BC via vlan-1, 4D.18.47.28 via vlan-1
1A.7C.53.5C : 4D.18.47.28 via vlan-1
1A.4C.45.2C : 1A.7C.53.A4 via vlan-1
4D.81.9D.AC : 4D.81.9D.AC via self
1A.4C.44.D8 : 1A.7C.53.BC via vlan-1
1A.7C.51.30 : 4D.18.47.28 via vlan-1, 1A.7C.53.BC via vlan-1
1A.4C.44.B0 : 4D.18.47.28 via vlan-1, 1A.7C.53.BC via vlan-1
4D.18.47.28 : 4D.18.47.28 via vlan-1
4D.80.BD.B8 : 1A.7C.53.BC via vlan-1
1A.7C.71.98 : 4D.18.47.28 via vlan-1
1A.4C.46.30 : 4D.81.9C.88 via vlan-1




Controller: RFS-6010-1000-WR
Mint ID: 19.6D.B7.76

This MINT route listing is from the **perspective** of AP25.
There are 5 entries here where in order for AP25 to get to the WING device in the first column, two hops are needed. Does this sound right?

Example: For AP25 to get to 19.6D.B7.76, it has to go through 1A.7C.53.BC via vlan-1 and then through 4D.18.47.28 via vlan-1
code:
19.6D.B7.76 : 1A.7C.53.BC via vlan-1, 4D.18.47.28 via vlan-1



You can confirm using mint traceroute 19.6D.B7.76
What's interesting here is that there are **5** entries with 2 hops....and you seem to indicate that the issue with the APs dropping off is with 5 APs. Coincidence?


The other interesting this here:
code:
AP25#show mint lsp-db
17 LSPs in LSP-db of 4D.81.9D.AC:
LSP 19.6D.B7.76 at level 1, hostname "RFS-SW01", 16 adjacencies, seqnum 285509
LSP 1A.4C.44.B0 at level 1, hostname "AP10", 11 adjacencies, seqnum 280965
LSP 1A.4C.44.D8 at level 1, hostname "AP04", 11 adjacencies, seqnum 280135
LSP 1A.4C.45.2C at level 1, hostname "AP02", 11 adjacencies, seqnum 279898
LSP 1A.4C.45.A4 at level 1, hostname "AP05", 11 adjacencies, seqnum 294789
LSP 1A.4C.46.30 at level 1, hostname "AP06", 11 adjacencies, seqnum 273896
LSP 1A.7C.51.30 at level 1, hostname "AP14", 11 adjacencies, seqnum 280856
LSP 1A.7C.53.5C at level 1, hostname "AP11", 11 adjacencies, seqnum 280674
LSP 1A.7C.53.A4 at level 1, hostname "AP15", 6 adjacencies, seqnum 276869
LSP 1A.7C.53.BC at level 1, hostname "AP16", 7 adjacencies, seqnum 697568
LSP 1A.7C.71.98 at level 1, hostname "AP07", 11 adjacencies, seqnum 280219
LSP 1A.7C.71.D8 at level 1, hostname "AP08", 11 adjacencies, seqnum 280893
LSP 4D.18.47.28 at level 1, hostname "AP24", 7 adjacencies, seqnum 249678
LSP 4D.80.BD.B8 at level 1, hostname "AP20", 11 adjacencies, seqnum 223532
LSP 4D.80.BE.70 at level 1, hostname "AP21", 11 adjacencies, seqnum 547615
LSP 4D.81.9C.88 at level 1, hostname "AP23", 7 adjacencies, seqnum 247070
LSP 4D.81.9D.AC at level 1, hostname "AP25", 5 adjacencies, seqnum 354676



.....is the varying number of adjancencies for each entry. The only one that makes sense is the first one for the controller. It has 16 adjacencies. Each of those representing a connection to the 16 APs. (Is it 16 or 17?)
Interesting that many of them show only 11. Again, 16-11=5. There's that **5** number again. More coincidence? Not sure what to make of the ones that show 5,6, and 7 though. For each value, there should be THAT many entries (there should be 5 entries that show 5 adjacencies, 6 entries that show 6 adjacencies, etc). The differences in adjacency values would be due to APs sitting behind a router. So it's fine that there's a difference, but the resultant output still doesn't jive.
I'm thinking there's something going on with 5 APs that all sit on the same segment behind a router...and they, for whatever reason, are the ones having an issue.

I'm starting to think there's some inconsistency in how these APs is configured on the controller...whereas, what I would expect is that all of these APs are setup cookie-cutter style. If you want to send me a copy of the config from the controller to look at and verify, feel free (ckelly@extremenetworks.com)

RWCampbell
New Contributor
For illustriative purposes, here's the switch logs that showed what we saw during the second event when APs were online but they dissapeared from the left panel. AP25 the device we started troubleshooting:

Controller ssh history from yesterday during the outage:
show adoption status

Adopted Devices:
---------------------------------------------------------------------------------------------------------------
DEVICE-NAME VERSION CFG-STAT MSGS ADOPTED-BY LAST-ADOPTION UPTIME
---------------------------------------------------------------------------------------------------------------
AP24 5.5.5.0-018R configured No RFS-SW01 0 days 00:48:51 0 days 00:51:45
AP20 5.5.5.0-018R configured No RFS-SW01 1 days 20:45:03 9 days 03:24:36
AP21 5.5.5.0-018R configured No RFS-SW01 1 days 20:45:04 2 days 02:32:49
AP23 5.5.5.0-018R configured No RFS-SW01 0 days 00:48:51 0 days 00:51:45
AP25 5.5.5.0-018R configured No RFS-SW01 0 days 00:48:50 0 days 00:51:45
AP10 5.5.5.0-018R configured No RFS-SW01 1 days 20:45:05 13 days 09:30:21
AP04 5.5.5.0-018R configured No RFS-SW01 1 days 20:45:04 13 days 08:06:49
AP02 5.5.5.0-018R configured No RFS-SW01 1 days 20:45:03 13 days 10:58:02
AP05 5.5.5.0-018R configured No RFS-SW01 1 days 20:45:03 13 days 10:58:27
AP06 5.5.5.0-018R configured No RFS-SW01 0 days 01:13:54 13 days 09:04:14
AP14 5.5.5.0-018R configured No RFS-SW01 1 days 20:45:04 13 days 10:04:16
AP11 5.5.5.0-018R configured No RFS-SW01 1 days 20:45:04 13 days 09:33:47
AP15 5.5.5.0-018R *configured No RFS-SW01 0 days 00:48:50 0 days 00:51:45
AP16 5.5.5.0-018R configured No RFS-SW01 0 days 00:48:50 0 days 00:51:45
AP07 5.5.5.0-018R configured No RFS-SW01 1 days 20:45:04 20 days 04:31:37
AP08 5.5.5.0-018R configured No RFS-SW01 1 days 20:45:04 20 days 04:31:38
----------------------------------------------------------------------------------------------------------------
Total number of devices displayed: 16

RFS-SW01#mint ping 4D.81.9D.AC
MiNT ping 4D.81.9D.AC with 64 bytes of data.
Ping request 1 timed out. No response from 4D.81.9D.AC
Ping request 2 timed out. No response from 4D.81.9D.AC
Ping request 3 timed out. No response from 4D.81.9D.AC

--- 4D.81.9D.AC ping statistics ---
3 packets transmitted, 0 packets received, 100% packet loss
RFS-SW01#ping 10.200.17.35
PING 10.200.17.35 (10.200.17.35) 100(128) bytes of data.
108 bytes from 10.200.17.35: icmp_seq=1 ttl=64 time=6.74 ms
108 bytes from 10.200.17.35: icmp_seq=2 ttl=64 time=0.314 ms


AP25 Today
AP25#show mint id
Mint ID: 4D.81.9D.AC
AP25#show mint info
Mint ID: 4D.81.9D.AC
16 Level-1 neighbors
Level-1 LSP DB size 17 LSPs (3 KB)
0 Level-2 neighbors
Level-2 LSP DB size 0 LSPs (0 KB)
Level-2 gateway is unreachable
Reachable adopters: 19.6D.B7.76
Max level-1 path cost: 30 (to 19.6D.B7.76)
AP25#show mint lsp-db
17 LSPs in LSP-db of 4D.81.9D.AC:
LSP 19.6D.B7.76 at level 1, hostname "RFS-SW01", 16 adjacencies, seqnum 285509
LSP 1A.4C.44.B0 at level 1, hostname "AP10", 11 adjacencies, seqnum 280965
LSP 1A.4C.44.D8 at level 1, hostname "AP04", 11 adjacencies, seqnum 280135
LSP 1A.4C.45.2C at level 1, hostname "AP02", 11 adjacencies, seqnum 279898
LSP 1A.4C.45.A4 at level 1, hostname "AP05", 11 adjacencies, seqnum 294789
LSP 1A.4C.46.30 at level 1, hostname "AP06", 11 adjacencies, seqnum 273896
LSP 1A.7C.51.30 at level 1, hostname "AP14", 11 adjacencies, seqnum 280856
LSP 1A.7C.53.5C at level 1, hostname "AP11", 11 adjacencies, seqnum 280674
LSP 1A.7C.53.A4 at level 1, hostname "AP15", 6 adjacencies, seqnum 276869
LSP 1A.7C.53.BC at level 1, hostname "AP16", 7 adjacencies, seqnum 697568
LSP 1A.7C.71.98 at level 1, hostname "AP07", 11 adjacencies, seqnum 280219
LSP 1A.7C.71.D8 at level 1, hostname "AP08", 11 adjacencies, seqnum 280893
LSP 4D.18.47.28 at level 1, hostname "AP24", 7 adjacencies, seqnum 249678
LSP 4D.80.BD.B8 at level 1, hostname "AP20", 11 adjacencies, seqnum 223532
LSP 4D.80.BE.70 at level 1, hostname "AP21", 11 adjacencies, seqnum 547615
LSP 4D.81.9C.88 at level 1, hostname "AP23", 7 adjacencies, seqnum 247070
LSP 4D.81.9D.AC at level 1, hostname "AP25", 5 adjacencies, seqnum 354676
AP25#show mint route
Destination : Next-Hop(s)
4D.80.BE.70 : 4D.81.9C.88 via vlan-1
4D.81.9C.88 : 4D.81.9C.88 via vlan-1
1A.7C.53.A4 : 1A.7C.53.A4 via vlan-1
19.6D.B7.76 : 1A.7C.53.BC via vlan-1, 4D.18.47.28 via vlan-1
1A.7C.53.BC : 1A.7C.53.BC via vlan-1
1A.7C.71.D8 : 1A.7C.53.BC via vlan-1, 4D.18.47.28 via vlan-1
1A.4C.45.A4 : 1A.7C.53.BC via vlan-1, 4D.18.47.28 via vlan-1
1A.7C.53.5C : 4D.18.47.28 via vlan-1
1A.4C.45.2C : 1A.7C.53.A4 via vlan-1
4D.81.9D.AC : 4D.81.9D.AC via self
1A.4C.44.D8 : 1A.7C.53.BC via vlan-1
1A.7C.51.30 : 4D.18.47.28 via vlan-1, 1A.7C.53.BC via vlan-1
1A.4C.44.B0 : 4D.18.47.28 via vlan-1, 1A.7C.53.BC via vlan-1
4D.18.47.28 : 4D.18.47.28 via vlan-1
4D.80.BD.B8 : 1A.7C.53.BC via vlan-1
1A.7C.71.98 : 4D.18.47.28 via vlan-1
1A.4C.46.30 : 4D.81.9C.88 via vlan-1
AP25#

ckelly
Extreme Employee
This is an important point here to expand on:
"With this more recent event (where same APs stopped communicating), is that adoption is fine, but whatever fundamental issue with the mint traffic is occurring still is occurring and blocking traffic despite the fact that they're still adopted."

Just to be clear, MINT is not normally responsible for carrying user traffic, but in your setup this comment just set off a big red flag for me. I checked your config listing and sure enough.... (I should've noticed this earlier!)

code:
wlan "WMS SSID"
description WMS RF Environment
ssid TEMP-WMS-RF
vlan 1
bridging-mode tunnel <--------- THIS "tunnel"
encryption-type tkip-ccmp
authentication-type none
wpa-wpa2 psk 0 XXXXXXXXXX
service wpa-wpa2 exclude-ccmp


The bridging mode you have configured in this WLAN profile is set to tunnel. What this does is this encapsulates the client traffic within the MINT tunnels that are formed between the APs and the controller. Once the traffic reaches the controller, the controller then strips out the user traffic from the tunnel and drops the user traffic onto the LAN. MOST of the deployments we have are not using this mode, except for special cases, like a Guest WLAN or something like that.

I'd highly recommend changing this setting over to "local". What this does is the AP takes the client traffic and simply bridges it over to the assigned VLAN (in the WLAN Profile) and it goes straight out the AP interface tagged...to the switch. It's just treated like any other traffic. In your case though, the config is indicating that the WLAN is assigned to VLAN 1...which I'm assuming is the untagged native VLAN? If so, no big deal...the same applies.

The point to understand here though is that in "tunnel" mode, since the user traffic is placed inside a MINT tunnel and brought BACK to the controller inside the MINT tunnel, if there's any issues with MINT between the AP and controller, it means that the user traffic is ALSO affected, right? So using local bridging is a much more resilient mode of operation.

"What was interesting was in the second incedent, even though the same AP's were not communicating, it still showed all 17 online. But I noticed that the 5 in question had disappeared from the left panel."

This comment is contradictory. What you see in the tree on the left panel will only be APs that are in fact adopted/online. So WHERE in the UI were you seeing that all 17 were still online while 5 have disapeared from the tree? I'm guessing that you were simply looking in the Devices tab and seeing all the APs listed. If that's the case, seeing the APs listed there is not an indicator of their adoption status. That's just a listing of what APs the controller knows about...whether they're adopted currently or not.

And yes, if MINT traffic can't get through, that will prevent adoption and the AP icons should disappear from the tree on the left. The Dashboard widget with the pie chart will also show you the number of devices online/offline....as well as other widgets.
And....since you have the WLAN setup in tunnel mode...losing MINT comms would then also affect the user traffic from making it back to the controller and being dropped onto the LAN.

With the APs setup in 'local' mode, you can have APs lose adoption but the user traffic is completely unaffected! The APs are just simply un-managed from the controller at that point. They're operating like fat APs and just keep chugging along. 🙂

Interesting that you still have layer-3 access to the APs when this happens and can SSH in. If this ever happens again, SSH in and you can start doing some interesting diagnostics directly from the AP to determine what's wrong. It has a pretty powerful CLI toolset.

I think the magic bullet here may very likely be in switching the WLAN over to 'local' bridge mode. Unless there' s a specific reason for it being setup this way, I'd change this immediately and then sit back and see how the system operates.

RWCampbell
New Contributor
For why, our first theory as to how to induce the storm was to pull the DHCP server offline and let the level of DHCP broadcast traffic increase. That didn't work for an hour and a half, but we said we were going to solve this issue tonight, so we just started trying stuff. We started DDoSing the DHCP server, and the wifi controller, and finally a device (might be a few other things in there). Sending all that traffic to a device had the effect we wanted for whatever reason, and from there we got a pretty good sense of the source and started honing in on it. Eventually found a RJ45 that was bridging two switches that also had fiber uplinks to the fiber switch. The next morning we noticed that there was a port in one of the switches that spanning tree had shut down. Thinking we had removed the loop we enabled that port and everything went down. So we knew that that port also was looped. So we figured out where that went and removed it as well. Switches were not really 'reacting' They just had loops which amplified the broadcast traffic until they were removed.

So here's my understanding (which could be wrong... But my understanding from above comments was that adoption can happen on the mint level or on 'layer 3'/ tcp/ip level. The prior event where we noticed that these devices were not adopted, the fix we came up with was to change the profile so that adoption would happen over tcp/ip instead, and reboot. This made adoption happen and communication resumed. With this more recent event (where same APs stopped communicating), adoption was fine, but whatever fundamental issue with the mint traffic is occurring still is occurring and blocking traffic despite the fact that they're still adopted. In the first incident, the dashboard showed on it's online offline graph, that 12 APs were online when there should be 17. Changing adoption setting and resetting APs brought it back up to 17 and the devices in question repopulated the left panel when you expand the RF domain. What was interesting was in the second incedent, even though the same AP's were not communicating, it still showed all 17 online. But I noticed that the 5 in question had disappeared from the left panel. So this made me think that the online/offline dashboard widget just shows devices that are adopted, but if the mint traffic can't get through, it won't show in the left panel and for whatever reason if that's the case, wifi traffic will not pass. This made me wonder if the TCP/IP traffic from the wifi clients ends up being sent to the controller via the mint protocol? I wouldn't think so. But whatever reason, if the mint ping is not able to get through, then traffic will not pass even if TCP/IP traffic has no issues (I realize this seems logically impossible). With the second event, all we did to get traffic moving on those AP's again was power cycle them individually and mint pings started working and they popped back into the left panel on the controller interface...

To address your more direct questions, we don't think there is a broadcast storm occuring now, and there wouldn't have been one before absent the loops we found. I think spanning tree protocol should have arrested those even, but the firmware on those switches with the loops is ancient and mint is proprietary so . Since we have removed the loops, and as I alluded to above, in neither case was TCP/IP communication down. We were able to SSH into them without issue at all times whether they were adopted or not and whether mint traffic was passing or not.

It definitely crossed my mind that there might be a software bug of some sort that could be causing this strange behavior. I guess I thought you had said mint traffic was normally tagged and seperated from the TCP/IP traffic. I must have misinterpreted something.

I have updated the customer on where everything stands and suggested that they get the support established so that we can get the firmware. I'm not a fan of the idea that I need support to get updates to the software that could include bug fixes, but the other stuff that goes along with it is worth it especially with the replacement plan. It wouldn't be an easy thing to just replace the controller, so I think they'd be well served to maintain that as long as they keep the controller and it's supportable, especially at the reasonable price you all are offering that service.

We're not familiar with this system, I'm just continuing the conversation so we both maybe can learn something. I certainly may be be misinterpreting the behavior of the system or your comments somewhere. You've been more than helpful in responding here and pointing us in the right direction.

~Robert
GTM-P2G8KFN