Apple Devices Casuing intermittent network outages

  • 5 November 2015
  • 18 replies
  • 610 views

We have recently started rolling Apple devices at one of our locations. Approximately 1200 iPads. They are connecting to via an Aerohive wireless solution. That solution has 1 point of entry to my network at a B5 switch that is also doing layer 3 for that site to our WAN. When the iPads starting ramping up, it has seemed to cause frequent intermittent network outages. Devices on the rest of lan in different vlans stop communicating and then will come back within 5-10 seconds. This happens anywhere from 1-50 minutes during every hour of the day including overnight when no users are onsite. After extensive troubleshooting, we still have not identified the cause of the problem but believe it is possibly something with the B5G124-48P2. From one of my switches at the local site, when the problem happens, I cannot reach my default gateway, which is the IP address of that vlan interface configured on the B5. Since affects all vlans, we suspect the B5. The routing is all static and RIP is disabled. Also when this is happening, the system CPU is not excessive, it is usually around 26-45% but I have seen it spike very shortly up to 60-80%. I have debug logging enable and the have no logs indicating that system resources are taxed. The only recurring log entry I get is "DHCPRELAY[265701448]: relay_main.c(315) 568089 This is from manager 1 %% Request could not be relayed to Server". I have also checked my DHCP server, however it is supporting all 14 of my sites with no issues anywhere else. I checked the mac tables and we usually sit at just under 4000 entries, but according to the specs for the B5 it can support up to 24,000. I appreciate any help or suggestions.

18 replies

Userlevel 4
It must be AeroHive ... it wouldn't happen with ExtremeWireless 🙂. Just kidding. Ultimately the best thing to do is to work with GTAC but one area you could look at is the effect that Apple Bonjour is having on the network. Bonjour is a zero-touch discovery protocol used by Apple devices which is great for the home but it doesn't scale in enterprises. It uses multicast as a discovery mechanism which is very expensive on a wireless network and also puts strain on switches because it has to be processed in software in the CPU (= high CPU utilization). AH has some controls for Bonjour and one thing you could try to do is block all Bonjour at the APs on one of the sites to see if it addresses the issue. Or you can try to drop the multicast at the switch before it is processed in the CPU to see if that helps. Again, the best thing to do is to work with GTAC to isolate the issue but it is something to consider.

Good luck with AH and keep and eye for what we are doing with ExtremeWireless next time you are looking to upgrade your Wi-Fi network.

Thanks,

Will
Userlevel 3
How about port utilisation?
Maybe it's a broadcast or multicast issue?
We have ran multiple packet captures with nothing standing out. Port utilization is very low. We were told that Bonjour Gateway was disabled with Aerohive. The thing is that I get reports of their switches going down too. Down is relative, they basically cannot get to their Default Gateway for a short period of time.
Userlevel 7
We have ran multiple packet captures with nothing standing out. Port utilization is very low. We were told that Bonjour Gateway was disabled with Aerohive. The thing is that I get reports of their switches going down too. Down is relative, they basically cannot get to their Default Gateway for a short period of time.I have recently seen high switch CPU usage (100% for about half an hour) on EXOS based switches after rebooting connected Aerohive APs. Blocking mDNS from reaching the switch CPU dropped the CPU usage below 100%, but it was still quite high.

Packet captures showed a significant increase in the following three frame / packet types affecting switch CPUs sent by the Aerohive access points:
  1. mDNS requests
  2. Some layer 2 broadcasts probably used for Aerohive AP discovery
  3. Gratuitous ARP replies
Strangely the access points rejected every received mDNS answer with an ICMP Port Unreachable message, but continued sending requests.

The Bonjour gateway was disabled on the Aerohive APs, but the access points generate their own mDNS requests.

It usually takes hours for the switch CPU usage to drop to the normal values observed in the steady state network.

See the GTAC Knowledge article "How can I block mDNS with an ACL using MAC addresses" for info on an ACL to mitigate mDNS impacts on EXOS switches.
Userlevel 4
Could you tell us what REV of code is on the B5?
Could you tell us what REV of code is on the B5?
Hw:BCM56514 REV 1

Bp:02.02.51

Fw:06.81.05.0003

BuFw:06.81.01.0027

PoE:2_1
Could you tell us what REV of code is on the B5?
That's the version of code we are running.. It's been rock solid so far.
Userlevel 4
Could you tell us what REV of code is on the B5?
How does the AeroHive handle all of the mDNS from the iPads? Sounds like the multicast is over running the switch. Are the AP's and the wireless networks on different VLANs from the workstations?
Have you checked port counters? Do a show rmon stats ge.1.9 for example to look at that port. However, I would take a look at your uplink first. Also, have you checked spanning-tree? Run show spantree stats active and show spantree stats. Do you see any topology changes when this is happening? If you are using 802.1d and the network has to reconverge because of some spantree event, it could take several seconds up to 40-50 seconds to stabilize.
Have you checked port counters? Do a show rmon stats ge.1.9 for example to look at that port. However, I would take a look at your uplink first. Also, have you checked spanning-tree? Run show spantree stats active and show spantree stats. Do you see any topology changes when this is happening? If you are using 802.1d and the network has to reconverge because of some spantree event, it could take several seconds up to 40-50 seconds to stabilize. The rmon stats show clean 0 errors or drops. We did check spantree too. We actually disabled it as a test, but did not change the result.
Userlevel 4
I do agree that you may want to reach out to the GTAC and open a ticket I would would be glad to assist if needed Let's start with a quick use toon Does the AP support -pause frames? To find out from the switch cli run the command show port advertise and look for pause under the remote section If a Yes is seen then the AP supports Pause So if yes, then run the command Show port flowControl port# or show txwmonitor flowControl port# If the ports show rx #'s, then there is more likely that the AP may be overwhelmed or if feels that the load along with some unwanted traffic it is processing traffic(may be multicast ) or maybe a small change on the switches config is needed) GTAC can assist An option ( if there are numbers that increment) as a workaround: You may have to run this command: warning : The command will drop the link, clear the option to advertise the port supports pause (Default on all Ports)and then negotiate speed/duplex Clear port advertise port# pause Now check you connection Here is an article that may help you: https://gtacknowledge.extremenetworks.com/pkb_mobile#/articles/en_US/Solution/Slow-Performance-on-SecureStack-or-random-loss-of-connection-to-neighboring-switch Jason Parker
I do agree that you may want to reach out to the GTAC and open a ticket I would would be glad to assist if needed Let's start with a quick use toon Does the AP support -pause frames? To find out from the switch cli run the command show port advertise and look for pause under the remote section If a Yes is seen then the AP supports Pause So if yes, then run the command Show port flowControl port# or show txwmonitor flowControl port# If the ports show rx #'s, then there is more likely that the AP may be overwhelmed or if feels that the load along with some unwanted traffic it is processing traffic(may be multicast ) or maybe a small change on the switches config is needed) GTAC can assist An option ( if there are numbers that increment) as a workaround: You may have to run this command: warning : The command will drop the link, clear the option to advertise the port supports pause (Default on all Ports)and then negotiate speed/duplex Clear port advertise port# pause Now check you connection Here is an article that may help you: https://gtacknowledge.extremenetworks.com/pkb_mobile#/articles/en_US/Solution/Slow-Performance-on-SecureStack-or-random-loss-of-connection-to-neighboring-switch Jason Parker Thanks for the info. The Aerohive APs are not attached to the B5, Aerohive just hands off to us on a copper gig port from their switch. It is a managed wireless solution. I did check the advertise status and it shows no for everything on all the 1000Base-T ports.
Userlevel 3
Just to chime in on the information already provided with an additional possibility/question...

Is it possible that there is an IP conflict? We chased an issue at a customer site for months. Our situation was slightly different, but had a similar impact.

There was a BYOD that would come on the network (iPad) that would keep it's DHCP address from their home network. The device would not relinquish the IP for a few cycles, even though it was connected to the wireless with a different IP space.

As an unfortunate coincidence, the IP that this device would not relinquish happened to be the IP address of the gateway. As a result, this poisoned the ARP tables on a few devices and caused devices to not be able to reach the gateway...
Just to chime in on the information already provided with an additional possibility/question...

Is it possible that there is an IP conflict? We chased an issue at a customer site for months. Our situation was slightly different, but had a similar impact.

There was a BYOD that would come on the network (iPad) that would keep it's DHCP address from their home network. The device would not relinquish the IP for a few cycles, even though it was connected to the wireless with a different IP space.

As an unfortunate coincidence, the IP that this device would not relinquish happened to be the IP address of the gateway. As a result, this poisoned the ARP tables on a few devices and caused devices to not be able to reach the gateway...

That is definitely a weird one, however our problem persist through the night with no users onsite. Our users are also not allowed to take iPads offsite. I do have a GTAC case open, but wanted to cover my bases with the community help. Thanks.
Userlevel 4
Another thought What mgbic are you using Try Set. Logging local Consol enable file enable Set logging default severity 7 Remove and insert the mgbic any messages in the show logging buffer How about egress Show vlan port info port port# Also try it on the uplink to dhcp and radius server
Another thought What mgbic are you using Try Set. Logging local Consol enable file enable Set logging default severity 7 Remove and insert the mgbic any messages in the show logging buffer How about egress Show vlan port info port port# Also try it on the uplink to dhcp and radius serverThe ports we are using are copper Ethernet. No gbics involved
Another thought What mgbic are you using Try Set. Logging local Consol enable file enable Set logging default severity 7 Remove and insert the mgbic any messages in the show logging buffer How about egress Show vlan port info port port# Also try it on the uplink to dhcp and radius serverCurrent update. We found out that for some reason the Layer 3 ARP Cache is not holding records. For example, the default ARP Cache is 4 hours. We have network switches that are showing anywhere from 0-5 minutes age on their ARP entries. No ARP Entry is past 10 minutes for anything. Also when an entry drops out of the list, it can take from a few seconds up to hour to re-populate. The number of entries in the cache is just over 2000 so not even close to the max. Same thing with mac address tables. We are waiting for a replacement device from Extreme support to see if that fixes the issue.
Another thought What mgbic are you using Try Set. Logging local Consol enable file enable Set logging default severity 7 Remove and insert the mgbic any messages in the show logging buffer How about egress Show vlan port info port port# Also try it on the uplink to dhcp and radius serverHi Thomas,

Any luck on this? I've been battling this for over a year and it seems to get worse and worse. And Yes, I do have random clients that connect to the wireless but cannot reach their gateway and that doesn't just happen on Apple devices.

Reply