Okay, I don’t know what’s happening here but this isn’t looking pretty. All of our AP650’s are losing connectivity multiple times throughout the day.
Look at this! Some AP’s will only do this once or twice and others are doing it often, like this ? There’s nothing indicating any type of error or issue. We didn’t have this problem on firmware 10.09rb.
I have a case open with GTAC, but are we the only ones seeing this issue?
All 145 AP’s are showing data like this. I’m waiting to see what GTAC says today, but I may be rolling back to 10.09rb tonight.
are you using 10.2r4? I know you were using 10.2r3 previously; have you had this issue?
I’m seeing the same thing on all our devices; I went to 10.2r4 on our entire fleet after having it at 2 locations without any reported issues.
This is a 630, but I am seeing this on any randomly selected 650s and 630s alike. I haven’t heard any issues with anyone’s connectivity the way I did with 10.0r10 though… Is this an actual issue with connectivity or is there something wrong with reporting?
can you let me know what GTAC says?
You most certainly are having the same issue. In the moments when this happens: reporting goes down and disconnects all of the clients. Sometimes they are able to reconnect immediately and sometimes it can take a couple minutes. For us, this disconnects teachers and students from their video conferencing sessions. I’m rolling back all of our AP’s to 10.0r9b right now.
Yes, I’ll most certainly share with you what the GTAC engineer says. However, even having high priority support on this is taking way too long. I sent over two AP’s tech data at 9AM and never heard a thing all day.
I think I experience the same issue with my test AP 410C with 10.2r4Dropped out about 3 times in 2 hours without moving the MacBookPro or reaching MacOS roam threshold of -75...
Okay, I’m glad I’m not alone. We have Apple, Windows, and Chromebook devices; they were all dropping every couple of hours. I reverted all of the AP’s back to 10.09rb last night. I’ll report back with any new info.
I got a few dozens schools with separate ECIQ instances and multiple admins working on the wifi platform. I got no way of knowing who has updated the firmware. By the time the customers complain the reputation damage is done. That means now checking all instances Really looking forward to get an API where i can automate the deployment of firmware and central monitoring. There is a few solutions out there which offer better help in situations like these. The amount of unpaid break fix hours has been way too high with Extreme in the past 12 months. Lets hope it gets better soon ;)
Oh wow, separate instances….that must be SOOOO much fun! ?I 100% agree, the issues with CloudIQ and the firmware updates have taken their toll. There are so many features that don’t even work or have been advised to turn off because they cause more issues. The AP650’s are powerful units and I’ve been instructed by GTAC to disable 75-80% of the advanced features through the past 6 months. I honestly just want our services and products to work as advertised. Truth be told, I wish Extreme would start a buyback program ?
Do any EN techs look at this forum?
I seriously only think it’s only Sam Pirok helping everyone on the forums. I never see any other engineers/support staff replying or helping her.
I’m calling in Gandolf
Hey guys, we do have a bunch of our engineers on here helping out, I’m just the most obsessed =) I also have to go bug other engineers for a lot of my answers, so there’s a lot of team contributions behind the scenes.
I have been following this thread, and I’m looking in to what we can do for you all. I didn’t want to jump in before I had something useful to contribute here but I’m definitely working on getting some help for you all!
I really appreciate you guys bring attention to these kinds of issues, please don’t hesitate to keep letting us know about these pain points.
You’re the best, hence why you’re our only beacon of hope. There are too many issues, Sam. CloudIQ monitoring going nuts/dropping reporting, GTAC not responding quickly, firmware updates that keep causing too many issues. We just want stable reliable WiFi, and I know that you know this. When educators lose their connection daily multiple times a day, K-12, because of buggy AP firmware along with CloudIQ having its own issues; it’s not acceptable.
Like what said, “The amount of unpaid break-fix hours has been way too high with Extreme in the past 12 months.”
We’re all worn out by this cycle of things not working as advertised. Okay, lunchtime. ✌
I rolled back one of our 630s from 10.2r4 to 10.0r9b and it seems the reporting shows the same type of connectivity as before. (8 hour time range; it’s been on 10.0r9b since last reconnect ~12 hours previously)
Is there some way I can actively monitor the connected clients so I can see exactly what’s happening at each of those points when connectivity/memory/cpu all show 0? Is there an active log monitor via cli I can keep up possibly?
I’d recommend auth debugs for cli monitoring, that should show any disconnection messages to help narrow down what exactly is causing the loss of signal. This guide reviews how to enable auth debugs: https://extremeportal.force.com/ExtrArticleDetail?an=000065975&q=Auth%20debug
If you can record a couple MAC addresses of clients having issues during the down times, that will help us sort through the auth debug logs.
You can also try setting up a client monitor in the XIQ GUI to see if that gives us any insights. This guide reviews how to set up a client monitor (apologies for the outdated screenshots, I’ll update that guide soon): https://extremeportal.force.com/ExtrArticleDetail?an=000056843&q=Client%20monitor
I’m trying to get auth debugs from two AP’s for you, along with some client monitoring info. It hasn’t been a pretty morning with WiFi ?
Thanks Kevin, sorry to hear it’s been a rough morning It would be good to attach all that to your case so our engineers can all see it, but please also let me know when that’s available and I’ll take a look too to see if anything jumps out at me. Good luck, and please let me know if I can help with anything. I’m pretty open today if you need help and want to jump on a call.
Thank you very much for your assistance. I’ve uploaded two AP’s worth of auth diags tech data, along with a GUI diag for one of devices. I hope we can have a better idea of what’s going on ASAP. ? to you Sam!
Thank you for getting that data together for us! I took a look at the client monitor, but it’s all normal connection messages and some generic disconnection messages. I’m going through the tech data now, also talking to the tech on your case who is reviewing the same data too, so far we haven’t found anything helpful but we are still looking there.
, I heard from the engineering team looking in to this for us and they’d like to compare data from before you rolled back the firmware, do you remember approximately what time you rolled the firmware back to 10.0r9 on your APs?
Unfortunately, the rollback to 10.09rb didn’t correct the lag and random disconnections. After I rolled out 10.2r4 on 1/10/2021, things were quiet overall until maybe the 15th. Then more issues crept up with random disconnects last week. This week it’s worse and reintroducing 10.0r9b just added an extra layer of divine icing on top. I’m just trying to provide a timeline based on incidents and teacher feedback.
It’s so bad we’re having to hardwire teachers laptops
I only rolled back one device so far: GW-MDF-2
This was at about 10am CST yesterday, 1/27.
This is from AP NCPS-AP650-RM305, one of the tech data AP’s that I sent. They were being disconnected and having performance issues during the times of the dips.
Thanks very much for the extra details guys! I hear the engineering team has found some “interesting things”, still waiting on details on that but progress is being made!
Of note, these graphs change drastically depending on which “Time Range” is selected. I’m seeing this across all of my models and firmware versions.
I’ve not been able to correlate issue with these dips, but we are having many reports of disconnects across multiple sites, models, firmware all with the symptom of - Connected “no internet” - Most of which started toward the beginning of Jan.
I’ve been delaying working with support until I had more info on my end as I wanted to ensure we weren’t having some other internal issue. Also it’s extremely difficult to catch such an intermittent issue in the act.
Looking just now I’ve confirmed this type of graph behavior (I pulled up all my stragglers that aren’t on latest FW to give a broader picture)
AP550 - 126.96.36.199, 10.0.8.1, 10.0.9.2
AP250 - 10.0.9.2
AP230 - 188.8.131.52, 10.0.8.1, 10.0.9.2
AP330 - 184.108.40.206
AP121 - 220.127.116.11
Good morning, I hope you’re doing well this Friday. How are things looking?
How about on this Monday instead?
Good morning all. My apologies, I was talking directly with Kevin on Friday, I meant to update this thread as well but obviously missed that goal. We do still have an engineering team actively working on this, unfortunately they had no significant updates to share on Friday. I’m still waiting for most of the team to come online this morning (they are based in California for the most part), and I will share an update here with what I find out.
Hey everyone, I really appreciate your patience, I know this is critical. Our engineers are still looking in to this but they are recommending that we try HiveOS 10.2r3. Has anyone already tried that version? I see a few other versions listed here but not that one. If no one has tried it yet, could you please try that version on a few APs to see if that helps things stabilize?
I am rolling back one of my APs to 10.2r3 and will update tomorrow if the reporting still shows any dips in memory use.
Hey Sam, any progress? My GTAC ticket hasn’t updated since the 28th of January, and that didn’t provide any additional information either.
Here’s the device I rolled back to 10.2r3 last night:
Looks like no change on the memory dump reporting.
As a note, I still haven’t had any significant issues with the vast majority of my fleet on 10.2r4. I see the same reporting above on all of them, but I’m just not sure what’s actually happening at those points and it’s clearly so random that I’m not sure how to actively monitor for that effectively.
Hey guys, good morning. Thank you for testing that for us, I will pass your results and notes on to the engineering team working on this.
, I haven’t heard anything since the suggestion to try 10.2r3 yesterday, but it looks like that likely won’t help here based on javabomberman’s results. I will start pinging that team as soon as I see them sign on for the day and I’ll let you all know what they have to say.
Oops, bit braindead this morning, that AP wasn’t actually a rollback, as I had moved it from 10.2r4 to 10.0r9b previously. So, that graph shows a move from 10.09rb to 10.2r3. Not sure that really makes a difference, but just FYI so all the info is correct.
I will find one on 10.2r4 to rollback to 10.2r3 and give it 8+ hours and see how it looks.
Thanks for clarifying, but that shouldnt make a difference here. They just wanted to see if 10.2r3 was stable, regardless of what version you moved from. I appreciate the thought though!
I mean, 10.09rb was stable for months but now even that version shows the same reporting after rolling back from 10.2r4.
On my test AP (AP410C) it looks really good since the downgrade to 10.2r3 last night
, that is interesting, thank you for letting me know! Would you be able to send me your VIQ name? You can find that in Global Settings> VIQ Management. If you’d rather send that directly to me please feel free to email me at firstname.lastname@example.org.
I see in that graph that you have it on 24 hours. What does it look like with 1, 2, or 4-hour intervals?
I don’t know if am just not able to or if its not possible but cant go back in time in the 1, 2, 4, 8 hour views but the recent ones look ok. I have had one hick up which shows as roaming event but doesnt show in the graphs (was around 20:04) which is weird because its a single AP and the Apple TV which is my test client is in plain sight of the AP and has excellent connection stats. But compared to 10.2r4 where the connection was dropping ~45 minutes it looks better but……..
Are you guys able to reproduce the problems in your labs? Did you get any news from the engineers?
Thank you for the screen shots ! Yes we are able to reproduce this, they were originally thinking it was an issue with 10.2r4 but I’ve shared all of the findings you all have shared with me, and they are looking elsewhere.
Unfortunately I don’t have any details beyond that yet. I imagine a bug report is in our future, if you haven’t filed a case already I’d encourage you to do so. I’ll keep updating you here as I learn more, but once we do have a bug open, it’s good to have several cases attached to the bug so we can keep momentum up on the resolution.
Open cases everyone, we need this fixed ASAP!
Hey everyone, I have an update! The team is seeing some oddities in the auth debugs/tech data Kevin was kind enough to get for us, and they asked if we could get a couple more examples to compare. If possible, could you please enable auth debugs on an AP having these issues, wait for another memory drop or client issue (preferably both at one time), then pull tech data so we can take a look? If you are able, please open a case and attach the data there (and let me know the case number so I can keep an eye on it for you). If you can’t open a case, please feel free to send the data to me at email@example.com and I’ll get it to the engineering team for you.
This guide reviews how to enable auth debugs: https://extremeportal.force.com/ExtrArticleDetail?an=000065975&q=auth%20debug
If anyone would like assistance enabling those debugs, please let me know and we’ll set up a call.
Good morning everyone! I sent over two packet captures from two AP’s; I ran the captures on both radios for 1 minute. Please let me know if you need anything else. Here’s some current wave action ?
Not so far, but they are reviewing the packet captures you got for us.
Good morning! Were you guys able to send over your auth debugs/tech data? I love roller coasters, but I need this one to stay flat. Are any of you still seeing this in XIQ?
Here’s one AP that was rolled back from 10.2r4 to 10.2r3 ~2 days ago: (the other two show similar reports)
I’m going to enabled auth debugs on these 3 and then check tomorrow for the points of memory dump and try to pull tech data to provide.
I don’t wanna muddy the waters with what might be an unrelated issue, but I’m getting good reports regarding no ongoing client disconnects from multiple sites after rolling back 10.0r9b to 10.0r8a (still keeping my fingers crossed and putting feelers out to ensure the issue is really resolved).
I am however still seeing the memory dump in the graphs (even on 10.0r8a). I’m working on grabbing some techdata with auth debug from my lab this morning. Hopefully I’ll catch something. Happy unicorn hunting everyone!
Thank you for helping! You went back to 10.0r8a…..OH that’s not pretty ? Happy unicorn hunting is an accurate statement LOL!
So again, I updated to 10.2r4 ~a month ago, and we still haven’t had any reports of disconnects or dropped devices at any sites. Comparing this to the 2 days of issues at all sites when we went from 10.0r9b to 10.0r10, I am skeptical that the reporting of memory dumps is actually causing the same issues that it was previously for us.
Just another fun bit for this whole scenario.