Extreme Networks

k_berrien · ‎08-12-2019

After moving from Classic to NG in our on-prem environment (about a month ago), we started seeing issues. Unable to push out configs, error in data reporting (# of clients connected, etc), a call to support brought up an obscure and rare hardware requirement for NG on prem. No SAN support, must be a directly connected drive, preferably SSD.

I have skepticism from my colleagues, this is a highly rare request for a VM environment in our experience. How did any of you else address this specifically?

thanks,

nicolas_lesaint · ‎02-06-2020

On 19.5.1.7-NGVA On Prem (VM on SAN SSD) with 300 AP, we had this problem for the first time on december.

The ElasticSearch service was stopped and can be restarted for only two hours max after reboot.

We had the same response from the support : SAN is not supported (we never had this information when we installed HiveManagerNG).

After a long negociating, the tech support accept to connect to the console and type this command : curl -X DELETE "localhost:9200/hm-*?pretty".

He said the problem will come back and he was right it came this morning after 50 days.

aprice · ‎12-11-2019

Hi Abe.

Depending on how far through this whole thread you read, and if your issue returns, we had somewhat similar issues and the fix ended up being a combination of about four things. These all got buried in my paragraphs...

And who knows, maybe this topic will help the next person who finds themselves stuck on 19.5.1.7.

I upgraded to 19.1.5.7, which resulted in a known bug where the indexes were not properly cleaned up. Fix #1 was to deploy a new VA, export my VHM, turn off the damaged VA, and import the VHM to the new VA. This is relatively easy but I did lose some VA (not VHM) settings.
I still had problems. Fix #2 was to follow @Kevin Berrien's post and turn data collection metrics wayyyy down. That stabilized things enough to start troubleshooting again.
After installing SSDs in one of my blades to get back into a "supported" state the tech found the continuing issues were due to the HTTPS certificate I installed to secure the GUI. In the previous VM my certificate worked great, but in 19.5.1.7 it was causing all of my config updates to fail. Support got into the VA as root and erased the cert back to the default self-signed one, and configs worked again.
In reviewing all of this it appears the Elasticsearch indexes driving everything do not get automatically maintained (frozen, closed, or deleted) and must be manually purged through the VAMS interface. If this is true it will eventually cause nasty problems for a lot of people. I just fought index problems again on our actual ES cluster after a change with ES 7.x caused our open index count to skyrocket. So...index overhead is really fun to deal with and should be fun when it takes out our PPSKs again.

After all of this we have what seems to be a stable, happy HMVA again. I am planning to purge indexes quarterly, sooner if I start to notice problems.

Recently, I turned 10 minute client stats back on (down from the 60 in Kevin's post) because I didn't like the massive gaps/spikes in my data. Things seem to be running fine. I left application stats and KDDR logs disabled. I hope to bring applications back since it was neat to know what was going on and makes the dashboard prettier. But "pretty" is not operationally important so I'm not in a rush to test fate again.

Alan

akoshy · ‎12-09-2019

I totally agree with that last paragraph! We're also on 19.5.1.7 (which is the first update they put Client/Network 360 integrated into onprem I believe) with 282 devices.

I just got off the phone with support (1 hour on hold waiting for a pickup and another hour troubleshooting...wait time seems longer since Extreme purchase). We had an issue where a bunch of APs got a config months ago with an accidental email address autofilled from my browser in it. This caused "The CLI 'ssid [email address] qos-classifier [email address] execute failed, cause by: Unknown error".

While the config was nowhere in HiveManager, a CLI reset of the AP and complete config update would bring that misconfig right back. Only thing that fixes it is a HM GUI "reset to default" which clears everything and you have to assign it back policy, locations, etc. Then a complete config update.

To me, this very much seems like a bug in HiveManager where it's holding onto data somewhere and not setting configs properly.

To ATAC, this is magically caused by HiveManager running on a SAN. What kind of logic is that? We run our VM infrastructure from a SAN with SSD caching and that thing is not a slouch. We have DB applications that run just fine with just as much I/O going on. It seems totally counterintuitive that they would just throw a blanket statement across anything they can't figure out as...."must be your SAN".

I've checked HiveManager's IOPS and majority of the hits are all SSD cache hits. It's also not our highest running VM in terms of storage hits.

This really does seem to be trying to eliminate On-Prem and force Cloud as you mentioned. We too love the Aerohive product and have been running it since 2013/2014, but this kind of thing does make me question what to do at our next refresh cycle. If they as a company want to go this route, do so, but tell the customers that it's because of that reason rather than start not supporting a product they've put out. I understand it's hard to develop an on-prem product that does the same as cloud, but I would rather hear that instead of this cop-out on actual issues that might be happening.

aprice · ‎11-28-2019

Thanks Kevin. That's a helpful comparison.

We're all AP230s that were on 8.4r11. I reverted to 8.2r6 to try and get rid of another portion of the problems were having (seems better, but not fixed, so far). We have the default VA configuration running on Cisco UCS and Nimble storage. That's been fine up until the "not supported" change, and possibly exacerbated by the problems introduced during our last VA upgrade.

We also have 1 PPSK SSID, 1 802.1x with RADIUS/NPS, and an open guest network with speed and port limits. Looks like we're a bit heavier in clients but also a single campus college with fewer spaces an APs.

We'll see how these latest changes play out, what's next for the "Cloud IQ" VA (since HM is apparently gone), and when some of these HiveOS problems get fixed.

k_berrien · ‎11-19-2019

Alan, per your query...

We're running mostly AP230's at 8.2r6, then a mix of other 350's (6.5r12), 120's (6.5r10), 121's (6.5r10) , 130s (8.2r6) for a total count of 591 APs, on-prem 12.8.3.3-NGVAFEB19. We run this on a Nutanix virtual system, with RAID SSD & spindle. HM VM is 8 core, 40 gb RAM. The HM Virtual Appliance Management System shows 60% mem utilitization, and 4% on idle (ie, not pushing updates or other admin driven activity).

Our school district is 6K students over 11 schools. We're about at that point where mobile (wifi only) devices out number traditional, so perhaps 3K of wireless devices, plus staff & HS student BYOD (2K high school students). Our municipal wifi usage is minor in comparison.

We use 1 PPSK SSID, 1 radius SSID (NPS on 2 seperate domains), and the odd open guest and such in specific locations or times.

Extreme Networks

Non-SAN on-prem Aerohive NG hardware requirement. How did you implement?

Non-SAN on-prem Aerohive NG hardware requirement. How did you implement?