cancel
Showing results for 
Search instead for 
Did you mean: 

Best practice for XMC alarms (EXOS)

Best practice for XMC alarms (EXOS)

Tomasz
Valued Contributor II
Hello Community,

I've been asked to improve Alarms configuration in XMC at a customer whose network is EXOS-based (currently 22.6).
They need their support team to react ASAP in case of any issues, outages etc. related to the network services and connectivity.
As it's my first serious fun with alarms in XMC, I'd like to ask you for some advice on this. Perhaps some of you already have a best practice config for Alarms section?

The customer uses EXOS for 2-tier network with stacks at the edge, with firewall performing routing, DHCP relay etc. Switches are mainly serving PoE, AAA (Policy + RFC 3580 from EAC), telemetry, RSTP. All kind of typical devices are connected: PCs, APs, phones, servers, cameras, even some physical access control devices.
The customer wishes to have reasonable alarms configured with urgent ones also sent via e-mail.
I've tried to walk through EMS Messages Catalog but unfortunately it doesn't consist of all the possible message strings to decide on putting those as an alarm criteria in XMC, just event types.
Here's what I thought of and would really appreciate if you helped sort this out:
  1. All syslogs with Warning, Error or Critical severity raising an alarm.
  2. All syslogs with Critical severity sending an e-mail additionally.
  3. NAC-related alarms for AAA status awareness.
  4. Particular types of syslogs that raise an alarm with e-mail as an action:
  • temperature: HAL.Sys.ShutDwnTempRangeExcd, HAL.Sys.TempWarning, HAL.Sys.TempCritical, HAL.Sys.FanTrayFail,
  • general HW failures: HAL.Msg.Critical/Error/Warning, POE.Critical/Error/Warning, DM.DsblSlotShutDown, ds.oom (?), ds.pcfg_init_fail (?),
  • STP loop detected: STP.DsblPortLoopDtect,
  • interface errors: DM.SensorAlarmDtect (regarding transceiver operation; Rx/Tx errors don't have any 'excessive rate' logs, do they?),
  • high resource utilization: EPM.cpu? rather stats-based alarms on XMC itself; HAL.Card.HwTblThrshldExcd, HAL.Card.L2L3HwTblThrshldExcd, HAL.Card.AclHwTblThrshldExcd,
  • stack topology errors: HAL.Stacking.Critical/Error/Warning, NM.NodeStateFail,
  • STP errors/events: STP.DsblPortBrdgDtect, STP.InBPDU.DropRxNonSTPPort, STP.System.InitFail/AllocMemFail/InsNodeFail.
  1. Particular alarm-raising events without e-mail as an action:
  • STP.SendClntTopoChgMsg,
  • vlan.msgs.PortLinkFlapActLogEvent - too many from some endsystems at the moment,
  • thought of vlan.msgs.FldRateOutActLogEvent with 10kpps as a threshold to inform on BUM traffic excessive rate.
BTW1 I couldn't find an option to e-mail active alarms report/history other than digest feature (Alarm->Consolidate Email) in suite-wide options. Is it possible to have both critical alarms sent immediately and others sent as a scheduled report?
BTW2 For some reason their 8.2.4.55 XMC only has 'Workflow Dashboard' in Tasks section, what is wrong here?

Kind regards,
Tomasz
3 REPLIES 3

Anonymous
Not applicable

Hi Tomasz,

No worries, thanks for getting back.

I ended up going through the EMS catalogue myself and adding a whole bunch that I thought would be useful. Probably went a little overboard but thought to be safe then sorry.

Here is some screenshots of what I added.

Cheers,

Martin

Tomasz
Valued Contributor II

Hi Martin,

 

I apologize for the delay.

Unfortunately we had to get off for a while and I didn’t manage to play with these finally at the customer. Today I’d consider to also include port utilization stats. I hope to move back to the topic soon, at least in my lab but gotta buy a new server then…

 

Thanks,

Tomasz

Anonymous
Not applicable

Hi Tomasz,

Out of interest I’m going through a similar exercise. 

What did you end up doing, did you stick with the ones you detailed?

Many thanks in advance

GTM-P2G8KFN