a week ago
Environment
— Core: Extreme Networks 7520-48Y-8C running EXOS 33.5.2.118-patch3-1
— Access: Extreme Networks 5520 / 5720 series
— Link: 25G SFP28 fiber between core and access switches
Problem
After any power outage (accidental or planned maintenance), the 25G SFP28 ports on the 7520 core switch that connect to 5520/5720 access switches via fiber go down and never come back up. No CLI command recovers them. The only workaround is physically moving the fiber cable to a different port on the 7520.
We opened a TAC case and received patch 33.5.2.118-patch3-1 but the issue persists. We reverse-engineered the firmware binary (.xos) and found two critical bugs.
Root cause — what we found inside the firmware
Bug 1: tx_disable asserted on every cold boot
Inside extr/hw-config/Extreme/7520/ports.yaml (extracted from rootfs.xos), all 48 SFP28 ports have this in their module_init_ops:
module_init_ops: - device: port_cpld_1 register: 0x1 bitmask: 0x0040 # tx_disable bit value: 1 # ← asserts TX_DISABLE on every cold bootOn cold boot EXOS applies these init ops first — asserting tx_disable=1 on all 48 SFP28 ports via the PORT CPLDs. EXOS is supposed to clear this bit during port init. If there is a race condition (remote switch still booting) or EXOS crashes mid-init, the bit stays latched and the port is physically dead. No CLI can recover it because the bit is set at CPLD hardware level.
Bug 2: FEC mode not specified (CL74 vs CL91)
The 25G_PORT definition has Fec: true but no explicit FEC type. EXOS negotiates at runtime, causing mismatches between the 7520 and 5520/5720 access switches.
Field test results
We changed FEC configuration (CL74 / CL91) explicitly on the affected ports:
— Most 25G SFP28 ports came back up immediately → FEC mismatch confirmed (Bug 2).
— Ports 1 and 5 remained down even after the FEC fix. Both use bitmask 0x0040 (bit 6 of the low byte) for tx_disable in PORT_CPLD_1 registers 0x1 and 0x2 respectively. This specific bit position appears to latch high and does not clear through normal EXOS port enable/disable cycles → Bug 1 confirmed as an independent failure.
1. Has anyone else hit this on 7520 + 55xx / 57xx topologies with 25G fiber after a power cycle?
2. Is there a way to write directly to PORT_CPLD registers from the EXOS CLI to manually clear the tx_disable bit without rebooting? Something like:
# Looking for something equivalent to: debug hal port 1 write cpld tx_disable 0 # or any undocumented EXOS debug command to clear CPLD bits directly3. Any experience with FEC auto-negotiation issues between 7520 and 5520/5720? Which FEC mode (CL74 or CL91) should be set on both sides for 25G SR fiber?
4. Is anyone aware of a fixed firmware version that addresses the tx_disable module_init_ops bug in ports.yaml?
Tuesday
Kudos for the complex troubleshooting you’ve carried out.
I can add that these bugs have been recurring for quite some time. I’ve encountered similar issues with DAC cables—both supported and unsupported—when rebooting X690s and X870s.
I haven’t gone as deep into the root cause, but I’ve been able to mitigate the problem by changing the FEC mode and, in some cases, moving the connection to a different port. This suggests that something has not been working correctly for quite a while.
Thursday
We have been doing the same, and now your answer has given me doubts about a quick solution
Monday
Wow, that's quite a bit of work you've put in there! Regarding your questions, the only one I can answer is #3b: "Which FEC mode (CL74 or CL91) should be set on both sides for 25G SR fiber?". This is described in the IEEE 802.3 clause 108. For 25GBASE-SR and 25GBASE-LR, RS-FEC is optional but must be supported. In EXOS/SwitchEngine, the name used is CL91 as clause 91 is also RS-FEC, but described for 100 G use cases, so clause 108 and 91 are very similar. Clause 74, CL74, means FC-FEC, which is mainly used for 40 G fibre links. This is not meant to be used for 25 G links, but most vendors seem to let users choose between FC-FEC (FireCode FEC) and RS-FEC (Reed Solomon FEC) more or less freely.
Historically, we've seen issues with both 3rd party optics and Extreme optics where links failed to come up after a reboot, but the explanation then was that the SFP initialization sequence was wrong so transceivers that were not listed as supported were reset and initialized as they should, but were then reset again. This was fixed only after the same issue happened to "original" plugs from a new vendor that Extreme labeled as their product.
You should of course open a TAC case for this issue.