I have a wave transport vendor that suffered issues twice about ten days apart, causing my link to flap a bunch. I put in a ticket on the second set of occurrences. I was told that there was a card issue identified and would be notified when the replacement happened. Ticket closed.
Three weeks later, I opened a new ticket asking for the status. The new card arrived the next day, but since no more flaps were happening, the card would not be replaced. Ticket closed.
A) It doesn’t seem like they actually did anything to fix the circuit.
B) They admitted a problem and sent a new card.
C) They later decided to not do anything.
Is that normal?
Is that acceptable?
To avoid issues flapping causes, I disabled that circuit until repaired, but it seems like they’re not going to do anything and I only know that because I asked.
Ask them for an RFO report. They should be able to explain the diagnostics and reason for the original flapping. It’s quite possible they found a problem somewhere on the path with a DWDM multiplexor or something. That wouldn’t need any card replacement on your prem.
With passive components like amplifiers and such, or they might have had someone do work that they don’t want to fess up to (which is kinda silly) I get that.
I have our junipers configured with a 5 second up timer, eg: "hold-time up 5000”
This way a flapping circuit must be stable for at least a few seconds before it can be placed back into service, otherwise if you have a prefix that comes from connected/direct/static/qualified-next-hop it won’t go into another protocol and possibly cause a globally visible BGP event.
Some providers have a much more disruptive layer-1 infrastructure and will ask you to configure a 1s+ up timer. I think there’s an interesting question that could go either way, do you want transport side faults to be exposed to you, or should the client interface in a system be held up so you don’t have that fault condition forward (sometimes called FDI) to the client interface.
They may have had the system misconfigured so you saw a fault on a protected path when there was a switch. It can take some time for the transponder to re-tune if the timing is different if your A path is 25km and B side is 5km and you have a optical switch, with the higher PHY rates it will take some extra time.
I know that Cisco also has these interface timers, but some of the others may not (eg: I don’t know if Mikrotik has them, but queue the wiki in a reply).
If it’s stable for 48 hours, I would place it back into service, but you should escalate at the same time and determine if they were truly hands off. It may be a fiber was bent and is now fixed and that actually was the root cause.
Hope this helps you and a few others.
At a previous $dayjob, we employed a guy that was a bona fide optical guru. He had effectively memorized the 400+ page Nortel 6500 operating guide, and some of the hardware vendors would call him for advice when their TACs couldn’t figure a problem out. Allegedly, he was the person who discovered that the early generations of OTU-4 line deployments were susceptible to problems across cable in OPGW space because of the Faraday Effect. On the rare occasion when he couldn’t diagnose a problem he’d respond with something like “voodoo doesn’t always work”.
To your question, it isn’t acceptable but it is likely pretty normal. Flapping isn’t often a particularly straightforward issue to diagnose and/or resolve in optical networks (especially ones where there’s regen or in-line amplification), and most transport providers don’t employ guys like that that can figure it out. And even then, voodoo doesn’t always work.
Your hope is that whatever the “card issue” was was a localized event rather than something that’s now systemic, and while I don’t really understand why they wouldn’t take a maintenance window to replace the cards anyway (aside from being cheap, which is almost definitely the reason), if they’re not seeing continued issues (and of course you’d have to trust them that they’re not), it’s equally likely as not that the problem has in fact resolved.
This is what they sent after my first ticket:
We found this circuit was on a higher level DWDM circuit that was bouncing in and out of service until it self corrected and cleared on 7/5/2023 at 09:31:21pm EDT. The [redacted] transport team has identified a high speed circuit card failure in a transport node near Berea, IL (which doesn’t exist) and has ordered a new one. Once it arrives, they will send out a MOP request via the [redacted] system to replace it during our maintenance window of 12am to 6am local time.
Knowing my luck, within a week of re-enabling this circuit, it’ll start flapping again.
nods I know optical transport can be difficult to track down, but they had admitted to a faulty card, then said they weren’t going to do anything because it hadn’t faulted again.
Yeah, it’s probably just being cheap. Well, kinda. I mean they’ve also said the card was already delivered, just needed installation.
In general, I think operators want to see the link go down if there is an issue with the transport network.
This also helps with features like BGP Next Hop Tracking.
From a wider backbone view, BGP PIC would be helpful in such cases, especially when combined with Best External or Add-Paths.
It's not normal, nor should it be acceptable. They have probably gotten away with this sort of behaviour long enough that it is entrenched. But I suspect they will jump if you scream loud enough, or even threaten cancellation.
It comes down to how much time and energy you have to chase this.
I would recommend insisting on an RFO and pushing for extreme comfort on your side, even if it is at their expense.
They will try to get away with whatever they can.
If they are tight on sparing, they won’t replace it unless it really becomes a headache. Mark.