Yesterday on 04/16/07 between 3:00 - 3:45 PM we had sporadic Internet problem. Our ISP’s are Sprint and Qwest.
thanks,
Audie Onibala
703-292-5316
Yesterday on 04/16/07 between 3:00 - 3:45 PM we had sporadic Internet problem. Our ISP’s are Sprint and Qwest.
thanks,
Audie Onibala
703-292-5316
Audie Onibala wrote:
Yesterday on 04/16/07 between 3:00 - 3:45 PM we had sporadic Internet problem. Our ISP's are Sprint and Qwest.
Around that time there was quite a bit sunspot activity and the moon
had an unusual position too. The NOC contacts of your ISP's probably
may be of more specific help. But make sure to ask them for their
networks SPF (sunspot protection factor). That's an important metric
to qualify their network reliability.
Andre Oppermann wrote:
Audie Onibala wrote:
Yesterday on 04/16/07 between 3:00 - 3:45 PM we had sporadic Internet problem. Our ISP's are Sprint and Qwest.
Around that time there was quite a bit sunspot activity and the moon
had an unusual position too. The NOC contacts of your ISP's probably
may be of more specific help. But make sure to ask them for their
networks SPF (sunspot protection factor). That's an important metric
to qualify their network reliability.
Are you sure it was sunspots? My NOC contacts were seeing substantial memory corruption due to cosmic rays.
Somebody form a certain large network vendor actually blamed problems
with their kit on cosmic rays causing memory corruption...
Remember that cosmic rays are very selective, they always seem to pick
boxes from this specific vendor.
/Tony
With certain susceptible Sun CPUs which were popular during the last
sunspot maxima, this was actually demonstrably true (and acknowledged
by Sun), so don't laugh too hard.
---rob
Leigh Porter <leigh.porter@ukbroadband.com> writes:
With certain susceptible Sun CPUs which were popular during the last
sunspot maxima, this was actually demonstrably true (and acknowledged
by Sun), so don't laugh too hard.
Yup, Sandia National Labs made a radiation hardened Pentium and, as far as I remember, was working on a hardened SPARC -- there was also some work done (AFAIR on PPC) whereby 3 processors would run the same instructions and vote on the output...
---rob
Leigh Porter <leigh.porter@ukbroadband.com> writes:
Somebody form a certain large network vendor actually blamed problems
with their kit on cosmic rays causing memory corruption...
Oh, not just "somebody" -- a certain large vendor has many, many references to it -- and I have received it as a explanation for random reloads -- believe me, trying to tell an irate customer / PHB that the reason that his "mission critical" circuit bounced was because of cosmic rays is No Fun(tm). Hmmm.. Isn't this the same vendor that now has a router sitting on a satellite ?!
There was also an issue where one of the large manufacturers of (binary) CAMs received a batch of polyimide that was contaminated with an alpa-emitter (for some reason thorium oxide springs to mind) and their quality control didn't catch it... As far as I know the problem was identified before any products with the CAMs were shipped, but I had an order held up while the vendor tried to source alternate parts...
Thinking of perhaps Resilience? http://www.resilience.com/
God, those things were horrid before they realized that the business
model of assuming "The app will always be OK, the issue will be the
hardware" was completely misguided. I forget what the product was named
at the time, but I'll never forget what a piece of crap it was.
Contamination by alpha emitters was a major problem some years ago.
Manufacturers had to change their formulations to avoid the problem.
--Steve Bellovin, http://www.cs.columbia.edu/~smb
From: owner-nanog@merit.edu [mailto:owner-nanog@merit.edu] On
Behalf Of Warren Kumari
Sent: Thursday, April 19, 2007 12:01 PM
To: Robert E. Seastrom
Cc: Leigh Porter; Jay Hennigan; Andre Oppermann; nanog@merit.edu
Subject: Re: BGP Problem on 04/16/2007With certain susceptible Sun CPUs which were popular during
the last
sunspot maxima, this was actually demonstrably true (and
acknowledged
by Sun), so don't laugh too hard.
Yup, Sandia National Labs made a radiation hardened Pentium
and, as far as I remember, was working on a hardened SPARC --
there was also some work done (AFAIR on PPC) whereby 3
processors would run the same instructions and vote on the output...
There is a radiation hardened Pwwer PC -
http://www.klabs.org/DEI/Processor/PowerPC/index.htm
You need this for space flight qualified hardware. Up there, cosmic ray bit flips
and stuck bits are a common occurrence.
Regards
Marshall
"David Temkin" <dave@rightmedia.com> writes:
From: owner-nanog@merit.edu [mailto:owner-nanog@merit.edu] On
Behalf Of Warren Kumari
Yup, Sandia National Labs made a radiation hardened Pentium
and, as far as I remember, was working on a hardened SPARC --
there was also some work done (AFAIR on PPC) whereby 3
processors would run the same instructions and vote on the output...Thinking of perhaps Resilience? http://www.resilience.com/
God, those things were horrid before they realized that the business
model of assuming "The app will always be OK, the issue will be the
hardware" was completely misguided. I forget what the product was named
at the time, but I'll never forget what a piece of crap it was.
Eh, they're not the only folks to have had
voting-muti-cpu-lockstep-execution hardware platforms. Stratus did it
for years; the Tandem Integrity S2 (to which I ported Emacs 18.55 many
moons ago) was similar.
---Rob
Nah, I wasn't thinking of them -- post-traumatic memory loss allowed me to forget them... There was someone else who's name I have managed to forget who tried to do the same thing through 4 parallel SCSI connectors and fancy OS software -- it was horrendous.. There were 2 motherboards in a case (driven by the same, non-redundant, non-swappable PSU!) and each motherboard had 2 dual channel SCSI cards with cables stretched between the cards. Fancy drivers exposed each board's RAM to the other machine -- there was also a 10Base-2 cable (I'm dating myself here) between the mother-boards for coordination and communication. Every now-and-then your application was supposed to make a system call that would cause the machines grind to a halt and compare their memory -- if there was a difference, the syscall would return non-zero and leave you to figure out what to do about it -- unfortunately because there were only 2 machines voting there was no way to know who was right and who was wrong -- the vendors suggestion was to a: reboot or b: "just choose one and hope you guessed right". Wildly broken system...
I cannot find any of my docs on the system that I was originally talking about, but it was 3 PPC cores in a single package -- there was built in hardware to keep them synchronized and voting. AFAIR, it was a drop-in replacement for the "normal" version of the same device, modulo the power-draw.
Maxwell Technologies makes a triple modular redundant cPCI board with SOI processor and rad tolerant FPGAs that is really nice -- somewhere I think I still have a stash of them...
NB: The above mentions 10BASE-2 and cPCI (which will fit in certain vendors hardware) which *just* managed to keep this on-topic -- hopefully
W
Right. I get that answer quite often. We've made a little spinner that
has "Upgrade software", "Random radiation", and "We've never supported
that feature". It's proven to be fairly accurate when opening cases
with this vendor's tech-support organization.
I helped develop a digital communication system for the Navy at Huges back in the early 80s. We could only use fusable ROMS and rad-hard 8080s. (No break points.) Crystals where nudged into lock for three-way synchronous voting on defective systems/hardware. Mechanical inputs were also redundant, and of course a bear to resync. This lead to a snafu during war games with an aircraft carrier, where the air controller panel's gray-code rotor switches were erroneously flagged as defective during peak use. Luckily everyone lived.
-Doug
in point of fact it seems like it's the fall through for their technical
assistance center's answer tree if all else fails. Quite funny.
I dont have the reference to hand but with Cisco the crash reason hinted at something very odd which was either a hardware failure or cosmic ray - i think it was a parity error or something similar.
I remember this because I had such a reload and it was during a period of heavy cosmic activity.. as the hardware had always been reliable and was reliable after this was beleived to be the cause
Steve
Hi Steve,
steve@telecomplete.co.uk (Stephen Wilcox) wrote:
I remember this because I had such a reload and it was during a period of heavy cosmic activity.. as the hardware had always been reliable and was reliable after this was beleived to be the cause
We have also started to use this as the standard excuse.
Up to now, people believe us...
Cheers,
Elmi.
> I remember this because I had such a reload and it was during a period of heavy cosmic activity.. as the hardware had always been reliable and was reliable after this was beleived to be the cause
We have also started to use this as the standard excuse.
Up to now, people believe us...
Well, there is some documentation on Cisco containing references to
cosmic rays and parity errors:
http://www.cisco.com/en/US/products/hw/routers/ps341/products_tech_note09186a00800942e0.shtml
Cisco 7200 Parity Error Fault Tree
"As with all computer and networking devices, the NPE is susceptible
to the rare occurrence of parity errors in processor memory. Parity
errors may cause the system to reset and can be a transient Single
Event Upset (SEU or soft error) or can occur multiple times (often
referred to as hard errors) due to damaged hardware. SEUs or soft
errors are caused by "noise" most frequently due to high-energy
neutrons generated in the atmosphere by cosmic rays. For more
information on SEUs, refer to the Increasing Network Availability
page.
[...]
Even if systems use Error Code Correction (ECC), it is still possible
to see an occasional parity error when more than a single error has
occurred in the 64 bits of data due to cosmic rays affecting more than
one memory cell, or a hard error in the cache."
Regards,
Daniele.
yup, thats the reference i was referring to.. we indeed had a single event upset on an NPE
Steve