Announcement

**milkylainen** · 24 May 2022, 06:36 AM

Hmm. So... -ERAT_CHEWED_CPU_WIRES ?

**Draget** · 24 May 2022, 10:11 AM

Someone help me here: What type of problem would one discover that would not be discoverable by self-tests executing regular instructions?

**CommunityMember** · 24 May 2022, 03:10 PM

Originally posted by Draget View Post

Someone help me here: What type of problem would one discover that would not be discoverable by self-tests executing regular instructions?

While newer (server class) processors will have internal parity/ecc support, there are extremely rare cases where a "glitch" occurs. These are sometimes called "soft errors", as they are not repeatable, and while the causes are not always fully understood, it may be the case "physics happens" (small features, external events (the infamous cosmic rays)?). The large cloud providers have found that if you run the exact same calculation across a large number of processors (even in the same server), sometimes one of the CPUs *occasionally* produces a different result while otherwise reporting no error. And sometimes one of the CPUs is more likely to see soft errors than some other CPU even in the same package. Newer processors have "In-Field Scan" (IFS) capability to run tests that can test various parts of the internal processor itself (not the full instruction processing), and may help identify individual CPUs/Processors/Servers that can be isolated/disabled or replaced, as per the providers replacement/repair strategy.

And while if you ran self-tests 100% of the time you might be able to see the same soft errors occasionally, all you would find (after millions of hours) is that it is likely your real world processor is not 100% perfect 100% of the time.

IFS is another tool in the toolbox to raise the confidence that at least the processor is not observed to be completely broken.

**muncrief** · 24 May 2022, 06:38 PM

Oh lord, this brings back nightmares of many years past.

In 1993 I was hired by Amdahl as part of a 60+ member team of hardware/firmware/software consultants (yes, actual consultants back in the day they weren't just temporary employees), to rescue them from some looming company catastrophe.

To make a long story short they had bought a fault tolerant mainframe design from a failed company named Key Supercomputer, and for some mysterious reason their own formidable team of engineers couldn't make it work. And when I say mysterious, the whole project was shrouded in secrecy from the beginning, with engineering teams and tasks compartmentalized and all of us asked not to share information unless it was authorized. I know this seems crazy and implausible, but it really is true.

The reasoning stated was that there was some secret method Key Supercomputer had developed to sense and recover from real-time circuit level errors, and it was critical that Amdahl's competitors not discover it.

However I was assigned the task of creating the C++ diagnostic and recovery code for the Central Bus Controller (CBC) board, so as time progressed I began to deduce that there was some type of massively parallel bidirectional system, communicating via serial data, that was supposed to achieve the goal of 50ms detection and recovery. Which, at the time, would have been a stunning achievement.

Simultaneously other engineers with peripheral board assignments also noticed the queued serial data interface, and the chatter grew among us on exactly how such a system could be implemented at the hardware level. And we always talked of it as some type of massively advanced JTAG type system, which later proved simultaneously prescient and fateful. By the way, for those who don't know what JTAG is, it's a way of testing integrated circuits at the gate level after production, but only one gate can be tested at a time so it's extremely slow.

So there we were, slaving day and night for eight months, excitedly watching as the physical computer was being constructed in its environmentally controlled room down the hall, while perfecting our code, breathlessly waiting for the first critical boards to be installed and tested.

And then the great day came.

The first reports were incredibly encouraging, with the system successfully booting and completing initial operating tests. However after a few days all information from the computer room fell silent, and we all became increasingly concerned about what was going on. But none of us were receiving bug reports, so we just assumed it was more secrecy.

Until, almost nine months from the day we were hired, we were called into a large conference room, which still required a few of us stand, and told the project had been terminated.

As it turns out Key Supercomputer had indeed attempted to take JTAG and turn it into some type of massively parallel real-time diagnostic and recovery system, but because of the compartmentalization of the hardware/firmware/software designers, both at Key and Amdahl, there were uncounted blocking conditions that prevented most parallel operations, and the computer took many minutes just to boot, and uncounted minutes to detect and recover from errors, with complete lockup failures a common occurrence.

I know this has nothing to do with Intel's system, but, as I said in the beginning, it sure did resurrect the nightmare.

**CommunityMember** · 24 May 2022, 11:04 PM

Originally posted by muncrief View Post

By the way, for those who don't know what JTAG is, it's a way of testing integrated circuits at the gate level after production, but only one gate can be tested at a time so it's extremely slow.

Ah yes, JTAG / boundary scan. I have a number of JTAG (and equivalent) programmers I typically use for embedded solutions. Intel has a (as I recall) a 60 pin eXtended Debug Port which supports JTAG and extensions (and there is a very expense toolkit to use it) for their existing core and xeon (and probably other) processors.

As I understand it, IFS is, in really vague hand waving terms, sort of the JTAG equivalent (ability to test internal functionality) from within the processor complex itself without external connections.

**muncrief** · 24 May 2022, 11:20 PM

Originally posted by CommunityMember View Post

Ah yes, JTAG / boundary scan. I have a number of JTAG (and equivalent) programmers I typically use for embedded solutions. Intel has a (as I recall) a 60 pin eXtended Debug Port which supports JTAG and extensions (and there is a very expense toolkit to use it) for their existing core and xeon (and probably other) processors.

As I understand it, IFS is, in really vague hand waving terms, sort of the JTAG equivalent (ability to test internal functionality) from within the processor complex itself without external connections.

Oh god no.

Not again.

**lowflyer** · 25 May 2022, 04:24 AM

More power to Intel for their attempts to figure out the more obscure and rare errors.

-- BUT --
Is it really justified to have that kernel driver mainlined? Given that Intel seems to be reluctant to release the other parts of the testing tools (software, manuals, microcode) to the broader public. Is it too much to ask these rare few folks that actually get their hands on the holy grail of these tests to also compile their own kernel?

**m1kemex** · 27 May 2022, 01:48 AM

Maybe that's too ambitious. But I do remember the BadRAM patch from my old days. it was surprisingly useful for saving bad memory modules from getting discarded. Proving that some level of deliberate fault tolerance is not only possible, but sometimes even trivial.

Announcement

Linux 5.19 Lands New Intel IFS Driver For Helping To Detect Faulty Silicon

Linux 5.19 Lands New Intel IFS Driver For Helping To Detect Faulty Silicon

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment