Originally posted by puleglot
View Post
Announcement
Collapse
No announcement yet.
AMD Confirms Linux Performance Marginality Problem Affecting Some, Doesn't Affect Epyc / TR
Collapse
X
-
Love it when "my answer is better than your answer" flares up.
Time for someone to print up tshirts that say "I was there when the Ryzen had segmentation faults in 2017". Then you can claim to anyone 10 years later when "Pyzen" comes out that you had the answer way back when.
- Likes 2
Comment
-
Originally posted by puleglot View PostThis freebsd bug is a completely different story.
After a while I splitted the compilation failure issue out into a seperate bug report...
Comment
-
Originally posted by soulsource View Post
May I ask: What are those reasonable voltages you are using? It'd be nice to use them as starting values for my own experiments.
VCore = 1.35v
SoC = 1.2v
DRAM = 1.375v
Oddly enough, it looks like a power outage the other day knocked out the other settings when it triggered a the BIOS setup screen. Everything still works, so that stuff must not have been important.
Originally posted by k1e0x View Post
Negative.
Seriously just don't. It's great that you want to help but this is an extremely hard problem and smart people are on it. Ya know.. about 10% of the computer industry has smart people that actually do the bulk of the work.. the rest are just there.
- Likes 9
Comment
-
I think it would really help the whole bug investigation process if people would provide observations only from non-overclocked, default-setting systems.
All this talk about this-and-that-tweak of BIOS settings is just a distraction, and it makes some people (both at AMD and elsewhere) believe that all the weird symptoms might just be caused by people running weird settings.
Fact is: People experience the bug on completely default-configured, non-overclocked systems. So it makes totally no sense to try all kinds of weird voltage or clock settings in hope to have it gone.
- Likes 3
Comment
-
Thanks Michael, for raising awareness of this issue.
As evident in the bug discussion over at AMD dating back to 2017-05-08, other bug reports, discussions over at FreeBSD and several threads here in the forums, there are two well documented symptoms:
1 - "Segfaults"
This has been well researched by several users. Under load pointers may get corrupted, which results in undefined behavior. This is why compilation fails "randomly".
2 - uOP Cache
The kernel detects errors (and sometimes corrections) in the uOP Cache.
Some examples I've grabbed from the thread over at AMD:
Example Linux:
"mce: [Hardware Error]: Machine check events logged"
"mce: [Hardware Error]: CPU 2: Machine Check: 0 Bank 5: bea0000000000108"
"mce: [Hardware Error]: TSC 0 ADDR 1ffffa94be452 MISC d012000101000000 SYND 4d000000 IPID 500b000000000"
"mce: [Hardware Error]: PROCESSOR 2:800f11 TIME 1500732880 SOCKET 0 APIC 2 microcode 8001126"
Example BSD:
MCA: Bank 1, Status 0x90200000000b0151
MCA: Global Cap 0x0000000000000117, Status 0x0000000000000000
MCA: Vendor "AuthenticAMD", ID 0x800f11, APIC ID 14
MCA: CPU 14 COR ICACHE L1 IRD error
-----
Our friends over there have done extensive testing and identified these symptoms, which is clear evidence of a hardware defect. But it remains unclear if these are symptoms of the same problem, or two unrelated problems. This will be up to AMD's engineers to research.
As for anyone claiming this is a software bug; when you have a piece of code that's proven to be correct, and then you run it repeatedly and it causes pointer corruption and segfaults, and eventually freezes the operating system, you have a hardware bug.
So far, their testing have eliminated OS kernels, since it's reproduced on Linux, BSD and Windows Subsystem for Linux (WSL). Both gcc and llvm are tested, the problems have been reproduced during compilation of gcc, mesa, chromium, thunderbird, libreoffice, ffmpeg, linux kernel, bsd kernel and more. Memory configurations and timings have been eliminated as a cause.
As stated in the article:
AMD engineers found the problem to be very complex and characterize it as a performance marginality problem exclusive to certain workloads on Linux
The hardware defect is not related to compilation. It just happens that the problem is most easily reproduced with compilation, since it stresses the right parts of the core. This is why these chips have no problem with Prime95 or Cinebench, the FPUs and ALUs just work fine. And who does a lot of heavy compilation? Linux developers, especially Gentoo users.
The defect is present in at least the B1 stepping of Ryzen, but as with all microprocessors, the risk of the bug occurring is dependent on the quality of each sample. A proper solution would probably require a new stepping, but hopefully AMD can manage to either tweak some parameters or disable some features in firmware to make these systems reliable (likely at a performance penalty). Anyone wanting to buy these for development or other productive work should hold off until the situation has been resolved.
Edit:
Improved sentence.Last edited by efikkan; 07 August 2017, 06:56 PM.
- Likes 10
Comment
-
Originally posted by ZombieNo7 View PostThe voltages are: ...
The whole voltage and clock setting vodoo is just a distraction, which makes some people at AMD and elsewhere think that the whole issue might be caused just by some weird settings.
But fact is: The bug does occur also on default-configured, non-overclocked systems. It will not go away by just tweaking some voltage levels or clocks.
- Likes 2
Comment
-
Originally posted by ZombieNo7 View Post
The voltages are:
VCore = 1.35v
SoC = 1.2v
DRAM = 1.375v
Oddly enough, it looks like a power outage the other day knocked out the other settings when it triggered a the BIOS setup screen. Everything still works, so that stuff must not have been important.
I get that I'm not an engineer. I'm just a guy who has built and overclocked plenty of computers, so yeah, maybe I'm seeing what I expect to see, but it's information, and it could be useful to someone who does have more in-depth technical knowledge than myself.
Keep in mind this could still be a case where AMD got it right and the compilers have just been doing the wrong thing for the past 20 years.Last edited by k1e0x; 07 August 2017, 06:44 PM.
Comment
-
Originally posted by k1e0x View PostSomeone is going to have to help me here but I think its the process of the execution of the non-executable memory pages using GCC trampolines that it's showing up in.
GCC is just a kind of software that executes a lot of complex (compiler-)code in parallel where any kind of "wrong intermediate result" is much more likely to result in a (visible) segmentation fault than most other massively parallel software. If you run ffmpeg in 16 threads, chances are that any kind of wrong intermediate result just means some distorted pixels in a frame.
Comment
-
Originally posted by ZombieNo7 View PostI get that I'm not an engineer. I'm just a guy who has built and overclocked plenty of computers, so yeah, maybe I'm seeing what I expect to see, but it's information, and it could be useful to someone who does have more in-depth technical knowledge than myself.
- Likes 6
Comment
Comment