AMD Confirms Linux Performance Marginality Problem Affecting Some, Doesn't Affect Epyc / TR

janweb replied

13 August 2017, 12:02 PM
Originally posted by Funks View Post

How did AMD come to the conclusion that ThreadRipper isn't affected by these issues given that it's running the same stepping as the Ryzen chips (B1). Does it just have better QA in general? A manufacturing fix on 2017 Week 25+ chips? A microcode fix?

Sounds to me that there's no Microcode Fix (if ThreadRipper has it, pretty sure they Ryzen would have it by now as well), From a die perspective - there's no new stepping (TR is supposedly using the same me B1 Zeppelin Dies used in Ryzen). Sounds like a lot of voodoo BS coming out of AMD - how are they able to claim that TR is unaffected if Linux is still being looked at?

Will the announced Ryzen Pro desktop processors have the same level of QA-binning as TR? Will the consumer line as well? Or will we play ongoing SEGV Silicon Lottery going forward on the consumer line?

AMD needs to give us a better explanation and fast-track RMA process for people affected instead of making us take pictures of our BIOS settings, our Case - (a bit pissed as I have two systems affected, and AMD support takes several days to respond to an ongoing support conversation).

A fast-track path IMO would be, make us set BIOS to defaults (to rule out overclocking), provide us a testing tool (something we can execute) and if it fails, generate an RMA number. The way it's happening now, it'll take over a week before one gets an RMA number because they want pictures of the BIOS, pictures of the case and AMD support doesn't respond quickly enough. Add shipping into account, this process ends up taking a couple of weeks (crossing ones fingers the first CPU sent back will be a good copy).

They ask me also to do some pictures of my mainboard and settings. This is all very sad.
Leave a comment:
tetsuos replied

13 August 2017, 10:18 AM
I've signed up because this is the only place where AMD employees get involved in the conversation about Ryzen.

When will AMD make a statement about these segfaults? What is AMD investigating? How can AMD be sure the B1 Threadripper is unaffected? What's wrong with the chips which fail with SEGV? Does AMD believe we are going to buy any Ryzen chip after this fiasco?

When can we expect an actual answer? When can we expect AMD to get involved to answer questions?
Likes 1
Leave a comment:
Funks replied

13 August 2017, 02:40 AM
Originally posted by bridgman View Post

There is an errata document but I'm not sure if it has been published yet. I asked about status last week.

IIRC a typical modern CPU has somewhere between 10 and 100 errata (check the revision guides for any recent CPU).

I do not expect this specific one to cause problems on Linux or Windows but as I said we are checking to make sure that transparent huge page logic (THP) in Linux does not need an additional tweak.

How did AMD come to the conclusion that ThreadRipper isn't affected by these issues given that it's running the same stepping as the Ryzen chips (B1). Does it just have better QA in general? A manufacturing fix on 2017 Week 25+ chips? A microcode fix?

Sounds to me that there's no Microcode Fix (if ThreadRipper has it, pretty sure they Ryzen would have it by now as well), From a die perspective - there's no new stepping (TR is supposedly using the same me B1 Zeppelin Dies used in Ryzen). Sounds like a lot of voodoo BS coming out of AMD - how are they able to claim that TR is unaffected if Linux is still being looked at?

Will the announced Ryzen Pro desktop processors have the same level of QA-binning as TR? Will the consumer line as well? Or will we play ongoing SEGV Silicon Lottery going forward on the consumer line?

AMD needs to give us a better explanation and fast-track RMA process for people affected instead of making us take pictures of our BIOS settings, our Case - (a bit pissed as I have two systems affected, and AMD support takes several days to respond to an ongoing support conversation).

A fast-track path IMO would be, make us set BIOS to defaults (to rule out overclocking), provide us a testing tool (something we can execute) and if it fails, generate an RMA number. The way it's happening now, it'll take over a week before one gets an RMA number because they want pictures of the BIOS, pictures of the case and AMD support doesn't respond quickly enough. Add shipping into account, this process ends up taking a couple of weeks (crossing ones fingers the first CPU sent back will be a good copy).

Last edited by Funks; 13 August 2017, 03:30 AM.
Likes 2
Leave a comment:
Funks replied

12 August 2017, 11:36 PM
Originally posted by Khudsa View Post

Same for me, when I disable the opcache option it works without errors. I have never do a RMA. How is the process? First contact with AMD (already contacted, waiting answer) and then contact with the store (pccomponentes official store in Spain for Ryzen release) with the AMD's reply?

Thanks

At this point, nobody knows if it truly fixes the problem or just delays it from happening eventually. May be best to RMA your cheap, seems like people are getting better luck when they get back RMA 2017 - Week 25+ chips. Nevertheless, one reported that the UA1725 chip they got back on the first RMA had Machine Checked Exception issues but the second UA1725 chip was okay. It seems like AMD's QA process needs more work.
Leave a comment:
scorpio810 replied

12 August 2017, 04:56 PM
Originally posted by kaseki View Post

I can confirm that disabling opcache eliminates segfault reports generated by kill-ryzen when opcache is set at Auto (at least for the period of 2 hours before I stopped testing). Impact of the opcache disable varies. Unigine Valley and Superposition scores and average frames per second are effectively unchanged. Latency and read rate per Intel MLC are effectively unchanged. Blender rendering, however, does take a few percent longer for the Ryzen logo and the Blender home page 'Classroom' renders. My measurements are not statistically significant, however, so YMMV.

Thank you for your benchs.
Leave a comment:
kaseki replied

12 August 2017, 03:55 PM
Originally posted by Khudsa View Post

Same for me, when I disable the opcache option it works without errors. ...

I can confirm that disabling opcache eliminates segfault reports generated by kill-ryzen when opcache is set at Auto (at least for the period of 2 hours before I stopped testing). Impact of the opcache disable varies. Unigine Valley and Superposition scores and average frames per second are effectively unchanged. Latency and read rate per Intel MLC are effectively unchanged. Blender rendering, however, does take a few percent longer for the Ryzen logo and the Blender home page 'Classroom' renders. My measurements are not statistically significant, however, so YMMV.
Leave a comment:
bridgman replied

12 August 2017, 03:31 PM
Originally posted by drSeehas View Post

Was this ever documented/communicated and when?

There is an errata document but I'm not sure if it has been published yet. I asked about status last week.

Originally posted by drSeehas View Post

So there are two "bugs", but in Linux and Windows shows up only one bug?

IIRC a typical modern CPU has somewhere between 10 and 100 errata (check the revision guides for any recent CPU).

I do not expect this specific one to cause problems on Linux or Windows but as I said we are checking to make sure that transparent huge page logic (THP) in Linux does not need an additional tweak.

Last edited by bridgman; 12 August 2017, 03:37 PM.
Likes 1
Leave a comment:
drSeehas replied

12 August 2017, 03:02 PM
Originally posted by bridgman View Post

... Ryzen needs a full guard page at the top rather than just a guard region, ...

Was this ever documented/communicated and when?

... and so the BSD devs have updated their code accordingly. Linux and Windows "got lucky" in this case because the guard page added for errata in previous CPUs also worked for Ryzen. ...

So there are two "bugs", but in Linux and Windows shows up only one bug?
Likes 1
Leave a comment:
bridgman replied

12 August 2017, 02:16 PM
Originally posted by drSeehas View Post

This is not a fix. It is a workaround. Only AMD can fix this bug. But AMD says it is a linux only bug ...

You guys may be mixing issues here. Linux (and Windows AFAIK) have had similar code for years (leaving an unused "guard page" at the top of user process address space) to deal with errata in other processors... BSD had workarounds in place as well, but they used a smaller guard region (less than a page) which was sufficient for previous processors. Not a "better or worse" thing, just a different design decision.

Ryzen needs a full guard page at the top rather than just a guard region, and so the BSD devs have updated their code accordingly. Linux and Windows "got lucky" in this case because the guard page added for errata in previous CPUs also worked for Ryzen.

We are checking to make sure that the combination of address space randomization and transparent huge page migration will never replace a collection of 4K pages at the top of user memory with a single 2M page (which would effectively remove the guard page). We don't think it will happen because of the unused guard page and the OS-managed write protection of the vsyscall page but need to be sure.

Last edited by bridgman; 12 August 2017, 02:25 PM.
Likes 2
Leave a comment:

scorpio810 replied

12 August 2017, 09:47 AM

I think my pre order 1700X bought in 2 april 2017 make segfaults only with CPU core 5 ? script continued without error for the moment ...

Edit: I spoke too fast ... krkrkr

Code:

[août12 12:35] logitech-hidpp-device 0003:046D:400A.0007: HID++ 2.0 device connected.
[août12 13:38] bash[27367]: segfault at 7f814d3857e8 ip 00007f814d0a1330 sp 00007ffeb781b898 error 4 in libc-2.24.so[7f814cf78000+193000]
[août12 15:44] bash[10807]: segfault at 7fba26df87e8 ip 00007fba26b14330 sp 00007ffd63026c28 error 4 in libc-2.24.so[7fba269eb000+193000]
[août12 15:45] bash[30145]: segfault at 12 ip 0000000000435d7e sp 00007ffcdc40fc30 error 6 in bash[400000+100000]

Code:

[loop-0] Sat Aug 12 12:48:36 CEST 2017 start 0
[loop-1] Sat Aug 12 12:48:37 CEST 2017 start 0
[loop-2] Sat Aug 12 12:48:38 CEST 2017 start 0
[loop-3] Sat Aug 12 12:48:39 CEST 2017 start 0
[loop-4] Sat Aug 12 12:48:40 CEST 2017 start 0
[loop-5] Sat Aug 12 12:48:41 CEST 2017 start 0
[loop-6] Sat Aug 12 12:48:42 CEST 2017 start 0
[loop-7] Sat Aug 12 12:48:43 CEST 2017 start 0
[loop-8] Sat Aug 12 12:48:44 CEST 2017 start 0
[loop-9] Sat Aug 12 12:48:45 CEST 2017 start 0
[loop-10] Sat Aug 12 12:48:46 CEST 2017 start 0
[loop-11] Sat Aug 12 12:48:47 CEST 2017 start 0
[loop-12] Sat Aug 12 12:48:48 CEST 2017 start 0
[loop-13] Sat Aug 12 12:48:49 CEST 2017 start 0
[loop-14] Sat Aug 12 12:48:50 CEST 2017 start 0
[loop-15] Sat Aug 12 12:48:51 CEST 2017 start 0
[loop-2] Sat Aug 12 13:17:40 CEST 2017 build failed
[loop-2] TIME TO FAIL: 1744 s
[loop-4] Sat Aug 12 13:38:28 CEST 2017 build failed
[loop-4] TIME TO FAIL: 2992 s
[loop-10] Sat Aug 12 15:44:59 CEST 2017 build failed
[loop-10] TIME TO FAIL: 10583 s
[loop-1] Sat Aug 12 15:45:08 CEST 2017 build failed
[loop-1] TIME TO FAIL: 10592 s

Last edited by scorpio810; 12 August 2017, 09:50 AM.

Announcement

AMD Confirms Linux Performance Marginality Problem Affecting Some, Doesn't Affect Epyc / TR

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment: