Announcement

**indepe** · 14 March 2017, 12:14 AM

Originally posted by efikkan View Post

You are completely missing the point.
Of course threads can at any time communicate, but it comes at a huge performance penalty. The sole purpose of utilizing multiple threads is to divide the workload, but if you are spending a lot of time on overhead communicating between them, then it defeats the purpose. During rendering of a single frame (usually <16.67 ms) you can't afford to do a lot of thread communication. You simply can't do thousands of them, otherwise you'll have to start measure the frame rate in seconds per frame…

Whenever a program successfully scales with multithreading, it does so by allowing the threads work on their task independently, usually only to sync after the task is complete. Some applications keep handing out new chunks of work to each thread, but it's always a significant amount of work, otherwise the overhead would surpass the gains. For games, timing is critical, meaning any synchronization is very expensive.

I don't know how you perhaps got the impression that I'm advocating frequently alternating write access to shared memory, but that is of course not the case.

Originally posted by efikkan View Post

For all multithreaded work sharing of L3 cache for data is something to avoid.

Are there any exceptions to the rule that L3 cache access is better than main memory access? I wouldn't completely rule it out, but in this context I can't think of any.

Originally posted by efikkan View Post

I suggest you watch Scott Meyers: Cpu Caches and Why You Care, which will also touch the subject of sharing and even false sharing.

That's a good talk, however I was already familiar with 97% of its content.

**leipero** · 14 March 2017, 02:32 AM

Originally posted by dungeon View Post

I don't think there was misunderstandings, AdoredTV took random results from some site and interpreted results like they interpreted it and others found that these and other results were mostly wrong, to claim how now FX is faster in year 2017. by +10% than SB.

On Hardware Unboxed's video you can see (at about 15:00 minute) that actually they got +20% results average on 16 games but now SB is faster

or in whole 30% difference, which is too much disparity between two stories really

Maybe you didn't, from what i see, point is, on properly coded games, FX is faster, because it is faster CPU in general. Now for sure, we can always take 16 games heavily optimized for SB architecture, almost non-optimized for BD architecture (eg poorly coded games) and pair it with GPU that have drivers optimized exclusively for SB architecture (who on the right mind gets last-gen highest end GPU and pair it with FX chips? more likely tehy will pair it with minimum 7700k) and prove that "SB is faster". What does that proves? It proves that if tyre is made for 15" rhyms, it will fit them best, you can put it on 16 by modifying it, but it will hardly be functional, let alone have performance of 100 mph or smth.

bakgwailo
Read my above comment, he used this as source:

AMD Ryzen 7 1800X, 1700X, 1700 im Test (Seite 4)

https://www.computerbase.de/2017-03/amd-ryzen-1800x-1700x-1700-test/4/

AMD Ryzen 7 1800X, 1700X, 1700 im Test: König in Anwendungen, Prinz in Spielen / Spiele-Benchmarks im Detail

Bulldozer was definitively not a flop, game industry on the other hand is...

@efikkan

If you read my whole post you'll see that gaming usually is not superscalar, and is generally very inefficient CPU code with loads of branch mispredictions and cache misses. This is also why Bulldozer back in the day sucked at gaming, even though Ryzen is a bit better.

Ok, so what are you saying (I still don't get it how it is connected with being superscalar, but nvm), is that when you code game, you have to optimize it for "X architecture", so you have a choice, either do it for X or for Y, or do 2 times the work for two architectures? If that is the case, then whole coding for games is done completely wrong, and we can only thank Intel for that (since most of the game devs use Intel based compilers and systems).

From what i understand, code in games uses SSEx instruction sets (depending on game), AMD 3Dnow+ is long time gone..., it relies on FPU in the CPU, there's no logical reason why game should be architecture dependent. All that superscalar does, is using multiple instruction sets per cycle, so that should really not have any influence in here.

TL;DR: All of the posters above are barking at the wrong tree, what you are suggesting is more corruption in software (parcitualary gaming) industry. I really don't see how that would solve anything, and how that would allow competition and even introduction of 3rd companies on the market. I think you should think about what you write, it is understandable that reviewers on youtube don't think about it, they are partners with companies..., they get money to do so, not to think.

**nuetzel** · 14 March 2017, 03:31 AM

Originally posted by indepe View Post

Exactly, even Ryzen's pure single thread performance isn't bad enough to explain these results. Somethings else may be going wrong. And, on the other side, it shouldn't be that much better with Tomb Raider 4K either. So this potential problem may be effecting the 7700K side as well.

@Michael:
Could you redo with thread count and clock graph writing(plotting), please?
Analysing these GREAT Tomb Raider numbers should tell something, hopefully.
Maybe this could be a starting point.

**leipero** · 14 March 2017, 04:58 AM

Originally posted by nuetzel View Post

@Michael:
Could you redo with thread count and clock graph writing(plotting), please?
Analysing these GREAT Tomb Raider numbers should tell something, hopefully.
Maybe this could be a starting point.

I would love to put my theory in test, isolate potential problems with scheduler, use OSS games as part of the test (like TuxRacing or something), use OSS drivers, see if pattern change it's percentage. Sure you can't get rid of code bias even with those games and conditions, but it would be interesting (at least for me) to see if percentage differs as an pattern (eg. closed source games are faster by X% on Intel/nvidia blob vs oss games with nvidia blob vs oss... you get the picture...). It will not be easy to find neutral game (eg drivers not optimized for platform X), and it is a ton of work, so i keep my hopes low. If i could, i would do it.

**bug77** · 14 March 2017, 05:20 AM

Originally posted by efikkan View Post

Netburst wasn't beaten due to lack of "optimized code", it had terrible penalties for branch mispredictions which caused it to be beaten by a much simpler CPU from AMD.

Yes, but if you used intel's compiler (or some specific flags), you could get code that was friendly towards that long pipeline. The argument coming from intel was that code wasn't optimized properly for Netburst. As it turned out, Netburst wasn't properly optimized for the world.

Originally posted by efikkan View Post

Why will it be better? Please elaborate.

Well, I expect it will support faster memory at some point. That should up the performance somewhat. And given enough time, motherboard makers should be able to squeeze a few more MHz out of it when overclocking. Beyond that, I don't know.

**Skolo** · 14 March 2017, 06:42 AM

Did anyone tried MuQSS with Ryzen?

**efikkan** · 14 March 2017, 06:25 PM

Originally posted by indepe View Post

Are there any exceptions to the rule that L3 cache access is better than main memory access? I wouldn't completely rule it out, but in this context I can't think of any.

L3 cache read speed is always faster than main memory.
But you are missing the point; increasing the L3 cache sharing between cores is not going to help Ryzen in gaming, which was implied here earlier in the thread.

As I've mentioned, if multiple threads are going to scale they can't write to data shared between threads, and since they obviously are going to work on different chunks of data, the only shared cache is going to be cached code. Threads crunching a lot of data is not going to process the same data, at least not at the same time, so there is not going to be any substantial cached data shared between them.

Originally posted by bug77 View Post

Yes, but if you used intel's compiler (or some specific flags), you could get code that was friendly towards that long pipeline. The argument coming from intel was that code wasn't optimized properly for Netburst. As it turned out, Netburst wasn't properly optimized for the world.

There are very few ways to optimize for a long pipeline. BTW, "no" software uses Intel's compiler.

Originally posted by bug77 View Post

Well, I expect it will support faster memory at some point. That should up the performance somewhat. And given enough time, motherboard makers should be able to squeeze a few more MHz out of it when overclocking. Beyond that, I don't know.

Ryzen scales well in workloads that are cache friendly, but scales terribly in workloads that are riddled with branch mispreditions and cache misses, like gaming. Bumping clock speeds for the CPU or memory wouldn't help when the problem is the prefetcher.

**indepe** · 14 March 2017, 07:47 PM

Originally posted by efikkan View Post

L3 cache read speed is always faster than main memory.
But you are missing the point; increasing the L3 cache sharing between cores is not going to help Ryzen in gaming, which was implied here earlier in the thread.

I'm not missing your "whole point" at all, it is actually irrelevant to the main point I was making: As I already said, in the case in which threads are not sharing memory, it may be an advantage to assign threads to specific core groups just as well, but in this case in order to ensure that *both* L3 caches are used optimally, meaning that *not* all memory-intensive threads will use the same L3 cache.

However I do think that in reality there always is some shared memory, and at least in the case where it is mostly read-only, that is not always a bad thing. So taking that into account may or may not result in additional optimizations, depending on the game. Certainly it doesn't need to remove the possibility of scaling above 4 cores.

Originally posted by efikkan View Post

As I've mentioned, if multiple threads are going to scale they can't write to data shared between threads, and since they obviously are going to work on different chunks of data, the only shared cache is going to be cached code. Threads crunching a lot of data is not going to process the same data, at least not at the same time, so there is not going to be any substantial cached data shared between them.

Wishful thinking? Why shouldn't a game have lot of data that is updated at the beginning of a frame's processing time, and then treated read-only for the rest of the time, shared by several threads? That might be how I would try to do it.

In such a case all your assumptions would be completely wrong.

**Shevchen** · 15 March 2017, 12:25 PM

Originally posted by indepe View Post

Why shouldn't a game have lot of data that is updated at the beginning of a frame's processing time, and then treated read-only for the rest of the time, shared by several threads? That might be how I would try to do it.

... just to chime in:
I guess, if you want to optimize games for multithreading, one of the things I'd do is to decouple threads and game logic as far as possible and have as many processes running simultaneously. One a process needs data from another process, it would just access the variable that is currently available and be done with it. You may get small sync problems as one variable might be from another frame than the other, but I'd try to code it in a way, that it doesn't matter much.

So, for example I have a radar system trying to show all the objects in the universe on my radar and I have all the objects in the universe. It doesn't matter, if I miss the correct position on the radar by 1 frame on a certain object, as the radar is a) low resolution and b) one frame later its correct again. Thus I have a decoupled system that is still performant, as it doesn't have to wait on other threads to "refresh their latest variable". So it doesn't even need to be read only.

But... I'm not a coder... its just how I'd imagine things to be "properly" done for games. On the other hand, if we do hit-reg by octree, we have dependent layers build on top of each other that can't be running amok in the background, and thus one layer has to wait for the previous one etc... *confused*

**konkor** · 16 March 2017, 01:08 AM

1. Tomb raider is the one game which optimized for game consoles with low memory...
2. Maybe L cache per core is a cause of low performance.

Announcement

GeForce GTX 1080 Ti: Core i7 7700K vs. Ryzen 7 1800X Linux Gaming Performance

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment