Announcement

**indepe** · 13 March 2017, 04:38 PM

Originally posted by efikkan View Post

Synchronization of threads is not the problem.

Not talking about synchronization here, but about the assignment of threads to physical and/or HT cores. (And the knowledge of L3 access by cores).

Originally posted by efikkan View Post

"Multithreading" has been supported by all APIs since the late 90s.

I think DX 11 was the first "DX" to support MT, what do you mean with "all" APIs, other than DX 11 and Vulkan?
(Of course newer versions of OpenGL don't support multithreading either.)

**NihilMomentum** · 13 March 2017, 05:14 PM

Originally posted by gurv View Post

There's something wrong with the Tomb Raider Benchmark.
No way the 1080 Ti is pushing 160fps in 4k Ultra.
And no way the fps stays exactly the same between 4k normal and 4k ultra.

Other than that, performance difference between the 1800x and 7700k seems consistent with what I've seen elsewhere.
In gaming, Ryzen is hamstrung by its memory controller and has an IPC ~ 85% of Sandy Bridge which means ~ 70% of SkyLake/KabyLake
Given that the 7700k runs at 4.2Ghz (all core) vs 3.7 Ghz (all core) for the 1800x, that gives a theoretical advantage of ~ 65% to Skylake for gaming workload that don't benefit from more than 4 cores / 8 threads.
And that's pretty much what we see with Deus Ex MD (in 1080p), Dota 2, Civ 6 and Unigine Valley.
The gap is smaller with Unigine Heaven probably because it's taxing the GPU more.
It's also smaller for Metro Last Light because this game can benefit from more than 4 cores as show in a previous phoronix article.

Ryzen's IPC is Broadwell-E, which is only 5-6% lower than Skylake/Kabylake. Ryzen is only behind Intel in clockspeed ATM.

phoronix It's an interesting test to see how the R7 fares against the 6900k (or the comparable 5960k) in gaming on Linux, since they have equivalent IPC and close clockspeeds. This would help find out if there is some problem to be fixed with Ryzen right now (like on Windows).

**bakgwailo** · 13 March 2017, 05:37 PM

Originally posted by indepe View Post

A number of optimizations will come very easily, as shown by the fact that several games have a significant performance increase (for example more than 10% for Deus Ex in the test I referenced above) just by disabling SMT. Games should be able to easily gain further by using finer-grained thread control.

Also, my understanding is that DX 11 already has multi-threading, if so then games (or game engines) don't have to be converted to DX 12 in order to improve further on Ryzen 7.

That is assuming that all of the SMP penalty can be actually fixed. Even Intel's initial hyper threading had penalties. Being AMD's first attempt, one might need to wait until Zen+ for it to be fully fixed/carry no performance hit.

**indepe** · 13 March 2017, 05:55 PM

Originally posted by bakgwailo View Post

That is assuming that all of the SMP penalty can be actually fixed. Even Intel's initial hyper threading had penalties. Being AMD's first attempt, one might need to wait until Zen+ for it to be fully fixed/carry no performance hit.

No, it is simply assuming that a game can use more fine-grained thread affinities, to achieve *at least* the same effect as you can by turning SMT completely off (which can be more than 10% depending on the game).

For example, if a game has 2 or 3 threads on a critical path, it can assign a complete physical core (without SMT) to each of these threads, and use SMT for all others. That may result in a larger benefit, and be applicable to more games.

Furthermore a game can assure that those thread groups sharing memory will use the same L3 cache, and threads not sharing the same memory will run on the other L3 cache.

Both of these steps should be quite easy to accomplish, just by assigning existing threads to specific cores, without having to re-write game code.

**efikkan** · 13 March 2017, 06:17 PM

Originally posted by indepe View Post

Not talking about synchronization here, but about the assignment of threads to physical and/or HT cores. (And the knowledge of L3 access by cores).

You are talking about scheduling, which is done by the kernel in the OS. Userland software don't work with knowing about L3, etc.

Originally posted by indepe View Post

I think DX 11 was the first "DX" to support MT, what do you mean with "all" APIs, other than DX 11 and Vulkan?
(Of course newer versions of OpenGL don't support multithreading either.)

That's completely false.
Both Direct3D and OpenGL has allowed you to use multiple contexts from multiple threads for ~20 years. As someone who has used OpenGL for a decade and a half, I can assure you it's supported there as well.

Multithreading is not new, but there are enhancements in Direct3D 12. Many of you seems to think that you just can take a rendering thread and split it two or more threads to create a single queue. Granted, Direct3D 12 will allow this, but you'll still have to avoid synchronization problems and data hazards, so there is really no point to it. You can build different queues in different threads, but not all games will gain a lot there.

Any decent programmer knows that you can't scale just anything with multithreading. It only works when you have multiple chunks of work which can be processed independently of each other. A rendering queue is a pipeline, consisting of separate stages. So if you wanted to use multiple threads to build a queue, you'll have to spend precious time synchronizing them in order to give the desired result.

The problem with rendering is usually what the engine does besides building the queue. Most games traverse a list of objects to render ingame. Doing so will cause at least a single cache miss per iteration, causing significant stalls for the CPU. Combine this with a rendering function call for each object, and some layers of abstractions and you'll have a significant amount of overhead. Intel's prefetcher is way better at branch prediction and prefetching of data, which helps mitigate these problems, but only to some extent of course.

Originally posted by indepe View Post

Furthermore a game can assure that those thread groups sharing memory will use the same L3 cache, and threads not sharing the same memory will run on the other L3 cache.

Both of these steps should be quite easy to accomplish, just by assigning existing threads to specific cores, without having to re-write game code.

What?
First of all, L3 cache is shared among all cores, not selectively. The cache hierarchy works as 64 byte cache lines. Whenever a thread requests a memory address the whole 64 byte block is cached. If two or more threads requests something inside the same cache line both get a local copy in their L2, but only a single copy exists in L3. Whenever a thread writes to a memory address, this cache line is flushed in all cores. So sharing of cache, even in the case for false sharing, would have to be read only. Otherwise the performance will slow down to a level slower than a single thread.

Since the whole point of multiple threads are working on different sets of data, there are usually very little sharing going on. As mentioned, any kind of data containing state or similar can't be shared due to data hazards, and the processed data is obviously not shared. In fact, the L3 will to a large extend share cached code, which may be used by a number of threads.

What you are talking about allocating L3 is nonsense though, that's not how L3 works.

**bakgwailo** · 13 March 2017, 06:49 PM

Originally posted by indepe View Post

No, it is simply assuming that a game can use more fine-grained thread affinities, to achieve *at least* the same effect as you can by turning SMT completely off (which can be more than 10% depending on the game).

For example, if a game has 2 or 3 threads on a critical path, it can assign a complete physical core (without SMT) to each of these threads, and use SMT for all others. That may result in a larger benefit, and be applicable to more games.

Furthermore a game can assure that those thread groups sharing memory will use the same L3 cache, and threads not sharing the same memory will run on the other L3 cache.

Both of these steps should be quite easy to accomplish, just by assigning existing threads to specific cores, without having to re-write game code.

Games should not manually be programming which threads go where for a specific CPU architecture. That is the job of the OS/Kernel and its scheduler. The issue is also ensuring things taking the dual CCXs into account, which I believe, unlike Windows 10, the Linux kernel already has patches in for it in 4.10, and that they were backported to some older kernels. SMP/Hyperthreading, as I said, has overhead, and I doubt they can reduce the performance penalty to 0 purely in software. It has a performance hit that perhaps AMD, like Intel, can get around/minimize in later silicon revisions. AMD's first foray into SMP as it is actually isn't that bad.

**dungeon** · 13 March 2017, 06:57 PM

At least AMD blogged something about Ryzen today

Sign In to AMD Community - AMD Community

https://community.amd.com/community/gaming/blog/2017/03/13/amd-ryzen-community-update?sf62107357=1

**indepe** · 13 March 2017, 07:12 PM

Originally posted by efikkan View Post

You are talking about scheduling, which is done by the kernel in the OS. Userland software don't work with knowing about L3, etc.

No, the OS does it by default, but an application can use explicit thread affinity as an alternative or in combination.

Originally posted by efikkan View Post

That's completely false.
Both Direct3D and OpenGL has allowed you to use multiple contexts from multiple threads for ~20 years. As someone who has used OpenGL for a decade and a half, I can assure you it's supported there as well.
...

No, sounds wrong to me. I always needed to call "makeCurrentContext" on an OpenGL context before using it in a thread, and for that time no other thread could use OpenGL, as the API calls themselves don't even specify the context they are meant to use. Furthermore, makeCurrentContext is said to be an expensive call down the road. And this is the reason OpenGL is generally not considered a multi-threaded API.

Originally posted by efikkan View Post

What?
First of all, L3 cache is shared among all cores, not selectively.
...

No, not on the Ryzen 7. The Ryzen 7 has two groups of 4 physical cores that each have their own L3 cache.

Originally posted by efikkan View Post

Since the whole point of multiple threads are working on different sets of data, there are usually very little sharing going on. As mentioned, any kind of data containing state or similar can't be shared due to data hazards, and the processed data is obviously not shared. In fact, the L3 will to a large extend share cached code, which may be used by a number of threads.
...

No, and even if that were true, on a Ryzen 7 (which has two L3 caches) you would still want the threads with the most data usage to use a different L3 cache, and therefore to be assigned to a different core group.

Although of course each thread has some data of its own, having only that would be pointless to the outside world.

For example you may have multiple data structures, and pass access to one structure from one thread to another. How else are threads to communicate with each other, and eventually with the GPU, if not by sharing data at one point or another?

**indepe** · 13 March 2017, 07:26 PM

Originally posted by bakgwailo View Post

Games should not manually be programming which threads go where for a specific CPU architecture. That is the job of the OS/Kernel and its scheduler.

That's great as a general guideline. Please follow it.

Originally posted by bakgwailo View Post

...
SMP/Hyperthreading, as I said, has overhead, and I doubt they can reduce the performance penalty to 0 purely in software.
...

Yes, and you don't want that overhead on a thread that is on the critical path. Especially if you have enough cores to give you the flexibility of optimization, and an application that can be expected to run more or less on its own. However, with just 4 cores that is perhaps not to be attempted even for specialists.

**efikkan** · 13 March 2017, 07:47 PM

Originally posted by indepe View Post

For example you may have multiple data structures, and pass access to one structure from one thread to another. How else are threads to communicate with each other, and eventually with the GPU, if not by sharing data at one point or another?

You are completely missing the point.
Of course threads can at any time communicate, but it comes at a huge performance penalty. The sole purpose of utilizing multiple threads is to divide the workload, but if you are spending a lot of time on overhead communicating between them, then it defeats the purpose. During rendering of a single frame (usually <16.67 ms) you can't afford to do a lot of thread communication. You simply can't do thousands of them, otherwise you'll have to start measure the frame rate in seconds per frame…

Whenever a program successfully scales with multithreading, it does so by allowing the threads work on their task independently, usually only to sync after the task is complete. Some applications keep handing out new chunks of work to each thread, but it's always a significant amount of work, otherwise the overhead would surpass the gains. For games, timing is critical, meaning any synchronization is very expensive.

For all multithreaded work sharing of L3 cache for data is something to avoid.
I suggest you watch Scott Meyers: Cpu Caches and Why You Care, which will also touch the subject of sharing and even false sharing.

Announcement

GeForce GTX 1080 Ti: Core i7 7700K vs. Ryzen 7 1800X Linux Gaming Performance

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment