Patches Revised For Supporting OpenGL 4.0 On Intel Haswell With Mesa

suberimakuri replied

02 December 2016, 01:26 PM
Re: HSW
It looks like the patch series was mostly all reviewed.
The only bit seems to be a couple of ideas about how to check whether okay to enable GL 4.0 on HSW.
A 3rd or 4th alternative method was suggested and the thread seems to stop there.

Hope it gets merged soon!
Leave a comment:
AdamOne replied

21 October 2016, 07:17 AM
Testing these patches now, so far

no ARMA 3 for Haswell.

Also Youtube-video stream lost video, audio only.
Leave a comment:
Xelix replied

14 October 2016, 04:36 AM
Originally posted by schmidtbag View Post

I'm aware they're independent. As I said before, each CPU core is paired up in a module. When you have 2 independent processes, you get half the performance because one core just sits there doing pretty much nothing. That core isn't doing anything because the other core is grabbing all the attention of the fetch, decode, and L2 cache.

This is incorrect. Having shared resources doesn't mean that the executions are completely serialized. This would completely defeat the purpose of having modules/SMT in the first place.

They are indeed competing for shared resources, but the two threads are load-balanced based on various criteria. The obvious one is long latency events (e.g. if one thread is stalled due to to a miss, you might as well start executing the other thread and overlap their execution).

One final note on caches. You mentioned the L2 cache being a serialization point. Modern caches are banked and can sustain several parallel requests.
Leave a comment:
schmidtbag replied

13 October 2016, 09:05 AM
Originally posted by Xelix View Post

You lost me once more. These 4 instances that atomsymbol talks about are independent, why would they have to wait on each other?

I'm aware they're independent. As I said before, each CPU core is paired up in a module. When you have 2 independent processes, you get half the performance because one core just sits there doing pretty much nothing. That core isn't doing anything because the other core is grabbing all the attention of the fetch, decode, and L2 cache.
Leave a comment:
Xelix replied

12 October 2016, 04:44 PM
Originally posted by schmidtbag View Post

That is true - this is because a single-threaded process locks up 2 cores, so if you are running 2 processes in parallel, the other one has to wait for the first to finish. So yes you're right, it is similar to the limitation of HT but as you said, to a lesser extent.

You lost me once more. These 4 instances that atomsymbol talks about are independent, why would they have to wait on each other?
Leave a comment:
schmidtbag replied

12 October 2016, 03:15 PM
Originally posted by atomsymbol

This, to a lesser extent, also happens on Bulldozer CPUs. For example, on Kaveri, running 2 instances of the code displayed below yields IPC=1.93 while running 4 instances of the code yields IPC=1.79. With other codes the slowdown can be much bigger.

Code:

perf stat bash -c "while true; do true; done"

That is true - this is because a single-threaded process locks up 2 cores, so if you are running 2 processes in parallel, the other one has to wait for the first to finish. So yes you're right, it is similar to the limitation of HT but as you said, to a lesser extent.
Leave a comment:
Xelix replied

12 October 2016, 12:24 PM
You completely lost me with your car analogy.

Originally posted by schmidtbag View Post

Hyper Threading, meanwhile, is almost the exact opposite. When looking at a single physical core, it can't process in parallel. This is why HT is known to sometimes make performance worse than having it completely disabled. For example, sometimes a process needs to wait for a HT thread to catch up with a non-HT thread.

This is flat out wrong. Intel's HyperThreading DOES allow to process instructions in parallel, even within a single physical core. These processors are superscalar, and are thus already processing several instructions in parallel, even without HT. The benefit of HT is that it allows to have in-flight instructions from different threads within the same core at the same time, thus (theoretically) maximizing the pipeline occupation. This is especially good if one of the threads has a high CPI due to long latency events (cache misses, etc). Meanwhile, the other thread can keep working.

As for the cases where HT makes things worse: This happens due to the nature of SMT: They share almost all the resources within a core. Notably, Hyperthreads share the private L1 and L2 caches, leading to possible cache trashing.

Note: I have no idea what you mean when you say "For example, sometimes a process needs to wait for a HT thread to catch up with a non-HT thread.".

Originally posted by schmidtbag View Post

As I stated in my original post, it is good and very fast in theory. In practice and reality, it's pretty bad in mostsituations. But even on Phoronix, every once in a while you'll find tests involving Piledriver (I think the 2nd generation to BD?) competes modern i7s. It's rare, and usually only wins in very obscure tests, but it does happen.

It really comes down to understanding how the architecture works.
[...]

So all that being said, you have to code software to take advantage of the CPU's features. For AMD, they lost out since the module system is not only unpopular but it's very picky. HT will never offer the performance benefit of AMD's modules, but it's flexibility will always make it a winner.

I understand well how the architecture works. And I think your analysis is wrong. When it comes to single thread performance, there are many reasons why Bulldozer cores are slow compared to Intel's:

*It has a deep pipeline in order to support running at high frequency. But this means that 1) It will use a lot more power 2) Branch mispredictions will be very, very expensive.
*Slow caches, small L1D
*And probably a lot of other small details I don't have in mind.

Bulldozer could (sometime) shine in applications that are multi-threaded and integer-heavy (because each module has two integer pipelines but only one floating point pipelines), but that's about it.

PS: This was my attempt at an objective analysis, I have nothing personal against AMD. I would love to see them perform better, and in fact I even own 3 or 4 of their products.
Leave a comment:
schmidtbag replied

12 October 2016, 10:30 AM
Originally posted by Xelix View Post

Do you mind explaining why Bulldozer's architecture was, in your opinion, "good and very fast"?

As I stated in my original post, it is good and very fast in theory. In practice and reality, it's pretty bad in most situations. But even on Phoronix, every once in a while you'll find tests involving Piledriver (I think the 2nd generation to BD?) competes modern i7s. It's rare, and usually only wins in very obscure tests, but it does happen.

It really comes down to understanding how the architecture works. They use modules, which involves 2 cores that share a few things like fetch and decode. I like to think of it as a car with 2 engines inside the same vehicle. If this vehicle is making multiple deliveries to a single destination, it will do a great job at this. But in most situations, you'd be better off putting one of those engines in a completely separate vehicle, which has the choice of going to the same destination, or, a completely different one. This also implies that 1 big powerful vehicle can't effectively perform multiple different deliveries in parallel, which is why single-threaded performance is so bad in bulldozer.

Hyper Threading, meanwhile, is almost the exact opposite. When looking at a single physical core, it can't process in parallel. This is why HT is known to sometimes make performance worse than having it completely disabled. For example, sometimes a process needs to wait for a HT thread to catch up with a non-HT thread.

So all that being said, you have to code software to take advantage of the CPU's features. For AMD, they lost out since the module system is not only unpopular but it's very picky. HT will never offer the performance benefit of AMD's modules, but it's flexibility will always make it a winner.
Leave a comment:
Xelix replied

12 October 2016, 05:37 AM
Originally posted by caligula View Post

There are other platforms that perform quite nicely, but the mainstream is stuck with Intel. Even Playstation had a special purpose Cell CPU. There's Kalray, Tilera, Parallela, ... Even ARM is becoming really strong wrt performance. For gaming, the GPUs already play a major role. IBM has really powerful CPUs that support much higher memory bandwidth. What's preventing Intel from doing the same?

Let's not forget that the Cell was a nightmare to program. Kalray, Tilera, Parallela, etc are all great ideas but they target specific niches, typically everything that is massively parallel (supercomputing, video, some specific datacenter workloads such as websearch, etc).

As for ARM, yeah, they are making huge improvements in performance. But in my opinion, the main reason is that modern ARM cores are getting closer to "fat" desktop cores, such as the ones produced by Intel and AMD. In fact, most of the key techniques used by desktop processors for high-IPC are present in recent high-performance ARM cores (wide issue, out-of-order execution, speculative execution, large caches, etc).
Leave a comment:
Xelix replied

12 October 2016, 05:13 AM
Originally posted by schmidtbag View Post

I've thought of that but Intel must have something planned in the even AMD out-does them. As stated before, I think the issue is more that they can't leave AMD too far in the dust or else they'll be a monopoly, and they're already dangerously close to that.

The market is an issue but I think your points can be further elaborated in a different direction. The problem is there's no flexibility for architectural changes. Take AMD's bulldozer for example - in theory, the architecture was good and very fast, but software isn't optimized to run the way it intended. x86 software revolves around CPUs with high IPC and doesn't scale with multiple threads very well. CPUs are pretty much required to follow this path. So as you said, maybe Intel really can't improve upon the architecture. But, they have money and technology to come up with a different approach.

Do you mind explaining why Bulldozer's architecture was, in your opinion, "good and very fast"?
Leave a comment:

Announcement

Patches Revised For Supporting OpenGL 4.0 On Intel Haswell With Mesa

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment: