Announcement

**starshipeleven** · 25 October 2019, 11:03 AM

Originally posted by atomsymbol

Just some notes:

A naturally parallel task is for example 2D rendering, assuming the display device (such as a 4K monitor) is incapable of on-the-fly/realtime data decompression of 1D lines or 2D rectangles.

Efficient (that is: incremental) compilation of C/C++ code is not a naturally parallel task.

That's correct, but it's kind of tangential to the point I was making.

I'm saying that the average OS will have dozens of processes by default and you can run any number of applications at once (which may or may not multithread themselves). So even on multicore systems each CPU core will be running (much) more than a single process.

Unless you are using a huge server CPU in a desktop, but it's not very efficient use of resources.

**caligula** · 25 October 2019, 01:13 PM

Originally posted by Space Heater View Post

It's not a zero-sum game, we can't ignore single-threaded performance and likewise we can't ignore thread-level parallelism.

Not true. Assume you have a fixed chip size and manufacturing process technology. You have a constant amount of space for logical gates. You can't exceed that limit. Now it depends on the task at hand which processor design would be the best, but I'm pretty sure designs like GPUs and DSPs already show that there are more efficient designs for solving tasks than monolith CPUs that focus on single thread performance. Sure CPUs are better general purpose platforms, but you have to make some strong assumptions about the target audience and their use cases to be able to argue which design is the best.

By the way, the original paper by Amdahl was an argument for focusing on single-threaded performance in processor designs, and that it is essentially impossible to get linear speed increases as you increase the core count on almost all real-world workloads. Did you actually read the paper and believe I'm missing something or did you just want to misconstrue what I'm saying?

Of course the sequential performance is important since it also translates into faster multi-core processing. The fact you're missing is that you're claiming that multi-threading sucks because it won't scale without any overhead. That is, adding 100 % more processing units only produce < 100% more actual processing power. The issue you're missing is that you don't even need to scale that well to exceed the performance improvements Intel achieves with faster single core perf. You only need to achieve like 5% improvement with 100% more cores. It's perfectly doable. That's the main reason people started buying Ryzen. Intel offered standard 5% annual improvements and AMD offered 100% more cores. Apparently people had computational tasks that were able to utilize the cores enough to beat the 5%, maybe 5,5% or more.

If you look at the single thread optimizations, the optimizations are pretty modest. Typically Intel CPUs only gain few percent of additional processing power per generation. A majority of the speedup can be attributed to faster frequencies. I'm not arguing that low level optimizations enabling higher clocks are bad. I'm arguing the IPC optimizations are more expensive in terms of chip space than adding more cores and threads. Later, if you happen to need space for cores, it's already spent on huge engines for speculative execution. It's funny that you're dismissing multi-threading when multi-threading is much more flexible way of adding computational power than vector instructions (which, along with higher frequencies, largely produce the better perceived IPC).

**caligula** · 25 October 2019, 01:20 PM

Originally posted by duby229 View Post

I guess its you who doesn't know what amdahls law means. In actual fact its a statement that you can't parallelize loads indefinitely. If you take almost any single thread x86 load and try to parallelize it vastly dimimishing returns after 4 threads and 16 threads is the very most that doesnt look retarded.

You can cherry pick whatever benchmarks you want. Ever looked at e.g. the phoronix benchmarks? Many of them utilize multiple cores and what's more important is that those tasks are the ones that matter. Of course there are millions of 100% sequential tasks, but how many of them truly are that slow even on a legacy system. Please, if you want to say something relevant, use examples that aren't already fast enough on a 80486.

Amdahls law is -exactly- why things like gcc are still single threaded. It makes more sense to run many single threads in parallel.

LoL, wtf are you smoking? I don't know any developer who doesn't use gcc in a multi-threaded way in 2019. It's also one of the best examples in phoronix test suite that shows close to linear scalability even on huge EPYC systems. Check the numbers man. If you're arguing that 'make' doesn't really use threads but processes, that's 100% irrelevant. The processes are threads from CPUs point of view. https://github.com/yrnkrn/zapcc also shows that you're totally clueless.

Amdahls law -is- a good reason why single threaded performance is so important

No, it shows that thread performance is important, not single threaded performance. Fast single thread might be better than multiple slow threads, but multiple fast threads is better than a fast single thread.

**caligula** · 25 October 2019, 01:34 PM

Originally posted by atomsymbol

Efficient (that is: incremental) compilation of C/C++ code is not a naturally parallel task.

Compilation consists of multiple phases. Some of them are embarrassingly parallel (the best class of parallelism). For example if your changeset contains two independent translation units, both can be lexed and parsed in 100% parallel. Assuming they don't depend on each other and only a previously known set of dependencies, you can check the syntax and semantics in 100% parallel, you can resolve references in 100% parallel. Look at what zapcc does. Assuming you don't do LTO and have proper modules, you can carry on with more synthesis stages and even full code generation. The only thing that isn't 100% parallel is the linker. This is a lousy example. But it's true that most modern compilers don't use parallelism here. They don't even do real incremental compilation like zapcc does.

**Space Heater** · 25 October 2019, 02:20 PM

Originally posted by caligula View Post

Not true. Assume you have a fixed chip size and manufacturing process technology. You have a constant amount of space for logical gates. You can't exceed that limit. Now it depends on the task at hand which processor design would be the best, but I'm pretty sure designs like GPUs and DSPs already show that there are more efficient designs for solving tasks than monolith CPUs that focus on single thread performance. Sure CPUs are better general purpose platforms, but you have to make some strong assumptions about the target audience and their use cases to be able to argue which design is the best.

You can improve single-threaded performance while increasing core count, clearly Zen 2 is an exemplar of that. Having to choose between only increasing core count and only improving single threaded performance is a classic false dichotomy.

For a concrete example, imagine you improve the branch predictor by moving from preceptron to tage (or vice-versa). That will improve single-threaded performance, but does not at all impinge on improving core count nor is it a net negative on transistor budget.

Originally posted by caligula View Post

Of course the sequential performance is important since it also translates into faster multi-core processing. The fact you're missing is that you're claiming that multi-threading sucks because it won't scale without any overhead.

But I don't think multi-core processing is bad and I never said it was bad. This whole time I have been saying that I don't think we should totally ignore single-threaded performance, as the original comment I quoted was strongly implying.

I'm not sure why you continue to claim that I think multi-threading is bad, I don't think that at all. The rest of your argument is against a fictitious stance.

Originally posted by caligula View Post

No, it shows that thread performance is important, not single threaded performance.

Yes, thread (singular) performance is still important for general purpose workloads. When people say single-threaded performance they are talking about per-thread performance, no one sane is suggesting we go back to single core processors.

**tildearrow** · 25 October 2019, 02:50 PM

Originally posted by uid313 View Post

Most ARM processors are not at so high frequencies because they are aimed at mobile devices, but I guess it would be possible to make a 4 GHz ARM processor if it was designed for workstations and servers.

Maybe...

Originally posted by uid313 View Post

I don't know about compiling, but aren't ARM processors really good for video decoding considering all phones and tablets that are used for video decoding with very little power usage?

They aren't. Mobile devices have dedicated decoding blocks that assist with the process, hence the low power usage.
But when I want to look at some format that isn't supported by the hardware, the CPU must do decoding, and this is where the high power usage kicks in (~90% usage on a 2.5GHz quad-core ARM CPU to decode 1080p60 4:4:4 H.264 (for comparison, a Skylake Intel CPU at 4.0GHz only uses ~30% of a single core!)).

**caligula** · 25 October 2019, 05:08 PM

Originally posted by Space Heater View Post

You can improve single-threaded performance while increasing core count, clearly Zen 2 is an exemplar of that. Having to choose between only increasing core count and only improving single threaded performance is a classic false dichotomy.

This is not necessarily the case in mobile devices. On desktop workstations it's ok to waste power as long as the heatsink can dissipate all the heat. On mobile devices it's much easier to shut down whole cores when not using them. There are also space constraints. ARM Cortex M and A series have also shown that simple cores can be ridiculously small and also power efficient. Sadly the latest A7x aren't that efficient anymore.

Yes, thread (singular) performance is still important for general purpose workloads. When people say single-threaded performance they are talking about per-thread performance, no one sane is suggesting we go back to single core processors.

They claim that most workloads don't scale so it's better to compute them using just one core and traditional programming methods. I found examples also in this thread. They're not suggesting a switch to single-core CPUs, but often advocate low core count CPUs where all R&D is spent on making a single core fast in a turbo mode. For example, one of the fastest Intel Core i7s (8086k) runs at 5.0 GHz. But you only get 6 cores. I'm pretty sure that outside the domain of hardcore fps gaming, a 16 core Zen 2 Threadripper hands down beats that 5.0GHz chip.

**F.Ultra** · 25 October 2019, 08:15 PM

Originally posted by sandy8925 View Post

Actually, it is. When you have multiple cores/processors, you're actually running things in parallel. Not just providing the appearance of running things in parallel. It does make a big difference as far as responsiveness.

Which still have nothing what so ever to do with "high single core performance" in the context that I was speaking.

**F.Ultra** · 25 October 2019, 08:39 PM

Originally posted by caligula View Post

This is not necessarily the case in mobile devices. On desktop workstations it's ok to waste power as long as the heatsink can dissipate all the heat. On mobile devices it's much easier to shut down whole cores when not using them. There are also space constraints. ARM Cortex M and A series have also shown that simple cores can be ridiculously small and also power efficient. Sadly the latest A7x aren't that efficient anymore.

They claim that most workloads don't scale so it's better to compute them using just one core and traditional programming methods. I found examples also in this thread. They're not suggesting a switch to single-core CPUs, but often advocate low core count CPUs where all R&D is spent on making a single core fast in a turbo mode. For example, one of the fastest Intel Core i7s (8086k) runs at 5.0 GHz. But you only get 6 cores. I'm pretty sure that outside the domain of hardcore fps gaming, a 16 core Zen 2 Threadripper hands down beats that 5.0GHz chip.

That 5GHz chip will beat the Zen2 in far more domains than just "hardcore fps gaming". It all depends on what you measure and what your load is, e.g we can take a simple http server, here the 5GHz chip will provide far lower latency and higher throughput per connection until you get enough simultaneous connections that the higher core count of the Zen2 starts to pay off.

So if your number of connections (and we also have to consider the length of each connection here) is below that threshold then the 5GHz chip is much faster, if you are beyond it the Zen2 is much faster. So even with the same type of software it all depends on the load and the nature of the load.

**microcode** · 25 October 2019, 11:53 PM

Originally posted by Space Heater View Post

Single-thread performance still matters in 2019, it's unfortunately not the case that every task can be parallelized to the nth degree.

I think he's just trolling, nobody could be this dumb twice.

Announcement

An Introduction To Intel's Tremont Microarchitecture

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment