Announcement

**milkylainen** · 26 October 2019, 03:41 AM

Originally posted by c117152 View Post

6-wide x86 instruction decode
* Dual 3-wide clusters Out of order
* Wide decode without the area of a uop cache
* Optional single cluster mode based on product targets

The clusters and removal of the micro operations cache seems ARM-like, real time oriented. Then the optional single cluster... I'm just not sure what the product target they have in mind. Robotics maybe? Weird.

I think you should view Tremont as an Atom microarch "update". Not to the core branch.
Atom has several embedded uses. Most embedded products don't technically need multicore. It does not have to be RTOS-tasks.
So a sharp single core is what some customers actually want.
A lot of code in the industry is just rubbish. Code that won't do well with multicore solutions and it's not looking to change.

**Guest** · 26 October 2019, 10:51 AM

Originally posted by milkylainen View Post

I think you should view Tremont as an Atom microarch "update". Not to the core branch.
Atom has several embedded uses. Most embedded products don't technically need multicore. It does not have to be RTOS-tasks.
So a sharp single core is what some customers actually want.
A lot of code in the industry is just rubbish. Code that won't do well with multicore solutions and it's not looking to change.

You're thinking purely in terms of compute workloads, but GUIs and most interactive systems work differently. Your cores don't need to be fastest possible, they just need to be pretty fast, and many mobile CPUs are. No matter how fast your single core performance is, some workload will be too heavy, and will cause stutter and lack of responsiveness if run on the same threads that process input and issue draw commands. You can significantly boost responsiveness just by offloading the work to a different thread, and it makes a huge difference. When you actually have multiple cores running, work is actually running in parallel, so you can load stuff from the network/disk while still being very responsive to the user. That plus there are multiple apps running on devices, multiple OS/framework processes and threads, and the system can still be super responsive.

**milkylainen** · 26 October 2019, 12:08 PM

Originally posted by sandy8925 View Post

You're thinking purely in terms of compute workloads, but GUIs and most interactive systems work differently. Your cores don't need to be fastest possible, they just need to be pretty fast, and many mobile CPUs are. No matter how fast your single core performance is, some workload will be too heavy, and will cause stutter and lack of responsiveness if run on the same threads that process input and issue draw commands. You can significantly boost responsiveness just by offloading the work to a different thread, and it makes a huge difference. When you actually have multiple cores running, work is actually running in parallel, so you can load stuff from the network/disk while still being very responsive to the user. That plus there are multiple apps running on devices, multiple OS/framework processes and threads, and the system can still be super responsive.

I'm not sure what point you're trying to make or if you are replying to something else I said? Embedded RTOS or just embedded are in a lot of cases non-GUI. Threading context and draw contexts are functions of OS and toolkits. How they work is the aspect of stupid software design. Not hardware. The discussion was around single-threaded performance. And it is as important as ever. But you're right in the sense that single context performance won't scale forever and multiple execution contexts are a boon for a lot of workloads. But given a single-core with the same envelope as a dual-core in performance and power, I'd go for the single-core. All day, every day. This is what the discussion is about. You can't remedy all problems with more contexts, but we sure try as a single core will never be as fast as multiple cores.

Intel correctly decided that they wanted to eke maximal performance from every core for as little power as possible for the Atom microarch.

**duby229** · 26 October 2019, 12:20 PM

Originally posted by caligula View Post

You can cherry pick whatever benchmarks you want. Ever looked at e.g. the phoronix benchmarks? Many of them utilize multiple cores and what's more important is that those tasks are the ones that matter. Of course there are millions of 100% sequential tasks, but how many of them truly are that slow even on a legacy system. Please, if you want to say something relevant, use examples that aren't already fast enough on a 80486.

LoL, wtf are you smoking? I don't know any developer who doesn't use gcc in a multi-threaded way in 2019. It's also one of the best examples in phoronix test suite that shows close to linear scalability even on huge EPYC systems. Check the numbers man. If you're arguing that 'make' doesn't really use threads but processes, that's 100% irrelevant. The processes are threads from CPUs point of view. https://github.com/yrnkrn/zapcc also shows that you're totally clueless.

No, it shows that thread performance is important, not single threaded performance. Fast single thread might be better than multiple slow threads, but multiple fast threads is better than a fast single thread.

Multiprocessing is not multithreading. It's not irrelevant, it's the -entire- point.

**carewolf** · 28 October 2019, 04:24 AM

Originally posted by sandy8925 View Post

Actually, it is. When you have multiple cores/processors, you're actually running things in parallel. Not just providing the appearance of running things in parallel. It does make a big difference as far as responsiveness.

Yes, but not in the way you imagine. In pure processing power you wouldn't notice the context switches 300/1000 per second. The problem though arrises with all the various CPU caches, that ends up being cleared or made irrelevant whenever a CPU changes task, that causes hickups in performance (even more so with the work-arounds for Spectre that clears many caches previosly thought safe to carry over). The same can happen when you have too many task running in parallel on a multi-core CPU, but the more cores the more pressure the machine needs to be under for this to happen, and good schedulers can minimize the effect.

**carewolf** · 28 October 2019, 04:32 AM

Originally posted by duby229 View Post

I guess its you who doesn't know what amdahls law means. In actual fact its a statement that you can't parallelize loads indefinitely.

No, it says no such thing, though your conclusions are closer. Amdahl's law is just a way to calculate speed up if you only speed up a certain part of the execution (for instance speeding up half the running time 2x gives a total speed up of 1/0.75 = 1.33x ). By adding more cores you only speed up CPU-depending code that can be parallelized, by increasing single-thread performance you speed up anything CPU-dependent, so a larger section of the code, but by a smaller margin. Still a lot of execution time is waiting on memory or disk, and Amdahl's law would still give you the deminished returns even for improvements to single-thread performance.

But Amdahls law says nothing about not being able to parallelize indefinitely, just that the returns are limited by the size of the exuction time that you parallelize. For instance speeding up half the execution time infinitiely with infinite cores gives a total speed up of 1/0.5 = 2x, since the rest of the execution time is unchanged.

**duby229** · 29 October 2019, 06:39 PM

Originally posted by carewolf View Post

No, it says no such thing, though your conclusions are closer. Amdahl's law is just a way to calculate speed up if you only speed up a certain part of the execution (for instance speeding up half the running time 2x gives a total speed up of 1/0.75 = 1.33x ). By adding more cores you only speed up CPU-depending code that can be parallelized, by increasing single-thread performance you speed up anything CPU-dependent, so a larger section of the code, but by a smaller margin. Still a lot of execution time is waiting on memory or disk, and Amdahl's law would still give you the deminished returns even for improvements to single-thread performance.

But Amdahls law says nothing about not being able to parallelize indefinitely, just that the returns are limited by the size of the exuction time that you parallelize. For instance speeding up half the execution time infinitiely with infinite cores gives a total speed up of 1/0.5 = 2x, since the rest of the execution time is unchanged.

you're definitely wrong, although your math is right your conclusion doesn't make sense. for example most x86 processors are OoO superscalar pipeline architectures. Ahmdahls law is the rreason why none of them have no more than 4 execution units per pipeline, they are interally parallelized.

the math itself -is- the point that you can't parallelize indefinitely. although you can extract parallelism as various different scales. Either at instruction level or thread level or process level. Ahmdals law applies independently at each scale.

**duby229** · 29 October 2019, 06:46 PM

Originally posted by DavidC1 View Post

What? No they don't. The RTX 2080 Ti has 4352 CUDA cores. The Vega VII has 3840 CUs. GFlops-wise they are the same. Each CUs or CUDA cores are capable of 2 FLOPs per cycle.

Except that 1 AMD compute unit is not equivalent with 1 nVidia cuda core. At the execution unit level AMD's architecture has alot more than nVidia's.

AMD's architecture favors comute capacity over compute efficiency, while nVidia's favors compute efficiency over compute capacity. One cuda core is essentially a complete processor while one CU is not...

**carewolf** · 30 October 2019, 06:08 AM

Originally posted by duby229 View Post

you're definitely wrong, although your math is right your conclusion doesn't make sense. for example most x86 processors are OoO superscalar pipeline architectures. Ahmdahls law is the rreason why none of them have no more than 4 execution units per pipeline, they are interally parallelized.

No it isn't. The reason is that there is are very rarely situations where you can look further into future reliably to execute more in parallel, and power efficiency is more important now. Besides, you can evaluate 8 32bit integer operations in parallel with a single AVX2 instruction, and you can evaluate two AVX2 instructions in parallel in most Intel CPUs, so 16 operations in parallel is quite common, and when you offload code to the GPU you move to 100s if not 1000s of operations in parallel, which is quite worthwhile if you have ever seen benchmarks of hardware accelerate "whatever".

**duby229** · 30 October 2019, 11:39 PM

Originally posted by carewolf View Post

No it isn't. The reason is that there is are very rarely situations where you can look further into future reliably to execute more in parallel, and power efficiency is more important now. Besides, you can evaluate 8 32bit integer operations in parallel with a single AVX2 instruction, and you can evaluate two AVX2 instructions in parallel in most Intel CPUs, so 16 operations in parallel is quite common, and when you offload code to the GPU you move to 100s if not 1000s of operations in parallel, which is quite worthwhile if you have ever seen benchmarks of hardware accelerate "whatever".

you're not making any sense. Avx2 is an instruction set extension where the instructions are specifically designed for use in specific scenarios where data oriented processing is important, something where x86 in general isn't well suited for... you can't just accelerate "whatever"... you can only accelerate specifically designed workloads that -can- be data oriented. most workloads can't be exactly -because- of ahmdahls law.

EDIT: I have to ask, you do know what vectors and scalars are right? you do know the diffences in design choices between scalar and vector pipelines right? Integer units and floating point units and memory management units right? if you don't then this conversstion is a dead end.

because your respponses to mine make it seem like you don't understand how isa instructions are processed in different types of execution units depending on the type of data being processed. bottom line is that bit depth for scalar pipelines or bit precision for vector pipelines is not the same thing as execution paralellism. nor is bit banging or bit trunking the same thing as execution paralellism.

Announcement

An Introduction To Intel's Tremont Microarchitecture

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment