An Introduction To Intel's Tremont Microarchitecture

caligula replied

25 October 2019, 01:34 PM
Originally posted by atomsymbol

Efficient (that is: incremental) compilation of C/C++ code is not a naturally parallel task.

Compilation consists of multiple phases. Some of them are embarrassingly parallel (the best class of parallelism). For example if your changeset contains two independent translation units, both can be lexed and parsed in 100% parallel. Assuming they don't depend on each other and only a previously known set of dependencies, you can check the syntax and semantics in 100% parallel, you can resolve references in 100% parallel. Look at what zapcc does. Assuming you don't do LTO and have proper modules, you can carry on with more synthesis stages and even full code generation. The only thing that isn't 100% parallel is the linker. This is a lousy example. But it's true that most modern compilers don't use parallelism here. They don't even do real incremental compilation like zapcc does.
Likes 1
Leave a comment:
caligula replied

25 October 2019, 01:20 PM
Originally posted by duby229 View Post

I guess its you who doesn't know what amdahls law means. In actual fact its a statement that you can't parallelize loads indefinitely. If you take almost any single thread x86 load and try to parallelize it vastly dimimishing returns after 4 threads and 16 threads is the very most that doesnt look retarded.

You can cherry pick whatever benchmarks you want. Ever looked at e.g. the phoronix benchmarks? Many of them utilize multiple cores and what's more important is that those tasks are the ones that matter. Of course there are millions of 100% sequential tasks, but how many of them truly are that slow even on a legacy system. Please, if you want to say something relevant, use examples that aren't already fast enough on a 80486.

Amdahls law is -exactly- why things like gcc are still single threaded. It makes more sense to run many single threads in parallel.

LoL, wtf are you smoking? I don't know any developer who doesn't use gcc in a multi-threaded way in 2019. It's also one of the best examples in phoronix test suite that shows close to linear scalability even on huge EPYC systems. Check the numbers man. If you're arguing that 'make' doesn't really use threads but processes, that's 100% irrelevant. The processes are threads from CPUs point of view. https://github.com/yrnkrn/zapcc also shows that you're totally clueless.

Amdahls law -is- a good reason why single threaded performance is so important

No, it shows that thread performance is important, not single threaded performance. Fast single thread might be better than multiple slow threads, but multiple fast threads is better than a fast single thread.
Leave a comment:
caligula replied

25 October 2019, 01:13 PM
Originally posted by Space Heater View Post

It's not a zero-sum game, we can't ignore single-threaded performance and likewise we can't ignore thread-level parallelism.

Not true. Assume you have a fixed chip size and manufacturing process technology. You have a constant amount of space for logical gates. You can't exceed that limit. Now it depends on the task at hand which processor design would be the best, but I'm pretty sure designs like GPUs and DSPs already show that there are more efficient designs for solving tasks than monolith CPUs that focus on single thread performance. Sure CPUs are better general purpose platforms, but you have to make some strong assumptions about the target audience and their use cases to be able to argue which design is the best.

By the way, the original paper by Amdahl was an argument for focusing on single-threaded performance in processor designs, and that it is essentially impossible to get linear speed increases as you increase the core count on almost all real-world workloads. Did you actually read the paper and believe I'm missing something or did you just want to misconstrue what I'm saying?

Of course the sequential performance is important since it also translates into faster multi-core processing. The fact you're missing is that you're claiming that multi-threading sucks because it won't scale without any overhead. That is, adding 100 % more processing units only produce < 100% more actual processing power. The issue you're missing is that you don't even need to scale that well to exceed the performance improvements Intel achieves with faster single core perf. You only need to achieve like 5% improvement with 100% more cores. It's perfectly doable. That's the main reason people started buying Ryzen. Intel offered standard 5% annual improvements and AMD offered 100% more cores. Apparently people had computational tasks that were able to utilize the cores enough to beat the 5%, maybe 5,5% or more.

If you look at the single thread optimizations, the optimizations are pretty modest. Typically Intel CPUs only gain few percent of additional processing power per generation. A majority of the speedup can be attributed to faster frequencies. I'm not arguing that low level optimizations enabling higher clocks are bad. I'm arguing the IPC optimizations are more expensive in terms of chip space than adding more cores and threads. Later, if you happen to need space for cores, it's already spent on huge engines for speculative execution. It's funny that you're dismissing multi-threading when multi-threading is much more flexible way of adding computational power than vector instructions (which, along with higher frequencies, largely produce the better perceived IPC).
Leave a comment:
starshipeleven replied

25 October 2019, 11:03 AM
Originally posted by atomsymbol

Just some notes:

A naturally parallel task is for example 2D rendering, assuming the display device (such as a 4K monitor) is incapable of on-the-fly/realtime data decompression of 1D lines or 2D rectangles.

Efficient (that is: incremental) compilation of C/C++ code is not a naturally parallel task.

That's correct, but it's kind of tangential to the point I was making.

I'm saying that the average OS will have dozens of processes by default and you can run any number of applications at once (which may or may not multithread themselves). So even on multicore systems each CPU core will be running (much) more than a single process.

Unless you are using a huge server CPU in a desktop, but it's not very efficient use of resources.
Leave a comment:
starshipeleven replied

25 October 2019, 10:57 AM
Originally posted by sandy8925 View Post

Realtime isn't actually useful there, except potentially for phone calls, and other types of audio and video calls. It's not some kind of nuclear reactor safety and monitoring usecase.

Realtime is not safety.

Safety is reliability, and for that you need realtime AND something else. While you can't have a non-realtime system that is certified as "safe", being realtime does not automatically make the system "safe".

Realtime per-se is just a more extreme form of process scheduling, where the CPU hardware receives hardware interrupts that tell it some input signals must be processed RIGHT NOW and the OS blocks execution of everything else to do what it should do with them.
Windows claims to be realtime

soft-realtime in Linux is again process scheduling https://people.mpi-sws.org/~bbb/pape...s/ospert13.pdf and it is commonly used to increase user-responsiveness, or running Jack audio server with minimum jitter

Last edited by starshipeleven; 25 October 2019, 11:05 AM.
Leave a comment:
Guest replied

25 October 2019, 10:30 AM
Originally posted by archsway View Post

It turns out, that Android is extremely bloated with far too many background processes doing nothing useful and needs many cores to be responsive - who knew?

That's called the RT patchset.

Have you heard of zram?

"That chip over there is the OTG USB controller (with buggy drivers that cause kernel panics), and right next to it is what we call the 'Ok' chip. We plan to put an 'Alexa' chip on next year's version."

Because even the app launcher uses 300MB of RAM.

But obviously only the ones not sending tracking data to Google.

Idle processes != bloat. The laptop on which I'm typing this has 239 processes - most of the CPU usage is probably from Chromium here.

Also, it's not that Android itself is bloated - it's actually due to apps which love running code in the background all the time for no reason, due to really bad code. (I know because I'm an Android app developer - a lot of apps are pretty badly written). Unfortunately, Android only started clamping down on this starting with Android 8.0. so before that, as long as there was enough memory available, apps were running rogue in the background, and using up CPU time that should have been used for the foreground apps. Of course, Android does share some blame for not reducing the process and IO priority of background work.

The RT patchset isn't really that important here - it's only in the context of processing touch input or phone calls that realtime is relevant.

Yes, ZRAM is nice, but if apps keep asking for more and more memory, ZRAM just won't be enough at that point. And no matter how fast and fancy you get with storage, it just isn't fast enough. Android and iOS don't use swap, and will kill background apps, which is just a sensible approach. Also, there's no separate "Ok Google" chip and "Alexa" chip. They'd all use the same chip for whatever hotword needs to be detected.

"Because even the app launcher uses 300MB of RAM." Where did you get that figure from? And what do you mean by "the app launcher" ? Google stopped caring about the AOSP launcher long ago, they just ship their Google/Pixel launcher on their devices (and with select partners). Others like Samsung, OnePlus make their own launchers. And yeah, Samsung launchers suck (just like the rest of their crap custom changes to Android).

Hahaha - as if there are no buggy drivers or software on desktop Linux, on servers, and many other critical devices that we depend on. Yeah, the drivers suck, they're mostly closed source, and we can't do shit about them. Which is why open source drivers and freely available hardware documentation are important.
Likes 2
Leave a comment:
Guest replied

25 October 2019, 10:12 AM
Originally posted by starshipeleven View Post

FYI: Android is not a real-time OS by any stretch of the imagination, and it usually does not even use the soft-realtime features from Linux kernel (so it won't just interrupt its processing when a high-priority input arrives), it's running 90% bloat, the CPU schedulers in the default firmware were written by hitting the keyboard with a fist multiple times without looking at the screen, and so on and so forth.

Really you can't use that as a reason to "add moar cores".

Realtime isn't actually useful there, except potentially for phone calls, and other types of audio and video calls. It's not some kind of nuclear reactor safety and monitoring usecase.
Leave a comment:
starshipeleven replied

25 October 2019, 09:26 AM
Originally posted by uid313 View Post

I don't know about compiling, but aren't ARM processors really good for video decoding considering all phones and tablets that are used for video decoding with very little power usage?

No they are not. Phones and tablets have decoding acceleration hardware and offload the media decoding to that.

Without hardware decode most ARM devices can't show more than 720p video.

Last edited by starshipeleven; 25 October 2019, 12:07 PM.
Likes 2
Leave a comment:
starshipeleven replied

25 October 2019, 09:13 AM
Originally posted by sandy8925 View Post

Actually, it is. When you have multiple cores/processors, you're actually running things in parallel. Not just providing the appearance of running things in parallel. It does make a big difference as far as responsiveness.

Responsiveness a matter of effective process scheduling, also note that on most multicore systems you are still running far more processes than you have cores for so there is still a BIG component of process scheduling and "appearance of running in parallel".
Likes 1
Leave a comment:
starshipeleven replied

25 October 2019, 09:03 AM
Originally posted by Alex/AT View Post

Fortunately, the typical number of different tasks running on modern general purpose CPU is more than one.
It's 2019. DOS and likes are way in the past.

You are confusing multithreading with multiprogramming.

DOS is monoprogramming so it can run a SINGLE process until it has finished and releases control to the "OS".

This hardware is most likely going to run a multiprogramming OS of some kind, where multiple processes will be allocated time so they can be run "together" without taking exclusive control of the CPU

Last edited by starshipeleven; 25 October 2019, 09:14 AM.
Leave a comment:

Announcement

An Introduction To Intel's Tremont Microarchitecture

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment: