Announcement

Collapse
No announcement yet.

An Introduction To Intel's Tremont Microarchitecture

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #51
    Originally posted by starshipeleven View Post
    Responsiveness a matter of effective process scheduling, also note that on most multicore systems you are still running far more processes than you have cores for so there is still a BIG component of process scheduling and "appearance of running in parallel".
    Just some notes:

    A naturally parallel task is for example 2D rendering, assuming the display device (such as a 4K monitor) is incapable of on-the-fly/realtime data decompression of 1D lines or 2D rectangles.

    Efficient (that is: incremental) compilation of C/C++ code is not a naturally parallel task.
    Last edited by atomsymbol; 10-25-2019, 10:46 AM. Reason: Add the word "efficient"

    Comment


    • #52
      Originally posted by archsway View Post

      It turns out, that Android is extremely bloated with far too many background processes doing nothing useful and needs many cores to be responsive - who knew?



      That's called the RT patchset.



      Have you heard of zram?



      "That chip over there is the OTG USB controller (with buggy drivers that cause kernel panics), and right next to it is what we call the 'Ok' chip. We plan to put an 'Alexa' chip on next year's version."



      Because even the app launcher uses 300MB of RAM.



      But obviously only the ones not sending tracking data to Google.


      Idle processes != bloat. The laptop on which I'm typing this has 239 processes - most of the CPU usage is probably from Chromium here.

      Also, it's not that Android itself is bloated - it's actually due to apps which love running code in the background all the time for no reason, due to really bad code. (I know because I'm an Android app developer - a lot of apps are pretty badly written). Unfortunately, Android only started clamping down on this starting with Android 8.0. so before that, as long as there was enough memory available, apps were running rogue in the background, and using up CPU time that should have been used for the foreground apps. Of course, Android does share some blame for not reducing the process and IO priority of background work.

      The RT patchset isn't really that important here - it's only in the context of processing touch input or phone calls that realtime is relevant.

      Yes, ZRAM is nice, but if apps keep asking for more and more memory, ZRAM just won't be enough at that point. And no matter how fast and fancy you get with storage, it just isn't fast enough. Android and iOS don't use swap, and will kill background apps, which is just a sensible approach. Also, there's no separate "Ok Google" chip and "Alexa" chip. They'd all use the same chip for whatever hotword needs to be detected.

      "Because even the app launcher uses 300MB of RAM." Where did you get that figure from? And what do you mean by "the app launcher" ? Google stopped caring about the AOSP launcher long ago, they just ship their Google/Pixel launcher on their devices (and with select partners). Others like Samsung, OnePlus make their own launchers. And yeah, Samsung launchers suck (just like the rest of their crap custom changes to Android).

      Hahaha - as if there are no buggy drivers or software on desktop Linux, on servers, and many other critical devices that we depend on. Yeah, the drivers suck, they're mostly closed source, and we can't do shit about them. Which is why open source drivers and freely available hardware documentation are important.

      Comment


      • #53
        Originally posted by sandy8925 View Post
        Realtime isn't actually useful there, except potentially for phone calls, and other types of audio and video calls. It's not some kind of nuclear reactor safety and monitoring usecase.
        Realtime is not safety.

        Safety is reliability, and for that you need realtime AND something else. While you can't have a non-realtime system that is certified as "safe", being realtime does not automatically make the system "safe".

        Realtime per-se is just a more extreme form of process scheduling, where the CPU hardware receives hardware interrupts that tell it some input signals must be processed RIGHT NOW and the OS blocks execution of everything else to do what it should do with them.
        Windows claims to be realtime

        soft-realtime in Linux is again process scheduling https://people.mpi-sws.org/~bbb/pape...s/ospert13.pdf and it is commonly used to increase user-responsiveness, or running Jack audio server with minimum jitter
        Last edited by starshipeleven; 10-25-2019, 11:05 AM.

        Comment


        • #54
          Originally posted by atomsymbol View Post

          Just some notes:

          A naturally parallel task is for example 2D rendering, assuming the display device (such as a 4K monitor) is incapable of on-the-fly/realtime data decompression of 1D lines or 2D rectangles.

          Efficient (that is: incremental) compilation of C/C++ code is not a naturally parallel task.
          That's correct, but it's kind of tangential to the point I was making.

          I'm saying that the average OS will have dozens of processes by default and you can run any number of applications at once (which may or may not multithread themselves). So even on multicore systems each CPU core will be running (much) more than a single process.

          Unless you are using a huge server CPU in a desktop, but it's not very efficient use of resources.

          Comment


          • #55
            Originally posted by Space Heater View Post
            It's not a zero-sum game, we can't ignore single-threaded performance and likewise we can't ignore thread-level parallelism.
            Not true. Assume you have a fixed chip size and manufacturing process technology. You have a constant amount of space for logical gates. You can't exceed that limit. Now it depends on the task at hand which processor design would be the best, but I'm pretty sure designs like GPUs and DSPs already show that there are more efficient designs for solving tasks than monolith CPUs that focus on single thread performance. Sure CPUs are better general purpose platforms, but you have to make some strong assumptions about the target audience and their use cases to be able to argue which design is the best.

            By the way, the original paper by Amdahl was an argument for focusing on single-threaded performance in processor designs, and that it is essentially impossible to get linear speed increases as you increase the core count on almost all real-world workloads. Did you actually read the paper and believe I'm missing something or did you just want to misconstrue what I'm saying?
            Of course the sequential performance is important since it also translates into faster multi-core processing. The fact you're missing is that you're claiming that multi-threading sucks because it won't scale without any overhead. That is, adding 100 % more processing units only produce < 100% more actual processing power. The issue you're missing is that you don't even need to scale that well to exceed the performance improvements Intel achieves with faster single core perf. You only need to achieve like 5% improvement with 100% more cores. It's perfectly doable. That's the main reason people started buying Ryzen. Intel offered standard 5% annual improvements and AMD offered 100% more cores. Apparently people had computational tasks that were able to utilize the cores enough to beat the 5%, maybe 5,5% or more.

            If you look at the single thread optimizations, the optimizations are pretty modest. Typically Intel CPUs only gain few percent of additional processing power per generation. A majority of the speedup can be attributed to faster frequencies. I'm not arguing that low level optimizations enabling higher clocks are bad. I'm arguing the IPC optimizations are more expensive in terms of chip space than adding more cores and threads. Later, if you happen to need space for cores, it's already spent on huge engines for speculative execution. It's funny that you're dismissing multi-threading when multi-threading is much more flexible way of adding computational power than vector instructions (which, along with higher frequencies, largely produce the better perceived IPC).

            Comment


            • #56
              Originally posted by duby229 View Post

              I guess its you who doesn't know what amdahls law means. In actual fact its a statement that you can't parallelize loads indefinitely. If you take almost any single thread x86 load and try to parallelize it vastly dimimishing returns after 4 threads and 16 threads is the very most that doesnt look retarded.
              You can cherry pick whatever benchmarks you want. Ever looked at e.g. the phoronix benchmarks? Many of them utilize multiple cores and what's more important is that those tasks are the ones that matter. Of course there are millions of 100% sequential tasks, but how many of them truly are that slow even on a legacy system. Please, if you want to say something relevant, use examples that aren't already fast enough on a 80486.

              Amdahls law is -exactly- why things like gcc are still single threaded. It makes more sense to run many single threads in parallel.
              LoL, wtf are you smoking? I don't know any developer who doesn't use gcc in a multi-threaded way in 2019. It's also one of the best examples in phoronix test suite that shows close to linear scalability even on huge EPYC systems. Check the numbers man. If you're arguing that 'make' doesn't really use threads but processes, that's 100% irrelevant. The processes are threads from CPUs point of view. https://github.com/yrnkrn/zapcc also shows that you're totally clueless.

              Amdahls law -is- a good reason why single threaded performance is so important
              No, it shows that thread performance is important, not single threaded performance. Fast single thread might be better than multiple slow threads, but multiple fast threads is better than a fast single thread.

              Comment


              • #57
                Originally posted by atomsymbol View Post

                Efficient (that is: incremental) compilation of C/C++ code is not a naturally parallel task.
                Compilation consists of multiple phases. Some of them are embarrassingly parallel (the best class of parallelism). For example if your changeset contains two independent translation units, both can be lexed and parsed in 100% parallel. Assuming they don't depend on each other and only a previously known set of dependencies, you can check the syntax and semantics in 100% parallel, you can resolve references in 100% parallel. Look at what zapcc does. Assuming you don't do LTO and have proper modules, you can carry on with more synthesis stages and even full code generation. The only thing that isn't 100% parallel is the linker. This is a lousy example. But it's true that most modern compilers don't use parallelism here. They don't even do real incremental compilation like zapcc does.

                Comment


                • #58
                  Originally posted by caligula View Post
                  Not true. Assume you have a fixed chip size and manufacturing process technology. You have a constant amount of space for logical gates. You can't exceed that limit. Now it depends on the task at hand which processor design would be the best, but I'm pretty sure designs like GPUs and DSPs already show that there are more efficient designs for solving tasks than monolith CPUs that focus on single thread performance. Sure CPUs are better general purpose platforms, but you have to make some strong assumptions about the target audience and their use cases to be able to argue which design is the best.
                  You can improve single-threaded performance while increasing core count, clearly Zen 2 is an exemplar of that. Having to choose between only increasing core count and only improving single threaded performance is a classic false dichotomy.

                  For a concrete example, imagine you improve the branch predictor by moving from preceptron to tage (or vice-versa). That will improve single-threaded performance, but does not at all impinge on improving core count nor is it a net negative on transistor budget.

                  Originally posted by caligula View Post
                  Of course the sequential performance is important since it also translates into faster multi-core processing. The fact you're missing is that you're claiming that multi-threading sucks because it won't scale without any overhead.
                  But I don't think multi-core processing is bad and I never said it was bad. This whole time I have been saying that I don't think we should totally ignore single-threaded performance, as the original comment I quoted was strongly implying.

                  I'm not sure why you continue to claim that I think multi-threading is bad, I don't think that at all. The rest of your argument is against a fictitious stance.

                  Originally posted by caligula View Post
                  No, it shows that thread performance is important, not single threaded performance.
                  Yes, thread (singular) performance is still important for general purpose workloads. When people say single-threaded performance they are talking about per-thread performance, no one sane is suggesting we go back to single core processors.

                  Comment


                  • #59
                    Originally posted by uid313 View Post

                    Most ARM processors are not at so high frequencies because they are aimed at mobile devices, but I guess it would be possible to make a 4 GHz ARM processor if it was designed for workstations and servers.
                    Maybe...

                    Originally posted by uid313 View Post
                    I don't know about compiling, but aren't ARM processors really good for video decoding considering all phones and tablets that are used for video decoding with very little power usage?
                    They aren't. Mobile devices have dedicated decoding blocks that assist with the process, hence the low power usage.
                    But when I want to look at some format that isn't supported by the hardware, the CPU must do decoding, and this is where the high power usage kicks in (~90% usage on a 2.5GHz quad-core ARM CPU to decode 1080p60 4:4:4 H.264 (for comparison, a Skylake Intel CPU at 4.0GHz only uses ~30% of a single core!)).

                    Comment


                    • #60
                      Originally posted by Space Heater View Post
                      You can improve single-threaded performance while increasing core count, clearly Zen 2 is an exemplar of that. Having to choose between only increasing core count and only improving single threaded performance is a classic false dichotomy.
                      This is not necessarily the case in mobile devices. On desktop workstations it's ok to waste power as long as the heatsink can dissipate all the heat. On mobile devices it's much easier to shut down whole cores when not using them. There are also space constraints. ARM Cortex M and A series have also shown that simple cores can be ridiculously small and also power efficient. Sadly the latest A7x aren't that efficient anymore.
                      Yes, thread (singular) performance is still important for general purpose workloads. When people say single-threaded performance they are talking about per-thread performance, no one sane is suggesting we go back to single core processors.
                      They claim that most workloads don't scale so it's better to compute them using just one core and traditional programming methods. I found examples also in this thread. They're not suggesting a switch to single-core CPUs, but often advocate low core count CPUs where all R&D is spent on making a single core fast in a turbo mode. For example, one of the fastest Intel Core i7s (8086k) runs at 5.0 GHz. But you only get 6 cores. I'm pretty sure that outside the domain of hardcore fps gaming, a 16 core Zen 2 Threadripper hands down beats that 5.0GHz chip.

                      Comment

                      Working...
                      X