Announcement

Collapse
No announcement yet.

An Introduction To Intel's Tremont Microarchitecture

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #61
    Originally posted by caligula View Post
    Of course the sequential performance is important since it also translates into faster multi-core processing. The fact you're missing is that you're claiming that multi-threading sucks because it won't scale without any overhead. That is, adding 100 % more processing units only produce < 100% more actual processing power. The issue you're missing is that you don't even need to scale that well to exceed the performance improvements Intel achieves with faster single core perf. You only need to achieve like 5% improvement with 100% more cores. It's perfectly doable. That's the main reason people started buying Ryzen. Intel offered standard 5% annual improvements and AMD offered 100% more cores. Apparently people had computational tasks that were able to utilize the cores enough to beat the 5%, maybe 5,5% or more.
    Note that Ryzen 3000 has about 20-30% better performance in CPU-bound games than Ryzen 1000/2000, with the same core count. In Middle-earth: Shadow of Mordor (Linux port, Very High quality in-game setting, RX 570) performance in terms of average FPS of the in-game benchmark improves by 32% when upgrading from Ryzen 1000 to Ryzen 3000. Thus, Ryzen's IPC in these cases improved by about 10-15%.

    Originally posted by caligula View Post
    If you look at the single thread optimizations, the optimizations are pretty modest. Typically Intel CPUs only gain few percent of additional processing power per generation. A majority of the speedup can be attributed to faster frequencies. I'm not arguing that low level optimizations enabling higher clocks are bad. I'm arguing the IPC optimizations are more expensive in terms of chip space than adding more cores and threads. Later, if you happen to need space for cores, it's already spent on huge engines for speculative execution. It's funny that you're dismissing multi-threading when multi-threading is much more flexible way of adding computational power than vector instructions (which, along with higher frequencies, largely produce the better perceived IPC).
    In my experience, Ryzen 3000 has a much higher IPC than Ryzen 1000 in a number of cases. For example, switching between tabs in NetBeans IDE is noticeably faster (NetBeans is a Java application). In another case, running an application test suite took 36 seconds and now it takes about 18 seconds - my quick estimate is that IPC improved by about 20-25% in this particular instance.

    Comment


    • #62
      Originally posted by sandy8925 View Post

      Actually, it is. When you have multiple cores/processors, you're actually running things in parallel. Not just providing the appearance of running things in parallel. It does make a big difference as far as responsiveness.
      Which still have nothing what so ever to do with "high single core performance" in the context that I was speaking.
      Last edited by F.Ultra; 10-25-2019, 08:23 PM.

      Comment


      • #63
        Originally posted by caligula View Post
        This is not necessarily the case in mobile devices. On desktop workstations it's ok to waste power as long as the heatsink can dissipate all the heat. On mobile devices it's much easier to shut down whole cores when not using them. There are also space constraints. ARM Cortex M and A series have also shown that simple cores can be ridiculously small and also power efficient. Sadly the latest A7x aren't that efficient anymore.

        They claim that most workloads don't scale so it's better to compute them using just one core and traditional programming methods. I found examples also in this thread. They're not suggesting a switch to single-core CPUs, but often advocate low core count CPUs where all R&D is spent on making a single core fast in a turbo mode. For example, one of the fastest Intel Core i7s (8086k) runs at 5.0 GHz. But you only get 6 cores. I'm pretty sure that outside the domain of hardcore fps gaming, a 16 core Zen 2 Threadripper hands down beats that 5.0GHz chip.
        That 5GHz chip will beat the Zen2 in far more domains than just "hardcore fps gaming". It all depends on what you measure and what your load is, e.g we can take a simple http server, here the 5GHz chip will provide far lower latency and higher throughput per connection until you get enough simultaneous connections that the higher core count of the Zen2 starts to pay off.

        So if your number of connections (and we also have to consider the length of each connection here) is below that threshold then the 5GHz chip is much faster, if you are beyond it the Zen2 is much faster. So even with the same type of software it all depends on the load and the nature of the load.

        Comment


        • #64
          Originally posted by Space Heater View Post
          Single-thread performance still matters in 2019, it's unfortunately not the case that every task can be parallelized to the nth degree.
          I think he's just trolling, nobody could be this dumb twice.

          Comment


          • #65
            Originally posted by c117152 View Post

            6-wide x86 instruction decode
            * Dual 3-wide clusters Out of order
            * Wide decode without the area of a uop cache
            * Optional single cluster mode based on product targets

            The clusters and removal of the micro operations cache seems ARM-like, real time oriented. Then the optional single cluster... I'm just not sure what the product target they have in mind. Robotics maybe? Weird.
            I think you should view Tremont as an Atom microarch "update". Not to the core branch.
            Atom has several embedded uses. Most embedded products don't technically need multicore. It does not have to be RTOS-tasks.
            So a sharp single core is what some customers actually want.
            A lot of code in the industry is just rubbish. Code that won't do well with multicore solutions and it's not looking to change.

            Comment


            • #66
              Originally posted by milkylainen View Post

              I think you should view Tremont as an Atom microarch "update". Not to the core branch.
              Atom has several embedded uses. Most embedded products don't technically need multicore. It does not have to be RTOS-tasks.
              So a sharp single core is what some customers actually want.
              A lot of code in the industry is just rubbish. Code that won't do well with multicore solutions and it's not looking to change.
              You're thinking purely in terms of compute workloads, but GUIs and most interactive systems work differently. Your cores don't need to be fastest possible, they just need to be pretty fast, and many mobile CPUs are. No matter how fast your single core performance is, some workload will be too heavy, and will cause stutter and lack of responsiveness if run on the same threads that process input and issue draw commands. You can significantly boost responsiveness just by offloading the work to a different thread, and it makes a huge difference. When you actually have multiple cores running, work is actually running in parallel, so you can load stuff from the network/disk while still being very responsive to the user. That plus there are multiple apps running on devices, multiple OS/framework processes and threads, and the system can still be super responsive.

              Comment


              • #67
                Originally posted by sandy8925 View Post

                You're thinking purely in terms of compute workloads, but GUIs and most interactive systems work differently. Your cores don't need to be fastest possible, they just need to be pretty fast, and many mobile CPUs are. No matter how fast your single core performance is, some workload will be too heavy, and will cause stutter and lack of responsiveness if run on the same threads that process input and issue draw commands. You can significantly boost responsiveness just by offloading the work to a different thread, and it makes a huge difference. When you actually have multiple cores running, work is actually running in parallel, so you can load stuff from the network/disk while still being very responsive to the user. That plus there are multiple apps running on devices, multiple OS/framework processes and threads, and the system can still be super responsive.
                I'm not sure what point you're trying to make or if you are replying to something else I said? Embedded RTOS or just embedded are in a lot of cases non-GUI. Threading context and draw contexts are functions of OS and toolkits. How they work is the aspect of stupid software design. Not hardware. The discussion was around single-threaded performance. And it is as important as ever. But you're right in the sense that single context performance won't scale forever and multiple execution contexts are a boon for a lot of workloads. But given a single-core with the same envelope as a dual-core in performance and power, I'd go for the single-core. All day, every day. This is what the discussion is about. You can't remedy all problems with more contexts, but we sure try as a single core will never be as fast as multiple cores.

                Intel correctly decided that they wanted to eke maximal performance from every core for as little power as possible for the Atom microarch.

                Comment


                • #68
                  Originally posted by caligula View Post

                  You can cherry pick whatever benchmarks you want. Ever looked at e.g. the phoronix benchmarks? Many of them utilize multiple cores and what's more important is that those tasks are the ones that matter. Of course there are millions of 100% sequential tasks, but how many of them truly are that slow even on a legacy system. Please, if you want to say something relevant, use examples that aren't already fast enough on a 80486.



                  LoL, wtf are you smoking? I don't know any developer who doesn't use gcc in a multi-threaded way in 2019. It's also one of the best examples in phoronix test suite that shows close to linear scalability even on huge EPYC systems. Check the numbers man. If you're arguing that 'make' doesn't really use threads but processes, that's 100% irrelevant. The processes are threads from CPUs point of view. https://github.com/yrnkrn/zapcc also shows that you're totally clueless.


                  No, it shows that thread performance is important, not single threaded performance. Fast single thread might be better than multiple slow threads, but multiple fast threads is better than a fast single thread.
                  Multiprocessing is not multithreading. It's not irrelevant, it's the -entire- point.

                  Comment


                  • #69
                    Originally posted by sandy8925 View Post

                    Actually, it is. When you have multiple cores/processors, you're actually running things in parallel. Not just providing the appearance of running things in parallel. It does make a big difference as far as responsiveness.
                    Yes, but not in the way you imagine. In pure processing power you wouldn't notice the context switches 300/1000 per second. The problem though arrises with all the various CPU caches, that ends up being cleared or made irrelevant whenever a CPU changes task, that causes hickups in performance (even more so with the work-arounds for Spectre that clears many caches previosly thought safe to carry over). The same can happen when you have too many task running in parallel on a multi-core CPU, but the more cores the more pressure the machine needs to be under for this to happen, and good schedulers can minimize the effect.

                    Comment


                    • #70
                      Originally posted by duby229 View Post

                      I guess its you who doesn't know what amdahls law means. In actual fact its a statement that you can't parallelize loads indefinitely.
                      No, it says no such thing, though your conclusions are closer. Amdahl's law is just a way to calculate speed up if you only speed up a certain part of the execution (for instance speeding up half the running time 2x gives a total speed up of 1/0.75 = 1.33x ). By adding more cores you only speed up CPU-depending code that can be parallelized, by increasing single-thread performance you speed up anything CPU-dependent, so a larger section of the code, but by a smaller margin. Still a lot of execution time is waiting on memory or disk, and Amdahl's law would still give you the deminished returns even for improvements to single-thread performance.

                      But Amdahls law says nothing about not being able to parallelize indefinitely, just that the returns are limited by the size of the exuction time that you parallelize. For instance speeding up half the execution time infinitiely with infinite cores gives a total speed up of 1/0.5 = 2x, since the rest of the execution time is unchanged.
                      Last edited by carewolf; 10-28-2019, 04:35 AM.

                      Comment

                      Working...
                      X