Announcement

Collapse
No announcement yet.

How A Raspberry Pi 4 Performs Against Intel's Latest Celeron, Pentium CPUs

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #81
    Originally posted by schmidtbag View Post
    They're doing poorly because you're judging it for something it wasn't meant to do. As the saying goes, "judge a fish by its ability to climb a tree and it will think it's stupid". ARM isn't built to compete with desktop performance.

    Yes, I basically said that myself but in fewer words.
    IPC can matter a lot. It isn't the only thing that matters.

    I don't know enough about that to make a worthwhile comment on it, which is why I didn't comment on it in the first place. What I do know is it clearly isn't a necessity to make ARM usable on a day to day basis, of course, assuming you're ok with the level of performance (which plenty of people are). I feel like if it was such an obvious thing to add, they would have done so. After all, the NEON instructions for example aren't exactly a simple addition.

    Poorly compared to what? You have to make comparisons, and the context of the article included Intel. ARM is a hell of a lot better compared to most, if not all other RISC architectures (POWER is faster but it is a lot more power hungry too). Apple's CPU might be better, but it's not going to be cheap; this Broadcom chip offers some fantastic performance[-per-watt] for the price. Any CPU can be made better if you just cram more instructions in it, but then it becomes expensive and inefficient for more basic tasks. Like I said multiple times already: you're expecting this CPU to be something it's not. It does what it was built to do very well.

    Me too. Though, kinda the point of these CPUs is they don't have a lot of instructions. My server uses A53 cores (not Broadcom) and this is all it shows for features in cpuinfo:
    half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt lpae evtstrm aes pmull sha1 sha2 crc32
    Sure isn't much to look at, eh? But it's fast enough for what I need it to do.
    The OS and benchmarks are pretty much the same for all these processors, and they work just fine. That in itself is a huge accomplishment (a collective accomplishment, if you wish). Your analogy is faulty, because they can really perform all the same tasks.

    In this day and age IPC matters a lot more than you think. Clock speeds are limited upwards by power consumption, and the differences between instruction sets are smaller than you'd think (and yes, that includes the venerable AMD64 instruction set).

    Problem is - it's not a spec (anyone can quote a spec number), and understanding why it can fluctuate a lot requires understanding of the microarchitecture. And it's easy to get 2 benchmarks that behave very differently on exactly the same processor, with the same config (a database can easily have 10 times lower IPC than a media benchmark).

    As far as instruction sets are concerned: the usefulness of additional instructions is limited by the compiler and Amdahl's law. The bigger problem with performance is still regular generic code, and not handcoded assembly.

    Please, read the following. The first one is a paper by one of AMD's leading researchers; while its main message is something else, a secondary message is that load store speculation is totally worth it:

    https://pdfs.semanticscholar.org/fae...982.1596905037

    And then textbooks:

    https://www.amazon.com/Computer-Arch...dp_ob_title_bk

    https://www.amazon.com/Modern-Proces.../dp/1478607831

    Then we can talk about microarchitecture.
    Last edited by vladpetric; 08 August 2020, 01:12 PM.

    Comment


    • #82
      Originally posted by hotaru View Post

      no, that doesn't "even things out". performance is generally much better with 64-bit code. the increase in pointer size doesn't make anywhere near as much difference as doubling the number of registers.
      You're totally wasting your time trying to explain fundamental trade-offs to some of the geniuses here .
      Last edited by vladpetric; 08 August 2020, 01:09 PM.

      Comment


      • #83
        Originally posted by vladpetric View Post
        The OS and benchmarks are pretty much the same for all these processors, and they work just fine. That in itself is a huge accomplishment (a collective accomplishment, if you wish). Your analogy is faulty, because they can really perform all the same tasks.
        Huh? That's not even slightly true. Whether you're comparing ARM to x86 or a RPi vs Apple's ARM, there can be significant differences. There are also basic tasks that, regardless of what architecture you have, you'll get similar results. Same can be said of my analogy: for example, an economy car and a sports car will still get you from point A to point B in the same amount of time if you're obeying speed limits.
        In this day and age IPC matters a lot more than you think. Clock speeds are limited upwards by power consumption, and the differences between instruction sets are smaller than you'd think (and yes, that includes the venerable AMD64 instruction set).
        IPC is incredibly important, but not for the target devices that these Broadcom CPUs are meant for. If IPC was so unanimously important, we wouldn't be having this discussion.
        As far as instruction sets are concerned: the usefulness of additional instructions is limited by the compiler and Amdahl's law. The bigger problem with performance is still regular generic code, and not handcoded assembly.
        I completely agree.

        Comment


        • #84
          Originally posted by schmidtbag View Post
          Huh? That's not even slightly true. Whether you're comparing ARM to x86 or a RPi vs Apple's ARM, there can be significant differences. There are also basic tasks that, regardless of what architecture you have, you'll get similar results. Same can be said of my analogy: for example, an economy car and a sports car will still get you from point A to point B in the same amount of time if you're obeying speed limits.
          All the benchmarks in question are compiled from the same source, with gcc, and run on the Linux OS, on both x86-64 and aarch64 (ARM 64). That's all I meant.

          Yes there are differences of course. But the benchmarks do the same thing. Take a 24 bit pic microcontroller and it's really not meant for running linux. RPi4? that's exactly what it's meant for.

          Well, performance depends linearly on IPC (just as it depends on total instruction count and clock frequency, but as I said earlier those are more or less fixed). The problem is that you can easily have sub-1 IPC these days, with a poorly designed chip. All I'm trying to say here is that IPC matters a lot more than you think, for these mobile chips.

          Please see the materials I recommended.

          Comment


          • #85
            Originally posted by Raka555 View Post
            And the fact that 64bit pointers "waste" 1/2 the cache they are loaded in, even things out. So more or less the same performance, dependent on workload.
            Originally posted by hotaru View Post
            no, that doesn't "even things out". performance is generally much better with 64-bit code. the increase in pointer size doesn't make anywhere near as much difference as doubling the number of registers.
            Originally posted by vladpetric View Post
            You're totally wasting your time trying to explain fundamental trade-offs [cut impoliteness]
            What fundamental trade-offs do you mean exactly?

            The fact is: As the number of transistors in a CPU increases the fundamental trade-offs are shifting to something else over time. Year 1980 fundamental trade-offs aren't year 2020 fundamental trade-offs. A change in quantity sometimes does mean/cause a change in quality, and/or a change in viewpoint.

            Both arguments, (1) 64-bit pointers "waste" 50% of L1D cache and (2) the increase in pointer size doesn't make as much difference as doubling the number of registers, are valid - they are valid in different contexts/situations. (See also: https://en.wikipedia.org/wiki/Multi-...e_optimization)

            vladpetric The fact is: 32-bit register-register instructions on AMD64 CPUs are running only at the speed of 64-bit register-register instructions, except instructions such as DIV and IDIV - because all current x86-64 CPUs are optimized for 64-bit code at the expense of 32-bit code. But just because this fact is valid in year 2020 does not mean that it will be valid in year 2030 - so it isn't a fundamental one.

            vladpetric If you do not agree (99% or 100%) with the previous paragraph, then I am afraid our worldviews are so different that we have nothing further to talk about. If you publish a scientific article containing an idea that originated in this discussion you are of course bound to use a proper citation in the article using the real name of the person who mentioned the idea here. (You can get the real name(s) by posting a private message via the person's userpage on phoronix forums.)

            Comment


            • #86
              Originally posted by atomsymbol View Post





              What fundamental trade-offs do you mean exactly?

              The fact is: As the number of transistors in a CPU increases the fundamental trade-offs are shifting to something else over time. Year 1980 fundamental trade-offs aren't year 2020 fundamental trade-offs. A change in quantity sometimes does mean/cause a change in quality, and/or a change in viewpoint.

              Both arguments, (1) 64-bit pointers "waste" 50% of L1D cache and (2) the increase in pointer size doesn't make as much difference as doubling the number of registers, are valid - they are valid in different contexts/situations. (See also: https://en.wikipedia.org/wiki/Multi-...e_optimization)

              vladpetric The fact is: 32-bit register-register instructions on AMD64 CPUs are running only at the speed of 64-bit register-register instructions, except instructions such as DIV and IDIV - because all current x86-64 CPUs are optimized for 64-bit code at the expense of 32-bit code. But just because this fact is valid in year 2020 does not mean that it will be valid in year 2030 - so it isn't a fundamental one.

              vladpetric If you do not agree (99% or 100%) with the previous paragraph, then I am afraid our worldviews are so different that we have nothing further to talk about. If you publish a scientific article containing an idea that originated in this discussion you are of course bound to use a proper citation in the article using the real name of the person who mentioned the idea here. (You can get the real name(s) by posting a private message via the person's userpage on phoronix forums.)
              Well, my main comment was targeted at the dubious claim that somehow doubling the register count and cost of pointers in the L1D cache cancel each other out. No, not even close, that really means that the person who wrote that doesn't have a clue about low level optimizations.

              While I generally agree with what you're saying, pointers don't waste 50% of the L1D cache, because you don't just have pointers in the L1D cache. If you're walking linked lists all the time, and don't have any data with them - then maybe you could fill the L1D cache with pointers. But what kind of program would be that???

              Also, the relationship between effective L1 cache size and hit rate is not linear. And a good out-of-order processor hides the latency of an L1 miss/L2 hit quite well.

              As for clock speed - yes, in the 80s widths affected your clock cycle considerably. But these days the main limitation of clock speed comes from dynamic power consumption (C * V ^ 2 * f, essentially). The pipelines could run at higher speed, but the heat production would be insane.

              Comment


              • #87
                Originally posted by vladpetric View Post

                Well, my main comment was targeted at the dubious claim that somehow doubling the register count and cost of pointers in the L1D cache cancel each other out. No, not even close, that really means that the person who wrote that doesn't have a clue about low level optimizations.

                While I generally agree with what you're saying, pointers don't waste 50% of the L1D cache, because you don't just have pointers in the L1D cache. If you're walking linked lists all the time, and don't have any data with them - then maybe you could fill the L1D cache with pointers. But what kind of program would be that???

                Also, the relationship between effective L1 cache size and hit rate is not linear. And a good out-of-order processor hides the latency of an L1 miss/L2 hit quite well.

                As for clock speed - yes, in the 80s widths affected your clock cycle considerably. But these days the main limitation of clock speed comes from dynamic power consumption (C * V ^ 2 * f, essentially). The pipelines could run at higher speed, but the heat production would be insane.
                Err, I don't understand why you posted this, mainly considering that the post contains zero bits of new information from my perspective.

                Originally posted by vladpetric View Post
                ... But what kind of program would be that???
                The propensity (natural tendency) to use linked lists in C code is larger than in C++ code, because implementing a type-safe array-list in C code is much harder (much more repetitive) than with C++ templates. So, the answer to your question "But what kind of program would be that???" from a probabilistic viewpoint is simple: C programs more so, C++ programs less so.

                Originally posted by vladpetric View Post
                As for clock speed - yes, in the 80s widths affected your clock cycle considerably. But these days the main limitation of clock speed comes from dynamic power consumption (C * V ^ 2 * f, essentially). The pipelines could run at higher speed, but the heat production would be insane.
                In the distant future (i.e: after our deaths), FPGAs will be moving matter (atoms and molecules) when the FPGA is told to transform into a particular configuration.
                Last edited by atomsymbol; 08 August 2020, 03:32 PM. Reason: Fix grammar

                Comment


                • #88
                  Ran the tests on Raspberry Pi OS 64-bit using my Pi 4 4GB (which is in a Flirc case):

                  https://openbenchmarking.org/result/...NE-2007316NE53

                  For some reason, some tests wouldn't run. Haven't looked into that. The results are a bit all over the place...
                  Last edited by Brunnis; 08 August 2020, 03:37 PM.

                  Comment


                  • #89
                    Originally posted by atomsymbol View Post

                    Err, I don't understand why you posted this, mainly considering that the post contains zero bits of new information from my perspective.



                    The propensity (natural tendency) to use linked lists in C code is larger than in C++ code, because implementing a type-safe array-list in C code is much harder (much more repetitive) than with C++ templates. So, the answer to your question "But what kind of program would be that???" from a probabilistic viewpoint is simple: C programs more so, C++ programs less so.



                    In the distant future (i.e: after our deaths), FPGAs will be moving matter (atoms and molecules) when the FPGA is told to transform into a particular configuration.
                    I don't know what you know ...

                    Yes, you're right about C and I do almost all my work with highly optimized C++.

                    Well, there's little point in arguing if essentially we're agreeing.

                    Comment


                    • #90
                      This beating on the Raspberry PI never gets old and brings in all the Intel fanboys and Raspi haters.
                      For the price what you get is more than great. I run various Raspberry PI 1,2,3,4,minis as file server, sip server, webserver, docker virtualization some with well over 200 days in uptime works reliably of course they are for smaller sites and applications you can't compare them to high end servers.

                      Where I find this platform fails is desktop applications. Yeah now that the RPI has 4,8GB models it's a bit better because a base XFCE suckers that 1GB ram up and if you load up a firefox or chrome next to it then it starts swapping which just a killer when your storage is mmc.

                      Where the raspi fails is Todays web applications like Youtube, stock, crypto exchange apps which put heavy JS load on the clients. Even just watching bunch of sites will OOM/kill your browser. Nope the PI is NOT usable for Todays Web for sure and it's a pity when having 2 HDMIs on the RPI4 and would be great for displaying pages in kiosk mode.

                      The XU4, orangepis and other products, regardless that some is faster from the pi have bad support and OSes for them. They are nowhere near to the PI, some models like the R1 constantly segfault/freezes with all kernels, these boards have no BSD support at all, the wifi is broken on them etc. etc. It's a small chinese company who making them their userbase and support is nowhere to the PI I would even remove it from this discussion. For those who are into these miniPCs these boards count as "stay away from junks".

                      Comment

                      Working...
                      X