Announcement

Collapse
No announcement yet.

How A Raspberry Pi 4 Performs Against Intel's Latest Celeron, Pentium CPUs

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Originally posted by ldesnogu View Post
    Sorry was on holiday

    I rely on Anandtech results for SPECCPU 2006/2017: https://www.anandtech.com/show/15603...ania-devices/6


    Cortex-A77 has about the performance of Apple A11. Iphone with A11 was released on Sept 2017, while Qualcomm 865 was released in early 2020. So that's more than the 1/1.5 year I previously claimed, it's about 2 years.
    I like this, thanks! which processor is the A77 though?

    Comment


    • Originally posted by vladpetric View Post
      I like this, thanks! which processor is the A77 though?
      Cortex-A77 is found in Snapdragon 865.

      Comment


      • Originally posted by RussianNeuroMancer View Post
        What happened? And why you still didn't requested RMA?

        Also some results of your N2 benchmarks looks anomalously strange, such as LibreOffice to PDF conversion. Any idea what may have caused this?
        Well the N2 did work fine for over a year and it is likely my fault that is broken ​ (it did break while i re flashed the eMMC to the mainline kernel but i was not that careful that day) and TBH i have still a use for it with BOINC(right now making some [email protected] work ​)

        I did retry it now and the libreoffice test is even worse today (i guess cause it is so hot) but it is a singlethread run and i would guess that it is maybe not optimized for aarch64 yet (the test run that was done on RPI 64bit https://openbenchmarking.org/result/...NE-2007316NE53 does not have a libreoffice result but my odroid c4 with 4x ARMv8 Cortex-A55 @ 1.91GHz has a result of 54.088(31.154 was the original test on the N2 / 38.108 N2 today)

        eMMC speed look fine so that should not be the reason. So i guess the RPI/ARMv7 32-bit community did optimize some stuff that got no attention in arm64/aarch64 yet.

        Comment


        • Originally posted by starshipeleven View Post
          ARM is an architecture that is supposed to scale up into the high-performance too.
          ARM is not exactly an architecture, it's an instruction set, or the name of the company, that company also designs Cortex-Ax, a FAMILY of architectures. Apple tried to use one architecture to rule them all but soon realized that's not viable.

          Originally posted by justwhatever View Post
          No, actualy Michael test raspi with old outdated 32bit armv6 raspbian.
          That's the official OS, 64 bit is still "EXPERIMENTAL", we should respect that.

          The horrifying situation now is that high performance ARM processors are not for sale, like Amazon Graviton2 and Apple Ax, which is worse then the situation of x86, yes there're only 2 competitors but at least they sell bare processors.

          Comment


          • Originally posted by hotaru View Post
            why run 32-bit on the Pi 4 instead of 64-bit? 64-bit is faster for a lot of workloads due to having more registers.
            I am curious how much of an improvement Ubuntu Server for Pi4(compiled for the newer ARM instruction set and not backwards compatible with Pi3) would provide in these benchmarks

            Comment


            • I think it would be more interesting to compare with, say, an Atom based SBC, if there is one in a similar price range.

              A72 is a cheap to license core, as it's fairly old. A77 and A78 are far more performant if you want a ARM designed ARM core - Apple's core's are a step or two ahead of course and rumours put the A14 significantly faster than the A13 still.

              The SoC in the RPi is cheap - $5 to $10 at most - and that's why it includes a cheap core, and is made on a cheap process (28nm) that limits achieved clock speeds. On the other hand it still runs in a very low TDP, and the cost is tiny.

              I have no idea when the RPi5 will be out - maybe 2021, possibly 2022. I'd expect that to bump the core to an A75 (which should be cheap by then), and maybe jump to a better process again (original Pi was 40nm until RPi3) if the costs work out. But they want to keep the cheapest Pi $35, and that puts fundamental limits on what they can achieve.

              I think the Pi4 had a good showing in this article, given its circumstances.

              Comment


              • Originally posted by vladpetric View Post
                Do read the paper though.
                I have read both of the suggested papers (1: RENO, 2: Memory Bypassing). A few notes about the paper "Loh, Sami, Friendly: Memory Bypassing: Not Worth the Effort" of my own:
                • I don't trust this paper because it doesn't contain essential information that I would like to see in a paper about this particular topic
                • If a loop executes 100+ iterations then exactly one of the following options is true:
                  1. Either, a variable's value depends on the value (1..100+) of the loop iteration,
                  2. or, a variable's value does not depend on the value (1..100+) of the loop iteration (loop-invariant variables)
                • The address used in a particular load-xor-store instruction:
                  1. Either, the address depends on which loop iteration is being executed,
                    • The probability of memory bypassing being applicable (in real-world code) is low (but of course, one can invent contrived non-real-world examples where this probability is high)
                  2. or, the address (=X) is loop-invariant:
                    • Considering the loop as a whole:
                      1. Either, X.num_loads = 0 and X.num_stores > 0:
                        • This is an output variable
                        • No memory bypassing required
                        • Applicable optimization: discard all writes to X except the last write to X
                          • The probability of this optimization being applicable in real-world highly optimized code is very low, but not-highly-optimized code might be amenable to this optimization
                      2. or, X.num_loads > 0 and X.num_stores = 0:
                        • This is an input constant (so it isn't actually a variable, it is a constant)
                        • No memory bypassing required
                      3. or, X.num_loads > 0 and X.num_stores > 0:
                        • Memory bypassing will lead to speedups - but there will be speedup only if load(X) and store(X) are executing in the same clock
                        • A necessary precondition for load(X) and store(X) to be executable in the same clock is the following:
                Last edited by atomsymbol; 08-11-2020, 03:32 PM. Reason: Add "of my own" to avoid confusion

                Comment


                • Originally posted by atomsymbol View Post

                  I have read both of the suggested papers (1: RENO, 2: Memory Bypassing). A few notes about the paper "Loh, Sami, Friendly: Memory Bypassing: Not Worth the Effort" of my own:
                  • I don't trust this paper because it doesn't contain essential information that I would like to see in a paper about this particular topic
                  • If a loop executes 100+ iterations then exactly one of the following options is true:
                    1. Either, a variable's value depends on the value (1..100+) of the loop iteration,
                    2. or, a variable's value does not depend on the value (1..100+) of the loop iteration (loop-invariant variables)
                  • The address used in a particular load-xor-store instruction:
                    1. Either, the address depends on which loop iteration is being executed,
                      • The probability of memory bypassing being applicable (in real-world code) is low (but of course, one can invent contrived non-real-world examples where this probability is high)
                    2. or, the address (=X) is loop-invariant:
                      • Considering the loop as a whole:
                        1. Either, X.num_loads = 0 and X.num_stores > 0:
                          • This is an output variable
                          • No memory bypassing required
                          • Applicable optimization: discard all writes to X except the last write to X
                            • The probability of this optimization being applicable in real-world highly optimized code is very low, but not-highly-optimized code might be amenable to this optimization
                        2. or, X.num_loads > 0 and X.num_stores = 0:
                          • This is an input constant (so it isn't actually a variable, it is a constant)
                          • No memory bypassing required
                        3. or, X.num_loads > 0 and X.num_stores > 0:
                          • Memory bypassing will lead to speedups - but there will be speedup only if load(X) and store(X) are executing in the same clock
                          • A necessary precondition for load(X) and store(X) to be executable in the same clock is the following:
                  1. Thank you! (I mean it )

                  2. It may take me a bit of time to fully respond, but I wanted to make a few quick points:

                  The latency of the register file is for the most part completely hidden. I.e., if you have two instructions, the first producing register A, and the second consuming register A, they can issue back-to-back (no additional delay). If the first instruction takes 1 cycle to execute, the second can issue in the next cycle (bypass networks and the OoO scheduler takes care of that).

                  However, the same is not true for loads. When a load issues, it takes 2-3-4 cycles if it hits in the L1. That latency isn't hidden, an instruction depending on the load needs to wait those 2-3-4 cycles.

                  Speculative bypassing means that, if you somehow know (or predict) that the value of the load is already in a specific physical register in the physical register file, you just rename the load to that physical register and you don't have to execute it anymore. Essentially, you converted a memory operation to a register read, which, as I said, has no "visible" latency. So you saved those 2-3-4 cycles that it takes to access (in parallel) the L1 cache and Store Queue.

                  In other words, you don't need the store and load to execute in the same cycle for speculative memory bypassing to produce a benefit.

                  The Loh et al. paper is pretty influential. I do realize that computer architecture papers are hard to read ... And the fact that it's a negative paper - something is not worth doing - means that one needs to understand that "something" first.
                  Last edited by vladpetric; 08-11-2020, 04:41 PM.

                  Comment


                  • Originally posted by Syfer View Post
                    I am curious how much of an improvement Ubuntu Server for Pi4(compiled for the newer ARM instruction set and not backwards compatible with Pi3) would provide in these benchmarks
                    Pi 3 and Pi 4 have the exact same instruction set.

                    Comment


                    • Originally posted by vladpetric View Post
                      In other words, you don't need the store and load to execute in the same cycle for speculative memory bypassing to produce a benefit.
                      Just a quick note:

                      Given a memory address X, store(X) executes in 1 clock because L1D cache stores are pipelined, i.e. L1D can sustain to execute 1 store every cycle although each of the stores takes 4 cycles to finish. Similarly, load(Y) when Y is different from X takes 1 cycle to execute because L1D cache can sustain to execute 1 load every cycle although each load takes 4 cycles to complete. So in fact (assuming there is one L1D load port and one L1D store port), load(Y) and load(Z) take 1+4=5 cycles to finish. From this we can derive that, assuming the load pipeline is being kept 100% busy, without any memory bypassing actually any L1D load takes just 1 cycle to execute if the load address is different than the store addresses of the previous 4 store instructions.

                      So, I really think that for memory bypassing to really have a performance impact (not 1% performance impact like in the paper, but 20+% performance impact) you actually need to target/find the situations when the store(X) load(X) pair are seen by the instruction window at the same time (aka "in the same clock").

                      Originally posted by vladpetric View Post
                      The Loh et al. paper is pretty influential. I do realize that computer architecture papers are hard to read ... And the fact that it's a negative paper - something is not worth doing - means that one needs to understand that "something" first.
                      I am not so sure that the paper is about a negation, at least not in the sense that it is a fundamental negation. Given the technological constraints (max number of transistors in a single chip) and algorithmic constraints (what the state of the art is in a given year), the paper is right. The paper essentially says that "we didn't find the situation/circumstances in which memory bypassing is required in order for performance to improve by 20+% and we do not intend to search further".
                      Last edited by atomsymbol; 08-11-2020, 05:13 PM.

                      Comment

                      Working...
                      X