Announcement

Collapse
No announcement yet.

How A Raspberry Pi 4 Performs Against Intel's Latest Celeron, Pentium CPUs

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #91
    Originally posted by vladpetric View Post
    While I generally agree with what you're saying, pointers don't waste 50% of the L1D cache, because you don't just have pointers in the L1D cache. If you're walking linked lists all the time, and don't have any data with them - then maybe you could fill the L1D cache with pointers. But what kind of program would be that???

    Also, the relationship between effective L1 cache size and hit rate is not linear. And a good out-of-order processor hides the latency of an L1 miss/L2 hit quite well.
    On second thought, you [temporarily] forgot about hash-tables of sizes larger than L1D cache.

    Comment


    • #92
      Originally posted by atomsymbol View Post
      I believe more in the potential usefulness of JIT compilation (in a CPU) than in value prediction.
      Originally posted by vladpetric View Post
      About JITs - I attended a talk in 1997 in which Java people claimed that they were going to beat C++ speed by doing clever things with their JITs. Well, that didn't work at all ...

      In order for JITs to take over the world, I think we'd need a paradigm shift of sorts (an Einstein-like person to figure out how to make JIT compilation much much better than it is today).
      Well, except that the µOP cache is part of a small in-CPU JIT engine, and it is going to get larger in the future. There is no need to gather Einstein-like people in order to make better use of a µOP cache in future CPUs.

      Sometime in the future, the µOP cache might get so large (multiple gigabytes) that it might be beneficial to download optimized versions of commonly used binary codes from the Internet from trusted sources (unless an application specifies not to, due to security reasons).

      Comment


      • #93
        Originally posted by vladpetric View Post

        All the benchmarks in question are compiled from the same source, with gcc, and run on the Linux OS, on both x86-64 and aarch64 (ARM 64). That's all I meant.

        .
        No, actualy Michael test raspi with old outdated 32bit armv6 raspbian. My 64bit aarch64 Gentoo setup was in some test up to twice as fast.https://openbenchmarking.org/result/...NE-GEGLCROPR28

        Comment


        • #94
          People really don't have anything to do any more. Look no further than...

          all this time and energy wasted on commenting on nothing but a toy--what THEY call a--"computer" (with sincerest apologies to anyone and everyone who deals with real computing machines).

          Comment


          • #95
            Originally posted by atomsymbol View Post



            Well, except that the µOP cache is part of a small in-CPU JIT engine, and it is going to get larger in the future. There is no need to gather Einstein-like people in order to make better use of a µOP cache in future CPUs.

            Sometime in the future, the µOP cache might get so large (multiple gigabytes) that it might be beneficial to download optimized versions of commonly used binary codes from the Internet from trusted sources (unless an application specifies not to, due to security reasons).
            There are no real JIT engines in the processor at this time. As I mentioned earlier, please do read the following though about implementing optimizations in the processor: https://iscaconf.org/isca2005/papers/02B-03.PDF (whether it can be called a JIT, that's a different story).

            A trace cache simply keeps a sequence of potentially discontiguous (i.e., from multiple basic blocks) micro-ops. I'm sure you know that. That's not a JIT though.

            As for downloading such codes from the Internet ... well, you need to look at the current geopolitical situation as to why it's unlikely to happen (the Internet is fracturing, and for security reasons everyone's going to be suspicious of downloading binary blobs).

            Comment


            • #96
              Originally posted by atomsymbol View Post

              On second thought, you [temporarily] forgot about hash-tables of sizes larger than L1D cache.
              A hashtable has a key, value, and then optionally a pointer for linking multiple entries that collide. The ratio of pointers to non-pointers tends to be higher for linked lists versus hashtables. I was simply mentioning a linked list because with those there's higher pointer density, but even then you won't fill up your L1D with pointers.

              Why optionally? If you have a good hash function it's actually better to not use linking, but side stepping. I'm generally careful about the hash functions I use, and use hashtable template classes that do stepping, not linking.

              Comment


              • #97
                Originally posted by vladpetric View Post
                There are no real JIT engines in the processor at this time. As I mentioned earlier, please do read the following though about implementing optimizations in the processor: https://iscaconf.org/isca2005/papers/02B-03.PDF (whether it can be called a JIT, that's a different story).

                A trace cache simply keeps a sequence of potentially discontiguous (i.e., from multiple basic blocks) micro-ops. I'm sure you know that. That's not a JIT though.

                As for downloading such codes from the Internet ... well, you need to look at the current geopolitical situation as to why it's unlikely to happen (the Internet is fracturing, and for security reasons everyone's going to be suspicious of downloading binary blobs).
                Just some notes:

                The Skylake/Zen µOP cache isn't a trace cache, in the sense of Pentium 4 trace cache. A disadvantage of a trace cache is that there have to be strict limits put on the tracing, otherwise it can end up consuming exponential amounts of bits.

                There is no information about the instruction format used by Skylake/Zen in their µOP caches, so I can only speculate. However, it is probable that the µOP instruction encoding will be slowly diverging from the programmer-visible x86 instruction encoding as time goes on - why wouldn't it?

                Maybe the downloaded binary blobs can be verified against the operational semantics of the original code. The download is a lesser security issue than the upload, in my opinion. Even if the machine uploads the code to trusted providers of the optimization service, the trust is limited. For common Linux apps it wouldn't be an issue because the source code is already open.

                Macro-op fusion (CMP + Jcc) is a miniscule JIT optimization.

                Comment


                • #98
                  Originally posted by atomsymbol View Post

                  Just some notes:

                  The Skylake/Zen µOP cache isn't a trace cache, in the sense of Pentium 4 trace cache. A disadvantage of a trace cache is that there have to be strict limits put on the tracing, otherwise it can end up consuming exponential amounts of bits.

                  There is no information about the instruction format used by Skylake/Zen in their µOP caches, so I can only speculate. However, it is probable that the µOP instruction encoding will be slowly diverging from the programmer-visible x86 instruction encoding as time goes on - why wouldn't it?

                  Maybe the downloaded binary blobs can be verified against the operational semantics of the original code. The download is a lesser security issue than the upload, in my opinion. Even if the machine uploads the code to trusted providers of the optimization service, the trust is limited. For common Linux apps it wouldn't be an issue because the source code is already open.

                  Macro-op fusion (CMP + Jcc) is a miniscule JIT optimization.
                  As I said earlier - not entirely sure what we're arguing about, as it seems that for the most part we're agreeing.

                  That divergence has been happening for a while now (e.g., to the best of my knowledge, there are internal registers that a micro-op can use, but you don't see them at instruction levels; IOW they are micro-architectural, not architectural).

                  As far as I'm concerned, a proper JIT would do some constant folding (e.g., chained adds get collapsed) and maybe some register reallocation as well, utilizing micro-architectural resources of course.

                  I don't know if it's worth quibbling about whether macro-op fusion, which combines two instructions into a more complex one, that the execution engine can handle as such, makes a JIT or not. If it were one out of a dozen optimization techniques, sure.

                  Does the fact that Intel and AMD completely control the internal micro op representation allow them to potentially do such optimizations? Yes, absolutely. But this doesn't mean that they do do it or will do it in the future. And these things are measurable, after all (Dr. Agner Fog uses synthetic code and the micro-op counter in his measurements).

                  Do read the paper though.

                  Comment


                  • #99
                    Originally posted by Toggleton View Post
                    Did a run on my N2(nearly broken. usb hub dead+ sd card does not work anymore)
                    What happened? And why you still didn't requested RMA?

                    Also some results of your N2 benchmarks looks anomalously strange, such as LibreOffice to PDF conversion. Any idea what may have caused this?

                    Comment


                    • Originally posted by vladpetric View Post
                      Hoping that you're right (honestly!). Any benchmarks though?
                      Sorry was on holiday

                      I rely on Anandtech results for SPECCPU 2006/2017: https://www.anandtech.com/show/15603...ania-devices/6


                      Cortex-A77 has about the performance of Apple A11. Iphone with A11 was released on Sept 2017, while Qualcomm 865 was released in early 2020. So that's more than the 1/1.5 year I previously claimed, it's about 2 years.

                      Comment

                      Working...
                      X