Announcement

Collapse
No announcement yet.

Further Investigating The Raspberry Pi 32-bit vs. 64-bit Performance

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #61
    Originally posted by vladpetric View Post
    Most of the time my surprise was quite negative, as daxpy loops are pretty rare. Though feel free to give me an example of successful vectorization of ARM benchmark code.
    I've mostly worked in the realm of x86 and it's optimization behavior doesn't change that much based on ISA. So if it vectorizes my code aggressively when compiled for x86 with the -o3 flag, then it's unlikely it'll behave that much different when compiled for ARM. Admittedly in the field of HPC (i.e where I wrote my Master's) it's usually a stopgap while you re-write the most performance-critical sections to explicitly use vector instructions, but the primary reason it's not used in a lot of projects like the Linux kernel is not performance. It's needing to maintain support for very old versions of GCC that produced buggy binaries at that optimization level.

    Originally posted by vladpetric View Post
    But most of ARM designs are toy-ish (as in, not high performance computing, the main reason ARM holdings is on sale for the second time in half a decade), and I doubt they would go that far in trying to extract performance.
    Low-power parts specifically designed for mobile, embedded and otherwise power constrained applications doesn't automatically mean "toy-ish". We've already seen ARM parts that give mobile i5s and i7s more than a run for their money and we're only going to seeing more of that class of parts now that ARM has shaken off the IoT mandate Softbank pushed on them.

    As for why ARM is "on sale" again; The sale was stopped on grounds of anti-trust violations and the whole sale was primarily done to cover Softbank's investment losses elsewhere. Only difference now is that Softbank is now disinvesting from ARM via an IPO with ARM's client companies buying stakes in them.
    Last edited by L_A_G; 18 February 2022, 11:34 AM.

    Comment


    • #62
      Originally posted by vladpetric View Post
      The purpose of register renaming is to get rid of WAR and WAW false dependencies, so that you can effectively use a larger instruction window for dynamic (out of order) scheduling. Register renaming by itself does not address spills. That's an extension to register renaming which I proposed a while back https://repository.upenn.edu/cis_papers/217/ and AMD 3000 series seems to implement https://www.agner.org/forum/viewtopic.php?t=41

      Of course, it's more important to address spills for an architecture like AMD64 or ARM 32, with 15 GPRs, than for ARM 64, with 30 GPRs. But most of ARM designs are toy-ish (as in, not high performance computing, the main reason ARM holdings is on sale for the second time in half a decade), and I doubt they would go that far in trying to extract performance.
      Arm has far more registers and doesn't use push/pop and load-op from memory like x86, so there is far less gain to be had from doing something like that. Also there are many other things that are both easier and give more gain. Arm designs are 8-wide already and likely will go wider in the next few years.

      And I don't think you can call any Arm design in the last 5 years toy-ish. Mobile phone already match typical desktops - as an example my S21 has the same single-threaded performance as my Ryzen 3700X desktop.

      Comment


      • #63
        Originally posted by L_A_G View Post

        I've mostly worked in the realm of x86 and it's optimization behavior doesn't change that much based on ISA. So if it vectorizes my code aggressively when compiled for x86 with the -o3 flag, then it's unlikely it'll behave that much different when compiled for ARM. Admittedly in the field of HPC (i.e where I wrote my Master's) it's usually a stopgap while you re-write the most performance-critical sections to explicitly use vector instructions, but the primary reason it's not used in a lot of projects like the Linux kernel is not performance. It's needing to maintain support for very old versions of GCC that produced buggy binaries at that optimization level.



        Low-power parts specifically designed for mobile, embedded and otherwise power constrained applications doesn't automatically mean "toy-ish". We've already seen ARM parts that give mobile i5s and i7s more than a run for their money and we're only going to seeing more of that class of parts now that ARM has shaken off the IoT mandate Softbank pushed on them.

        As for why ARM is "on sale" again; The sale was stopped on grounds of anti-trust violations and the whole sale was primarily done to cover Softbank's investment losses elsewhere. Only difference now is that Softbank is now disinvesting from ARM via an IPO with ARM's client companies buying stakes in them.
        A lot of vectorization examples are unfortunately cherry-picking. Yes, they look great on paper, but when you get unrelated code written by someone else (out of sample vs in sample, as scientists would say), it all falls apart.

        Apple has shown pretty clearly that you can totally get high performance computing in a mobile chip. The M1s (and successors) beat the crap out of the ARM offering, on a performance per Watt basis (by far the most important factor for mobile chip design). Instruction set is the same, but microarchitecture matters hugely. That distinction you mentioned is more historical than actual.

        Why the sale was stopped is irrelevant here - if they were successful they wouldn't have been on sale to begin with. Keeping in mind inflation, the sale would have barely turned a profit for Softbank. https://www.in2013dollars.com/us/inflation/2015

        Comment


        • #64
          Originally posted by vladpetric View Post
          A lot of vectorization examples are unfortunately cherry-picking. Yes, they look great on paper, but when you get unrelated code written by someone else (out of sample vs in sample, as scientists would say), it all falls apart.
          Similarly, a lot of the people who argue against bothering with it typically only do so with GCC's -o2 optimization level, which is the lowest level that even tries to do it. If this (i.e doing the minimum effort possible) isn't cherry picking, I don't know what is. There's also the fact that how you structure your code has a significant impact on how effective the vectorization is and most people don't write their code with this in mind. I'm something of a neat freak who does things like loop unrolling without even thinking about it and my code vectorizes pretty well.

          Apple has shown pretty clearly that you can totally get high performance computing in a mobile chip. The M1s (and successors) beat the crap out of the ARM offering, on a performance per Watt basis (by far the most important factor for mobile chip design). Instruction set is the same, but microarchitecture matters hugely. That distinction you mentioned is more historical than actual.
          We are talking about an ISA here, not the implementation of it. Apple building it's own custom cores would only be non-applicable if they were building non-conformant cores, which they aren't. I mentioned them specifically because they're the ones who have put very recent ARM cores in power envelopes that allow them to compete directly with x86. There's also ARM's own "big" cores which have been achieving very similar kinds of performance. Albeit used in lower-power applications, putting them at a clear disadvantage when compared to most x86 parts directly.

          Why the sale was stopped is irrelevant here - if they were successful they wouldn't have been on sale to begin with. Keeping in mind inflation, the sale would have barely turned a profit for Softbank. https://www.in2013dollars.com/us/inflation/2015
          As I already pointed out; They're not for sale because of their own actions, but rather due to disastrous investments into companies like WeWork and Uber. We're talking losses of 17.7 billion for 2020 and 3.5 billion for 2021 in their "Vision Fund" investment business. Your response suggests that you didn't know the scale of their other investment losses.

          Let's also not forget that you can't compare sales prices when the situations are totally different. When Softbank bought ARM they were a publicly traded company and hence had to be bought at a premium over the then stock price. Now it's a privately held company being sold to raise much needed capital Softbank is at a much worse negotiating position and that's going to be reflected in the sale price. The real comparison point is going to be their market cap once Softbank has successfully re-floated them on the stock market later this year.

          Comment


          • #65
            Originally posted by L_A_G View Post

            Similarly, a lot of the people who argue against bothering with it typically only do so with GCC's -o2 optimization level, which is the lowest level that even tries to do it. If this (i.e doing the minimum effort possible) isn't cherry picking, I don't know what is. There's also the fact that how you structure your code has a significant impact on how effective the vectorization is and most people don't write their code with this in mind. I'm something of a neat freak who does things like loop unrolling without even thinking about it and my code vectorizes pretty well.



            We are talking about an ISA here, not the implementation of it. Apple building it's own custom cores would only be non-applicable if they were building non-conformant cores, which they aren't. I mentioned them specifically because they're the ones who have put very recent ARM cores in power envelopes that allow them to compete directly with x86. There's also ARM's own "big" cores which have been achieving very similar kinds of performance. Albeit used in lower-power applications, putting them at a clear disadvantage when compared to most x86 parts directly.



            As I already pointed out; They're not for sale because of their own actions, but rather due to disastrous investments into companies like WeWork and Uber. We're talking losses of 17.7 billion for 2020 and 3.5 billion for 2021 in their "Vision Fund" investment business. Your response suggests that you didn't know the scale of their other investment losses.

            Let's also not forget that you can't compare sales prices when the situations are totally different. When Softbank bought ARM they were a publicly traded company and hence had to be bought at a premium over the then stock price. Now it's a privately held company being sold to raise much needed capital Softbank is at a much worse negotiating position and that's going to be reflected in the sale price. The real comparison point is going to be their market cap once Softbank has successfully re-floated them on the stock market later this year.
            I never build code for high performance production stuff at -O2, not sure why you keep bringing it up. Still, the cherry-picking/overfitting problem applies. Compiler writers use some benchmark code to tune their heuristics, and because they use the same code to train and report their findings (this is super unscientific), everything looks great. But you try with completely different stuff, and no vectorization happens. I've seen that happening many many times with loopy floating point C++ code. This is much less of a problem for other optimizations done by GCC and LLVM (and yes, I read metric sh*ttons of amd64 disassembly in my career).

            Regarding Softbank: if you have a link about their need for money in the summer/fall of 2020 (when they started the process), please do share it! Yes, they lost money with WeWork, but they also made good investments as well. This article is from before the proposed merger was announced: https://www.cnbc.com/2020/08/11/soft...0-results.html

            It appears that they were doing quite ok then, actually. It's a bit hard to say because this is a fund that doesn't seem to have, TTBOMK, strong reporting requirements (like the form 13F in the US). So it's entirely possible that they needed a lot of money, but I would like some evidence. Yes, if ARM is floated again, we will see.

            An ISA is just a spec. Yes, it's an important spec because all the software depends on it, and the spec can be limiting in many ways. However, the implementation matters hugely.

            One of the first thing mentioned in Hennessy and Patterson (which I assume you've seen) is that total end-to-end time depends on three things: instruction count, clock cycle, and instructions per cycle (IPC). Instruction count is the same for the same binary, of course. Clock speeds - not that much variation there, not to mention that dynamic power consumption (not static leakage) has clock speed as a multiplicative factor (while not perfect, this is still a good approximation). So IPC is what makes all the difference in the world when it comes to performance, and performance per Watt. I have in fact seen IPC differences as high as 9x (e.g., a good ARM processor sustaining close to 3 instructions per cycle retired, and another one doing .33) for a benchmark. And yes, it all comes down to the microarchitecture. Problem is, computer microarchitecture is really hard, and that number is not a constant. Thus many people choose to completely ignore it.
            Last edited by vladpetric; 18 February 2022, 02:03 PM.

            Comment


            • #66
              Originally posted by PerformanceExpert View Post

              Arm has far more registers and doesn't use push/pop and load-op from memory like x86, so there is far less gain to be had from doing something like that. Also there are many other things that are both easier and give more gain. Arm designs are 8-wide already and likely will go wider in the next few years.

              And I don't think you can call any Arm design in the last 5 years toy-ish. Mobile phone already match typical desktops - as an example my S21 has the same single-threaded performance as my Ryzen 3700X desktop.
              X86-32 - 7 GPRs
              AMD64 aka X86-64 - 15 GPRs
              ARM 32 - 15 GPRs
              ARM 64 - 30 GPRs

              Yes, there will be considerably more spills for ARM 32 and AMD 64 versus ARM 64. IMO it'd be totally worth it to do spill optimization for ARM 32 code.

              Please show me some numbers for the chip in S21 versus Ryzen 3700x
              Last edited by vladpetric; 18 February 2022, 02:00 PM.

              Comment


              • #67
                Originally posted by vladpetric View Post

                X86-32 - 7 GPRs
                AMD64 aka X86-64 - 15 GPRs
                ARM 32 - 15 GPRs
                ARM 64 - 30 GPRs

                Yes, there will be considerably more spills for ARM 32 and AMD 64 versus ARM 64. IMO it'd be totally worth it to do spill optimization for ARM 32 code.

                Please show me some numbers for the chip in S21 versus Ryzen 3700x
                Since AArch64 is much faster already, why would a sane CPU designer even consider optimizing spilling for 32-bit ARM? On x86 I could see the point given it has a large 32-bit legacy, and it will still help 64-bit, but 32-bit has been on minimal life support on Arm for years now. Btw note 32-bit ARM has 14 GPRs.

                I did do a quick run of Geekbench which got 1165 on my 3700X and 1135 on S21.

                Comment


                • #68
                  Originally posted by PerformanceExpert View Post

                  Since AArch64 is much faster already, why would a sane CPU designer even consider optimizing spilling for 32-bit ARM? On x86 I could see the point given it has a large 32-bit legacy, and it will still help 64-bit, but 32-bit has been on minimal life support on Arm for years now. Btw note 32-bit ARM has 14 GPRs.

                  I did do a quick run of Geekbench which got 1165 on my 3700X and 1135 on S21.
                  I don't have a problem with that at all, assuming that you are correct about ARM 32 being obsolete. Are you implying that most of the apps on my phone and Android 10 system code are all 64 bits? It's a recently bought Android 10.

                  Spill optimization still helps quite a bit though with only 14-15 GPRs (and yes, it's much much more of a win with only 7 GPRs).

                  As for geekbench - well, there's issues with the benchmark suite itself, but also there seem to be differences between what you measured and what's published:




                  Still even as such - not too shabby. Point taken.

                  Comment


                  • #69
                    Originally posted by vladpetric View Post
                    I don't have a problem with that at all, assuming that you are correct about ARM 32 being obsolete. Are you implying that most of the apps on my phone and Android 10 system code are all 64 bits? It's a recently bought Android 10.
                    Yes, Android has mandated 64-bit versions of all apps for some years. Since last year 32-bit apps are no longer supported on the app store, so you can only download 64-bit apps today. Phones will no longer run 32-bit apps from 2024. This was originally announced in 2017, so it has taken a long time (much slower than Apple), but it means nobody has put much effort into optimizing 32-bit since then.

                    As for geekbench - well, there's issues with the benchmark suite itself, but also there seem to be differences between what you measured and what's published:




                    Still even as such - not too shabby. Point taken.
                    I think the biggest issue with Geekbench is that the scores vary too much. The average scores listed for phones are a huge underestimate, maybe it runs on a slow core due to power saving mode. However if you look at other published scores, they get the same results as I do (first bar is S21).

                    The other interesting thing is that my PC burns ~150W to get a multithreaded score that is about twice that of my phone (which barely gets warm). If somebody made a Windows PC that is even faster, silent and uses <30W, I will say goodbye to the AMD and Intel dinosaurs.

                    Comment


                    • #70
                      Originally posted by PerformanceExpert View Post

                      Yes, Android has mandated 64-bit versions of all apps for some years. Since last year 32-bit apps are no longer supported on the app store, so you can only download 64-bit apps today. Phones will no longer run 32-bit apps from 2024. This was originally announced in 2017, so it has taken a long time (much slower than Apple), but it means nobody has put much effort into optimizing 32-bit since then.



                      I think the biggest issue with Geekbench is that the scores vary too much. The average scores listed for phones are a huge underestimate, maybe it runs on a slow core due to power saving mode. However if you look at other published scores, they get the same results as I do (first bar is S21).

                      The other interesting thing is that my PC burns ~150W to get a multithreaded score that is about twice that of my phone (which barely gets warm). If somebody made a Windows PC that is even faster, silent and uses <30W, I will say goodbye to the AMD and Intel dinosaurs.
                      Yeah, so geekbench is only good as a first order approximation. But for that it's fine.

                      Don't mean to be splitting hairs but the aggressive power saving tradeoffs in a mobile chip are probably deemed not worth the effort for a desktop (no battery) application. After all, when the Fridges + ACs consume thousands of watts, whether your CPU idles at 30 W or subwatt is not going to make a difference. In addition to that, higher power allows for sustained high peak performance (the variation in mobile CPU performance is not typically caught by something like geekbench, IIUC).

                      Comment

                      Working...
                      X