Announcement

Collapse
No announcement yet.

Arm Details The Cortex-X4 With +15% Performance, Armv9.2 ISA

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #31
    Originally posted by ryao View Post

    SVE2 is not similar to AVX2 in the way NEON is similar to SSE. It is a different approach to SIMD that is more similar to how SIMD works in a GPU. It is possible to convert from one to another, but it is not a nearly 1 to 1 translation like SSE to NEON. In particular, SVE2 is a variable length SIMD format while NEON and AVX are fixed length. The differences need to support variable length keep it from being a 1 to 1 translation to give a simple explanation why that is not as easy to do as SSE to NEON.
    It shouldn't be hard to "fix" SVE2 to 128 bit (minimum length required) and then have a fixed ratio, it should be pretty easy to generalize and produce fitting code (instead of a dynamic loop) for the maximum width of the used CPU. Not really a hard problem.

    Missing/mismatched operations are likely more of a concern (in terms of performance)

    Comment


    • #32
      Originally posted by brad0 View Post

      That's not the RPi market. Would never happen.
      What do you mean? RPi users want 40-pin GPIO + all peripherals connected to a single USB host + fastest possible CPU & dozens of gigabytes of RAM.

      Comment


      • #33
        Originally posted by ryao View Post
        Rock chip has better availability than the Raspberry Pi, but documentation is not as good as the Raspberry Pi. I know someone who tried using a Radxa CM3 in place of the Raspberry Pi CM4 and could not figure out how to use the GPIO because the documentation was so poor. I looked through documentation and thought I saw how, but he was no longer willing to try after that.
        A major benefit of the Raspberry Pi is that there is documentation on how to use it for just about anything. The performance / price ratio is better with the Raspberry Pi. The A72 cores in the Raspberry Pi 4, while old, are more performant than anything else in the same price range. Raspberry Pi competitors use the same A53 cores used in the Raspberry Pi 3 when reaching the same pricing. At that point, they are really competing with the Raspberry Pi Zero 2 W, which costs only $15.
        A major reason why the Raspberry Pi is so cost effective is that it is made by a non-profit and much of the engineering time spent on it is donated. It is nearly impossible to compete with that. Another reason would be that the Raspberry Pi 4 is using the 28nm process, which is the most cost effective node right now. That is according to public charts that I have seen, although I do not remember where. A quick search finds this article from 2014 that explains that below 28nm, costs go up rather than down, which is consistent with what we have been seeing in the industry with leading edge silicon becoming increasingly more expensive:

        I also find this article from 2016 showing that the cost per transistor was flat below 28nm:

        Unfortunately, I cannot find the more recent source I read that confirmed that this trend has continued to the present day.
        The only place where I have seen others be competitive with the Raspberry Pi is at the very low end where the Raspberry Pi Pico operates. There are significantly cheaper options in that area. That said, I doubt their documentation is as good as the Raspberry Pi, although to be honest, I have not checked since I am not interested in anything much lower end than the Raspberry Pi 4 / CM4
        but did not amd proof with the AMD radeon 7600 that AMD can produce a newer chip at a lower price per video card with higher performance ?
        why this is possible even to the fact the node is more expensive is the point that the new chip is only 83% the size of the radeon 6600 but because of high clock frequencies it is faster than even a radeon 6600XT
        this means over the last 2-3 years the situation has changed the 28nm node is no more the most cost efficient one.
        thats because on a smaller node you can increase the clock speed and build a smaller chip and the performance is higher even if you put less tranistors in it.

        right now it looks like 8-12nm node is the most cost efficient one.

        the rockchip 3588 is already on the 8nm node. in my point of view you can not improve the situation by going from 8nm to 12nm because yes the chip is cheaper then but the clock speeds are lower to and this anihilates any benefit.

        if they would go from 8nm to the 6nm node of TSMC the result would be that they can go from 2,4ghz cpu clock speed to maybe 3ghz. we of course will see exactly this in 6-12 months ahead.
        Last edited by qarium; 30 May 2023, 08:13 AM.
        Phantom circuit Sequence Reducer Dyslexia

        Comment


        • #34
          Originally posted by qarium View Post
          right now it looks like 8-12nm node is the most cost efficient one.
          This is always a matter of demand, from a production cost its been a few years since smaller nodes meant cheaper transistors per dollar. I dont think this has changed - what has changed is that the "premium" chips have moved further down, and 28nm is still high in demand ("good enough" for alot purposes). The markup of the manufacturer differs based on that.

          Comment


          • #35
            Originally posted by discordian View Post
            This is always a matter of demand, from a production cost its been a few years since smaller nodes meant cheaper transistors per dollar. I dont think this has changed - what has changed is that the "premium" chips have moved further down, and 28nm is still high in demand ("good enough" for alot purposes). The markup of the manufacturer differs based on that.
            of course 28nm is good enough for many usercases... but he claims something rockchip wo is on 8nm need to produce the chip in 28nm to become a option for the performance per dollar people...

            and this myfriend is not true. as you say the premium chip makers are all on 4nm/5nm/6nm/7nm now and less and less customers do want these old 8nm nodes ... because of this the 8nm node goes cheaper.

            if they still produce the rockchip rk3588 in 12 month on 8nm you can expect a massive price reduction only because the 8nm nodes becomes cheaper and cheaper over time.

            in 1 year to have it 20-40% cheaper is what you can expect.

            i don't know what he wants but even if the product is 20-40% cheaper in a year he maybe still is not happy
            Phantom circuit Sequence Reducer Dyslexia

            Comment


            • #36
              Michael

              Please see post post number 2 above by tildearrow for three typos (THe, cace, larget).

              Comment


              • #37
                Originally posted by discordian View Post

                It shouldn't be hard to "fix" SVE2 to 128 bit (minimum length required) and then have a fixed ratio, it should be pretty easy to generalize and produce fitting code (instead of a dynamic loop) for the maximum width of the used CPU. Not really a hard problem.

                Missing/mismatched operations are likely more of a concern (in terms of performance)
                While you can do that, we already have NEON for 128-bit. There is no need to use SVE2 for 128-bit. 256-bit and 512-bit SIMD operations are where the question of how to efficiently translate SIMD occurs. You can have instructions for operating on 128-bit vectors mixed with ones for 256-bit vectors, so trivially fixing the width is not as simple as it sounds. Translation is still doable, but it is easier to translate 128-bit to NEON and ignore larger vector widths like Apple did. That was probably done in part because Apple hardware does not support SVE/SVE2. Most ARM hardware does not support it either. Implementing higher widths using NEON instructions is problematic because you do not have enough registers to hold all of the state, plus temporaries without doing register spills that are slow.

                That said, I am still studying SVE2 (and also RISC-V’s Vector extension on which it is based). I know that it is advertised as allowing hardware SIMD support to be less than what the software supports, but GCC and Clang require you to specify a width for SVE2 and when you do that, the assembly when run on hardware with a different width will simply read or write at the native width, such that it could read invalid memory by overcomputing when the hardware vector size is larger and undercompute when the hardware vector size is smaller. They have a scalable setting meant to avoid that problem, but neither emit SVE2 instructions when that is set. I am currently not sure what code that supports an arbitrary hardware width looks like and assuming I knew, I would want to know how the hardware does about the intermediate values from the larger width registers that the software is written to expect. :/
                Last edited by ryao; 30 May 2023, 03:40 PM.

                Comment


                • #38
                  Originally posted by qarium View Post

                  but did not amd proof with the AMD radeon 7600 that AMD can produce a newer chip at a lower price per video card with higher performance ?
                  why this is possible even to the fact the node is more expensive is the point that the new chip is only 83% the size of the radeon 6600 but because of high clock frequencies it is faster than even a radeon 6600XT
                  Less die area means better yields, which can help to mitigate the cost increase from the newer process. In addition to the newer process, they have also made architectural improvements, which is likely where most of the performance gains originate.

                  Originally posted by qarium View Post
                  this means over the last 2-3 years the situation has changed the 28nm node is no more the most cost efficient one.
                  thats because on a smaller node you can increase the clock speed and build a smaller chip and the performance is higher even if you put less tranistors in it.

                  right now it looks like 8-12nm node is the most cost efficient one.

                  the rockchip 3588 is already on the 8nm node. in my point of view you can not improve the situation by going from 8nm to 12nm because yes the chip is cheaper then but the clock speeds are lower to and this anihilates any benefit.
                  Shrinking things does mean clock speeds can rise from things being closer together if you do not add transistors to lengthen the longest signal path and a bunch of other factors do not inhibit it. It also improves yields if the die size shrinks. These are factors not included in the per transistor cost calculations. However, there is definite upward pressure on prices from increased fabrication costs at smaller node sizes.

                  That said, I have not read anything that says that smaller process nodes are more cost effective. TSMC is currently trying to get customers to migrate to 28nm from older nodes while citing cost savings:



                  Unless something has changed, this argument would not work as well with newer nodes.​ The ratio of how much things shrink from the next node versus how much it costs to fabricate things has been getting worse with each new process. They are so bad now that people are beginning to question if investments into newer nodes still make sense:



                  Originally posted by qarium View Post
                  if they would go from 8nm to the 6nm node of TSMC the result would be that they can go from 2,4ghz cpu clock speed to maybe 3ghz. we of course will see exactly this in 6-12 months ahead.
                  Die shrinks do not give that much leeway. A 33% increase in clock speed from a die shrink alone sounds unrealistic to me. The logic is not shrinking that much between die shrinks anymore and even when it did, we never saw such clock speed increases from it. Other design changes would likely be needed to support that, unless they are simply running things slower than they need to be right now.
                  Last edited by ryao; 30 May 2023, 04:05 PM.

                  Comment


                  • #39
                    Originally posted by qarium View Post

                    and this myfriend is not true. as you say the premium chip makers are all on 4nm/5nm/6nm/7nm now and less and less customers do want these old 8nm nodes ... because of this the 8nm node goes cheaper.
                    The argument is, that producing X amount of transistors is more expensive at 8nm than at 28nm (or whatever the cutoff point is) and will stay that way.
                    How much the customers pay the foundry is subject to more variables than that, but it's not a no-brainer anymore that smaller structures will get you cheaper transistors on the long run (after the market cost moved toward slim markups on product)

                    Comment


                    • #40
                      Originally posted by ryao View Post

                      While you can do that, we already have NEON for 128-bit. There is no need to use SVE2 for 128-bit. 256-bit and 512-bit SIMD operations are where the question of how to efficiently translate SIMD occurs. You can have instructions for operating on 128-bit vectors mixed with ones for 256-bit vectors, so trivially fixing the width is not as simple as it sounds. Translation is still doable, but it is easier to translate 128-bit to NEON and ignore larger vector widths like Apple did. That was probably done in part because Apple hardware does not support SVE/SVE2. Most ARM hardware does not support it either. Implementing higher widths using NEON instructions is problematic because you do not have enough registers to hold all of the state, plus temporaries without doing register spills that are slow.

                      That said, I am still studying SVE2 (and also RISC-V’s Vector extension on which it is based). I know that it is advertised as allowing hardware SIMD support to be less than what the software supports, but GCC and Clang require you to specify a width for SVE2 and when you do that, the assembly when run on hardware with a different width will simply read or write at the native width, such that it could read invalid memory by overcomputing when the hardware vector size is larger and undercompute when the hardware vector size is smaller. They have a scalable setting meant to avoid that problem, but neither emit SVE2 instructions when that is set. I am currently not sure what code that supports an arbitrary hardware width looks like and assuming I knew, I would want to know how the hardware does about the intermediate values from the larger width registers that the software is written to expect. :/
                      Sure, neon is a better target that sve2 if the cpu doesn't support it, but that wasn't the argument i was responding too.
                      Instead you can easily drop the "dynamic size" and work with fixed width (for the vast majority of instructions) - so translation would not be more complicated.

                      The higher price of course, would be to translate whole loops into optimal SVE2 loops with dynamic count of iterations. But that's not the scope of Rosetta2.

                      Comment

                      Working...
                      X