Announcement

Collapse
No announcement yet.

AMD Talks Up Zen 4 AVX-512, Genoa, Siena & More At Financial Analyst Day

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #31
    Originally posted by birdie View Post
    Most likely not going to happen ever. The attached memory is very special, limited (in terms of space) and very expensive. Apple promises 24GB of it as far as I remember, but on PC you can have 128GB or is it 256 with DDR5 now? I don't remember. Servers can spot a lot more of course. Literally terabytes of RAM for multisocket configurations.
    Some customers will always demand more memory, but on the consumer side at least, RAM requirements have plateaued. It's 2022 and we're still buying budget devices with 4 GiB and paying too much for 8 GiB soldered.

    tildearrow said "processor cache" so technically L4 cache would count. Let's see AMD or Intel put 8-16 GiB of that on a CPU.

    Originally posted by shmerl View Post
    What about 16-core CPUs with 3D V-cache?
    Nobody knows for sure, but there's an expectation that they will launch more than just an 8-core next time around. 8-core and 16-core would make sense, as they both have good chiplets with no disabled cores. Unless they can do something like put partially disabled lower capacity 3D V-Cache chiplets on a 12-core.

    Originally posted by ms178 View Post
    As long as AMD cannot provide any volume to the market, having the best CPU on the planet won't help them much to gain market share (by volume). Rembrandt notebooks are still not widely available in Europe, and as we have seen with Zen 3, they now charge a premium everywhere they can, that means no good for consumers until Intel gets competitive again.
    It's too bad they couldn't use GlobalFoundries more effectively, although there's still time to do something with 12LP+.

    And there's always Sammy.

    Comment


    • #32
      Originally posted by Virtus View Post

      If you want higher level functions over intrinsics, like SVML, and do not want to depend on a specific compiler, you may use libraries like Sleef https://sleef.org/.
      Compilers are not able, and will not be able, to deal with all the complexity of AVX-512 flavors, so the only way to take full advantage of it is to use intrinsics... and to know what you do. I have seen completely esoteric and inefficient codes with intrinsics...
      I agree, however you need to be careful with libraries as well / know what they do, do avoid copying your data to other registers, doing some "non-AVX" instructions and copying them back to the AVX-registers..

      We ended up implementing the exponential function we needed by hand fully with AVX.

      We got a nice performance improvement with our hand written AVX-256 implementation actually. Our algorithm had like 50 AVX-Instructions per processed vector with 32bit floats. So we did 8 elements per instruction instead of one.. Obviously did not improve 8 fold, but was a good bit faster than the non-vectorized implementation.

      We came from a Java implementation, which ran for ~45 minutes for the task at hand and reduced that to ~12 seconds..
      Other optimizations of the algorith itself where done as well to archive that.

      Was our first endeavour into really using vector extensions on that level and we noticed pretty quickly that even small changes, like having non-memory alligned data or having a non vectorized step in the algorithm made huge performance differences. worst case scenario: switch between SSE2 instructions and AVX2, that tanked performance, as the CPU has to save all vector registers to main? memory as the xmm (128bit wide used in SSE2) and ymm (256bit wide for AVX2) registers are the same in the hardware as far as i understand it.

      Comment


      • #33
        Originally posted by Spacefish View Post
        . Obviously did not improve 8 fold, but was a good bit faster than the non-vectorized implementation.
        Sounds like the compiler was able to auto vectorize a lot of the stuff. You should get far more than "a good bit faster" typically.

        I remember writing my fist intrinsic avx function and expecting it massive perf. I got within margin of error results instead. Turns out the compiler was able to fully vectorize the plain scalar code.


        Comment


        • #34
          Originally posted by brad0 View Post

          Intel does not even implement all instructions in any processor, so the question doesn't make sense.
          Your reply makes even less sense, 'Brad':

          Do you want to imply that AMD is incapable of overtaking Intel by offering what they can't/won't?

          Zero respect for you BTW, since you're known to insult Michael...

          Comment


          • #35
            Originally posted by AdrianBc View Post


            According to WikiChip, Zen 4 will support the same AVX-512 instruction subsets as Ice Lake, together with the extra AVX-512 instructions of Cooper Lake.

            It will not support the AVX-512 instructions added by Tiger Lake or Sapphire Rapids.

            About the speed, nothing is certain, but it is likely that Zen 4 will do only one 512-bit FMA per cycle, but probably it will be able to also do a 512-bit FADD simultaneously with the FMA. This will result in a speed at equal clock frequency that is intermediate between the Intel models with one 512-bit FMA per cycle and the Intel models with two 512-bit FMA per cycle.
            Thanks for the info!

            So it looks like it will still be inferior than my 11th gen Intel Rocket Lake's AVX-512 implementation (Tiger Lake backport to 14nm), but definitely better than 12th gen Alder Lake with official support removed because of the E-cores.

            Comment


            • #36
              Originally posted by Linuxxx View Post

              Thanks for the info!

              So it looks like it will still be inferior than my 11th gen Intel Rocket Lake's AVX-512 implementation (Tiger Lake backport to 14nm), but definitely better than 12th gen Alder Lake with official support removed because of the E-cores.

              No, it would not be inferior in any way to Rocket Lake.

              Rocket Lake is not Tiger Lake backported to 14 nm, it is Ice Lake backported to 14 nm. Rocket Lake lacks the extra instructions of Tiger Lake, like VP2INTERSECT (which will also be absent in Zen 4).

              Zen 4 will support exactly the same AVX-512 instructions as Rocket Lake, plus the additional instructions for machine learning of Cooper Lake.

              Meanwhile, a couple of pages from the AMD PPR manual for Zen 4 have been leaked, and they confirm what was said on WikiChip about which AVX-512 subsets are supported.

              Moreover, in single-thread Zen 4 will have a slightly higher clock frequency than Rocket Lake and in multi-thread a much higher clock frequency, while also having a slightly higher IPC.

              Zen 4 will beat in multi-thread benchmarks any Intel CPU, but in single-thread it will be slower than the new Intel Raptor Lake, expected in October, and also slightly slower than Alder Lake at equal clock frequency, but the top Zen 4 models will have a higher clock frequency, which might compensate the lower IPC.

              The funny thing is that while during many years the Intel CPUs were saved in many benchmarks by AVX-512, now it may happen that Zen 4 will win some benchmarks against Alder Lake and Raptor Lake when they take advantage of AVX-512.


              In any case, your Rocket Lake is a very fast CPU, which supports a very good set of AVX-512 instructions (which now after AVX-512 support will become much more widespread thanks to AMD, will be used by an increased number of programs), so you do not have to worry about an upgrade for a few more years, certainly not before Zen 5 and Intel Meteor Lake or Lunar Lake.












              Comment


              • #37
                Originally posted by AdrianBc View Post


                No, it would not be inferior in any way to Rocket Lake.

                Rocket Lake is not Tiger Lake backported to 14 nm, it is Ice Lake backported to 14 nm. Rocket Lake lacks the extra instructions of Tiger Lake, like VP2INTERSECT (which will also be absent in Zen 4).

                Zen 4 will support exactly the same AVX-512 instructions as Rocket Lake, plus the additional instructions for machine learning of Cooper Lake.

                Meanwhile, a couple of pages from the AMD PPR manual for Zen 4 have been leaked, and they confirm what was said on WikiChip about which AVX-512 subsets are supported.

                Moreover, in single-thread Zen 4 will have a slightly higher clock frequency than Rocket Lake and in multi-thread a much higher clock frequency, while also having a slightly higher IPC.

                Zen 4 will beat in multi-thread benchmarks any Intel CPU, but in single-thread it will be slower than the new Intel Raptor Lake, expected in October, and also slightly slower than Alder Lake at equal clock frequency, but the top Zen 4 models will have a higher clock frequency, which might compensate the lower IPC.

                The funny thing is that while during many years the Intel CPUs were saved in many benchmarks by AVX-512, now it may happen that Zen 4 will win some benchmarks against Alder Lake and Raptor Lake when they take advantage of AVX-512.


                In any case, your Rocket Lake is a very fast CPU, which supports a very good set of AVX-512 instructions (which now after AVX-512 support will become much more widespread thanks to AMD, will be used by an increased number of programs), so you do not have to worry about an upgrade for a few more years, certainly not before Zen 5 and Intel Meteor Lake or Lunar Lake.
                Thanks for sharing your knowledge about CPUs, I definitely wasn't aware that Rocket Lake was a 14nm backport of Ice Lake.

                I guess I was confused by the lspci output of my PC, which contains several references to Tiger Lake:

                00:00.0 Host bridge: Intel Corporation Device 4c43 (rev 01)
                00:01.0 PCI bridge: Intel Corporation Device 4c01 (rev 01)
                00:06.0 PCI bridge: Intel Corporation Device 4c09 (rev 01)
                00:08.0 System peripheral: Intel Corporation Device 4c11 (rev 01)
                00:14.0 USB controller: Intel Corporation Tiger Lake-H USB 3.2 Gen 2x1 xHCI Host Controller (rev 11)
                00:14.2 RAM memory: Intel Corporation Tiger Lake-H Shared SRAM (rev 11)
                00:14.3 Network controller: Intel Corporation Tiger Lake PCH CNVi WiFi (rev 11)
                00:15.0 Serial bus controller: Intel Corporation Tiger Lake-H Serial IO I2C Controller #0 (rev 11)
                00:15.2 Serial bus controller: Intel Corporation Device 43ea (rev 11)
                00:16.0 Communication controller: Intel Corporation Tiger Lake-H Management Engine Interface (rev 11)
                00:1f.0 ISA bridge: Intel Corporation Device 4387 (rev 11)
                00:1f.3 Audio device: Intel Corporation Tiger Lake-H HD Audio Controller (rev 11)
                00:1f.4 SMBus: Intel Corporation Tiger Lake-H SMBus Controller (rev 11)
                00:1f.5 Serial bus controller: Intel Corporation Tiger Lake-H SPI Controller (rev 11)
                01:00.0 VGA compatible controller: NVIDIA Corporation GA104 [GeForce RTX 3060 Ti] (rev a1)
                01:00.1 Audio device: NVIDIA Corporation GA104 High Definition Audio Controller (rev a1)
                02:00.0 Non-Volatile memory controller: Sandisk Corp WD Blue SN550 NVMe SSD (rev 01)
                Therefore I just kind of expected Rocket Lake to be a backport of Tiger Lake...

                Thanks again for clarifying the situation!

                Comment


                • #38
                  Originally posted by Spacefish View Post
                  Was our first endeavour into really using vector extensions on that level and we noticed pretty quickly that even small changes, like having non-memory alligned data or having a non vectorized step in the algorithm made huge performance differences. worst case scenario: switch between SSE2 instructions and AVX2, that tanked performance, as the CPU has to save all vector registers to main? memory as the xmm (128bit wide used in SSE2) and ymm (256bit wide for AVX2) registers are the same in the hardware as far as i understand it.
                  CPU is never by itself saving SSE/AVX registers to caches nor to memory. Maybe you instead were experiencing the following penalty associated with switching between SSE and AVX code:

                  "The Zen 3 has penalties similar to the Intel Sandy Bridge processor when mixing 256-bit VEX instructions with 128-bit non-VEX code as explained on page 132. The transitions from modified to saved state, saved to modified state, or modified to clean state, take approximately 130 clock cycles each. It is important to obey the rules for using VZEROUPPER or VZEROALL to avoid these penalties."

                  Source of the citation: https://www.agner.org/optimize/microarchitecture.pdf
                  The cause of the penalty is that SSE code does not zero the upper 128 bits of the 256-bit AVX registers (AVX-512: does not zero the upper 384 bits of the registers).

                  Comment


                  • #39
                    Originally posted by Linuxxx View Post

                    Thanks for sharing your knowledge about CPUs, I definitely wasn't aware that Rocket Lake was a 14nm backport of Ice Lake.

                    I guess I was confused by the lspci output of my PC, which contains several references to Tiger Lake:



                    Therefore I just kind of expected Rocket Lake to be a backport of Tiger Lake...

                    Thanks again for clarifying the situation!

                    "lspci" describes the PCI devices based on their PCI identifiers, which in your case are the identifiers that had been used for the first time when Tiger Lake was launched, earlier than the launch of Rocket Lake.

                    All those PCI devices are not located on the CPU die, with the CPU cores, but they are located in a separate chip soldered on the motherboard, usually referred to as the south-bridge or the motherboard chipset. The south-bridge is not made using the same 14-nm CMOS process as the Rocket Lake CPU, but it is made using an older process, to reduce the manufacturing cost.

                    From your lspci output, it appears that Rocket Lake has reused the south-bridge of Tiger Lake, which makes sense. There was no reason to design another different south-bridge chip for Rocket Lake.


                    Comment


                    • #40
                      Originally posted by phoronix View Post
                      Phoronix: AMD Talks Up Zen 4 AVX-512, Genoa, Siena & More At Financial Analyst Day

                      AMD today hosted their 2022 Financial Analyst Day where they made some new disclosures and firmed up past product road-map plans...

                      https://www.phoronix.com/scan.php?pa...ncial-Day-2022
                      From angstronomics.com/p/ryzen-7000-desktop-preview:
                      • AVX-512 is [supposedly] implemented using 256-bit datapaths (but the webpage doesn't say whether it means "256-bit datapaths and 256-bit ALUs" or "256-bit datapaths and 512-bit ALUs")
                      • µop cache [supposedly] has 6144 entries (Zen3 µop cache has 4096 entries)

                      Comment

                      Working...
                      X