Announcement

Collapse
No announcement yet.

AMD Talks Up Zen 4 AVX-512, Genoa, Siena & More At Financial Analyst Day

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #31
    Originally posted by Virtus View Post

    If you want higher level functions over intrinsics, like SVML, and do not want to depend on a specific compiler, you may use libraries like Sleef https://sleef.org/.
    Compilers are not able, and will not be able, to deal with all the complexity of AVX-512 flavors, so the only way to take full advantage of it is to use intrinsics... and to know what you do. I have seen completely esoteric and inefficient codes with intrinsics...
    I agree, however you need to be careful with libraries as well / know what they do, do avoid copying your data to other registers, doing some "non-AVX" instructions and copying them back to the AVX-registers..

    We ended up implementing the exponential function we needed by hand fully with AVX.

    We got a nice performance improvement with our hand written AVX-256 implementation actually. Our algorithm had like 50 AVX-Instructions per processed vector with 32bit floats. So we did 8 elements per instruction instead of one.. Obviously did not improve 8 fold, but was a good bit faster than the non-vectorized implementation.

    We came from a Java implementation, which ran for ~45 minutes for the task at hand and reduced that to ~12 seconds..
    Other optimizations of the algorith itself where done as well to archive that.

    Was our first endeavour into really using vector extensions on that level and we noticed pretty quickly that even small changes, like having non-memory alligned data or having a non vectorized step in the algorithm made huge performance differences. worst case scenario: switch between SSE2 instructions and AVX2, that tanked performance, as the CPU has to save all vector registers to main? memory as the xmm (128bit wide used in SSE2) and ymm (256bit wide for AVX2) registers are the same in the hardware as far as i understand it.

    Comment


    • #32
      Originally posted by Spacefish View Post
      . Obviously did not improve 8 fold, but was a good bit faster than the non-vectorized implementation.
      Sounds like the compiler was able to auto vectorize a lot of the stuff. You should get far more than "a good bit faster" typically.

      I remember writing my fist intrinsic avx function and expecting it massive perf. I got within margin of error results instead. Turns out the compiler was able to fully vectorize the plain scalar code.


      Comment


      • #33
        Originally posted by brad0 View Post

        Intel does not even implement all instructions in any processor, so the question doesn't make sense.
        Your reply makes even less sense, 'Brad':

        Do you want to imply that AMD is incapable of overtaking Intel by offering what they can't/won't?

        Zero respect for you BTW, since you're known to insult Michael...

        Comment


        • #34
          Originally posted by AdrianBc View Post


          According to WikiChip, Zen 4 will support the same AVX-512 instruction subsets as Ice Lake, together with the extra AVX-512 instructions of Cooper Lake.

          It will not support the AVX-512 instructions added by Tiger Lake or Sapphire Rapids.

          About the speed, nothing is certain, but it is likely that Zen 4 will do only one 512-bit FMA per cycle, but probably it will be able to also do a 512-bit FADD simultaneously with the FMA. This will result in a speed at equal clock frequency that is intermediate between the Intel models with one 512-bit FMA per cycle and the Intel models with two 512-bit FMA per cycle.
          Thanks for the info!

          So it looks like it will still be inferior than my 11th gen Intel Rocket Lake's AVX-512 implementation (Tiger Lake backport to 14nm), but definitely better than 12th gen Alder Lake with official support removed because of the E-cores.

          Comment


          • #35
            Originally posted by Linuxxx View Post

            Thanks for the info!

            So it looks like it will still be inferior than my 11th gen Intel Rocket Lake's AVX-512 implementation (Tiger Lake backport to 14nm), but definitely better than 12th gen Alder Lake with official support removed because of the E-cores.

            No, it would not be inferior in any way to Rocket Lake.

            Rocket Lake is not Tiger Lake backported to 14 nm, it is Ice Lake backported to 14 nm. Rocket Lake lacks the extra instructions of Tiger Lake, like VP2INTERSECT (which will also be absent in Zen 4).

            Zen 4 will support exactly the same AVX-512 instructions as Rocket Lake, plus the additional instructions for machine learning of Cooper Lake.

            Meanwhile, a couple of pages from the AMD PPR manual for Zen 4 have been leaked, and they confirm what was said on WikiChip about which AVX-512 subsets are supported.

            Moreover, in single-thread Zen 4 will have a slightly higher clock frequency than Rocket Lake and in multi-thread a much higher clock frequency, while also having a slightly higher IPC.

            Zen 4 will beat in multi-thread benchmarks any Intel CPU, but in single-thread it will be slower than the new Intel Raptor Lake, expected in October, and also slightly slower than Alder Lake at equal clock frequency, but the top Zen 4 models will have a higher clock frequency, which might compensate the lower IPC.

            The funny thing is that while during many years the Intel CPUs were saved in many benchmarks by AVX-512, now it may happen that Zen 4 will win some benchmarks against Alder Lake and Raptor Lake when they take advantage of AVX-512.


            In any case, your Rocket Lake is a very fast CPU, which supports a very good set of AVX-512 instructions (which now after AVX-512 support will become much more widespread thanks to AMD, will be used by an increased number of programs), so you do not have to worry about an upgrade for a few more years, certainly not before Zen 5 and Intel Meteor Lake or Lunar Lake.












            Comment


            • #36
              Originally posted by AdrianBc View Post


              No, it would not be inferior in any way to Rocket Lake.

              Rocket Lake is not Tiger Lake backported to 14 nm, it is Ice Lake backported to 14 nm. Rocket Lake lacks the extra instructions of Tiger Lake, like VP2INTERSECT (which will also be absent in Zen 4).

              Zen 4 will support exactly the same AVX-512 instructions as Rocket Lake, plus the additional instructions for machine learning of Cooper Lake.

              Meanwhile, a couple of pages from the AMD PPR manual for Zen 4 have been leaked, and they confirm what was said on WikiChip about which AVX-512 subsets are supported.

              Moreover, in single-thread Zen 4 will have a slightly higher clock frequency than Rocket Lake and in multi-thread a much higher clock frequency, while also having a slightly higher IPC.

              Zen 4 will beat in multi-thread benchmarks any Intel CPU, but in single-thread it will be slower than the new Intel Raptor Lake, expected in October, and also slightly slower than Alder Lake at equal clock frequency, but the top Zen 4 models will have a higher clock frequency, which might compensate the lower IPC.

              The funny thing is that while during many years the Intel CPUs were saved in many benchmarks by AVX-512, now it may happen that Zen 4 will win some benchmarks against Alder Lake and Raptor Lake when they take advantage of AVX-512.


              In any case, your Rocket Lake is a very fast CPU, which supports a very good set of AVX-512 instructions (which now after AVX-512 support will become much more widespread thanks to AMD, will be used by an increased number of programs), so you do not have to worry about an upgrade for a few more years, certainly not before Zen 5 and Intel Meteor Lake or Lunar Lake.
              Thanks for sharing your knowledge about CPUs, I definitely wasn't aware that Rocket Lake was a 14nm backport of Ice Lake.

              I guess I was confused by the lspci output of my PC, which contains several references to Tiger Lake:

              00:00.0 Host bridge: Intel Corporation Device 4c43 (rev 01)
              00:01.0 PCI bridge: Intel Corporation Device 4c01 (rev 01)
              00:06.0 PCI bridge: Intel Corporation Device 4c09 (rev 01)
              00:08.0 System peripheral: Intel Corporation Device 4c11 (rev 01)
              00:14.0 USB controller: Intel Corporation Tiger Lake-H USB 3.2 Gen 2x1 xHCI Host Controller (rev 11)
              00:14.2 RAM memory: Intel Corporation Tiger Lake-H Shared SRAM (rev 11)
              00:14.3 Network controller: Intel Corporation Tiger Lake PCH CNVi WiFi (rev 11)
              00:15.0 Serial bus controller: Intel Corporation Tiger Lake-H Serial IO I2C Controller #0 (rev 11)
              00:15.2 Serial bus controller: Intel Corporation Device 43ea (rev 11)
              00:16.0 Communication controller: Intel Corporation Tiger Lake-H Management Engine Interface (rev 11)
              00:1f.0 ISA bridge: Intel Corporation Device 4387 (rev 11)
              00:1f.3 Audio device: Intel Corporation Tiger Lake-H HD Audio Controller (rev 11)
              00:1f.4 SMBus: Intel Corporation Tiger Lake-H SMBus Controller (rev 11)
              00:1f.5 Serial bus controller: Intel Corporation Tiger Lake-H SPI Controller (rev 11)
              01:00.0 VGA compatible controller: NVIDIA Corporation GA104 [GeForce RTX 3060 Ti] (rev a1)
              01:00.1 Audio device: NVIDIA Corporation GA104 High Definition Audio Controller (rev a1)
              02:00.0 Non-Volatile memory controller: Sandisk Corp WD Blue SN550 NVMe SSD (rev 01)
              Therefore I just kind of expected Rocket Lake to be a backport of Tiger Lake...

              Thanks again for clarifying the situation!

              Comment


              • #37
                Originally posted by Linuxxx View Post

                Thanks for sharing your knowledge about CPUs, I definitely wasn't aware that Rocket Lake was a 14nm backport of Ice Lake.

                I guess I was confused by the lspci output of my PC, which contains several references to Tiger Lake:



                Therefore I just kind of expected Rocket Lake to be a backport of Tiger Lake...

                Thanks again for clarifying the situation!

                "lspci" describes the PCI devices based on their PCI identifiers, which in your case are the identifiers that had been used for the first time when Tiger Lake was launched, earlier than the launch of Rocket Lake.

                All those PCI devices are not located on the CPU die, with the CPU cores, but they are located in a separate chip soldered on the motherboard, usually referred to as the south-bridge or the motherboard chipset. The south-bridge is not made using the same 14-nm CMOS process as the Rocket Lake CPU, but it is made using an older process, to reduce the manufacturing cost.

                From your lspci output, it appears that Rocket Lake has reused the south-bridge of Tiger Lake, which makes sense. There was no reason to design another different south-bridge chip for Rocket Lake.


                Comment

                Working...
                X