Announcement

Collapse
No announcement yet.

AMD Talks Up Zen 4 AVX-512, Genoa, Siena & More At Financial Analyst Day

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • zamroni111
    replied
    Originally posted by theriddick View Post
    The CPU's sound great to me, using a 5700G atm which I kind of regret buying due to its stripped down cache and the fact I haven't really made good use of the iGPU. (plus the iGPU can't do video encoding correctly!!)

    I think this time around I'll be skipping the 7000 series AMD GPU cards however (70% certain), and going with a 4080 or something.
    I know it sounds lame but I want functional up-scaling (FSR1 looks bad and FSR2 is in a single game) and ray-tracing support.

    My 6800XT has served me well but I've had just as many if not more problems then my 1080Ti I had previously!!!!!
    for reliability, use radeon pro driver instead of adrenalin driver.

    Leave a comment:


  • ddriver
    replied
    Originally posted by Linuxxx View Post
    So, will AVX-512 support mean all instructions are going to be implemented or only the ones benefitting AI/ML workloads?

    And will it be similar to first gen Zen 1 where AVX2 was realized as 2 × 128 bits wide registers, so maybe AVX-512 as 2 × 256 bits?
    There are flavors upon flavors of avx 512, it is really a mess, thanks intel. Some of the instructions are rather exotic and not that useful for general use. Intel itself most likely doesn't have and will not have a cpu that supports "all avx 512 instructions". In addition to expanding the vector size to 512, avx 512 does offer quite a lot of more purpose specific instructions. Amd is pragmatic, it will support the instructions that have high transistor roi. They have the option to emulate some - this way the cpu can qualify for the higher tier simd by signalling "support" and emulating the corner cases, which will perform better than having an incomplete set of requirements and falling back to a lower version simd implementation.

    And yes, zen 4 will implement it as 2x256 registers, this worked out great for zen 1, and seeing how intel cpus struggle to keep below 200 watts, and cannot even sustain their clocks during avx workloads, there is really no need for amd to get ahead of themselves here.

    Zen 4's ace appears to be its clocks - not excessively cramming the chiplet full of additional circuitry and keeping it small appears to have achieved an unprecedented boost of clock speeds for amd. And it is not just about overall transistor count or power usage, there's an advantage to having the chip physically smaller - that increases the top clock it can synchronize at. When zen 2 switched to 7nm, amd realized the chiplet became so small it could unite the two ccx modules into one, at 12 nm amd just couldn't synchronize the entire chip wide, but once they got the zen 2 numbers they estimated the two ccx's can be united and a whole additional level of sync - removed outright.

    Which is why they didn't hurry to boost the chiplet core count, technically they can have 16 core chiplets smaller than zen 1, but no way in hell they could hit the claimed 5.8 ghz with a chip that big. That is also why they are reserving zen 4c for servers, which do not chance top clocks anyway. 5nm is expensive, so it is understandable that amd is adopting a design to leverage increased clocks and a smaller chip, reserving low hanging ipc gains from add for intel's future contenders. They could have achieved larger ipc gains with a bigger chip, but lose the frequency, ending up costing more per flop than a more conservative and snappy design. And generally speaking, it is not a good idea to get ahead of yourself here, just because you can,

    Zen 5 will extend on that, it will likely remove l3 cache from the chiplets altogether, removing another level of sync - as the ccx will sync at L2 cache, and they will stack chiplets onto a fabric substrate that has the l3 cache. Having no l3 on die will allow to cram far more circuitry into the same footprint, so in zen5 amd can offer a more significant ipc increase without sacrificing clocks.

    Leave a comment:


  • smitty3268
    replied
    Originally posted by theriddick View Post
    I know it sounds lame but I want functional up-scaling (FSR1 looks bad and FSR2 is in a single game) and ray-tracing support.
    Pretty sure it's in 3 now, but yeah it needs to be a lot more widespread.

    Deathloop, God of War, and Farming Simulator 22 for some reason.

    Leave a comment:


  • brad0
    replied
    Originally posted by Linuxxx View Post
    So, will AVX-512 support mean all instructions are going to be implemented or only the ones benefitting AI/ML workloads?
    Intel does not even implement all instructions in any processor, so the question doesn't make sense.

    Leave a comment:


  • WorBlux
    replied
    Originally posted by Spacefish View Post

    I am dreaming of an FPGA "Co-Processor" integrated into the CPU.. On the software side you would have some sort of "compiler" which translates your algorithm /kernels into Netlists for the FPGA.
    The FPGA Nodes would be connected to CPU registers and you would have an x86 Instructions to load FPGA Netlists from memory and other ones to "trigger" them.
    That FPGA Net could do things, which you typically would do in multiple normal instructions.

    This could accelerate Algorithms where a later stage depends on data on an earlier stage, so no real vectorization/SIMD is possible or there are a lot of branches. These branches could just run in parallel on the FPGA.

    For example if you have something like:
    y[i] = y[i-1]*x[i]

    or

    if(x > 3) x+= 2 else x+=3

    You could just have two binary full adders which do x+2 and x+3 in parallel and another logic LUT which looks at the relevant bits of "x" and generates a 1 if x > 3 or 0 otherwise.. The output of one of the binary full adders is connected to the result register depending on the 1/0 signal of the condition evaluation net.
    That´s quite power inefficient, but depending on transistor speed, you could do all of that in one clockcycle.

    With their recent Xilinx Acquisition, AMD has FPGA based accelerator cards in their portfolio (Alveo Lineup).. Guess the "Smart NIC" mentioned in the picture is one of them, maybe combined with a Zen 4 in one package / on the same card?
    The later example can already be done with cmov, or a vector predicate.

    And in the first example there is necessarily a 1 multiply latency between stores. Even doing it in FPGA you have to wait for that latency, and it's probably worse there as LUT's run slower than your ALU, the main gain being the decoder doesn't have to keep shoving instructions into the pipeline.

    But it would be more beneficial to mix vector and scalar registors, and keep a scalar as an intermediate and let hardware counters issue the loop - Libre-soc Simple-V style.

    And there are issue will connecting the FPGA directly to the architectural registers, a lot of potential issues and bottle-necks. It would be better using FPGA blocks as asynchronous accelerators operating on memory rather than getting so close to the critical path of the cpu.
    Last edited by WorBlux; 09 June 2022, 10:00 PM.

    Leave a comment:


  • rclark
    replied
    Sounds great! Thanks for the news! But I wonder how many of us with make the 'jump' having to buy at least 'motherboard/memory/cpu' out the gate . I mean those of us already on the 5000 series CPUs. I certainly can see data centers for example, wanting the extra performance.... I did jump on the first Ryzen 1600 when introduced and just upgraded CPUs (which was 'very' nice trick with the AM4 socket), but then I was actually 'needing' better performance back then. Now on 5000 series CPUs, VMs just 'fly'. All software I use is very responsive, compiles are quick. So, as for myself, I will initially just watch how this plays out as I am swimming in an ocean of unused performance ... and loving all that elbow room so to speak.
    Last edited by rclark; 09 June 2022, 09:28 PM.

    Leave a comment:


  • theriddick
    replied
    The CPU's sound great to me, using a 5700G atm which I kind of regret buying due to its stripped down cache and the fact I haven't really made good use of the iGPU. (plus the iGPU can't do video encoding correctly!!)

    I think this time around I'll be skipping the 7000 series AMD GPU cards however (70% certain), and going with a 4080 or something.
    I know it sounds lame but I want functional up-scaling (FSR1 looks bad and FSR2 is in a single game) and ray-tracing support.

    My 6800XT has served me well but I've had just as many if not more problems then my 1080Ti I had previously!!!!!

    Leave a comment:


  • shmerl
    replied
    What about 16-core CPUs with 3D V-cache?

    Leave a comment:


  • Spacefish
    replied
    Originally posted by EvilHowl View Post
    I'm really interested on the AMD Instinct MI300, which is basically an incredibly big and powerful APU with lots of memory bandwidth. Just imagine having that as a consumer product.
    I am dreaming of an FPGA "Co-Processor" integrated into the CPU.. On the software side you would have some sort of "compiler" which translates your algorithm /kernels into Netlists for the FPGA.
    The FPGA Nodes would be connected to CPU registers and you would have an x86 Instructions to load FPGA Netlists from memory and other ones to "trigger" them.
    That FPGA Net could do things, which you typically would do in multiple normal instructions.

    This could accelerate Algorithms where a later stage depends on data on an earlier stage, so no real vectorization/SIMD is possible or there are a lot of branches. These branches could just run in parallel on the FPGA.

    For example if you have something like:
    y[i] = y[i-1]*x[i]

    or

    if(x > 3) x+= 2 else x+=3

    You could just have two binary full adders which do x+2 and x+3 in parallel and another logic LUT which looks at the relevant bits of "x" and generates a 1 if x > 3 or 0 otherwise.. The output of one of the binary full adders is connected to the result register depending on the 1/0 signal of the condition evaluation net.
    That´s quite power inefficient, but depending on transistor speed, you could do all of that in one clockcycle.

    With their recent Xilinx Acquisition, AMD has FPGA based accelerator cards in their portfolio (Alveo Lineup).. Guess the "Smart NIC" mentioned in the picture is one of them, maybe combined with a Zen 4 in one package / on the same card?

    Leave a comment:


  • zamroni111
    replied
    Siena cost cutting can be ddr4.
    amd just need to make different io chiplet for that

    Leave a comment:

Working...
X