Announcement

**pegasus** · 31 October 2020, 03:40 PM

This is a wonderful opportunity to throw a bunch of OpenMP enabled codes at it and see what it brings ...

The question now is this: is there any code out there that is using OpenMP and is able to scale to hundreds of threads? Maybe something from the world of big numa systems ... but those usually also want huge memories so might not fit on a gpu ... Anyway, numbers will be interesting.

**Jumbotron** · 01 November 2020, 01:13 PM

Oooooo....this gets my Spidey Senses tingling < and no my name is NOT Peter so it's not my "Peter Tingle" >.

Here's why. With this plumbing now in place and only to get better still, this will play VERY nicely to AMD's next gen CPU/GPU lineup and Interconnect architecture. Namely ZEN 4 / Genoa / 3rd Generation Infinity Architecture. Pretty much "Everything Connected to Everything".

Bridgman recently confirmed that with ZEN 4 / Genoa / 3rd Gen IA that the dream of AMD's FUSION initiative started around 2011 and implemented first with HSA APU's will come to fruition for non APU CPUs and RDNA discreet GPU's.

Timeframe.....late 2021 into 2022. < giggling like a litte kid >

Ubuntu 22.04 and onward or Fedora/RHEL 3x.x with OpenMP and also rocking out on ROCm on a four ZEN 4 CPU box with each CPU sporting 32-64 cores tied to 8 RDNA/CDNA GPU's across a 3rd Gen Infinity Architecture interconnect along with Xilinx FPGA SmartNIC and the whole box stuffed to the brim with DDR 5..maybe 6.

Oh yeah....** Peter Tingle **

**elldekaa** · 02 November 2020, 08:58 AM

Originally posted by pegasus View Post

This is a wonderful opportunity to throw a bunch of OpenMP enabled codes at it and see what it brings ...

The question now is this: is there any code out there that is using OpenMP and is able to scale to hundreds of threads? Maybe something from the world of big numa systems ... but those usually also want huge memories so might not fit on a gpu ... Anyway, numbers will be interesting.

Not that simple, you need to add few pragmas to handle data movements between CPU and GPU memory if you want to program GPUs with OpenMP.

**pegasus** · 02 November 2020, 09:06 AM

Yes, I expected data movement to still be an issue ... Sigh, when will the AMD's promise of HSA finally materialize?

**Jumbotron** · 02 November 2020, 05:51 PM

Originally posted by pegasus View Post

Yes, I expected data movement to still be an issue ... Sigh, when will the AMD's promise of HSA finally materialize?

So let me take a crack at that.

For background, the cut and paste below is from the Phoronix article on AMD Navi Blockchain support being added to the Linux Kernel 5.11
During that thread I engaged in a VERY enlightening back and forth with Bridgman from AMD. He dropped something in one of his threads that blew me away. Simply that in terms of "HSA" today's Zen CPUs and GPUs are not where AMD was back in 2014 with the Kaveri / Carrizo / Bristol Ridge APUs. Have a look.....

Originally posted by Jumbotron View Post
Ok....so. A: When do you anticipate that AMD dGPUs will ACTUALLY catch up to your APU's of 2014 and in what way?

And.........B: What does this entail for AMD's APU's going forward particularly the upcoming Zen3 / RDNA 2 based "Cezanne" APU's ?

Oh...let me add C: Does Zen 4 / Genoa / 3rd Gen Infinity Architecture herald this "catching up with the APU capabilities we had in 2014" ?

( from Bridgman at AMD )

A: probably mid-2021 as we finish integrating HMM into our compute products and get the same kind of GPU access to unpinned OS-allocated memory

B: we actually took iGPUs a step back in the short term, using GPUVM code paths rather than ATC/IOMMUv2 to let them coexist with and be compatible with our dGPUs. I think we started this with Picasso and may do the same with Raven Ridge. Should make it easier to maintain consistent support across the entire stack and allow a developer working on a laptop to run the same code on a server without issues

C: yes, that should be the final missing piece of the solution

That from Bridgman REALLY connected a LOT of dots for me.

HSA has fired my imagination since the advent of the FUSION efforts of AMD going all the way back to 2011. FUSION and HSA were the efforts to get zero copy, everything connected to everything first in APUs in wafer but then branching out over the motherboard itself to non APU AMD CPUs and discreet Radeon GPU's and even out to racks of AMD CPU's and GPU's.

But the whole HSA movement and the subsequent HSA Consortium seemed to stagnate and become very moribund after Lisa Su and company came aboard to try to save AMD from what had become a period of time where AMD was just hemmorhaging money hand over fist. Hence the new vision of ZEN and RDNA/CDNA to get back into the more lucrative HPC / Cloud / Supercomputer markets. Which they have with tremendous success.

But as I kept reading over the years about AMD's new Infinity Fabric, which is a superset of HyperTransport, which turned into Infinity Architecture, the more I became convinced that Infinity Fabric / Architecture was going to be the HSA of the ZEN / RDNA-CDNA Era at AMD. Hence my question to Bridgman about that theory of mine.

And as you can see...he agreed !!

But here is what it is taking to FINALLY get HSA, as it was envisioned all the years ago, into fruition in the near future.

1: HMM ( Heterogeneous Memory Management in the Linux Kernel )
2: Full OpenMP support finally worked out for both AMDGPU and ROCm
3: ZEN 4 / Genoa / 3rd Generation Infinity Architecture.

Numbers 1 and 2 are almost plumbed completely in
Number 3 will have to await the rollout of ZEN 4 and it's enterprise version code name Genoa which will replace Milan in servers and HPC/Supercomputers. And 3rd Generation Infinity Architecture is simply Infinity Fabric, which once again is the superset of HyperTransport, in place not just from CPU to CPU but CPU to GPU to FPGA to DSP......EVERYTHING....with full cache coherency and zero copy of memory.

That....in a nutshell....is HSA. But HSA that's not JUST for APUs anymore.

Timeline.....late 2021 into 2022.

Hang on pegasus. The Dream is still alive. It's just took a little time and it changed form a bit. But Lisa Su and company had a plan. And they are indeed executing that plan with purpose.

**pegasus** · 02 November 2020, 06:12 PM

Now you've opened an interesting can of worms ... You know, I still have an A12-9800 pc around with an old ubuntu installation and proprietary amd driver just so I can tinker with it and poke at HSA things there when I find some time and motivation. I'm very aware that this is still at the base of the AMD's plan forward, just by glancing at their proposed exascale hpc architectures. I just have my fingers crossed that they manage to bring their HIP compiler (or is it transpiler?) to a level where porting of cuda code to run on such architecture is just a line or two to the Makefile of a largeish cuda codebase ... I feel that this is the key issue for them going forward. Once this is solved, the flood gates open and intel+nvidia in hpc is history.

**Jumbotron** · 02 November 2020, 09:23 PM

Originally posted by pegasus View Post

Now you've opened an interesting can of worms ... You know, I still have an A12-9800 pc around with an old ubuntu installation and proprietary amd driver just so I can tinker with it and poke at HSA things there when I find some time and motivation. I'm very aware that this is still at the base of the AMD's plan forward, just by glancing at their proposed exascale hpc architectures. I just have my fingers crossed that they manage to bring their HIP compiler (or is it transpiler?) to a level where porting of cuda code to run on such architecture is just a line or two to the Makefile of a largeish cuda codebase ... I feel that this is the key issue for them going forward. Once this is solved, the flood gates open and intel+nvidia in hpc is history.

LOL....I am typing this on a Lenovo laptop with a A12-9700 and my desktop is an HP with an A12-9800.

I was needing a new laptop and desktop but was disappointed in Carrizo and was still hoping that AMD was going to extend HSA to non APU parts and GPUs. After Lisa Su came on board as CEO but before ZEN was rolled out, AMD launched one final revision of Carrizo, namely Bristol Ridge, for BOTH laptops and desktops, whereas Carrizo was just for laptops. That's when I decided to get both a desktop and a laptop with Bristol Ridge as it was already a known fact that AMD under Lisa Su was putting HSA on the back burner with ZEN and Vega. ( BTW...I got a great deal on Ebay for both as new open boxed items, $250.00 for each. )

After Zen and Vega came out I saw how HSA along with HSAIL just seemed to get ignored, at least by the marketing crew at AMD and the press. What came out instead was ROCm and HIP. I can understand why. Nvidia rules HPC and Supercomputing through CUDA. It's no longer just about having the best CPUs which AMD now has by a mile over Intel. AMD knew they are leaving a LOT of money on the table by equipping Supercomputers with AMD CPUs but with Nvidia GPUs and CUDA. So...let's make this translation layer through HIP so we can credibly market a computer with ALL AMD from CPUs to GPUs and now FPGAs with Xilinx.

But HIP, in my humble and flawed opinion, is a transitional technology until AMD has a credible story and offering with ROCm and the market decides that between OpenMP, HMM, ROCm and OpenCL, AMD has a credible alternative to CUDA. And that's BEFORE you add the new offering from Intel with oneAPI.

I've lived too long and have had too much of my prognostication blown up in my face to say that Intel or Nvidia in HPC could be history with a resurgence AMD post 2022 with their ZEN 4 / Genoa / Infinity Architecture. But what I think will happen, with the combo of OpenMP, HMM, ROCm, and AMD Infinity Architecture tying everything together in cache coherency and zero copy to and from everything, is that it will give truly give to Intel and Nvidia the competition they so richly deserve.

And it WILL unleash a Price War the likes we haven't seen, at least in the HPC space. It will turn out to be the silicon version of the War of the Roses. And I think it will brutalize Intel the worst. Nvidia will still have a virtual lock with CUDA and they will have a cheaper alternative with ARM. But when the HPC market continues to buy ALL AMD kit.....then even Nvidia will have to bow to market conditions and drop prices. And that may hurt them later on this decade since they will still be digesting their purchase of ARM.

Make no mistake though. AMD has BY A MILE the most interesting hardware story. 3rd Generation infinity Architecture is a marvel. That architecture will be baked into every ZEN 4 product and every RDNA and CDNA discreet GPU AMD produces and EVERY AMD MOTHERBOARD. 3rd Gen IA IS...HSA...for the entire, freaking computer.

Now...let me bake your noodle a bit before I wrap this up. Tie in Gen-Z, which is the internconnect protocol which ties racks together and in which AMD is part of that consortium, could extend HSA / Infinity Architecture out from each cabinet and across racks. You could have full cache coherency and zero copy across the entire Data Center.

<giggling like a little kid>

And think....all that I mentioned above...is pretty much modeled in wafer inside your Bristol Ridge APU. I have always thought that Bristol Ridge, being the pinnacle of APU tech, and sadly, might be the last until a ZEN 4 based APU, is the logical answer to a hypothetical question in a Comp. Sci class where the professor says to the class....

"So, class, I have a special assignment for you. You all know how a typical Supercomputer installation is designed and installed. You know the typical components. You have a cabinet. inside that cabinet are multiple racks of boards with CPUs. These boards are connected to additional racks of GPUs and additional racks of storage. They are all tied together with high speed interconnects. Some even have specialized DSPs and ASICs as well. Some have high speed RAM on each respective board and some have pools of RAM once again tied together with high speed interconnects. These high speed interconnects and their firmware also allow Cache Coherency between the different tech nodes....CPU, GPU, Storage, DSPs and ASICs.

Ok....so you know that at least in the aggregate. You can visualize that.

Now...here is your assignment.

Recreate a typical Supercomputer design.....in a wafer.

And to make it more interesting. Make it a working, general purpose CPU that can work with the three major OS's....Windows, MacOS and Linux."

And as the semester comes to an end....one student comes forward with that very design. That's student's name is....AMD. And that student's design was called "Bristol Ridge".

**pegasus** · 03 November 2020, 04:11 AM

It's nice to dream big

But then you wake up and the reality strikes you ... physics dictate that you don't want to move your data at large distances, if you want to stay energy efficient. So HSA at datacenter scale, while tempting, doesn't really make sense. Ask HPE and their The Machine. Some type of ram being integrated onto the cpu is more sensible solution, this is what Fujitsu is doing for the Fugaku. Times of cache coherent interconnects dangling between racks are long gone, cpus are too fast these days.

**Jumbotron** · 03 November 2020, 06:28 PM

Originally posted by pegasus View Post

It's nice to dream big

But then you wake up and the reality strikes you ... physics dictate that you don't want to move your data at large distances, if you want to stay energy efficient. So HSA at datacenter scale, while tempting, doesn't really make sense. Ask HPE and their The Machine. Some type of ram being integrated onto the cpu is more sensible solution, this is what Fujitsu is doing for the Fugaku. Times of cache coherent interconnects dangling between racks are long gone, cpus are too fast these days.

Funny you mentioned HPE and The Machine. Gen-Z actually arose from some of the work done by HPE on The Machine. At first Gen-Z was going to be a catch all interconnect for inside the rack and outside the rack. But it looks like now in the wake of all the different interconnect protocols that sprang up from the delay of PCIe 4, what with CXL from Intel, OpenCAPI from IBM, CCIX from ARM and Infinity Fabric, now Architecture, from AMD that the market has forced a truce between Intel's CXL for the internal interconnects and Gen-Z for internconnects outside the rack.

It seems that even though AMD's Infinity Architecture, being a superset of HyperTransport, which was already superior to Intel's Omnipath, is superior still to CXL, no one gets fired for buying...or buying into....Intel's marketing. Plus, CXL is a superset of PCIe 5 so that helps as well when it comes to board manufacturers, so there is always that.

But CXL as well as PCIe 5 still have issues going outside the rack. And that's where Gen-Z, although just as capable if not more so than CXL for interior interconnects, was chosen to be the interconnect protocol going outside the rack after it became clear that Intel and CXL would win the day inside the rack. It didn't hurt that both Dell and HP along with AMD were behind Gen-Z.

Plus, there are enough architectural and thematic similarities between the two so that the two camps could sign a memorandum of agreement to combine forces and make sure that CXL and Gen-Z work together seamlessly in common cause.

Now...to your argument that you don't want to move your data at large distance. You are most correct on this. However, as we have seen at the wafer level, the era of the pure SoC on x86 is over. At least HSA based SoCs. Intel could never do it, choosing to instead go the SiP route where you have a tiny board with both the CPU and the GPU tightly coupled on this tiny square but not integrated together inside the die. Of course, AMD accomplished this with FUSION and the APUs. But now even AMD is going the chiplet route as we butt up against the very physics you speak of that is making it impossible at least on the x86 side to cram even more billions of transistors and various parts that used to reside on the motherboard or a discreet card onto the very die itself in once cohesive whole. ARM based chips can still do this because...well...RISC. In large measure anyway..

In the HPC / Supercomputing realm we have been seeing disaggregation of parts for some time and that is only accelerating with AI as we need GOBS and GOBS more memory, much less storage, to hold the huge datasets and feed the various bits of chippery. You can't really put enough RAM closely around or even on each CPU because A: that makes each CPU bigger, hotter, more complex and more expensive and B: You open yourself up to wafer failures at manufacturing and lower yields which can have the knock on effect of forcing you to charge more for each CPU and delaying a rollout of the next gen product. ( cough cough, Intel, cough cough ). Yes, theoretically, it meets the requirement of putting your memory as close as possible. But now that's actually becoming impossible at scale. Hence the Memory Pool. Racks and racks of nothing but memory to feed these Super monsters.

And what is going to be used to hook CPU and GPU cabinets up to these Memory Pools? Gen-Z which has the atomics capable of keeping up with CXL or even Infinity Fabric and which PCIe does not.

And what chip will facilitate that data flow between racks of Supers connected to Memory Pools via Gen-Z?

Xilinx FPGAs

Oh...lookee. AMD just bought Xilinx. Funny that.

So AMD has their own in house interconnect Fabric call Infinity which is superior to Intel's CXL. They were a founding member of Gen-Z. They were of founding member of CCIX which includes ARM and Xilinx. And they just bought Xilinx.

You are correct. You want to keep everything close as possible for HPC. But Moore's Law is running straight up against Quantum Physics. SiPs and Chiplets are increasingly becoming the order of the day. Disaggregation of components, particularly Data Storage and RAM pools are also becoming the order of the day. This is why Gen-Z is so important to mitigate that growing distance between data and compute.

But I still hope like hell that AMD revisits FUSION and HSA as they ultimately did with Bristol Ridge where everything is on the die. It's such a design marvel. As far as I know, as Bridgman said, it is the only CPU with an integrated GPU that shares a 48 bit memory address between the CPU and GPU for zero copy data transfers. Not even the ZEN based APUs have that anymore. He actually mentions some kind of GPU VM that is incorporated in the ZEN APUs between its CPU cores and its GPU so that code for the GPU can be used either with the APU's GPU or a discreet GPU.

It would be wonderful to see, hopefully around the 2024 timeframe a ZEN 4 / RDNA 3 based APU with Infinity Fabric / Architecture built in the die itself, along with an ARM DSP for real time audio effects and object oriented surround sound and an integrated Xilinx FPGA for real time video encode and decode. And hell...while I'm wishing let's throw in some HBM 3+ memory on that thing as well just to show off.

Hmmmm.....did I just describe the APU that will power the Sony PS-6 ??

For further reading, if you are interested I have provided links below to articles on Gen-Z from a site called The Next Platform. It's from the same publishers of The Register which is one of my goto tech info sites. The Next Platform is my goto site for HPC and Supers. It's a great site.

I also have provided a link below to an old video from Server and HPC provider Penguin computers. This was right after AMD's Kaveri came out which was the first HSA compliant APU. Penguin was building and marketing high density servers using Kaveri APU's and HSA. Of course...things change for AMD since that time as well as for Penguin. But it is a taste of where the vision of HSA was taking some folks. Perhaps we will get there again someday.

Gen-Z Memory Servers Loom On The Horizon

https://www.nextplatform.com/2020/01/09/gen-z-memory-servers-loom-on-the-horizon/

We have been waiting for a long, long time for the ionic bond between compute and main memory to be softened to something a little more covalent and

APU Clusters and "Watts" Powering the Future of HPC

https://www.youtube.com/watch?v=yifsXMkCQz4

In this video from the AMD booth at SC14, Phil Pokorny from Penguin Computing presents: APU Clusters and "Watts" Powering" the Future of HPC."Learn more: htt...

Announcement

AOMP 11.11 Released For LLVM Clang OpenMP Offloading To Radeon GPUs

AOMP 11.11 Released For LLVM Clang OpenMP Offloading To Radeon GPUs

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment