Announcement

**FireBurn** · 27 November 2021, 08:32 AM

Finally? Making is sound like some urgent thing that's been delayed forever

**cl333r** · 27 November 2021, 08:44 AM

Not an advanced user, I just completed "the vulkan tutorial" a few times and by far the worst thing I found about Vulkan is the over-engineered descriptor API. I was constantly asking myself - why is such a tiny field of GPU programming (descriptors take up very little memory) so overly complicated in Vulkan?

PS: The descriptor = uniforms in OpenGL.

**oiaohm** · 27 November 2021, 01:03 PM

Originally posted by cl333r View Post

Not an advanced user, I just completed "the vulkan tutorial" a few times and by far the worst thing I found about Vulkan is the over-engineered descriptor API. I was constantly asking myself - why is such a tiny field of GPU programming (descriptors take up very little memory) so overly complicated in Vulkan?

PS: The descriptor = uniforms in OpenGL.

Because it is more complex than it first appears.

Descriptor Sets

https://vkguide.dev/docs/chapter-4/descriptors/

Practical guide to vulkan graphics programming

A descriptor set allocation will typically be allocated in a section of GPU VRAM.

You see this kind of statement in documentation around descriptors. But this is a rabbit hole statement. Descriptors take up very little memory is true. What makes it tricky is the typically be allocated in section of GPU VRAM that is not understood.

Lets take GPU with like GPUDirect storage as in directly able to request data from a NVME/harddrive. So you ask for a descriptor it might be stored in the NVME instead of the GPU VRAM. Of course there is going to be major performance difference between something stored on the NVME/harddrive vs VRAM.

There is more than just this either.

The CXL Roadmap Opens Up The Memory Hierarchy

https://www.nextplatform.com/2021/09/07/the-cxl-roadmap-opens-up-the-memory-hierarchy/

The system world would have been a simpler place if InfiniBand had fulfilled its original promise as a universal fabric interconnect for linking all

CXL memory where it could possible to have multi CXL devices in system provide memory to the GPU.

Yes it use to be system ram and vram and that was it .
Going forwards.
1) we have system memory that may or may not be one piece. This is the NUMA problem
2) We have vram that inside a GPU may not be a single allocation system. So a gpu that has like 2 descriptor sets because it has two completely different memory management units. Yes now NUMA problem inside gpu cards. Yes we had this problem a long time ago with multi GPU on single card but we never fixed that problem with opengl back then.
3) going forwards totally vram less GPU items will be possible. As in all ram provide by CXL modules and of course the system could have many CXL modules.

cl333r the reality is going forward memory has got a lot more complicated so its no longer just allocate tiny field you have to have enough structure to say where you really want that tiny field allocated as there are way more options where a GPU in future could be allocating that tiny field.

**coder** · 28 November 2021, 04:31 AM

Originally posted by oiaohm View Post

Lets take GPU with like GPUDirect storage as in directly able to request data from a NVME/harddrive. So you ask for a descriptor it might be stored in the NVME instead of the GPU VRAM. Of course there is going to be major performance difference between something stored on the NVME/harddrive vs VRAM.

Can you point to any sort of API reference which says it's even possible to do specifically that, or are you just making shit up?

It seems to me that GPUDirect is just about enabling DMA transfers between GPU and storage - not actually putting GPU runtime state in nonvolatile memory on the NVMe drive. If you know differently, let's see some proof.

Originally posted by oiaohm View Post

https://www.nextplatform.com/2021/09...ory-hierarchy/
CXL memory where it could possible to have multi CXL devices in system provide memory to the GPU.

A memory hierarchy doesn't mean the memory will all be accessed in exactly the same way. Even when using directly-connected NVDIMMs, you don't treat them as if they're just more DRAM. Optane is the best NVDIMM tech we currently have, yet it's still an order of magnitude slower and has multiple orders of magnitude worse endurance than DRAM. If you blindly treated them like DRAM, system performance would crater and you'd be replacing the NVDIMMs of a system like every couple weeks.

Originally posted by oiaohm View Post

3) going forwards totally vram less GPU items will be possible. As in all ram provide by CXL modules and of course the system could have many CXL modules.

That will never happen. GPUs are incredibly bandwidth-hungry. Even CXL 2.0 is down by an order of magnitude from what the latest GPU-like accelerators are getting from their HBM configurations. The point of CXL memory devices is to have a large pool of shared memory, but not to completely replace locally-attached DRAM.

Originally posted by oiaohm View Post

cl333r the reality is going forward memory has got a lot more complicated so its no longer just allocate tiny field you have to have enough structure to say where you really want that tiny field allocated as there are way more options where a GPU in future could be allocating that tiny field.

I'm pretty sure what actually changed is that OpenGL tried to hide the details of GPUs' NUMA from the programmer, whereas Vulkan took a different approach of just exposing all the gory complexity for the programmer to manage.

**arabek** · 28 November 2021, 05:45 AM

now if only it could support alpha2one on vulkan, it'd be complete... anyhoo. Great job!

**oiaohm** · 28 November 2021, 06:13 AM

Originally posted by coder View Post

Can you point to any sort of API reference which says it's even possible to do specifically that, or are you just making shit up?

It seems to me that GPUDirect is just about enabling DMA transfers between GPU and storage - not actually putting GPU runtime state in nonvolatile memory on the NVMe drive. If you know differently, let's see some proof.

NVIDIA Launches Magnum IO Software Suite to Help Eliminate Data Bottlenecks for Data Scientists and AI, HPC Researchers

https://nvidianews.nvidia.com/news/nvidia-launches-magnum-io-software-suite-to-help-eliminate-data-bottlenecks-for-data-scientists-and-ai-hpc-researchers

SC19 -- NVIDIA today introduced NVIDIA Magnum IO, a suite of software to help data scientists and AI and high performance computing researchers process massive amounts of data in minutes, rather than hours.

GPUDirect is used in magnum io stuff from Nvidia. Yes state between multi GPUs ends up shoved out to the NVMe's with the magnum io stuff result in multi gpu able to process more data due to not binding up in the CPU.

Yes GPUDirect is between GPU and storage. But this can be multi GPU and multi storage. With a storage devices/devices with operation state that the GPUs in a cluster can be referring.

So as vulkan support more multi GPU options we can expect the descriptor to expand as well.

Also note the first experiments with direct storage connected to GPU being used to share state go back before vulkan was even a idea. This nvidia magnum io is just a recycled idea.

Originally posted by coder View Post

A memory hierarchy doesn't mean the memory will all be accessed in exactly the same way. Even when using directly-connected NVDIMMs, you don't treat them as if they're just more DRAM. Optane is the best NVDIMM tech we currently have, yet it's still an order of magnitude slower and has multiple orders of magnitude worse endurance than DRAM. If you blindly treated them like DRAM, system performance would crater and you'd be replacing the NVDIMMs of a system like every couple weeks.

It depends on what state you are putting on the NVDIMMS. Vulkan has the means of making hierarchy viable in it design for good reasons. Magnum io putting different GPU cluster states out by GPUDirect to Nvme is may times faster than having to using CPU in the middle.

Originally posted by coder View Post

That will never happen. GPUs are incredibly bandwidth-hungry. Even CXL 2.0 is down by an order of magnitude from what the latest GPU-like accelerators are getting from their HBM configurations. The point of CXL memory devices is to have a large pool of shared memory, but not to completely replace locally-attached DRAM.

Note I said "vram less GPU items" did I say they would have to be latest high performing GPUs no did not. Think you want to provide a basic gpu to a virtual machine and you want todo it cheaply as possible..

This is the same thing we are going to see with raid controllers as well. Yes warped that will see lower performing GPUs that don't have ram. So that when they are not needing the ram that would have been to those GPU can be used else where.

Originally posted by coder View Post

I'm pretty sure what actually changed is that OpenGL tried to hide the details of GPUs' NUMA from the programmer, whereas Vulkan took a different approach of just exposing all the gory complexity for the programmer to manage.

Horrible reality here is Opengl itself standard does not support NUMA or multithreading. Yes the performance boosts mesa, nvidia... get by enabling multi threading is outside the opengl specification yes this is why having multi threading enabled on applications comes a per application quirk.

So its not that Opengl tries to hide the details of GPU's NUMA its that the opengl standard says you must hide those details result in some horrible bad performance.

Vulkan is design to have NUMA and multi-threading and other up coming features. Yes this does result in need to tell the application developers more.

**coder** · 28 November 2021, 06:52 AM

Originally posted by oiaohm View Post

https://nvidianews.nvidia.com/news/n...pc-researchers
GPUDirect is used in magnum io stuff from Nvidia. Yes state between multi GPUs ends up shoved out to the NVMe's with the magnum io stuff result in multi gpu able to process more data due to not binding up in the CPU.

Saying "yes" makes it sound like that supports your original statement about GPU runtime state being stored directly on flash memory via Vulkan Descriptors, but it says nothing of the sort!

Originally posted by oiaohm View Post

So as vulkan support more multi GPU options we can expect the descriptor to expand as well.

When you're speculating, you need to be clear about it. Your original post sounded like a statement of fact.

Originally posted by oiaohm View Post

It depends on what state you are putting on the NVDIMMS.

No, it doesn't. You don't use NVDIMMs like DRAM, period.

Originally posted by oiaohm View Post

Note I said "vram less GPU items" did I say they would have to be latest high performing GPUs no did not. Think you want to provide a basic gpu to a virtual machine and you want todo it cheaply as possible..

It still doesn't make sense. Ever since AGP was introduced, there was some notion that graphics cards could avoid having their own memory and simply use system DRAM. However, this never happened because even low-end GPUs would be terribly bottlenecked by it.

The only case where unified memory is actually used is in CPU-integrated GPUs (and before them, chipset-integrated graphics), where there's basically no alternative. And, even in those cases, it's still largely bottlenecked by system DRAM.

**oiaohm** · 29 November 2021, 07:04 AM

Originally posted by coder View Post

Saying "yes" makes it sound like that supports your original statement about GPU runtime state being stored directly on flash memory via Vulkan Descriptors, but it says nothing of the sort!

https://media.contentapi.ea.com/cont...cyonvulkan.pdf

Vulkan is a similar story to DirectX 12, except we haven’t implemented multi-GPU or ray tracing support at this time, but it is planned

Page 3 and that is from 2018.

Originally posted by coder View Post

When you're speculating, you need to be clear about it. Your original post sounded like a statement of fact.

Because it was a statement of fact. I have read the Vulkan roadmap. The reason why some areas vulkan appears over-complex because its not complete. There are bits that are there for multi-gpu

Originally posted by coder View Post

No, it doesn't. You don't use NVDIMMs like DRAM, period.

Vulkan does not say that descriptors have to point to writeable storage either. The Vulkan standard calls them like "Resource Descriptors" for a reason. Yes the current Vulkan standard never states once that resources has to stored in DRAM.

Yes current versions of the released vulkan standard only have Host Memory and device memory Device Memory. But the multi gpu roadmap for vulkan includes the likes GPU direct and CXL as well.

Originally posted by coder View Post

The only case where unified memory is actually used is in CPU-integrated GPUs (and before them, chipset-integrated graphics), where there's basically no alternative. And, even in those cases, it's still largely bottlenecked by system DRAM.

CXL is not unified memory. I said vramless gpu becausee it is something different to your CPU-intergarted GPUs. I never said this would be without a major bottleneck. Your server CPUs like your EPIC, Xeons and so on don't have integrated GPUs. So what I am talking about with vramless is like chipset/igpu like graphics for virtual machines. These are not going to be best in the world GPU.

Look at the complexity mxgpu(amd), gvt-g(inteo) and vgpu(Nvidia) remember virtual machines need secure. Vramless gpu would basically put the security of allocated memory to it back on CXL.

There is another cost here. Think about you have limited GPUs. Really only enough GPUs for the currently active VMs. So as a VM stops then you have to suspend the state of the GPU to storage somewhere and bring up the state the GPU for the VM you want to bring up. So there is a stack of costs here.

vram on a GPU in on a system running massive numbers virtual machine world can be it own form of nightmare when you are attempt to get the most allocation to the GPU that can result in the most expensive thing coming saving and restoring GPU state. What worse being performance bottlenecked with the GPU by CXL or being bottlenecked because you are having to save and restore the GPU vram all the time.

Coder things are little complex here than you are thinking. vramless gpu are not to be the best performing GPU for a desktop computer or the best performing absolutely decanted GPU. What vramlress gpu will be for is the mass virtual machine setup where this is result in the state GPU information not need to be copied is in fact removing a problem while giving GPU optimised instructions to reduce CPU usage.

Coder there are gains to a vramless GPU in a particular segment of the market.

Please note integrated GPUs in desktop CPUs or historic chipsets are not vramless they share ram controller with host system so they still in fact have vram just not good vram.

Vramless GPU can be complete missing MMU. Just having the GPU caches. These are like your up coming ramless RAID controllers that can use CXL ram. Yes the ramless RAID controllers are generally worse performing that the ram contain versions. But again in mass virtual machine ramless RAID controllers they can provide a set of security advantages.

Mass virtual machine market is a very much different beast. This different beast is going to result in particular bits of hardware existing that will make no sense to use on a desktop computer or decanted server or as individual vm decanted part because they make sense for the mass virtual machine setup.

**coder** · 29 November 2021, 10:49 AM

Originally posted by oiaohm View Post

https://media.contentapi.ea.com/cont...cyonvulkan.pdf

Page 3 and that is from 2018.

I don't know why I even follow you down these rabbit holes, but that's a statement about their Vulkan backend, rather than about Vulkan itself.

Originally posted by oiaohm View Post

Because it was a statement of fact. I have read the Vulkan roadmap.

Then whatever you read about the original point, you also misinterpreted. I guarantee nobody put "using Vulkan Descriptors to allocate storage on a NVMe SSD" on the original Vulkan roadmap.

Originally posted by oiaohm View Post

So what I am talking about with vramless is like chipset/igpu like graphics for virtual machines. These are not going to be best in the world GPU.

So far, BMC's use their own dedicated DRAM.

Originally posted by oiaohm View Post

There is another cost here. Think about you have limited GPUs. Really only enough GPUs for the currently active VMs. So as a VM stops then you have to suspend the state of the GPU to storage somewhere and bring up the state the GPU for the VM you want to bring up. So there is a stack of costs here.

That makes about as much sense as building a CPU without L1/L2 cache, "... because it would just get swapped out when there's a context-switch". The fact that nobody does it should tell you something. The cost of evicting it after the context-switch turns out to be a lot less than the impact of not having the cache.

Originally posted by oiaohm View Post

vram on a GPU in on a system running massive numbers virtual machine world can be it own form of nightmare when you are attempt to get the most allocation to the GPU that can result in the most expensive thing coming saving and restoring GPU state.

They can page it in/out, just like a CPU would.

Announcement

RADV Vulkan Driver Finally Adds VK_KHR_synchronization2 Support

RADV Vulkan Driver Finally Adds VK_KHR_synchronization2 Support

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment