Does anyone else pronounce "rusticl" like "testicle"?
Rusticl OpenCL Driver Nearing Cross-Vendor Shared Virtual Memory Support
Collapse
X
-
Originally posted by ultimA View PostIt makes developing with OpenCL easier, but actively using it comes at a great performance cost. So demand isn't that great, which is why other stacks prioritized other more useful features.
I'm pretty sure SVM has solid use cases, like when you want the GPU to have random-access to more data than it has available memory onboard. Or, maybe it needs only like 1% from a large pool of data and you'd rather just have it request what it needs than always send it all over. Yes, it's going to be slow, but think about how much slower it would be to build in an additional round-trip at the command layer, in order for the GPU to be able to request the data it wants!
Comment
-
-
Last edited by nuetzel; 05 January 2025, 12:43 AM.
Comment
-
-
Originally posted by ultimA View PostIt makes developing with OpenCL easier, but actively using it comes at a great performance cost. So demand isn't that great, which is why other stacks prioritized other more useful features. But it makes a nice marketing headline "Hey, we are the first to implement this thing that most people do not want to use."
I once did a demonstrator for an industrial image processing unit using an AMD Kaveri and OpenCL+Fine Grained SVM worked great - sure, you loose a great deal of memory bandwidth thanks to SVM - but for this use case copying data was way more expensive.
Comment
-
-
Originally posted by nuetzel View Post
Global memory bandwidth (GBPS)Data Type Clover Rusticle Ratio float 2.64184.336982.2%float2 2.65180.366806.0%float4 2.65186.157024.5%float8 2.17174.008018.4%float16 2.05181.058831.7%
Single-precision compute (GFLOPS)Data Type Clover Rusticle Ratio float 3209.496193.49193.0%float2 3208.706173.33192.4%float4 3205.005958.12185.9%float8 3193.705918.83185.3%float16 3158.805827.80184.5%
Double-precision compute (GFLOPS)Data Type Clover Rusticle Ratio double 403.84401.2199.3%double2 403.80401.2099.4%double4 403.25400.0599.2%double8 401.71398.8999.3%double16 390.15397.50101.9%
Integer compute (GIOPS)int 1260.001249.9099.2%int2 1236.251243.64100.6%int4 1253.341242.5599.1%int8 1251.421241.0899.2%int16 1250.631240.8499.2%
Integer compute Fast 24bit (GIOPS)Data Type Clover Rusticle Ratio int 5529.181246.7022.5%int2 5352.501240.9323.2%int4 5265.231240.6523.6%int8 5216.861239.5823.8%int16 5109.001241.8824.3%
Integer char (8bit) compute (GIOPS)Data Type Clover Rusticle Ratio char 6093.271028.3316.9%char2 3527.385739.72162.7%char4 3490.045444.48156.0%char8 3268.795432.31166.2%char16 3262.625397.22165.4%
Integer short (16bit) compute (GIOPS)Data Type Clover Rusticle Ratio short 6000.481009.9016.8%short2 3774.825577.88147.8%short4 3531.095304.85150.2%short8 3488.435393.45154.6%short16 3497.315353.42153.1%
Transfer bandwidth (GBPS)Operation Clover Rusticle Ratio enqueueWriteBuffer 5.044.6892.86%enqueueReadBuffer 5.074.7693.89%enqueueWriteBuffer non-blocking 5.044.7393.85%enqueueReadBuffer non-blocking 5.074.7994.48%enqueueMapBuffer(for read) 3154.823.450.11%memcpy from mapped ptr 5.054.8996.83%enqueueUnmap(after write) 3852.684.850.13%memcpy to mapped ptr 5.034.9598.41%
Kernel Launch Latency (usec)Clover Rusticle Ratio (lower is better) 240.6961.9125.72%
Some big gains, but also notable regressions. I wouldn't break out the champagne, just yet.Last edited by coder; 05 January 2025, 04:18 AM.
Comment
-
-
Originally posted by coder View PostSounds more like an excuse than a real reason.
I'm pretty sure SVM has solid use cases, like when you want the GPU to have random-access to more data than it has available memory onboard. Or, maybe it needs only like 1% from a large pool of data and you'd rather just have it request what it needs than always send it all over. Yes, it's going to be slow, but think about how much slower it would be to build in an additional round-trip at the command layer, in order for the GPU to be able to request the data it wants!Last edited by ultimA; 05 January 2025, 11:01 AM.
Comment
-
-
Originally posted by ultimA View PostIt makes developing with OpenCL easier, but actively using it comes at a great performance cost. So demand isn't that great, which is why other stacks prioritized other more useful features. But it makes a nice marketing headline "Hey, we are the first to implement this thing that most people do not want to use."
Comment
-
-
Originally posted by ultimA View Post
But that is exactly the reason why it makes programs slower. It's not like the bus bandwidth is lower when SVM is in use or that it needs a lot of extra processing power. The problem with SVM is that even though the virtual memory space gets shared between GPU and host, their memories are still distinct. So what ends up happening is that there will be a lot of round-trips of data back and forth, or trips of small chucks of data instead of more optimized large batches, because the programmer treats the memories as if they were unified, but they are not. The GPU cannot predict what data the algorithm will need in the future, and it certainly cannot rewrite an algorithm to transfer data more optimally. Without SVM, the programmer is forced to think about transfer sizes, shoving data at the right times between algorithmic stages, overlaying data transfers with processing and so on. With SVM, a lot of these happen in the background automatically but very suboptimally. The absence of a shared memory space basically forces you to think about these problems (but ofc you can still choose sub-optimal solutions), whereas SVM lets the programmer get away without thinking at all. And if you are targeting efficient and fast code, not just code-offloaded to the GPU, then you'll need to come up with similarly complex schemes with SVM as without SVM, save for a few cases.
Comment
-
Comment