Rusticl OpenCL Driver Nearing Cross-Vendor Shared Virtual Memory Support

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • kiffmet
    Senior Member
    • Jan 2016
    • 477

    #11
    Does anyone else pronounce "rusticl" like "testicle"?

    Comment

    • Nth_man
      Senior Member
      • Nov 2012
      • 1037

      #12
      Originally posted by kiffmet View Post
      Does anyone else pronounce "rusticl" like "testicle"?
      Yes, sophiestesticle.

      Comment

      • coder
        Senior Member
        • Nov 2014
        • 8924

        #13
        Originally posted by ultimA View Post
        It makes developing with OpenCL easier, but actively using it comes at a great performance cost. So demand isn't that great, which is why other stacks prioritized other more useful features.
        Sounds more like an excuse than a real reason.

        I'm pretty sure SVM has solid use cases, like when you want the GPU to have random-access to more data than it has available memory onboard. Or, maybe it needs only like 1% from a large pool of data and you'd rather just have it request what it needs than always send it all over. Yes, it's going to be slow, but think about how much slower it would be to build in an additional round-trip at the command layer, in order for the GPU to be able to request the data it wants!

        Comment

        • nuetzel
          Senior Member
          • May 2016
          • 754

          #14
          Some input 'latest' RustiCL, related...
          My last numbers for my poor Intel Xeon X3470 (Nehalem), 3 GHz, 4/8 c/t, GFX8 (Polaris 20, 8 GB), PCIe 2 system: clpeak


          ​More to come.
          Last edited by nuetzel; 05 January 2025, 12:43 AM.

          Comment

          • Linuxhippy
            Senior Member
            • Jan 2008
            • 389

            #15
            Originally posted by ultimA View Post
            It makes developing with OpenCL easier, but actively using it comes at a great performance cost. So demand isn't that great, which is why other stacks prioritized other more useful features. But it makes a nice marketing headline "Hey, we are the first to implement this thing that most people do not want to use."
            It is a killer feature for algorithms which work on large, sparse data-sets with complex data structures - typically found in systems where the GPU can only be used to accelerate a few "steps" of a processing pipeline. Thanks to SVM, you can avoid re-formating the data for the GPU at each "step".
            I once did a demonstrator for an industrial image processing unit using an AMD Kaveri and OpenCL+Fine Grained SVM worked great - sure, you loose a great deal of memory bandwidth thanks to SVM - but for this use case copying data was way more expensive.

            Comment

            • coder
              Senior Member
              • Nov 2014
              • 8924

              #16
              Originally posted by nuetzel View Post
              Some input 'latest' RustiCL, related...
              My last numbers for my poor Intel Xeon X3470 (Nehalem), 3 GHz, 4/8 c/t, GFX8 (Polaris 20, 8 GB), PCIe 2 system: clpeak


              ​More to come.
              FTFY.

              Global memory bandwidth (GBPS)
              Data Type Clover Rusticle Ratio
              float
              2.64
              184.33
              6982.2%
              float2
              2.65
              180.36
              6806.0%
              float4
              2.65
              186.15
              7024.5%
              float8
              2.17
              174.00
              8018.4%
              float16
              2.05
              181.05
              8831.7%

              Single-precision compute (GFLOPS)
              Data Type Clover Rusticle Ratio
              float
              3209.49
              6193.49
              193.0%
              float2
              3208.70
              6173.33
              192.4%
              float4
              3205.00
              5958.12
              185.9%
              float8
              3193.70
              5918.83
              185.3%
              float16
              3158.80
              5827.80
              184.5%

              Double-precision compute (GFLOPS)
              Data Type Clover Rusticle Ratio
              double
              403.84
              401.21
              99.3%
              double2
              403.80
              401.20
              99.4%
              double4
              403.25
              400.05
              99.2%
              double8
              401.71
              398.89
              99.3%
              double16
              390.15
              397.50
              101.9%

              Integer compute (GIOPS)
              int
              1260.00
              1249.90
              99.2%
              int2
              1236.25
              1243.64
              100.6%
              int4
              1253.34
              1242.55
              99.1%
              int8
              1251.42
              1241.08
              99.2%
              int16
              1250.63
              1240.84
              99.2%

              Integer compute Fast 24bit (GIOPS)
              Data Type Clover Rusticle Ratio
              int
              5529.18
              1246.70
              22.5%
              int2
              5352.50
              1240.93
              23.2%
              int4
              5265.23
              1240.65
              23.6%
              int8
              5216.86
              1239.58
              23.8%
              int16
              5109.00
              1241.88
              24.3%

              Integer char (8bit) compute (GIOPS)
              Data Type Clover Rusticle Ratio
              char
              6093.27
              1028.33
              16.9%
              char2
              3527.38
              5739.72
              162.7%
              char4
              3490.04
              5444.48
              156.0%
              char8
              3268.79
              5432.31
              166.2%
              char16
              3262.62
              5397.22
              165.4%

              Integer short (16bit) compute (GIOPS)
              Data Type Clover Rusticle Ratio
              short
              6000.48
              1009.90
              16.8%
              short2
              3774.82
              5577.88
              147.8%
              short4
              3531.09
              5304.85
              150.2%
              short8
              3488.43
              5393.45
              154.6%
              short16
              3497.31
              5353.42
              153.1%

              Transfer bandwidth (GBPS)
              Operation Clover Rusticle Ratio
              enqueueWriteBuffer
              5.04
              4.68
              92.86%
              enqueueReadBuffer
              5.07
              4.76
              93.89%
              enqueueWriteBuffer non-blocking
              5.04
              4.73
              93.85%
              enqueueReadBuffer non-blocking
              5.07
              4.79
              94.48%
              enqueueMapBuffer(for read)
              3154.82
              3.45
              0.11%
              memcpy from mapped ptr
              5.05
              4.89
              96.83%
              enqueueUnmap(after write)
              3852.68
              4.85
              0.13%
              memcpy to mapped ptr
              5.03
              4.95
              98.41%

              Kernel Launch Latency (usec)
              Clover Rusticle Ratio (lower is better)
              240.69
              61.91
              25.72%

              Some big gains, but also notable regressions. I wouldn't break out the champagne, just yet.
              Last edited by coder; 05 January 2025, 04:18 AM.

              Comment

              • nuetzel
                Senior Member
                • May 2016
                • 754

                #17
                coder
                Can you PLEASE redo this, when my FIXED numbers are ready (very nice table!).
                Greetings,
                Dieter

                Comment

                • ultimA
                  Senior Member
                  • Jul 2011
                  • 292

                  #18
                  Originally posted by coder View Post
                  Sounds more like an excuse than a real reason.

                  I'm pretty sure SVM has solid use cases, like when you want the GPU to have random-access to more data than it has available memory onboard. Or, maybe it needs only like 1% from a large pool of data and you'd rather just have it request what it needs than always send it all over. Yes, it's going to be slow, but think about how much slower it would be to build in an additional round-trip at the command layer, in order for the GPU to be able to request the data it wants!
                  But that is exactly the reason why it makes programs slower. It's not like the bus bandwidth is lower when SVM is in use or that it needs a lot of extra processing power. The problem with SVM is that even though the virtual memory space gets shared between GPU and host, their memories are still distinct. So what ends up happening is that there will be a lot of round-trips of data back and forth, or trips of small chucks of data instead of more optimized large batches, because the programmer treats the memories as if they were unified, but they are not. The GPU cannot predict what data the algorithm will need in the future, and it certainly cannot rewrite an algorithm to transfer data more optimally. Without SVM, the programmer is forced to think about transfer sizes, shoving data at the right times between algorithmic stages, overlaying data transfers with processing and so on. With SVM, a lot of these happen in the background automatically but very suboptimally. The absence of a shared memory space basically forces you to think about these problems (but ofc you can still choose sub-optimal solutions), whereas SVM lets the programmer get away without thinking at all. And if you are targeting efficient and fast code, not just code-offloaded to the GPU, then you'll need to come up with similarly complex schemes with SVM as without SVM, save for a few cases.
                  Last edited by ultimA; 05 January 2025, 11:01 AM.

                  Comment

                  • cb88
                    Senior Member
                    • Jan 2009
                    • 1346

                    #19
                    Originally posted by ultimA View Post
                    It makes developing with OpenCL easier, but actively using it comes at a great performance cost. So demand isn't that great, which is why other stacks prioritized other more useful features. But it makes a nice marketing headline "Hey, we are the first to implement this thing that most people do not want to use."
                    Except that is false... the performance cost is due to the LACK of this on most SDKs thus requiring GPU->SYSTEM->GPU for any transfers.

                    Comment

                    • Svyatko
                      Senior Member
                      • Dec 2020
                      • 211

                      #20
                      Originally posted by ultimA View Post

                      But that is exactly the reason why it makes programs slower. It's not like the bus bandwidth is lower when SVM is in use or that it needs a lot of extra processing power. The problem with SVM is that even though the virtual memory space gets shared between GPU and host, their memories are still distinct. So what ends up happening is that there will be a lot of round-trips of data back and forth, or trips of small chucks of data instead of more optimized large batches, because the programmer treats the memories as if they were unified, but they are not. The GPU cannot predict what data the algorithm will need in the future, and it certainly cannot rewrite an algorithm to transfer data more optimally. Without SVM, the programmer is forced to think about transfer sizes, shoving data at the right times between algorithmic stages, overlaying data transfers with processing and so on. With SVM, a lot of these happen in the background automatically but very suboptimally. The absence of a shared memory space basically forces you to think about these problems (but ofc you can still choose sub-optimal solutions), whereas SVM lets the programmer get away without thinking at all. And if you are targeting efficient and fast code, not just code-offloaded to the GPU, then you'll need to come up with similarly complex schemes with SVM as without SVM, save for a few cases.
                      But in the case of shared memory (APU - iGPU) the memory is ... shared?

                      Comment

                      Working...
                      X