Announcement

Collapse
No announcement yet.

LLVMpipe Scaling With Intel's Core i7 Gulftown

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • LLVMpipe Scaling With Intel's Core i7 Gulftown

    Phoronix: LLVMpipe Scaling With Intel's Core i7 Gulftown

    When finding out that an Intel Core i7 970 "Gulftown" CPU was on the way, which boasts six physical cores plus another six logical cores via Hyper Threading, immediately coming to mind was to try out this latest Intel 32nm processor with the Gallium3D LLVMpipe driver. There's a lot to love about Gallium3D when it comes open-source Linux graphics drivers with the possibilities being presented by the different state trackers (such as native Direct3D 11 support on Linux) and the hardware drivers themselves being more advanced, easier to write, and eventually should be much faster than the classic Mesa drivers for Linux. One of the drivers that has especially been of interest is LLVMpipe, which is an attempt to finally make a useful CPU-based software rasterizer for Linux by leveraging the Low-Level Virtual Machine infrastructure. Here is our introductory article to LLVMpipe and even with a Core i7 "Bloomfield" processor the driver is very demanding, but with Intel's Gulftown the results are somewhat surprising as we experiment with how this CPU-based driver scales up to twelve threads.

    http://www.phoronix.com/vr.php?view=15407

  • #2
    Given that graphics is an "embarrassingly parallel" problem, shouldn't it be possible -- theoretically -- to achieve very nearly linear scaling with the number of CPU cores? I'm not saying it would be easy, or that llvmpipe is flawed if it doesn't -- just asking whether, theoretically, it's within the realm of possibility to achieve.

    Though I guess one complicating factor here is that it's not just the graphics, but also the normal game logic itself which is running on the CPU at the same time. Have you guys considered trying some kind of purely-graphics benchmark to try and isolate that factor?

    Comment


    • #3
      So going by these test results it seems that adding the 6 logical (HT) cores to the physical cores is actually a hindrance to performance at low resolutions and only becomes at all beneficial to performance at high resolutions and only minimally so, at least as far as LLVM Pipe is concerned.

      Comment


      • #4
        Is this a joke? A $1K CPU to use as a soft renderer being able to play games only @800x600.

        I don't understand the meaning of this article. To show that LLVMpipe scales well? But who's gonna use it anyway?

        Comment


        • #5
          Originally posted by sirdilznik View Post
          So going by these test results it seems that adding the 6 logical (HT) cores to the physical cores is actually a hindrance to performance at low resolutions and only becomes at all beneficial to performance at high resolutions and only minimally so, at least as far as LLVM Pipe is concerned.
          "The performance improvement seen is very application-dependent, however when running two programs that require full attention of the processor it can actually seem like one or both of the programs slows down slightly when Hyper Threading Technology is turned on. "
          http://en.wikipedia.org/wiki/Hyper-threading

          Comment


          • #6
            Originally posted by illissius View Post
            Given that graphics is an "embarrassingly parallel" problem, shouldn't it be possible -- theoretically -- to achieve very nearly linear scaling with the number of CPU cores?
            Is it known that current mainstream rendering techniques are embarrassingly parallel? I haven't studied the algorithms to any real detail, but it would surprise me if they are (I'd expect some issues with Z-sort and overlapping fragments, at least). Surely some important parts of it are, but that's different from the whole pipeline scaling ideally.

            Comment


            • #7
              In the last year, ATI got nearly double the performance going from 160 to 320 execution cores, so yes, 3D rendering is very definitely embarrassingly parallel.

              With the current accepted rendering algorithms, Z-sort doesn't need to always happen. Only for transparent rendering you need to sort, and then, only that which is in the tile frustum.

              Comment


              • #8
                Originally posted by Ex-Cyber View Post
                Is it known that current mainstream rendering techniques are embarrassingly parallel? I haven't studied the algorithms to any real detail, but it would surprise me if they are (I'd expect some issues with Z-sort and overlapping fragments, at least). Surely some important parts of it are, but that's different from the whole pipeline scaling ideally.
                Not to mention that CPUs themselves do not scale linearly either as each core is going to be sharing L2 cache and main memory bandwidth.

                Comment


                • #9
                  A summary of sort-of typical rendering in 3D (without considering the actual game logic):
                  Order notation used.

                  1. Determine view frustum - O(1) - Serial
                  2. Determine objects in frustum - O(log n) - Somewhat parallel, but not great

                  3. Roughly sort opaque objects from front to back - O(log n) - Mostly serial
                  4. Emit every object - O(n) - serial
                  4. Where surface is split into tiles - almost O(n) parallelization: (reasonable gain here)
                  4.2 Throw away if unneeded in tile - cheap, early exit point
                  4.2 Emit each part of object - O(n) - serial
                  4.2.1 compute render region - O(1) - serial
                  4.2.2 for each pixel under region - stupidly parallel (most of gain here)
                  4.2.2.1 test if visible - O(1) - cheap, early exit point
                  4.2.2.1 render - O(1)

                  5 & 6. More-or less the same as 3 & 4, but transparent objects sorted back to forward, Sorting here can be more expensive, and early exit points much less used

                  7. For each post processing: - O(n) - serial
                  7.1 For each pixel: - stupidly parallel (most of gain here)
                  7.1.1 Do something

                  Um, I think that is about it?
                  Of course, limits such as cache hits, bandwidth, unbalanced workload, etc... all contribute to slow it down.

                  Comment


                  • #10
                    I think the issue here is that while graphics still has a big chunk of embarrassingly parallel work the individual tasks are extremely small so for real scalability you either need some hardware scheduling (like a GPU has) or you need to design the software renderer from day one around the idea of having a very large number of cores/threads (as was attempted with the Larabee renderer).

                    AFAIK the LLVMpipe renderer was designed for "one to a small number" of threads... I'm pretty impressed with how well it scales.

                    I'm only looking at the results from 1 core to 6 cores, since the jump from 6 to 12 isn't really bringing more cores onstream just more threads per core.

                    Comment


                    • #11
                      True. I guess after LLVMPipe is mostly feature complete, they might do a few optimisation runs, and we will probably see some improvements.

                      And what will happen when the Gallium3D developers decide to deprecate TGSI, or use an LLVM front-end?
                      I believe that discussion came up several times recently.
                      There will be less translation going on, so possibly even more speedups?

                      Could you use this renderer, as a way of defining the LLVM-providing framework for Gallium?
                      Hmm, Would be cool, I think.

                      Comment


                      • #12
                        It seems like LLVM at this point would be an attractive failover pipe for hardware pipes being written. That way, even a pretty basic GPU pipe could accelerate enough to make the system usable (e.g. desktop composition, really basic games, video). Is anyone doing this or thinking about it?

                        Comment


                        • #13
                          Good article. The first sentence is quite awkward, but the topic of the article itself is pretty interesting.

                          Also, the new graphs, especially the bar graphs, look great. Nice work.

                          Comment


                          • #14
                            Originally posted by TechMage89 View Post
                            It seems like LLVM at this point would be an attractive failover pipe for hardware pipes being written. That way, even a pretty basic GPU pipe could accelerate enough to make the system usable (e.g. desktop composition, really basic games, video). Is anyone doing this or thinking about it?
                            The idea is to use it for older chips without vertex shaders to do fast vertex processing.

                            Comment


                            • #15
                              RE: Hyper Threading

                              Correct me if I'm wrong but doesn't Hyper Treading essentially require a tread to stall temporarily (do to a a cache miss or bad branch prediction) in order for there to be a performance gain? Basically it uses whatever downtime caused by one thread's stall to run another thread scheduled for the same physical core. So, if a thread stalls frequently you frequently get a payoff.

                              But haven't we been working to avoid situations where a thread stalls in compilers, kernels, and other performance critical software? Certainly we can't avoid them all, but does it not stand to reason then that the better GCC/LLVM, the kernel, and etc. gets the less benefit you'll see from Hyper Threading?

                              Also as the number of physical cores goes up, unless your software scales with it, you won't see as much of a benefit. If you have 6 cores for example, you need to be able to peg all 6 and be hungry for more or hyper threading to matter. Mostly, unless I'm doing some "make -j4" my processors are sitting idle and scaled down to 800 MHz. If I'm gaming, I may see activity on up to two cores with most of the heavy lifting done on the GPU.

                              So I guess I find myself wondering if it came down to just HT as the distinguishing bullet point, would I be just as well choosing a cheaper processor without HT. HT was just so much more interesting when single core systems were more common.

                              Another thought... If we know the likelihood a thread has of stalling within a time slice, we could employ some interesting scheduling for even greater performance than assuming all threads stall with around the same frequency. Advanced PGO for Intel SMT? Of course it would be very architecture specific optimization and perhaps not attractive to developers aiming to catch more than just the Intel crowd.

                              Comment

                              Working...
                              X