Intel Announces Arc B-Series "Battlemage" Discrete Graphics With Linux Support

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • pong
    Senior Member
    • Oct 2022
    • 316

    #61
    Originally posted by pioto View Post
    I meant why add the cost on the host side by implementing PCIe 5.0x16 support (in Intel Gen 12), with no GPU in sight to use it?
    The same Intel Gen 12 introduced DDR5 support (when AMD's Zen3 did not have it). This gave Intel competitive advantage for a period of time, until Zen4 appeared. The point is that DDR5 memory sticks were immediately available together with Gen 12. PCIe 5.0x16 host support is completely out of sync with consumer GPUs.
    Consumer GPUs, ok, but what about the Grace Hopper (PCIE5) or Blackwell (PCIE6) GPUs already productized for the server side?
    I assume intel may want to support that level of stuff from nvidia and others for the server side.

    And what about other possible upcoming use cases for USB / thunderbolt / PCIE for laptop use cases they expect to support before
    they generationally change their basic IC technology again?

    I also imagine at the data center level networking may be eager to make use of higher bandwidth PCIE already.

    Or maybe just a "we did not lose marketing specification parity / superiority with respect to AMD Zen" thing at the consumer level.

    To what extent intel wants to use a similar "generation" IC design across some laptop / client desktop / server / data center use cases I don't know -- there are cost advantages to minimize the number of IP cores one has to put into ASICs and verify for a given generation / family of chips so if they want to move SOME things to PCIE5 now then it can be advantageous to do it with others in the same IC / fab / IP core generation.

    Anyway since the cpu / chipsets are so horribly pcie lane limited for consumer desktops it kind of makes sense to
    "go faster" since you stand a pretty good chance of any PCIE peripheral being stuck running at x8 or x4 width or running out of lanes to handle NVME, USB/TB/DP / NIC, etc.

    Comment

    • Quackdoc
      Senior Member
      • Oct 2020
      • 5051

      #62
      Originally posted by PapagaioPB View Post

      Hello, thank you for being willing to run this test. If it's not too much trouble, could you test a demo of some AAA games?

      Play the opening section of FINAL FANTASY XVI. Save data can be carried over to the full version of the game.

      300 years of tyranny. A mysterious mask. Lost pain and memories. Wield the Blazing Sword and join a mysterious, untouchable girl to fight your oppressors. Experience a tale of liberation, featuring characters with next-gen graphical expressiveness!

      https://store.steampowered.com/app/1295510/DRAGON_QUEST_XI_S_Echoes_of_an_Elusive_Age__Defini tive_Edition/



      Hi,the issue isn’t so much about it being slow right now but Intel's commitment to developing the open-source driver. AMD gets a lot of support from companies like Valve. If companies started hiring developers to help with Intel's driver, things would improve significantly.​
      NOTE: I am running xfce4 nested ontop of cosmic using xwayland-run, so there is a slight performance hit there

      sorry about the dumb bar in the tales of arise, no idea why that happened but I can run a better test later if you really want me to. Also sorry about the quality, bad internet so I had to really crunch the stuff. Videos are hosted on catbox. Couldn't grab numbers from the FF demo because for some reason it hammered my CPU so hard the compositors would constantly stop responding, even on sway

      Dragon Quest
      Tales of arise

      EDIT: But DQ is easily playable in 1080p and almost in 1440p, if we have overclocking I am confident I could make it do 1440p. as for tales of arise, messing around with the settings was annoying, but on max I couldn't hit a stable 60, but my frametimes were rather solid regardless so it was playable

      Comment

      • coder
        Senior Member
        • Nov 2014
        • 8920

        #63
        Originally posted by pong View Post
        So IIRC PCIE5 x16 64 GB/s; x8 = 32 GB/s; x4 = 16 GB/s.
        All of those are WAY less than what one would hope one's RAM BW would be on a modern DDR5 PC,
        Chips & Cheese has a database of bandwidth benchmark data, so you don't need to "hope". They measured Ryzen 9 7950X single-thread read performance at just 57.3 GB/s. I know the 9950X improved on that front and they've also tested one, so I guess they must have gotten lazy about updating the DB. For Intel, it's even worse - an i9-12900K managed only 40.3 GB/s with DDR5. I'm guessing the PCIe controller is going to be hitting the memory controller somewhat like a single thread on one of the CPU cores.
        Originally posted by pong View Post
        and WAY WAY less than what one's VRAM BW must be. So right from the start even PCIE x16 is a bottleneck transferring between RAM / VRAM and CPU cache / VRAM relative to what the CPU and RAM and GPU should be capable of.
        Whether it's a bottleneck depends on how the GPU is being used. Most GPU data transfers are async. If the game engine is well-written, it's predicting which assets the GPU will need and sending them in advance. If the GPU is minimally blocked on host memory data transfers, then you can't call it a bottleneck. The PCIe scaling data I linked earlier shows that games are pretty good at this sort of thing.

        Originally posted by pong View Post
        ​Then consider many PCs will have one or more GPUs loaded in slots that are only functional at x8 or x4 width
        Many?? I can understand the x8 case, because that's what you get with an APU. However, if you're getting less than x8 for your graphics card, it's only because you made a conscious decision to prioritize another device.

        Originally posted by pong View Post
        ​as well as "hurry up and get idle" ability to save power on a sooner-idle PCIE link after the transmission is done.
        This is not a net power savings, which is why even most laptops with dGPUs don't have PCIe 5.0.

        Originally posted by pong View Post
        ​Really the PC design is backwards since the GPU is EFFECTIVELY more powerful wrt. any reasonable metric of FLOPS / MIPS / VRAM BW than the PC it is inserted in but it's attached to the mainboard (and can only communicate IPC or to system RAM) at PCIE x16 bottlenecked speeds well less than what the system RAM / CPU or parallel operating (multiple) GPUs are capable of handling if it were not for the PCIE BW bottleneck.
        That's not true. The whole reason GPUs have local memory is so that they aren't bottlenecked by using host memory. Sure, they do still require assets and commands to arrive from the host, but most assets are compressed (and geometry tessellation can be thought of as a form of compression) and if you look at the average FPS impact in the TechPowerUp scaling data I linked, it shows PCIe speed having surprisingly little impact on game performance. You really have to go to the 1% graphs to see much win, and even that amounts to just tweaking at the margins, if we're talking about 16 GB/s vs. 32 GB/s.

        So, going from PCIe 3.0 x16 to PCIe 4.0 x16 only nets you an average of 2.1% more mean fps, and that's with the fastest available GPU and an i9-13900K CPU with DDR5-6000 memory. Using a slower CPU or GPU will only lessen the bottleneck effect of PCIe.

        As I mentioned before, the only exception might be when using a low-memory GPU at high settings, resulting in lots of data turnover in VRAM.

        Comment

        • coder
          Senior Member
          • Nov 2014
          • 8920

          #64
          Originally posted by pioto View Post
          I meant why add the cost on the host side by implementing PCIe 5.0x16 support (in Intel Gen 12), with no GPU in sight to use it?
          Yeah, this was one weird-ass decision. I can think of 3 reasons, none of them great:
          1. As a PCIe 5.0 development platform, since Sapphire Rapids was still a ways off.
          2. To beat AMD at something, specs-wise (AMD went to PCIe 4.0 a couple years ahead of Intel).
          3. To provide LGA 1700 buyers with a sense of the platform being somewhat future-proof.


          What I find ironic is that Intel didn't implement PCIe 5.0 in the one place where you can actually get consumer products @ PCIe 5.0, which is the M.2 slot!

          If Intel had just upgraded their DMI link to 5.0 (and I'd expect they should also decrease the link width back to x4), then they could at least hang PCIe 5.0 M.2 slots off of the chipset.

          Comment

          • coder
            Senior Member
            • Nov 2014
            • 8920

            #65
            Originally posted by pong View Post
            To what extent intel wants to use a similar "generation" IC design across some laptop / client desktop / server / data center use cases I don't know -- there are cost advantages to minimize the number of IP cores one has to put into ASICs and verify for a given generation / family of chips so if they want to move SOME things to PCIE5 now then it can be advantageous to do it with others in the same IC / fab / IP core generation.
            The server CPU cores & dies are completely different animals than the consumer versions.

            They already had PCIe 4.0 controller IP they'd used in previous generations that they could've repurposed for Alder Lake. The best argument for putting PCIe 5.0 in it is that they wanted to be able to use it as a test platform, but even this doesn't totally explain why they didn't restrict its use just to engineering samples. Either way, PCIe 5.0 controllers use more die space than PCIe 4.0, so it's not "free".

            Comment

            • pong
              Senior Member
              • Oct 2022
              • 316

              #66
              Originally posted by coder View Post
              Chips & Cheese has a database of bandwidth benchmark data, so you don't need to "hope". They measured Ryzen 9 7950X single-thread read performance at just 57.3 GB/s. I know the 9950X improved on that front and they've also tested one, so I guess they must have gotten lazy about updating the DB. For Intel, it's even worse - an i9-12900K managed only 40.3 GB/s with DDR5. I'm guessing the PCIe controller is going to be hitting the memory controller somewhat like a single thread on one of the CPU cores.

              Whether it's a bottleneck depends on how the GPU is being used. Most GPU data transfers are async. If the game engine is well-written, it's predicting which assets the GPU will need and sending them in advance. If the GPU is minimally blocked on host memory data transfers, then you can't call it a bottleneck. The PCIe scaling data I linked earlier shows that games are pretty good at this sort of thing.


              Many?? I can understand the x8 case, because that's what you get with an APU. However, if you're getting less than x8 for your graphics card, it's only because you made a conscious decision to prioritize another device.


              This is not a net power savings, which is why even most laptops with dGPUs don't have PCIe 5.0.


              That's not true. The whole reason GPUs have local memory is so that they aren't bottlenecked by using host memory. Sure, they do still require assets and commands to arrive from the host, but most assets are compressed (and geometry tessellation can be thought of as a form of compression) and if you look at the average FPS impact in the TechPowerUp scaling data I linked, it shows PCIe speed having surprisingly little impact on game performance. You really have to go to the 1% graphs to see much win, and even that amounts to just tweaking at the margins, if we're talking about 16 GB/s vs. 32 GB/s.

              So, going from PCIe 3.0 x16 to PCIe 4.0 x16 only nets you an average of 2.1% more mean fps, and that's with the fastest available GPU and an i9-13900K CPU with DDR5-6000 memory. Using a slower CPU or GPU will only lessen the bottleneck effect of PCIe.

              As I mentioned before, the only exception might be when using a low-memory GPU at high settings, resulting in lots of data turnover in VRAM.

              > Chips & Cheese has a database of bandwidth benchmark data,...

              Thanks, that's a good reference for many relevant systems. I haven't kept up my bookmarks for who's got
              good BW benchmarks so I was going from approximation / memory.

              The distinctions between single thread vs. multi thread cases, read vs write cases, the relevance of what amount of data can be transferred from CPU cache (if for SW driven copy), and then the cases relevant to the DMA operations that might be set up between PCIE MM space and system RAM space are all good distinctions to consider. But even the cited 57 GBy/s Rz 7950X ST read rate is right around the PCIE5 peak and when looking at the MT benchmarks or cases for somewhat smaller transfer sizes or cases where cache could be relevant it is clear that the PCIE5 BW could indeed be a BW bottleneck compared to what a CPU/RAM transfer might achieve and CPUs / DIMMs / GPU are getting faster faster than intel / AMD is generationally advancing PCIE slot BW (v5 x16, v6 x16, ...) it's an unfortunate limitation given that one's HIGHEST performance device/memory (GPU/VRAM) in the system is "stranded" on the other end of a sub-64GB/s MMIO bus.

              Besides that (sadly) for current "consumer" GPUs (without sideband physical fabric links) the PCIE slots are the only paths for GPUs in the same host to communicate with each other. So even if the system CPU / RAM was slow enough to not be a bottleneck vs. PCIE, certainly PCIE 64GB/s BW is like an order of magnitude slower than one might wish to transfer data between cooperating GPUs (VRAM to VRAM DMA or so on) sharing data relating to a parallel / concurrent computation whose data memory and computation is spread among 2+ GPUs. That's common enough for STEM type engineering / ML use cases and is even edging closer to consumer relevance given "edge" / "client" based ML, simulation, visualization et. al. domains.

              IDK what various current GPUs look like wrt. the memory controller's "ports" that can be used by PCIE related DMA on current client platforms. You're right that it COULD be quite limited in width / throughput / "virtual" channel count etc, and not look better than a ST CPU's R/W to RAM path, though it's also
              possible that it could be in some ways superior in "connection / flow" though still BW limited by
              whatever is the weakest link (PCIE BW or whatever). There are lots of OpenCL / CUDA and other synthetic benchmarks wrt. VRAM-to-RAM and RAM-to-VRAM sequential block transfer BW, parallel multi-transfer BW, scattered / small block transfers, etc. But IDK what site shows what the current benchmarks for amd / intel / nvidia client GPU to amd64 client platform memory transfers are.

              It was my overstatement to say "many" GPU attachments are limited to x4, I was intending to point out
              that in many consumer systems it's easy enough to exhaust the available options to use the few available x16 or x8 link width slots, so if one wants more than a couple PCIE slot attached devices like GPUs (and NICs, NVME cards, maybe MB attached NVMEs, ...) one's only option might quickly degrade to be an x4 or x1 slot after using up the first better one or two available slots. I've got systems with such limits, and I've seen similar / worse specs elsewhere.

              Yes I know the local VRAM on the GPU is to enable local processing to avoid I/O bottleneck to access remote memory which can be made local for the duration of the computation. And I know the presence & location of a bottleneck depends on the application needs as well as the system. But for instance there are lots of applications "HPC", "ML" etc. where one commonly "teams" multiple GPUs to scale things in parallel and they do spend significant time loading large amounts of data over PCIE more or less iteratively to sync. it with RAM or simply other GPUs. So in such cases one does have PCIE BW as a large latency / bottleneck as compared to even a "mid range consumer" GPU's VRAM transfer BW or "processing data flow BW" (which is often equal to the VRAM BW for memory-bound computations like simple linear algebra low compute intensity stuff).
              Imagine something like an 4k x 4k x 4k 3d computational domain where one is storing say 32 bytes of state per grid cell, that's already ~2TB and you're using all of your system RAM plus a few GPUs full VRAM for every pass through whatever you're simulating, so you're going to feel the memory bottlenecks for IPC and system RAM even more so if your GPUs are fast and end up starving to exchange data. And that'd still be true even with a small fraction of such a thing even merely processing the more feasible scale of NN GB of data on some personal box.

              Anyway one can come up with all kinds of use cases where today's prosumer-consumer CPUs / motherboards are wonderful, or consumer GPUs, but I still say the overall "packaging" is all wrong mechanically, electrically, interfacing, etc. (wrt. the consumer "enthusiast" space) in that it's just got no more room to "scale" because there's a wall that's too often in the platform "architecture" as opposed to isolated within a given peripheral which could be scaled independently of the system. If in 2025 CPU demand switches from 8 to 16 cores for the consumer you're still more or less stuck with the same RAM BW (unless switching platforms entirely away from the "desktop"), same slot PCIE BW (x16), the same heatsink volume. More likely / relevant if you upgrade from a NV 2080 to 3090 to 4090 to 5090 GPU you've still got a "PC" (barely) but you'll find that you may be dangerously close to exceeding the amount of physical volume you have in your case, on top of your MB PCIE slots, the ability of a PSU and cooling and cabling system that fits / works well to deal with, and especially you've now got an "accelerator" that is close to fully 20x higher in internal processing / memory data BW / throughput vs. its maximum possible attachment BW to what we're still (not so seriously) calling "main system memory" and "main system CPU" or to talk to any sibling GPU via "snail speed" PCIE5 x16.

              And you're LONG past any practical correspondence of what a "single PCIE slot" even has relevance to considering you're often well into triple or quadruple slot wide "peripherals" just to "play kids games" to say nothing about how that'd scale for more demanding uses where one wants N such teamed.

              Comment

              • coder
                Senior Member
                • Nov 2014
                • 8920

                #67
                Originally posted by pong View Post
                the relevance of what amount of data can be transferred from CPU cache (if for SW driven copy),
                None. Ryzen doesn't support that. AMD only just started enabling it in Genoa/EPYC.

                Intel had a feature that allowed PCIe to snoop the cache, called DDIO (Data-Direct I/O), however I think it needs to be explicitly used by drivers and isn't automatic. It was originally a server-only feature, as well. I don't know if clients have it or if drivers bother to use it.

                Originally posted by pong View Post
                ​But even the cited 57 GBy/s Rz 7950X ST read rate is right around the PCIE5 peak and when looking at the MT benchmarks
                Yes, I'm aware it's near peak PCIe 5.0 x16, but not "WAY less than", as you had hoped.

                Also, I want someone to show me that the PCIe controller acts like MT, before I'm willing to consider that data. Better yet would be just to benchmark what throughput is actually achievable via PCIe.

                Originally posted by pong View Post
                ​​CPUs / DIMMs / GPU are getting faster faster than intel / AMD is generationally advancing PCIE slot BW (v5 x16, v6 x16, ...)
                PCIe 5.0 is still overkill for client machines, even 3 years after it was introduced! Again, the TechPowerUp data is quite clear about this. PCIe just isn't the bottleneck you'd have us believe.

                It took at least a full year after the introduction of AM5 even for SSDs to truly surpass PCIe 4.0 x4 speeds. It's telling that datacenter SSDs, which are generally much higher performance, only started crossing over to PCIe 5.0 this year!

                Originally posted by pong View Post
                ​​​PCIE 64GB/s BW is like an order of magnitude slower than one might wish to transfer data between cooperating GPUs (VRAM to VRAM DMA or so on) sharing data
                The multi-GPU use case is basically dead, for client GPUs. Most games don't support multi-GPU rendering and the GPU vendors have been steadily walking back this functionality. Nvidia basically killed off SLI in the previous generation of consumer GPUs.

                Originally posted by pong View Post
                ​​​​That's common enough for STEM type engineering / ML use cases and is even edging closer to consumer relevance given "edge" / "client" based ML, simulation, visualization et. al. domains.
                They want you to buy workstation GPUs for that. Then, you can get over-the-top links, which bypass PCIe, entirely.

                Originally posted by pong View Post
                ​​​​​in many consumer systems it's easy enough to exhaust the available options to use the few available x16 or x8 link width slots,
                Yeah, the CPU-direct PCIe link is usually configurable as x16, x8 + x8, or x8 + x4 + x4. The exception I mentioned was AMD's APUs, which have just x8 PCIe (and maybe also limited to PCIe 4.0?). Beyond that, you have to rely on the chipset lanes, which Intel connects to the CPU (in aggregate) via PCIe 4.0 x8 and AMD connects at only PCIe 4.0 x4.

                If you want more than that, you need to move up to either Xeon-W or Threadripper platforms, which are considerably more expensive but have a lot more CPU-direct PCIe 5.0 lanes (and more memory bandwidth, too!).

                Originally posted by pong View Post
                ​​​​​​Imagine something like an 4k x 4k x 4k 3d computational domain where one is storing say 32 bytes of state per grid cell, that's already ~2TB and you're using all of your system RAM plus a few GPUs full VRAM for every pass through whatever you're simulating,
                It's for this sort of use case that Nvidia created Grace. Each Grace CPU node has 512 GB of memory. In a system with 8 Grace nodes, you'd hold 2 TB quite easily.

                BTW, if I were processing such amounts of volumetric data, I'd look hard at the sparsity of the data and how well I might be able to apply lossless compression to it.

                Originally posted by pong View Post
                ​​​​​​​I still say the overall "packaging" is all wrong mechanically, electrically, interfacing, etc. (wrt. the consumer "enthusiast" space)
                It's not wrong for consumers. Games do fine with this architecture and those are the heaviest sort of thing most people are running.

                LLM inferencing is probably the most interesting of the exceptions, because those models can quite easily exceed the memory capacity of a GPU and a powerful GPU would indeed be bottlenecked on reading in the weights from host memory. This use case is probably the single biggest reason next gen GPUs might move up to PCIe 5.0.

                Originally posted by pong View Post
                ​​​​​​​​it's just got no more room to "scale" because there's a wall that's too often in the platform "architecture"
                They want to keep the mass market systems cheap and simple. For people who need scale, they will sell you a much more expensive workstation or server solution that does.

                Originally posted by pong View Post
                ​​​​​​​​​If in 2025 CPU demand switches from 8 to 16 cores for the consumer you're still more or less stuck with the same RAM BW
                CUDIMMs are making impressive progress, on this front. DDR6 isn't far off. For now, it seems latency is a bigger issue for most memory-intensive client tasks (e.g. gaming) than overall bandwidth.

                For some MT workloads, you can definitely max out the bandwidth of 128-bit DDR5 and AMD recently suggested that's a reason they haven't gone further than 16 cores in desktop platforms.

                Comment

                • Svyatko
                  Senior Member
                  • Dec 2020
                  • 208

                  #68
                  Originally posted by bug77 View Post

                  As far as I understood, yes, ReBAR is still required.
                  Desired, not required.

                  Available since PCIe 2.0. AM4 mobos support it with BIOS upgrade. You can manually enable ReBAR (AMD SAM).

                  Comment

                  • Svyatko
                    Senior Member
                    • Dec 2020
                    • 208

                    #69
                    Originally posted by Guilleacoustic View Post
                    Probably a silly question, but what's the current status of Xe/Xe2 with Wayland ?
                    AFAIK XeSS is not supported with Linux.

                    Comment

                    • Quackdoc
                      Senior Member
                      • Oct 2020
                      • 5051

                      #70
                      Originally posted by Svyatko View Post

                      AFAIK XeSS is not supported with Linux.
                      i cry, never forget they told us it would be open source

                      Comment

                      Working...
                      X