Announcement

Collapse
No announcement yet.

Intel Details New Data Streaming Accelerator For Future CPUs - Linux Support Started

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #21
    Would the DSA on a Sapphire Rapids function as the primary controller for transferring data among multiple GPU accelerators, for example in the Argonne nodes with the six GPUs? It seems something has to orchestrate that.

    Comment


    • #22
      Originally posted by jayN View Post
      Would the DSA on a Sapphire Rapids function as the primary controller for transferring data among multiple GPU accelerators, for example in the Argonne nodes with the six GPUs? It seems something has to orchestrate that.
      How are they connected?

      Whatever is orchestrating it probably resides on each GPU, with the interconnectivity (e.g. CXL, PICe, or whatever) simply acting as a switched network. That would scale best and have the lowest latency. Did Intel hint at having DSAs in their GPUs?

      Comment


      • #23
        Originally posted by coder View Post
        How are they connected?

        Whatever is orchestrating it probably resides on each GPU, with the interconnectivity (e.g. CXL, PICe, or whatever) simply acting as a switched network. That would scale best and have the lowest latency. Did Intel hint at having DSAs in their GPUs?
        Slides leaked during the last week show DSA in Sapphire Rapids.

        All the coherency is handled in the host CPU caches. That's one of the main features of CXL... the accelerators can be relatively simple.

        Comment


        • #24
          Originally posted by jayN View Post
          All the coherency is handled in the host CPU caches. That's one of the main features of CXL... the accelerators can be relatively simple.
          Yeah, but even with bandwidth equivalent to PCIe 5.0, CXL will still be a bottleneck if it's the only way GPUs are connected. Nvidia and AMD both have multi-link inter-GPU connectivity solutions that don't route through the CPU.According to the above MI100 specs, each accelerator features 3 IF links, though I'm not 100% certain if they're each at 92 GB/s or if that's across all links. It sounds like the former, one can't be too careful in interpreting mfg specs.

          Comment


          • #25
            Originally posted by coder View Post
            Yeah, but even with bandwidth equivalent to PCIe 5.0, CXL will still be a bottleneck if it's the only way GPUs are connected. Nvidia and AMD both have multi-link inter-GPU connectivity solutions that don't route through the CPU.

            According to the above MI100 specs, each accelerator features 3 IF links, though I'm not 100% certain if they're each at 92 GB/s or if that's across all links. It sounds like the former, one can't be too careful in interpreting mfg specs.
            https://www.nextplatform.com/2020/11...-accelerators/

            The current Infinity fabric data rate is below pcie5. I believe AMD currently uses IF on top of PCIE4 lanes.

            According to the article link, the current IF GPU config is limited by the number of IF links to 4, and does not include the CPU.

            Symmetric coherency becomes a bottleneck ... not scalable, and complicates the GPUs, requiring them to have AMD's proprietary home agent. CXL GPUs would use whatever home agent of the host cpus ... so would be compatible with different hosts. CXl biased coherency is more scalable for bus bandwidth since the snoops are not used while the bias is held... in other words, the GPU local memory pool is considered private while bias is held.

            PCIE5 connections are private between CPU and GPU. I haven't seen a spec of how many DMAs are simultaneously supported by the DSA, but Intel has two Sapphire Rapids CPU on the Aurora nodes, each with 80 pcie5 lanes, so looks like they could support up to 10 GPUs with 16 lane bi-directional pcie5 data rate, which is 2x the PCIE4 data rate.

            The Aurora topology only has 6 Xe-HPC GPUs per node, leaving four 16 lane, bidirectional cxl connections to communicate to other nodes.
            Last edited by jayN; 12 April 2021, 09:29 AM.

            Comment


            • #26
              jayN , in case you didn't see this yet, here's what Nvidia just announced:



              Source: https://www.anandtech.com/show/16611...0am-pt1630-utc

              Comment


              • #27
                Originally posted by coder View Post
                jayN , in case you didn't see this yet, here's what Nvidia just announced:


                Source: https://www.anandtech.com/show/16611...0am-pt1630-utc
                It's a pretty marketing picture, but doesn't make any sense as an AI training topology.

                Intel bought Habana for its Gaudi training architecture, which was preferred by FB. Intel has the tile manufacturing capability and silicon optics technology to help Habana reduce the power and increase the interconnect bandwidth.

                Comment


                • #28
                  Originally posted by jayN View Post
                  It's a pretty marketing picture, but doesn't make any sense as an AI training topology.
                  Okay, sure. Nvidia, who's been leading the way for basically the whole AI revolution, doesn't know where the bottlenecks are, how to scale up, or what makes sense. Right.

                  Originally posted by jayN View Post
                  Intel bought Habana for its Gaudi training architecture,
                  Got any references, so we can compare?

                  Comment


                  • #29
                    Originally posted by coder View Post
                    Okay, sure. Nvidia, who's been leading the way for basically the whole AI revolution, doesn't know where the bottlenecks are, how to scale up, or what makes sense. Right.


                    Got any references, so we can compare?
                    nvda knows what they're doing. They apparently just let their marketing come up with that graphic... which doesn't give away anything.

                    Habana GAUDI Training Whitepaper v1.2.pdf

                    Comment


                    • #30
                      Oh, that's right. I forgot it was all Ethernet.

                      So, is the data movement all push-based? It seems like they'd be at a disadvantage for reads.

                      BTW, I think NVLink is cache-coherent, which is just unnecessary overhead in some important cases (but nice, in others).

                      One thing I'll say is definitely working against Nvidia, in the future, is their apparent desire to be all things to all users. AI training doesn't need fp64, so why exactly does the A100 still have it?

                      Comment

                      Working...
                      X