Announcement

**coder** · 14 September 2019, 06:09 PM

Originally posted by tildearrow View Post

The standard doesn't deliver (PCIe 4.0 isn't as fast enough as NVLink), so they're forced to make their own.

Well, PCIe 5.0 does surpass NVLink 2.0, and CXL is based on PCIe 5.0.

So, the question is valid: will Nvidia continue to push NVLink? Especially since PCIe 6.0 is already in draft and doubles bandwidth yet again.

That said, I don't know if CXL completely covers NVLink's use cases - at least for GPU-to-GPU communication.

**coder** · 14 September 2019, 06:27 PM

Originally posted by milkylainen View Post

Ie. GPU does not share high throughput coherency state with the CPU.

Vega's HBCC might beg to differ.

https://www.techpowerup.com/gpu-spec...chitecture.pdf

Of course, whether it's high-throughput or not depends entirely on the workload. But, the hardware does seem to support it.

**milkylainen** · 15 September 2019, 04:15 AM

Originally posted by coder View Post

Well, PCIe 5.0 does surpass NVLink 2.0, and CXL is based on PCIe 5.0.

So, the question is valid: will Nvidia continue to push NVLink? Especially since PCIe 6.0 is already in draft and doubles bandwidth yet again.

That said, I don't know if CXL completely covers NVLink's use cases - at least for GPU-to-GPU communication.

First I'll answer your questions.
Yes. NVidia will push NVLink 2.0 and higher. It simply outclasses PCIe for compute applications.
CXL might cover NVLink. CXL, RapidIO, NVLink etc all has similar goals, to cater for stuff that PCIe has dropped.
I'm all for an industry stanard. NVLink being proprietary bugs me a lot.

I think you're both completely missing the point and the technical spec.
1. It's the SerDes that is important for lane speed. They are shared between all protocols.
2. NVLink 2.0 is _currently_ available on the same SerDes as PCIe 4.0.
3. PCIe 5.0 or 6.0 is _not available_ now. NVLink 2.0 is.
4. You seem to think that bigger numbers mandate higher speed for one protocol and not the others.
Speed has to do with the SerDes, not protocol numbers. Once newer SerDes are available, all protocols will benefit.
5. It will be trivial to add higher speed SerDes to the NVLink protocol. The biggest changes are protocol implementation, not lane bandwidth.
6. No, PCIe 5.0 and PCIe 6.0 will not "outspeed" NVLink 2.0. Per lane yes but NVLink 2.0 can be trunked to 6x. Which means 300G/sec bidi.
7. Again. PCIe does not care much for turnaround latency and trunking + routing.

NVLink and PCIe serve different purposes. Yes. It is expensive to maintain NVLink for that purpose.
But as long as PCIe looks the way it does, other protocols will be faster at serving compute purposes.

**milkylainen** · 15 September 2019, 04:26 AM

Originally posted by coder View Post

Vega's HBCC might beg to differ.

https://www.techpowerup.com/gpu-spec...chitecture.pdf

Of course, whether it's high-throughput or not depends entirely on the workload. But, the hardware does seem to support it.

Yes you are right. Most if not all modern GPUs will have a full coerency state inside.
They have both MMUs, IOMMUs and they have multiple executing contexts.
So they are fully capable of sharing that context with an outside world.
But current implementation in PC over PCIe does not look at the GPU as a context equal load sharing execution unit.
The GPU is still a peripheral. That does not mean it cannot execute load sharing tasks, but rather it does that through a peripheral interface.
If context was an OpenCL kernel in a common byte-code then perhaps it could be shared directly with the CPU. Maybe that is a future thing?
But then again, what would be the point if the GPU is much, much faster?

Even in high-end supercomputers like Summit or NVidias DGX 2.0 the GPU is not seen as a context equal to the CPUs.
NVLink mostly used to share context with other GPUs over NVSwitch. It uses PCIe and also NVLink for peripheral control over the GPUs.

Maybe the question should be:
Will NVLink replace PCIe in the PC when there is no need for context coherency on the desktop? Or will NVLink and friends remain a speciality for high-GPU-count supercomputing solutions?

**coder** · 15 September 2019, 10:49 AM

Originally posted by milkylainen View Post

NVLink 2.0 can be trunked to 6x. Which means 300G/sec bidi.

What I'm reading says NVLink 2.0 does 25 Gbits/sec per lane per direction. If I understand correctly, the lanes are bundled 8 per sub-link, for 25 GBytes/sec per sub-link per direction. Then, the V100 (the most capacious) has 6 of these, for a max of 150 GBytes/sec per direction.

If that's incorrect, then maybe you should fix this: https://en.wikipedia.org/wiki/NVLink

Originally posted by milkylainen View Post

7. Again. PCIe does not care much for turnaround latency and trunking + routing.

Well, are the latencies fixed in time, or proportional to the bus clock?

Originally posted by milkylainen View Post

as long as PCIe looks the way it does, other protocols will be faster at serving compute purposes.

Nvidia seems to be jumping into the ARM ecosystem, rather seriously. Perhaps because POWER doesn't seem to be taking off, and AMD & Intel are both increasingly competitors. Anyway, if ARM server chips are dragging their feet on adding NVLink support, then perhaps that is pushing Nvidia more towards embracing CXL.

**coder** · 15 September 2019, 10:52 AM

Originally posted by milkylainen View Post

But current implementation in PC over PCIe does not look at the GPU as a context equal load sharing execution unit.

Why do they need to be context equal? There can still be CPU-oriented tasks and GPU-oriented tasks, in the same pipeline. In fact, it's not uncommon to see this in deep learning, where some layer types simply do not map well to a GPU. Therefore, you need to shuffle data back and forth between them. The majority of work still happens on the GPU, but as long as some of it is CPU-oriented, you still have a data-movement problem.

**milkylainen** · 15 September 2019, 01:18 PM

Originally posted by coder View Post

What I'm reading says NVLink 2.0 does 25 Gbits/sec per lane per direction. If I understand correctly, the lanes are bundled 8 per sub-link, for 25 GBytes/sec per sub-link per direction. Then, the V100 (the most capacious) has 6 of these, for a max of 150 GBytes/sec per direction.

If that's incorrect, then maybe you should fix this: https://en.wikipedia.org/wiki/NVLink

Correct. 150G/sec single direction. 300G/sec bidirectional. Maybe I misunderstood you?
PCIe 4.0 is 32G single direction, 64G bidi. 5.0 will be 64G single direction, 128G bidi in 16x config and no means of trunking by protocol support.

Well, are the latencies fixed in time, or proportional to the bus clock?

Indeed proportional to transaction time. But a protocol can provide way faster turnarounds.
For example, PCIe would require inbound + outbound transaction time on a memory MOESI coherency answer.
That is excluding any arbitration if someone is contending for the same resource.
RapidIO or other protocols could have the turnaround time before the completing inbound transaction is over.
Ergo overriding any arbitration while in transaction + time to complete and answer turn-around time.
Again. Different purposes. I'm not saying that PCIe could not handle coherency implementation, but is slower than other protocols in doing so.

Nvidia seems to be jumping into the ARM ecosystem, rather seriously. Perhaps because POWER doesn't seem to be taking off, and AMD & Intel are both increasingly competitors. Anyway, if ARM server chips are dragging their feet on adding NVLink support, then perhaps that is pushing Nvidia more towards embracing CXL.

I vote for CXL. Mainly because I hate the closed implementation NVLink.
I wanted a free NVLink block to implement on high end FPGA-SerDes to do real interesting stuff.
But no can do. Shit is proprietary. $$$$
So maybe yes. NVidia maybe are digging their own grave on NVLink because to make it take off, they need to open the lid and free the implementation.

**milkylainen** · 15 September 2019, 01:33 PM

Originally posted by coder View Post

Why do they need to be context equal? There can still be CPU-oriented tasks and GPU-oriented tasks, in the same pipeline. In fact, it's not uncommon to see this in deep learning, where some layer types simply do not map well to a GPU. Therefore, you need to shuffle data back and forth between them. The majority of work still happens on the GPU, but as long as some of it is CPU-oriented, you still have a data-movement problem.

Absolutely. I fully agree. They don't have to be context equals. I already explained that. There is little need for the CPU to be context equal to the GPU.
But GPU to GPU is another story altogether. PCIe is a peripheral interconnect. It is very obvious from the protocol specs, even though it has developed since.

It plain sucks as a context coherent solution for GPU-to-GPU. An interconnect with low hop-count and low transaction time can significantly reduce problem computational time. It is no different in why we are sticking CPUs in large coherent environments and spend a shitload of $$$ and time on cache-coherency protocols (Sun, Cray, IBM etc).
Not everything is easily scalable as batch oriented tasks. And the other way around. Scaling some types of problems is useless, because you are way better off on cheaper batch orientation solutions.
So it depends. PCIe is fine for a lot of stuff. And not so shiny when it comes to other problems.

**coder** · 15 September 2019, 09:15 PM

Originally posted by milkylainen View Post

Correct. 150G/sec single direction. 300G/sec bidirectional. Maybe I misunderstood you?
PCIe 4.0 is 32G single direction, 64G bidi. 5.0 will be 64G single direction, 128G bidi in 16x config and no means of trunking by protocol support.

Well, isn't 6 NVLink sublinks an equivalent number of pins to x48 lanes of PCIe? Though, I believe PCIe only allows up to x32 lanes to be bundled. But, if you're talking about 6 NVLinks, then the closest thing would be to compare it with PCIe x32 @ 256 GB/sec bidir. I think neither is actually used, in practice, so you might as well go ahead and compare it with the aggregate of Epyc/Rome's x130 PCIe 4.0 lanes @ 520 GB/sec, for purposes of pissing contests.

Originally posted by milkylainen View Post

NVidia maybe are digging their own grave on NVLink because to make it take off, they need to open the lid and free the implementation.

Perhaps it already served its purpose. It let them get out ahead and scale up, when comparable options weren't necessarily available.

Announcement

Arm Joins The Compute Express Link Bandwagon (CXL)

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment