Announcement

**coder** · 14 January 2022, 10:10 PM

Originally posted by oiaohm View Post

AMD high end dGPU cards are only PCIe 4.0 x8 not x16 even that they take x16 slot.

This is demonstrably false. It's only the RX 5500 and below, which are x8. In the 6000 series, it's the RX 6600 that's x8, while the RX 6500 is x4.

Originally posted by oiaohm View Post

Even with Nvidia 16x PCIe 4.0 cards there is to the point that there is no difference performance to be gained from faster.

This is also demonstrably false.

Since all your conclusions are based on faulty data, I can see how you arrive at such ridiculous conclusions. Troll harder, next time.

Originally posted by oiaohm View Post

Basically PCIe bandwidth is expanding faster than dgpus can use it question is by how much.

I'll grant you this. There was no justification for Intel adding PCIe 5.0 x16 to Alder Lake. That's why I didn't believe they'd really do it.

**oiaohm** · 15 January 2022, 01:32 AM

Originally posted by coder View Post

I'll grant you this. There was no justification for Intel adding PCIe 5.0 x16 to Alder Lake. That's why I didn't believe they'd really do it.

https://www.pcmag.com/news/alder-lak...ative-platform

Intel did provide a split PCIe 5.0 x8 times 2 as a motherboard maker option. PCIe 5.0 x16 does make sense for particular accelerators dgpus is not one of them.

Originally posted by coder View Post

This is demonstrably false. It's only the RX 5500 and below, which are x8. In the 6000 series, it's the RX 6600 that's x8, while the RX 6500 is x4

No what I said is demonstrably true. Because AMD motherboards if the PCIe 4.0 it frame support bifurcation. So you have a 16x Pcie 4.0 slot you add another card and it comes a 8x slot. Even the highest end AMD model of dgpu in PCIe 4.0 x8 slot does not alter performance. So yes its a PCIe 4.0 x16 but its really not using that bandwidth over half the bandwidth is not being used..

I was talking as in PCIe 4.0 x8 in bandwidth not it wired up connections on card. Of course the AMD higher end cards have all 16 lanes of PCIe wired up means they hold their performance when dropped in a PCIe 3.0 x 16 slot. The lower end cards you start noticing some problems when you drop them into a PCIe 3.0 slot because then you fall under the required bandwidth for performance.

Dgpu not need a PCIe 5.0 or 6.0 x16 slot does not mean there will be absolutely nothing. The reality here with the bifurcation of slots in different motherboards the fact dgpus are not using more than PCIe 3.0 x16 or PCIe 4.0 x8 is really demonstrable.

Yes bifurcation of PCIe 3.0 x16 to two PCIe 3.0 x8 use to cause under particular benchmarks quite a performance fall off.

So pcie 4.0 dgpu that is a x16 end up behaving the same when its in a PCIe 4.0 slot x8 as a x16. Yes with the ~1 percent performance improvement over the old PCIe 3.0 x16.

We kind of hit a threshold here.

The doubling dice of Pcie performance has outstripped dgpu development. Coder just because something is wired up does not mean it doing anything useful.

**coder** · 15 January 2022, 05:39 AM

Originally posted by oiaohm View Post

Intel did provide a split PCIe 5.0 x8 times 2 as a motherboard maker option.

Yes, I did see that. I still think PCIe 5.0 doesn't make sense on a desktop. Intel simply wanted to be first, and as long as people are content to pay for PCIe 5.0 in their Alder Lake motherboards, I guess that's fine.

Originally posted by oiaohm View Post

PCIe 5.0 x16 does make sense for particular accelerators

But this is a mainstream desktop platform. I'm not talking about their workstations or server chips.

Originally posted by oiaohm View Post

So you have a 16x Pcie 4.0 slot you add another card and it comes a 8x slot.

Duh. Their desktop CPUs don't have PCIe 4.0 x32.

Originally posted by oiaohm View Post

Even the highest end AMD model of dgpu in PCIe 4.0 x8 slot does not alter performance.

Again, that's not true now, nor was it even true of the 5700 XT. There's good data out there, like this:

https://www.techpowerup.com/review/n...press-scaling/

Unfortunately, their analysis isn't quite up to the same level. First, it should be noted that they used only a Ryzen 3900X, which was not even the best gaming CPU at the time of the article's publication. The next caveat is that the GPU they used is only a RTX 3080 FE. On such a system, PCIe is less of a bottleneck than it'd be on a top-spec machine built even at that time, much less today. Finally, they looked at average framerate, rather than 99th percentile, which again would be more revealing.

With all that said, let's look into what it actually shows. At all resolutions, the mean & median speedup is a bit over 1%. However, there's a fair amount of variation within that. What a gamer cares about is their favorite games, and not at all about the performance of games they have no interest in playing. So, let's look at where it helps and by how much.

Resolution	Max Speedup	# Above Average
1920x1080	3.9%	9
2560x1440	4.2%	8
3840x2160	6.0%	7

That's not an insignificant shift in average FPS. As I said, the improvement in 99th percentile should be even greater. Finally, it will only increase with faster CPUs and GPUs.

Originally posted by oiaohm View Post

The doubling dice of Pcie performance has outstripped dgpu development.

I already agreed on this point. PCIe 4.0 adds little, but more than you say.

For gaming, PCIe 5.0 x16 is absolutely useless. It does add motherboard cost and burns more power (when actually used). The only argument for it would be to support dual-5.0 x8 GPUs, but we don't even know when any PCIe 5.0 GPUs will exist, nor is good multi-GPU support among software very widespread.

**oiaohm** · 17 January 2022, 05:28 PM

Originally posted by coder View Post

Resolution	Max Speedup	# Above Average
1920x1080	3.9%	9
2560x1440	4.2%	8
3840x2160	6.0%	7

That's not an insignificant shift in average FPS. As I said, the improvement in 99th percentile should be even greater. Finally, it will only increase with faster CPUs and GPUs..

PCIE 4.0 x8 vs x16 / TEST WYDAJNOŚCI / RTX 3080

https://www.youtube.com/watch?v=WvBV4_TOz-4

💻 Polecane zestawy komputerowe: https://bit.ly/3aJ79xgW dzisiejszym materiale sprawdziłem dla was jakie są różnice w wydajności na RTX 3080 w trybach PCIE 4...

Except they did not do PCI 4.0 x8 because they presume it performs exactly the same as a PCI 3.0 x16 because it has the same bandwidth problem is this is not the case. Now this video not english does x8 pcie 4.0 against 16x pcie 4.0 there are other out there who have done the test. A card in PCIe 4.0 x8 is slightly faster than card in PCIe 3.0 x16. Reasons are simple PCI 4.0 higher clock speed so lower latency knowing when gpu wants data and PCI 4.0 protocol improvements.

Yes 4 lanes of PCI 5.0 is slightly faster than 8 lanes of PCI 4.0 as well and two lanes of PCI 6.0 will be slightly faster again all here have the same bandwidth in theory. The problem is not just bandwidth its latency in communication. Higher bus speed lower latency. Lower latency allows slightly more usage of the same bandwidth due to have less dead time(dead time as in waiting not sending anything waiting for request). So the amount of the theoretical bandwidth that can be consumed by a dgpu is going up as the PCIe standard improves. But its appearing that once you get to 16GB/s the important factor for a dgpu is now latency of the bus not the bandwidth any more. Yes that what the benchmarks that put pci 4.0 x8 against x16 are showing.

If we stay in this bandwidth usage stall point PCI 5.0 x8 for dgpus will be absolutely overkill with all the performance gains of a PCI 5.0 gpu got in a 4x slot. We could be at a stall point.

Yes there has been a incorrect presume that bandwidth figure is be all and end all. Problem here its bandwidth and latency improvements in either can result in performance improvements of a dgpu depending on how it bottle-necked. Current gdpu are not bottle-necked by bandwidth instead bottle-necked by latency.

**coder** · 17 January 2022, 06:44 PM

Originally posted by oiaohm View Post

A card in PCIe 4.0 x8 is slightly faster than card in PCIe 3.0 x16. Reasons are simple PCI 4.0 higher clock speed so lower latency knowing when gpu wants data

For tiny transactions, you will see latency reductions, but they're still commensurate with bandwidth increases.

Originally posted by oiaohm View Post

PCI 4.0 protocol improvements.

Such as?

Originally posted by oiaohm View Post

Yes 4 lanes of PCI 5.0 is slightly faster than 8 lanes of PCI 4.0

Why would that be?

Originally posted by oiaohm View Post

as well and two lanes of PCI 6.0 will be slightly faster again all here have the same bandwidth in theory.

Not necessarily. If you're latency-bound rather than bandwidth-limited, you could actually see a regression going to 6.0. That's because the minimum transaction size is now a 256-byte Flit. So, the overhead and latency of tiny transactions will actually go up, and with no clock speed boost to compensate!

As for raw bandwidth terms, PCI-SIG claims that 1/1-bit encoding + Flits are a slight win over the previous 128/130 NRZ encoding, but I think the amount saved is probably on the order of only a couple bits per Flit.

Originally posted by oiaohm View Post

The problem is not just bandwidth its latency in communication. Higher bus speed lower latency. Lower latency allows slightly more usage of the same bandwidth due to have less dead time(dead time as in waiting not sending anything waiting for request).

I've actually made this exact same argument and I do believe it. However, graphics & compute APIs bend over backwards to make pipelined, asynch interactions, specifically so that neither the GPU nor the CPU are constrained by turnaround time. There are obviously limitations to the degree this is possible, especially for interactive apps like games.

You need look no further than PCIe scaling benchmarks, like the ones I cited, and note how some games' performance is almost uncorrelated with PCIe speed (these are mostly the ones dragging down the mean speedup figure, as well). That's because they have enough bandwidth and do enough async/pipelined communication that virtually all latency is effectively hidden.

**oiaohm** · 17 January 2022, 09:31 PM

Originally posted by coder View Post

For tiny transactions, you will see latency reductions, but they're still commensurate with bandwidth increases.

Yes linked to the bandwidth per lane. Not linked to total bandwidth of a x16 x8 x4.... slot. Issue is we are not seeing the total bandwidth used.

Originally posted by coder View Post

Why would that be?

https://blogs.synopsys.com/vip-centr...-new-features/
There were quite a few what appear to be minor changes. Altering polling system is kind of a big one. Its not quite as big as a difference between USB 2.0 and USB 3.0 but when you start playing in some of these areas there is going to be a performance change.

Majority of the change is the higher speed per lane resulting in lower latency.

Originally posted by coder View Post

Not necessarily. If you're latency-bound rather than bandwidth-limited, you could actually see a regression going to 6.0. That's because the minimum transaction size is now a 256-byte Flit. So, the overhead and latency of tiny transactions will actually go up, and with no clock speed boost to compensate!

As for raw bandwidth terms, PCI-SIG claims that 1/1-bit encoding + Flits are a slight win over the previous 128/130 NRZ encoding, but I think the amount saved is probably on the order of only a couple bits per Flit.

There is a kind of a killer to that problem. The low latancy bits of opengl and vulkan are 256 bytes majority of the time so the most common tiny transactions are not that tiny. So majority of the time this change will be a boost not a hindrance. Now cuda/opencl this might be a different matter. So you might be right here that some things are hindered by pcie 6.0 I had not consider that change.

Originally posted by coder View Post

I've actually made this exact same argument and I do believe it. However, graphics & compute APIs bend over backwards to make pipelined, asynch interactions, specifically so that neither the GPU nor the CPU are constrained by turnaround time. There are obviously limitations to the degree this is possible, especially for interactive apps like games.

You need look no further than PCIe scaling benchmarks, like the ones I cited, and note how some games' performance is almost uncorrelated with PCIe speed (these are mostly the ones dragging down the mean speedup figure, as well). That's because they have enough bandwidth and do enough async/pipelined communication that virtually all latency is effectively hidden.

The numbers you were quote for difference. Other people have run those same benchmarks in PCIE 4.0 x8 and PCIE 4.0 x16 and notice no performance difference. So the performance gain is not coming from the total bandwitdh increase. Its the per lane bandwidth increase and the polling and other minor changes that were corrected in PCIE 4.0. Yes it games that hit the latency limit with stuff the graphics and compute APIs cannot hide where you are seeing differences.

Yes there are a large number of programs that you get to PCIe 2.0 x16 and that enough bandwidth and low enough latency that you are going to get no performance.

So far I have not seen a single benchmark finding anything that faster at PCIe 4.0 x16 than PCIe 4.0 x8 using a dgpu by value greater than margin or error. Yes you can find examples with PCIe 4.0 x16 beating PCIe 3.0 x16 by margins that cannot be measurement error yet this is still matching what the PCIe 4.0 x8 does so the PCIe 4.0 protocol changes does some positive things same with the faster lane speed. async/pipelined communication also undermines increasing bandwidth past a particular point as well because this improves utilisation of the bandwidth you already have. Yes this pipelined is why you see some programs basically tapping out at PCIe 2.0 x16.

coder its kind of a funny location that PCIe 4.0 doubled the x16 slot bandwidth and basically nothing using it. Of course due to the fact in the past performance improvement has been bandwidth increase different parties did not consider the latancy and small protocol changes could be the performance difference they are seeing. This is why the benchmarks putting PCIe 4.0 x16 head to head with PCIe 4.0 x8 have been very interesting because of the basically zero difference. You don't have zero difference when you put PCIe 3.0 x16 head to head with PCIe 3.0 x8.

**coder** · 18 January 2022, 12:38 AM

Originally posted by oiaohm View Post

The numbers you were quote for difference.

The article from which my numbers were derived tested 23 different games. A few of them showed performance relatively uncorrelated with PCIe, at least at the upper levels. That's what I was referring to.

**oiaohm** · 18 January 2022, 02:30 AM

Originally posted by coder View Post

The article from which my numbers were derived tested 23 different games. A few of them showed performance relatively uncorrelated with PCIe, at least at the upper levels. That's what I was referring to.

The all the ones that show the biggest different from that site you choose Civilization VI, Death Stranding, Detroit Become Human, Devil May Cry 5, DOOM Eternal,F1 2020 and Metro Exodus when you dig around you will find other parties with NVIDIA GeForce RTX 3080 who benchmarked PCIE 4.0 x8 and PCIE 4.0 x16 and found no performance difference at all yes with slightly difference from PCIE 3.0 x16. So the minor protocol change and the alteration in latency due to faster lanes.

The reality there is not benchmarks with dgpus showing the total PCIE 4.0 x16 total bandwidth being anywhere near used.

PCIE 5.0 X8 has the same total bandwidth as PCIE 4.0 x16. Yes PCIE 6.0 X4 is meant to have the same total bandwidth again. The realty here is by the time PCIE 6.0 appears in the consumer space we might not have got to using the total bandwidth of PCIE 4.0x16 for dgpus.

Amd releasing 8 lane and 4 lane cards this is important for another reason. Your pipelined and asynch interactions you mentioned coder. The reason why I pay attention to amd releasing 4 and 8 pcie lane cards is these come cards to make sure driver ability to pipeline suite these lower lane counts.

Yes before AMD did the PCIe Bifurcation in consumer boards to 8x 8x they also released cards that were dGPUs x8 lanes only. Yes this was entry level card.

Something most people are not aware some X570 AM4 supporting x4x4x4x4 Bifurcation of the 16x PCIe 4.0 straight from the cpu exist in server space.

Not only does PCIe standard slowly work from the server space to the consumer space so is Bifurcation. Its very possible by PCIe 6.0 we have the next level of Bifurcation.

Yes if dgpus are not needing more total bandwidth doing Bifurcation makes more sense to reduce the number of tracks and possible layers in consumer motherboard so reducing cost.

Something I have always found annoying about the PCIe standard is that a slot has to be completely one PCIe standard and connected to the same PCIe controller.

With direct from storage it would be really good if you could have like 4x lanes PCIe 6.0 straight from CPU and like 12x lanes of PCIe 3.0/4.0 straight from the chipset for direct accessing storage.

Unless there is some usage reason I don't see any good reason not to do more bifurcation in the consumer space with the move to PCIe 6.0.

**JustRob** · 23 January 2022, 12:22 PM

Originally posted by coder View Post

Thanks for that. I was aware of the theoretical possibility of x32, but never knew of an actual example!

Here's a 10 year old PCI-E 3.0 x32 Left Riser slot on a SuperMicro X10SRW-F motherboard.

The x32 size is rare, but the socket is available for motherboard manufacturers to purchase; of course there needs to be cards that size, which requires the motherboard to be manufactured first. It also eats up a lot of space when almost all cards are smaller, thus it's usually used in proprietary applications such as risers; but if you have a card that big it will work.

**oiaohm** · 23 January 2022, 04:00 PM

Originally posted by coder View Post

Thanks for that. I was aware of the theoretical possibility of x32, but never knew of an actual example!

The pcie standard has x1, x2,, x4, x8, x12, x16, and x32. The one that really hard to find x12 I don't know of anyone who makes x2 or x12 slots. When you see a x2 on motherboard its a x4 half populated or a m.2

Just a moment...

https://cdn.amphenol-icc.com/media/wysiwyg/files/documentation/customerpresentation/pcie_gen4_gen5_productpresentation.pdf

Yes x24 for riser uses are made as well even that its not part of the standard.

The first time I saw a x24 was inside a dell compact computer.

There is hand full of pcie 3.0 x32 interconnect cards(as in less than ten) there are about 4 makers that have made pcie 4.0 x32 interconnect cards as well they are a absolute nightmare to find a motherboard that has the slot for them. These are basically mostly used in data centres on custom ordered motherboards.

Announcement

PCIe 6.0 Specification Released With 64 GT/s Transfer Speeds

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment