Announcement

**sinepgib** · 23 November 2021, 09:41 AM

Originally posted by tuxd3v View Post

There are a big elephant in the room..
amd64 is little endian,network is big endian.. and so all bits need to be swapped in and out, for every package, in high speed networks that takes its tall..we are already at 400Gbps, there will be a time were you will need "30 cores just for that"..

Yep. We're carrying the fallout of a big mistake named big endian, and forever will because changing all routers in the world is infeasible. It should have never been the network endianess to begin with.

**microcode** · 23 November 2021, 11:21 AM

Originally posted by tuxd3v View Post

There is a big elephant in the room..
AMD64 is little endian, network is big endian.. and so all bytes need to be swapped around, for every packet, in high speed networks that takes its toll..we are already at 400Gbps, there will be a time were you will need "30 cores just for that"..

No, endianness swapping is never "a big elephant in the room". On any AMD64 CPU, endianness swaps are essentially free. The function that was optimized in this patch is way more expensive than endian swapping every field in the packet, and still has to be done for every packet.

**microcode** · 23 November 2021, 11:29 AM

Originally posted by sinepgib View Post

Yep. We're carrying the fallout of a big mistake named big endian, and forever will because changing all routers in the world is infeasible. It should have never been the network endianess to begin with.

Yeah, I even disagree with the classic analogy of "endianness" (comes from a children's tale about meaningless prejudice, the little endians being those who eat a boiled egg starting from the little end, big endian from the big end), because it implies that it isn't important. Little endian has real benefits, which is why it is the choice for the majority of processors.

Heck, even in the moral tale where the "endianness" metaphor comes from, there is a clear advantage to eating a boiled egg from the little end; the big end fits better in a softboiled egg cup.

**jabl** · 23 November 2021, 11:47 AM

Originally posted by microcode View Post

No, endianness swapping is never "a big elephant in the room". On any AMD64 CPU, endianness swaps are essentially free. The function that was optimized in this patch is way more expensive than endian swapping every field in the packet, and still has to be done for every packet.

True, and further, TCP/IP being big endian only applies to the packet headers. There's no requirement to byteswap the actual data you're transferring, unless the higher layer protocol you're using specifies that all data must be big endian (which some protocols do, but many others don't).

**Vistaus** · 23 November 2021, 12:24 PM

Michael Maybe you could add a note that the previous patch is already available in the Xanmod kernel?

https://github.com/xanmod/linux/issues/217

So no need to wait for 5.17 to enjoy the previous performance improvements.

**tuxd3v** · 23 November 2021, 12:53 PM

Originally posted by sinepgib View Post

Yep. We're carrying the fallout of a big mistake named big endian, and forever will because changing all routers in the world is infeasible. It should have never been the network endianess to begin with.

At the time was the correct choice.. since all computers that were something real, were Big Endian, and we are basically talking about servers and supercomputers..
Because at the time, doesn't existed personal computers..
The problem was that intel, and Microsoft, the duopoly grow so much that they became the standard of personal computers, and now also the standard in the datacenter..also IBM/Motorola,Sun,mips,HP, DEC( alpha ) and others left the CPU market because they couldn't compete with the low prices that intel/Microsoft were practising..also they lacked Microsoft cheap and buggy software..

IBM OS/2 was very stable and very advanced, but it costed both legs , and also both arms..also their hardware was top, but too much costly..
So Intel/Microsoft basically just continued its track without any challenge.

Now if network was designed today, of course it would be little endian, but that boat sailed long ago.. right now, I think only mips,armv7,sparc,powerpc are Bi Endian, maybe there are more I don't know..

But the software was optimized for little endian , and so a lot of cpus started to work only with little endian,
In fairness I don't see big advantage of being big or little, however the little endian concept come to stay, I believe pcie is little endian also( or was optimized for little endian.. ), something very used today in diverse accelerators, and uses a lot of bandwidth..

So today you have a problem...

you want very fast network go big endian
you want pcie accelerators( optimized for speed) go little endian

Now, and if I want both things??
we have a problem here, because nowadays there are already in a lot of places 400Gbps fiber channel, and FCoE, but in todays servers you also want tons of pcie channels

In the future cpus for network will need to be big endian, or if little endian, have offload accelerators( via pcie for example), but in that case you couldn't use others accelerators via pcie because pcie channels are limited..

During a lot of time it worked well with little endian in desktop( desktop here will never be the real problem as requirements for desktop are low..at least not the real problem in the next 30 years or so.. ), now also on server, but with the increase of massive bandwidth for network, the problem starts to appear, and its only the tip of the iceberg..
I was looking a previous article from Michael, and we can't even saturate 100Gbps with amd64, how will we saturate 400Gbps??
and when network will be 800Gbps how will we saturate them?? big problem..

**ASBai** · 23 November 2021, 12:56 PM

If I remember correctly, basically all 100Gb+ network cards have tcp check sum offloading, right? So when do we need to run the csum_partial() function on the CPU?

**tuxd3v** · 23 November 2021, 01:11 PM

Originally posted by microcode View Post

.. On any AMD64 CPU, endianness swaps are essentially free.

No its definitely NOT.. and you can see that in a 100Gbps adapter we are at ~65Gbps mtu 1500 or jumbo frames more than 90Gbps, I believe, using 2 cores( It is in the previous article Michael wrote, about network ), ..but you now just imagine if you want to saturate 400Gbps..

**sinepgib** · 23 November 2021, 01:46 PM

Originally posted by tuxd3v View Post

No its definitely NOT.. and you can see that in a 100Gbps adapter we are at ~65Gbps mtu 1500 or jumbo frames more than 90Gbps, I believe, using 2 cores( It is in the previous article Michael wrote, about network ), ..but you now just imagine if you want to saturate 400Gbps..

Indeed. You have a very limited number of instructions you can run on the CPU to achieve those throughputs. Swapping is an extra instruction, and not only that, it's an extra instruction that necessarily introduces a data dependency, which in turn means it takes extra space on your reorder buffer, slowing down your pipeline. The only operation you can get for free is the one you don't do.

Originally posted by tuxd3v View Post

At the time was the correct choice.. since all computers that were something real, were Big Endian, and we are basically talking about servers and supercomputers..

It was the wrong choice because making those computers big endian was the wrong choice to begin with. Of course, we can't blame the TCP protocol for adapting to the world it was invented in. But that world was already in the wrong.
Essentially, whether big endian or little endian is better comes down to whether you care more about being able to read numbers in "human order" directly or making it easier on the computer to do things like down casting and carry. I care more about efficiency. If I want to read a binary number I can tell my debugger I'm working with an int. I still need to anyway because binaries don't know about types.

**tuxd3v** · 23 November 2021, 06:18 PM

Originally posted by sinepgib View Post

Indeed. You have a very limited number of instructions you can run on the CPU to achieve those throughputs. Swapping is an extra instruction, and not only that, it's an extra instruction that necessarily introduces a data dependency, which in turn means it takes extra space on your reorder buffer, slowing down your pipeline. The only operation you can get for free is the one you don't do.

yeah and also in a lot of cases, if you send files via network, you need to convert them to the network byte order, depending on the format.
One of those files that are stored in Big Endian is the JPEG format, even in little endian machines , you need to convert always, because its the format of the standard( even to just open a image,a conversion is needed, to store a image also, etc.. ), but a lot can be said about files over the network..then there are the formats that are stored in Little Endian also.
pcie for example is a Little Endian thing, majority of Graphics drivers where written for Little Endian, or with Little Endian machines in mind..
I just found this..

Originally posted by sinepgib View Post

It was the wrong choice because making those computers big endian was the wrong choice to begin with. Of course, we can't blame the TCP protocol for adapting to the world it was invented in. But that world was already in the wrong.
Essentially, whether big endian or little endian is better comes down to whether you care more about being able to read numbers in "human order" directly or making it easier on the computer to do things like down casting and carry. I care more about efficiency. If I want to read a binary number I can tell my debugger I'm working with an int. I still need to anyway because binaries don't know about types.

Here is the thing from my perspective( this subject have been already debated to almost death on the internet..

)..
To programme Big Endian is amazing, "you are always on top of the cake", because of it natural representation in hexadecimal to humans, even in Binary..

To be honest,in my opinion, it doesn't matter the Endianess in terms of efficiency, a algorithm created for BigEndian can be created for Little Endian with same efficiency, or vice-versa..
Of Course the algorithm will not be the same, it will be different if optimized for one of those.. but it will do the same task, in different representation.. there are no Buggy man in efficiency for endianess..
However tons and tons of code have being optimized for Little Endian over the years, and because of that, you are left with the impression that Little endian arch's are faster...its not the arch, its the way the code was created for..
Nowadays userpace code, at least the majority of it, is optimized for Little Endian, and that translate to Lower Efficiency if that code runs on Big Endian, or even in errors( majority of times.. ), if correct attention is not taken to the problem..but that doesn't turn Big Endian Less efficient per-se, however gives Little Endian an advantage, because of that code optimized for it, when you run in Little Endian you see a better result, due to the optimizations it was coded for..
Also the compilers have received tons and tons of optimizations for Little Endian, a lot more than Big Endian, it in the end makes a big difference at least for Multimédia which needs low latency..

If Little Endian was not so popular today, and if someone asked me to choose, I would choose Big Endian, without any doubts

Why?because its easier to program for it, its already the network byte order, it feels nice( at least to me..), others will say that Little Endian is the best format because they feel it is..
What matters now is that virtually all machines operated by Humans are Little Endian, in the domestic segment, in the datacenter Little Endian have grown without competition due to the historical facts that we all know, and its also the majority I believe, even thought that Big Endian machines in the datacenter are huge with big processing power, and scalability( but they cost tons of money.. ), also Big Endian machines have tremendous performance when dealing with network..its not by chance..

What costs me a bit is when I start to think, the amount of processing power we are wasting, every second, using the network, with Little Endian..but at same time you will have tons of processing in Big Endian to deal with software coded for Little Endian, and also with Little Endian standards, pcie, etc..its a mess situation..
Everybody talks in going green... how can we think in going green when we burn so much processing power..?!

Announcement

Another Sizable Performance Optimization To Benefit Network Code With Linux 5.17

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment