Another Sizable Performance Optimization To Benefit Network Code With Linux 5.17

tuxd3v replied

24 November 2021, 01:27 AM
Originally posted by sinepgib View Post

I really don't see the big deal here.

Not everyone has programmed big endian in assembler, that is the case were you will mostly notice it, but there are also situations were you also loose

Originally posted by sinepgib View Post

It's the efficiency of the chip that is affected, plus the mixed endianess issues. There are certain things that require extra operations to work in big endian compared to little endian. They can be done in hardware or in software, but in both cases they mean wasted energy.

Yes that is my point, its the Chip architecture,fabrication node, etc, and has nothing to do with endianess..

There are also certain things that would require more operations to be done in Litle Endian too, but the code was done to follow the logical path in which the programmer was coding..for example test if a number is positive or negative..

Originally posted by sinepgib View Post

I can't think of a case where the (high level, programmer visible) algorithm would change due to endianess handling. Just whether you'll be doing swaps or not during serialization and deserialization.

Algorithms are not a fixed thing, there are always or almost always, situations were the algorithm can , and would change to take advantage of one or other situation, independently of the arch, but of-course the programmer is the one that define that, sometimes he doesn't even realize that he is coding for little endian, because it feels so natural, probably for someone that was instructed to think that way.He can take advantage of comparisons for positive or negative, and he maybe can find a way to adapt his algorithm to take advantage of that..its only one example of big endian optimization..

serialization/deserialization is the "typical elephant in the room" case were you need to explicitly swap bytes....

Originally posted by sinepgib View Post

The kind of optimization you would do here is the kind that is forced by the endianess itself, and even that would most likely be implicit (done by the compiler or architecture itself).

There are situations that favour Big Endian or Litle Endian, if you are in little endian you will programme to take advantage of what favours little endian
Its the way the algorithm was implemented give advantage to big or little..majority of code was thought for Little Endian in Gnu/Linux..yes the endianess maybe can give marginal gains, but marginal over marginal, after thousands and thousands of iterations or lines of code, makes difference..

Originally posted by sinepgib View Post

This has nothing to do with which endianess you used to write the software, but whether or not you handled mixing those.

that can also happen yes..

Originally posted by sinepgib View Post

I don't follow.

An algorithm is not a static thing, it can be moulded to be adapted to take advantage of the Endianess you are programming for..
sometimes is easier to follow one route because you know in advance that favours your little endian, when compiler generates assembler code.., or even in assembler doing some operations instead of others..

Originally posted by sinepgib View Post

Hmmm, it sounds like a bad idea. If LE weren't so popular but still used by consumers you would still need to write endian aware code for anything not strictly backend. Besides, that's the problem. I don't think we should be using BE as network endianess. But the harm is done and there's no going back.

yes you would but in that case, code could take advantage of Big Endian, because majority of people would work there, and would had moulded the code to be faster in Big Endian...in that case would be the little endian cpus that would have the burden in Gnu/Linux..

Well, now that we know the history that CPUs/Operating systems took, its easy to say that Big Endian was a wrong move in network, but at the time..

Originally posted by sinepgib View Post

I'd ask the same about using BE for the network. There's two ways to solve the issue after all.

Indeed,
If today consumer hardware were mixed, some would suffer from one thing, others would suffer from other things
If you were using BE for personal computers today, and if pcie, etc were invented to be little endian would also be a mess either way..
But I suspect that in that BE scenario, pcie would have being invented BE( to take advantage of it endianess.. ).

In any case the Ideal solution would have being, from the beginning the same, and don't change it..in my opinion.

Last edited by tuxd3v; 24 November 2021, 01:33 AM.
Leave a comment:
sinepgib replied

23 November 2021, 07:36 PM
Originally posted by tuxd3v View Post

To programme Big Endian is amazing, "you are always on top of the cake", because of it natural representation in hexadecimal to humans, even in Binary..

I really don't see the big deal here.

Originally posted by tuxd3v View Post

To be honest,in my opinion, it doesn't matter the Endianess in terms of efficiency, a algorithm created for BigEndian can be created for Little Endian with same efficiency, or vice-versa..

It's the efficiency of the chip that is affected, plus the mixed endianess issues. There are certain things that require extra operations to work in big endian compared to little endian. They can be done in hardware or in software, but in both cases they mean wasted energy.

Originally posted by tuxd3v View Post

Of Course the algorithm will not be the same, it will be different if optimized for one of those.. but it will do the same task, in different representation.. there are no Buggy man in efficiency for endianess..

I can't think of a case where the (high level, programmer visible) algorithm would change due to endianess handling. Just whether you'll be doing swaps or not during serialization and deserialization.

Originally posted by tuxd3v View Post

However tons and tons of code have being optimized for Little Endian over the years, and because of that, you are left with the impression that Little endian arch's are faster...its not the arch, its the way the code was created for..
Nowadays userpace code, at least the majority of it, is optimized for Little Endian

The kind of optimization you would do here is the kind that is forced by the endianess itself, and even that would most likely be implicit (done by the compiler or architecture itself).

Originally posted by tuxd3v View Post

and that translate to Lower Efficiency if that code runs on Big Endian, or even in errors( majority of times.. ), if correct attention is not taken to the problem..

This has nothing to do with which endianess you used to write the software, but whether or not you handled mixing those.

Originally posted by tuxd3v View Post

but that doesn't turn Big Endian Less efficient per-se, however gives Little Endian an advantage, because of that code optimized for it, when you run in Little Endian you see a better result, due to the optimizations it was coded for..

I don't follow.

Originally posted by tuxd3v View Post

Why?because its easier to program for it, its already the network byte order, it feels nice( at least to me..), others will say that Little Endian is the best format because they feel it is..

Hmmm, it sounds like a bad idea. If LE weren't so popular but still used by consumers you would still need to write endian aware code for anything not strictly backend. Besides, that's the problem. I don't think we should be using BE as network endianess. But the harm is done and there's no going back.

Originally posted by tuxd3v View Post

even thought that Big Endian machines in the datacenter are huge with big processing power, and scalability( but they cost tons of money.. ), also Big Endian machines have tremendous performance when dealing with network..its not by chance..

No, it's not by chance, it's precisely because swaps and what not don't come for free and for historical reasons we're stuck with BE for the network.

Originally posted by tuxd3v View Post

What costs me a bit is when I start to think, the amount of processing power we are wasting, every second, using the network, with Little Endian..but at same time you will have tons of processing in Big Endian to deal with software coded for Little Endian, and also with Little Endian standards, pcie, etc..its a mess situation..
Everybody talks in going green... how can we think in going green when we burn so much processing power..?!

I'd ask the same about using BE for the network. There's two ways to solve the issue after all.
Likes 1
Leave a comment:
tuxd3v replied

23 November 2021, 06:18 PM
Originally posted by sinepgib View Post

Indeed. You have a very limited number of instructions you can run on the CPU to achieve those throughputs. Swapping is an extra instruction, and not only that, it's an extra instruction that necessarily introduces a data dependency, which in turn means it takes extra space on your reorder buffer, slowing down your pipeline. The only operation you can get for free is the one you don't do.

yeah and also in a lot of cases, if you send files via network, you need to convert them to the network byte order, depending on the format.
One of those files that are stored in Big Endian is the JPEG format, even in little endian machines , you need to convert always, because its the format of the standard( even to just open a image,a conversion is needed, to store a image also, etc.. ), but a lot can be said about files over the network..then there are the formats that are stored in Little Endian also.
pcie for example is a Little Endian thing, majority of Graphics drivers where written for Little Endian, or with Little Endian machines in mind..
I just found this..

Originally posted by sinepgib View Post

It was the wrong choice because making those computers big endian was the wrong choice to begin with. Of course, we can't blame the TCP protocol for adapting to the world it was invented in. But that world was already in the wrong.
Essentially, whether big endian or little endian is better comes down to whether you care more about being able to read numbers in "human order" directly or making it easier on the computer to do things like down casting and carry. I care more about efficiency. If I want to read a binary number I can tell my debugger I'm working with an int. I still need to anyway because binaries don't know about types.

Here is the thing from my perspective( this subject have been already debated to almost death on the internet.. )..
To programme Big Endian is amazing, "you are always on top of the cake", because of it natural representation in hexadecimal to humans, even in Binary..

To be honest,in my opinion, it doesn't matter the Endianess in terms of efficiency, a algorithm created for BigEndian can be created for Little Endian with same efficiency, or vice-versa..
Of Course the algorithm will not be the same, it will be different if optimized for one of those.. but it will do the same task, in different representation.. there are no Buggy man in efficiency for endianess..
However tons and tons of code have being optimized for Little Endian over the years, and because of that, you are left with the impression that Little endian arch's are faster...its not the arch, its the way the code was created for..
Nowadays userpace code, at least the majority of it, is optimized for Little Endian, and that translate to Lower Efficiency if that code runs on Big Endian, or even in errors( majority of times.. ), if correct attention is not taken to the problem..but that doesn't turn Big Endian Less efficient per-se, however gives Little Endian an advantage, because of that code optimized for it, when you run in Little Endian you see a better result, due to the optimizations it was coded for..
Also the compilers have received tons and tons of optimizations for Little Endian, a lot more than Big Endian, it in the end makes a big difference at least for Multimédia which needs low latency..

If Little Endian was not so popular today, and if someone asked me to choose, I would choose Big Endian, without any doubts
Why?because its easier to program for it, its already the network byte order, it feels nice( at least to me..), others will say that Little Endian is the best format because they feel it is..
What matters now is that virtually all machines operated by Humans are Little Endian, in the domestic segment, in the datacenter Little Endian have grown without competition due to the historical facts that we all know, and its also the majority I believe, even thought that Big Endian machines in the datacenter are huge with big processing power, and scalability( but they cost tons of money.. ), also Big Endian machines have tremendous performance when dealing with network..its not by chance..

What costs me a bit is when I start to think, the amount of processing power we are wasting, every second, using the network, with Little Endian..but at same time you will have tons of processing in Big Endian to deal with software coded for Little Endian, and also with Little Endian standards, pcie, etc..its a mess situation..
Everybody talks in going green... how can we think in going green when we burn so much processing power..?!

Last edited by tuxd3v; 23 November 2021, 06:20 PM. Reason: end of quote.. :S
Leave a comment:
sinepgib replied

23 November 2021, 01:46 PM
Originally posted by tuxd3v View Post

No its definitely NOT.. and you can see that in a 100Gbps adapter we are at ~65Gbps mtu 1500 or jumbo frames more than 90Gbps, I believe, using 2 cores( It is in the previous article Michael wrote, about network ), ..but you now just imagine if you want to saturate 400Gbps..

Indeed. You have a very limited number of instructions you can run on the CPU to achieve those throughputs. Swapping is an extra instruction, and not only that, it's an extra instruction that necessarily introduces a data dependency, which in turn means it takes extra space on your reorder buffer, slowing down your pipeline. The only operation you can get for free is the one you don't do.

Originally posted by tuxd3v View Post

At the time was the correct choice.. since all computers that were something real, were Big Endian, and we are basically talking about servers and supercomputers..

It was the wrong choice because making those computers big endian was the wrong choice to begin with. Of course, we can't blame the TCP protocol for adapting to the world it was invented in. But that world was already in the wrong.
Essentially, whether big endian or little endian is better comes down to whether you care more about being able to read numbers in "human order" directly or making it easier on the computer to do things like down casting and carry. I care more about efficiency. If I want to read a binary number I can tell my debugger I'm working with an int. I still need to anyway because binaries don't know about types.
Leave a comment:
tuxd3v replied

23 November 2021, 01:11 PM
Originally posted by microcode View Post

.. On any AMD64 CPU, endianness swaps are essentially free.

No its definitely NOT.. and you can see that in a 100Gbps adapter we are at ~65Gbps mtu 1500 or jumbo frames more than 90Gbps, I believe, using 2 cores( It is in the previous article Michael wrote, about network ), ..but you now just imagine if you want to saturate 400Gbps..
Leave a comment:
ASBai replied

23 November 2021, 12:56 PM
If I remember correctly, basically all 100Gb+ network cards have tcp check sum offloading, right? So when do we need to run the csum_partial() function on the CPU?
Leave a comment:
tuxd3v replied

23 November 2021, 12:53 PM
Originally posted by sinepgib View Post

Yep. We're carrying the fallout of a big mistake named big endian, and forever will because changing all routers in the world is infeasible. It should have never been the network endianess to begin with.

At the time was the correct choice.. since all computers that were something real, were Big Endian, and we are basically talking about servers and supercomputers..
Because at the time, doesn't existed personal computers..
The problem was that intel, and Microsoft, the duopoly grow so much that they became the standard of personal computers, and now also the standard in the datacenter..also IBM/Motorola,Sun,mips,HP, DEC( alpha ) and others left the CPU market because they couldn't compete with the low prices that intel/Microsoft were practising..also they lacked Microsoft cheap and buggy software..

IBM OS/2 was very stable and very advanced, but it costed both legs , and also both arms..also their hardware was top, but too much costly..
So Intel/Microsoft basically just continued its track without any challenge.

Now if network was designed today, of course it would be little endian, but that boat sailed long ago.. right now, I think only mips,armv7,sparc,powerpc are Bi Endian, maybe there are more I don't know..

But the software was optimized for little endian , and so a lot of cpus started to work only with little endian,
In fairness I don't see big advantage of being big or little, however the little endian concept come to stay, I believe pcie is little endian also( or was optimized for little endian.. ), something very used today in diverse accelerators, and uses a lot of bandwidth..

So today you have a problem...
you want very fast network go big endian

you want pcie accelerators( optimized for speed) go little endian

Now, and if I want both things??
we have a problem here, because nowadays there are already in a lot of places 400Gbps fiber channel, and FCoE, but in todays servers you also want tons of pcie channels
In the future cpus for network will need to be big endian, or if little endian, have offload accelerators( via pcie for example), but in that case you couldn't use others accelerators via pcie because pcie channels are limited..

During a lot of time it worked well with little endian in desktop( desktop here will never be the real problem as requirements for desktop are low..at least not the real problem in the next 30 years or so.. ), now also on server, but with the increase of massive bandwidth for network, the problem starts to appear, and its only the tip of the iceberg..
I was looking a previous article from Michael, and we can't even saturate 100Gbps with amd64, how will we saturate 400Gbps??
and when network will be 800Gbps how will we saturate them?? big problem..
Likes 2
Leave a comment:
Vistaus replied

23 November 2021, 12:24 PM
Michael Maybe you could add a note that the previous patch is already available in the Xanmod kernel?

https://github.com/xanmod/linux/issues/217

So no need to wait for 5.17 to enjoy the previous performance improvements.
Likes 1
Leave a comment:
jabl replied

23 November 2021, 11:47 AM
Originally posted by microcode View Post

No, endianness swapping is never "a big elephant in the room". On any AMD64 CPU, endianness swaps are essentially free. The function that was optimized in this patch is way more expensive than endian swapping every field in the packet, and still has to be done for every packet.

True, and further, TCP/IP being big endian only applies to the packet headers. There's no requirement to byteswap the actual data you're transferring, unless the higher layer protocol you're using specifies that all data must be big endian (which some protocols do, but many others don't).
Likes 4
Leave a comment:
microcode replied

23 November 2021, 11:29 AM
Originally posted by sinepgib View Post

Yep. We're carrying the fallout of a big mistake named big endian, and forever will because changing all routers in the world is infeasible. It should have never been the network endianess to begin with.

Yeah, I even disagree with the classic analogy of "endianness" (comes from a children's tale about meaningless prejudice, the little endians being those who eat a boiled egg starting from the little end, big endian from the big end), because it implies that it isn't important. Little endian has real benefits, which is why it is the choice for the majority of processors.

Heck, even in the moral tale where the "endianness" metaphor comes from, there is a clear advantage to eating a boiled egg from the little end; the big end fits better in a softboiled egg cup.

Last edited by microcode; 23 November 2021, 11:42 AM.
Likes 1
Leave a comment:

Announcement

Another Sizable Performance Optimization To Benefit Network Code With Linux 5.17

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment: