Announcement

Collapse
No announcement yet.

Another Sizable Performance Optimization To Benefit Network Code With Linux 5.17

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #41
    Originally posted by microcode View Post
    What makes you say "it's definitely NOT" free? Because as far as I can tell, CPUs like those that implement AMD64, could easily eliminate endianness swaps in a store at decode time; and if you have seen some of the things that are eliminated at decode time in these chips, it's hard to imagine they didn't bother with this... unless it's already so cheap that it's not worth doing, which may also be true. I think people here genuinely don't understand how cheap basic integer operations are, let alone simple MOVs.
    I think the home market of computers doesn't have , for now, to deal with high packet processing..and not all servers have to deal with big amount of packet processing..
    But specially if you are operating in a heterogeneous environment, you have no escape.
    its not cheap because there are tons of swap operations, conditions and such in the code, for one packet is ok, but when you are processing 100Gbps is not ok, for 400Gbps worst, to just swap 8 bytes or 64 bits you do:
    Code:
    /usr/include/x86_64-linux-gnu/bits/byteswap.h
    /* Swap bytes in 64-bit value. */
    ((((x) & 0xff00000000000000ull) >> 56) \
    | (((x) & 0x00ff000000000000ull) >> 40) \
    | (((x) & 0x0000ff0000000000ull) >> 24) \
    | (((x) & 0x000000ff00000000ull) >> 8) \
    | (((x) & 0x00000000ff000000ull) << 8) \
    | (((x) & 0x0000000000ff0000ull) << 24) \
    | (((x) & 0x000000000000ff00ull) << 40) \
    | (((x) & 0x00000000000000ffull) << 56))
    This code is generally preprocessor macros..
    you usually don't call this functions directly, you call then indirectly, trough other macros..
    Imagine the amount of operation you will have running loops of this things..
    for a single time, or even for 1 packet or a dozen, no problem..

    Comment


    • #42
      Originally posted by tuxd3v View Post
      sooner or later it will have to be done., in a binary file.. host to network, and network to host..
      As mentioned, network byte order applies only to protocol headers, not payload.

      I know what endianness means, thank-you-very-much.

      JPEG, like I said is already in network byte order( Big Endian ), so no change for it..a program in Little Endian opening a JPEG, it already knows that it has to convert the file for reading, the same for writing..
      That jpeg is big endian matters only for jpeg encoders and decoders (and there doing a simple byteswap as part of all the other (de)compression work is noise). Other software, like network protocol implementations, don't need to care.

      For another example, if we're transferring a little endian image like GIF over HTTP, there is no need to byteswap the GIF image before pushing it out over the wire. The GIF file can be transferred as-is because neither HTTP nor TCP/IP care about the data payload. Just like for JPEG, the only software that needs to care are GIF encoders and decoders.


      Comment


      • #43
        Originally posted by tuxd3v View Post

        I think the home market of computers doesn't have , for now, to deal with high packet processing..and not all servers have to deal with big amount of packet processing..
        But specially if you are operating in a heterogeneous environment, you have no escape.
        its not cheap because there are tons of swap operations, conditions and such in the code, for one packet is ok, but when you are processing 100Gbps is not ok, for 400Gbps worst, to just swap 8 bytes or 64 bits you do:
        Code:
        /usr/include/x86_64-linux-gnu/bits/byteswap.h
        /* Swap bytes in 64-bit value. */
        ((((x) & 0xff00000000000000ull) >> 56) \
        | (((x) & 0x00ff000000000000ull) >> 40) \
        | (((x) & 0x0000ff0000000000ull) >> 24) \
        | (((x) & 0x000000ff00000000ull) >> 8) \
        | (((x) & 0x00000000ff000000ull) << 8) \
        | (((x) & 0x0000000000ff0000ull) << 24) \
        | (((x) & 0x000000000000ff00ull) << 40) \
        | (((x) & 0x00000000000000ffull) << 56))
        This code is generally preprocessor macros..
        you usually don't call this functions directly, you call then indirectly, trough other macros..
        Imagine the amount of operation you will have running loops of this things..
        for a single time, or even for 1 packet or a dozen, no problem..
        You neglected to copypaste the full code:

        Code:
        /* Swap bytes in 64-bit value. */
        #define __bswap_constant_64(x) \
        ((((x) & 0xff00000000000000ull) >> 56) \
        | (((x) & 0x00ff000000000000ull) >> 40) \
        | (((x) & 0x0000ff0000000000ull) >> 24) \
        | (((x) & 0x000000ff00000000ull) >> 8) \
        | (((x) & 0x00000000ff000000ull) << 8) \
        | (((x) & 0x0000000000ff0000ull) << 24) \
        | (((x) & 0x000000000000ff00ull) << 40) \
        | (((x) & 0x00000000000000ffull) << 56))
        
        __extension__ static __inline __uint64_t
        __bswap_64 (__uint64_t __bsx)
        {
        #if __GNUC_PREREQ (4, 3)
        return __builtin_bswap64 (__bsx);
        #else
        return __bswap_constant_64 (__bsx);
        #endif
        }
        So that macro thing is evaluated only if you have an ancient GCC < 4.3. Otherwise it uses the compiler builtin, which on x86_64 evaluates to a single instruction.

        (Also I recall when when looking at ASM output modern optimizing compilers are really quite clever and can often recognize the byteswap pattern and replace code like the macro above with the target byteswap instruction.)

        Comment


        • #44
          Originally posted by tuxd3v View Post

          I think the home market of computers doesn't have , for now, to deal with high packet processing..and not all servers have to deal with big amount of packet processing..
          But specially if you are operating in a heterogeneous environment, you have no escape.
          its not cheap because there are tons of swap operations, conditions and such in the code, for one packet is ok, but when you are processing 100Gbps is not ok, for 400Gbps worst, to just swap 8 bytes or 64 bits you do:
          Code:
          /usr/include/x86_64-linux-gnu/bits/byteswap.h
          /* Swap bytes in 64-bit value. */
          ((((x) & 0xff00000000000000ull) >> 56) \
          | (((x) & 0x00ff000000000000ull) >> 40) \
          | (((x) & 0x0000ff0000000000ull) >> 24) \
          | (((x) & 0x000000ff00000000ull) >> 8) \
          | (((x) & 0x00000000ff000000ull) << 8) \
          | (((x) & 0x0000000000ff0000ull) << 24) \
          | (((x) & 0x000000000000ff00ull) << 40) \
          | (((x) & 0x00000000000000ffull) << 56))
          This code is generally preprocessor macros..
          you usually don't call this functions directly, you call then indirectly, trough other macros..
          Imagine the amount of operation you will have running loops of this things..
          for a single time, or even for 1 packet or a dozen, no problem..
          That I'm aware, 64-bit swaps are not required for TCP. A TCP segment header has two 32-bit fields, the sequence number and the ACK number (this is zero unless ACK is set), the other numeric fields are either 16-bit or weird (4-bit data offset). There are also optional fields, but the largest datatypes in there are 32-bit. Furthermore.. please read what jabl shared.

          Comment


          • #45
            Originally posted by jabl View Post
            So that macro thing is evaluated only if you have an ancient GCC < 4.3. Otherwise it uses the compiler builtin, which on x86_64 evaluates to a single instruction.

            (Also I recall when when looking at ASM output modern optimizing compilers are really quite clever and can often recognize the byteswap pattern and replace code like the macro above with the target byteswap instruction.)
            It's even better, if you have GCC < 4.3, it uses a different builtin called __bswap_constant_....

            Originally posted by tuxd3v View Post
            I think the home market of computers doesn't have...
            AMD64 chips are not the home market of computers, they are the most common server chips in the world, and they are the machines that move the most TCP in software, period.

            Originally posted by tuxd3v View Post
            But specially if you are operating in a heterogeneous environment, you have no escape.
            What?

            Originally posted by tuxd3v View Post
            its not cheap because there are tons of swap operations, conditions and such in the code, for one packet is ok, but when you are processing 100Gbps is not ok, for 400Gbps worst, to just swap 8 bytes or 64 bits you do:
            Code:
            /usr/include/x86_64-linux-gnu/bits/byteswap.h
            /* Swap bytes in 64-bit value. */
            ((((x) & 0xff00000000000000ull) >> 56) \
            | (((x) & 0x00ff000000000000ull) >> 40) \
            | (((x) & 0x0000ff0000000000ull) >> 24) \
            | (((x) & 0x000000ff00000000ull) >> 8) \
            | (((x) & 0x00000000ff000000ull) << 8) \
            | (((x) & 0x0000000000ff0000ull) << 24) \
            | (((x) & 0x000000000000ff00ull) << 40) \
            | (((x) & 0x00000000000000ffull) << 56))
            This code is generally preprocessor macros..
            you usually don't call this functions directly, you call then indirectly, trough other macros..
            Imagine the amount of operation you will have running loops of this things..
            for a single time, or even for 1 packet or a dozen, no problem..
            Here's a Godbolt showing what those compiler builtins are compiled to https://godbolt.org/z/ez8rc3MT9
            32 and 64-bit byteswaps are handled with a single instruction. a 32-bit byteswap is encoded as a two-byte instruction, and a 64-bit byteswap (unused in TCP) is encoded as three bytes. __builtin_bswap16 technically expands to two instructions, and five bytes, but that is well within the range of things instruction decoders can detect. Keep in mind that instruction !== operation, in a machine like this; just because an instruction is encoded, doesn't mean that there's a corresponding operation, and operations can have zero latency (that is, if you are doing some other work to compute that field, or moving it from somewhere else, there is no time taken between that operation and storing the swapped version; ditto for loading).
            Last edited by microcode; 26 November 2021, 10:44 AM.

            Comment


            • #46
              Originally posted by microcode View Post

              That I'm aware, 64-bit swaps are not required for TCP. A TCP segment header has two 32-bit fields, the sequence number and the ACK number (this is zero unless ACK is set), the other numeric fields are either 16-bit or weird (4-bit data offset). There are also optional fields, but the largest datatypes in there are 32-bit. Furthermore.. please read what jabl shared.
              Which is slightly worse if you think about it. You have optimized instructions for swapping, but then you need to issue two instructions instead of one and use two slots of the ROB instead of one.
              On another note, out of complete ignorance, doesn't IPv6 require a swap for addresses too? Those are bigger AFAIR.

              Comment


              • #47
                Originally posted by sinepgib View Post
                You have optimized instructions for swapping, but then you need to issue two instructions instead of one and use two slots of the ROB instead of one.
                Again, none of these instructions imply that there are any slots taken up in the ROB; the smart way to implement them, if this cost was of any concern to begin with, would be to fuse them with load and store ops.

                Originally posted by sinepgib View Post
                On another note, out of complete ignorance, doesn't IPv6 require a swap for addresses too? Those are bigger AFAIR.
                That depends on how you stored them to begin with, and whether you are computing on them as numbers (pretty uncommon); my understanding is that when you have a socket open like this, those IP fields are templated in.

                Overall, the bigger problem with TCP would not be byte swaps, but the bit manip required to fill those odd-shaped fields.

                Comment


                • #48
                  Originally posted by microcode View Post
                  It's even better, if you have GCC < 4.3, it uses a different builtin called __bswap_constant_....
                  Er, no. If you look at the code snippet I posted, you'll see that __bswap_constant_64 is the name of the macro defined just above.

                  Comment


                  • #49
                    Originally posted by microcode View Post
                    Again, none of these instructions imply that there are any slots taken up in the ROB; the smart way to implement them, if this cost was of any concern to begin with, would be to fuse them with load and store ops.
                    Going by the tables at https://www.agner.org/optimize/instruction_tables.pdf it looks like MOVBE is equal in both ops and reciprocal throughput to MOV (at least on Zen) so it does look like modern CPU:s actually do this with zero cost (as long as the compiler optimizes to use MOVBE and not BSWAP+MOV).

                    Comment


                    • #50
                      [QUOTE=jabl;n1293002]
                      You neglected to copy paste the full code:
                      [/CODE]
                      I didn't, I just see very badly thanks,
                      Yes it does use __builtin_bswapxx for swapping..
                      I believed the support started to land in 2012..
                      gcc advanced a lot in the last years..

                      There are also arch's that perform badly with builtins for byte swapping, just using calls to __bswapsi2 and __bswapdi2 in libgcc..
                      I found this about the subject, but is old, probably is outdated..

                      Comment

                      Working...
                      X