Announcement

Collapse
No announcement yet.

Improving The Linux Kernel's Memory Performance

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #21
    Originally posted by Shining Arcanine View Post
    SSE4A is AMD's variant of Intel's SSE4 extensions.
    Not at all.

    Comment


    • #22
      Originally posted by RealNC View Post
      No, that's not what -mtune is doing. It does not generate any instructions that would not run on other CPUs. It just applies changes that will work everywhere, but are known to result in faster execution. From the docs:

      "-mtune=cpu-type - Tune to cpu-type everything applicable about the generated code, except for the ABI and the set of available instructions."

      So no multiple code paths or anything. That's what the Intel compiler does. GCC doesn't provide that functionality.
      You are right, but my initial point about him having to recompile his kernel is correct. Since GCC is not generating multiple code paths, getting a kernel that uses SSE3 requires compiling it for SSE3.

      Originally posted by whitecat View Post
      You say that, but your reference agrees with me:

      AMD currently supports only 4 instructions from the SSE4 instruction set, but have also added four new SSE instructions, naming the group SSE4a. These instructions are not found in Intel's processors supporting SSE4.1 and alternatively AMD processors aren't supporting Intel's SSE4.1. Support was added for SSE4a for unaligned SSE load-operation instructions (which formerly required 16-byte alignment).
      The general story is that some of the instructions Intel was implementing were hard to implement in the K8 architecture, so they picked the easy ones and added a few. It is AMD's version of SSE4.

      Comment


      • #23
        Originally posted by Shining Arcanine View Post
        You are right, but my initial point about him having to recompile his kernel is correct. Since GCC is not generating multiple code paths, getting a kernel that uses SSE3 requires compiling it for SSE3.
        A program that uses SSE3 in order to manipulate data and a program that is compiled with SSE3 is 2 different things.
        You don't have to compile the kernel with "sse3" in order to enable (ie. "use") SSE3 for your program/kernel. I'm not specialist but given the purpose of SSE3 I think that if you compile the kernel with SSE3, there is quite few lines of codes which will be efficiently SSE3 instructions. Also, if devs use SSE3 optimizations to speedup some functions, you don't have to compile your kernel with SSE3.


        Originally posted by Shining Arcanine View Post
        You say that, but your reference agrees with me:
        SSE4 is:
        Intel -> SSE4.1 : 47 instructions implemented in Intel CPU.
        Intel -> SSE4.2 : 7 instructions implemented in Intel CPU.
        AMD -> SSE4a : 4 "intel" instructions (don't know which ones precisely) + 4 exclusive AMD instructions (not found in Intel implementation).


        Originally posted by Shining Arcanine View Post
        It is AMD's version of SSE4.
        Very light version hence.
        AMD: 8 instructions (4 "intel" + 4 "amd exclusive")
        Intel: 54 instructions

        Comment


        • #24
          Originally posted by whitecat View Post
          A program that uses SSE3 in order to manipulate data and a program that is compiled with SSE3 is 2 different things.
          You don't have to compile the kernel with "sse3" in order to enable (ie. "use") SSE3 for your program/kernel. I'm not specialist but given the purpose of SSE3 I think that if you compile the kernel with SSE3, there is quite few lines of codes which will be efficiently SSE3 instructions. Also, if devs use SSE3 optimizations to speedup some functions, you don't have to compile your kernel with SSE3.
          I am not a Linux kernel developer, but I have enough experience in software development that I have a decent idea of how the kernel works. If the processor family is not set to a SSE3 capable processor at the time it is compiled, then SSE3 should not be used in the kernel. The only exception to this should be if the compiler could generate multiple code paths, which I thought GCC could do, but RealNC demonstrated that I was wrong in thinking that.

          It is possible to write code for less capable x86 processor families that will automatically detect more capable x86 processor families and adjust the path to make things faster, but that is something that developers would rely on a compiler to do. It is difficult to maintain kernel code if you do your compiler's job for it. The reason for this is that the Linux kernel supports more than just x86. It doesn't make sense for the kernel developers to write hacks that second guess a specific processor family unless they do it in a way that benefits all architectures. I would be surprised if Linus Torvalds committed code that did this while only being relevant to a single architecture.

          Originally posted by whitecat View Post

          SSE4 is:
          Intel -> SSE4.1 : 47 instructions implemented in Intel CPU.
          Intel -> SSE4.2 : 7 instructions implemented in Intel CPU.
          AMD -> SSE4a : 4 "intel" instructions (don't know which ones precisely) + 4 exclusive AMD instructions (not found in Intel implementation).



          Very light version hence.
          AMD: 8 instructions (4 "intel" + 4 "amd exclusive")
          Intel: 54 instructions
          Whatever it is called by either of us is pointless as it has no effect on how actual CPUs function. I still call it AMD's version, as I understand it to be a derivative extension.
          Last edited by Shining Arcanine; 17 August 2011, 02:15 PM.

          Comment


          • #25
            Originally posted by Shining Arcanine View Post
            It is possible to write code for less capable x86 processor families that will automatically detect more capable x86 processor families and adjust the path to make things faster, but that is something that developers would rely on a compiler to do. It is difficult to maintain kernel code if you do your compiler's job for it. The reason for this is that the Linux kernel supports more than just x86. It doesn't make sense for the kernel developers to write hacks that second guess a specific processor family unless they do it in a way that benefits all architectures. I would be surprised if Linus Torvalds committed code that did this while only being relevant to a single architecture.
            You are wrong about this, a classic example would be software RAID. Do a grep for raid6 in your dmesg (assuming you have software RAID support built in) and you will get something that looks like this:

            raid6: int64x1 1715 MB/s
            raid6: int64x2 2401 MB/s
            raid6: int64x4 1701 MB/s
            raid6: int64x8 1626 MB/s
            raid6: sse2x1 2894 MB/s
            raid6: sse2x2 4924 MB/s
            raid6: sse2x4 5496 MB/s
            raid6: using algorithm sse2x4 (5496 MB/s)
            md: raid6 personality registered for level 6

            As you can see the kernel developers did implement code that works only on x86 machines and they also wrote a mini-benchmark for figuring out which version is best for the CPU it is being ran on. If the sse3 version of memcpy proves to be fast enough to justify the effort they will do the same.

            Comment


            • #26
              Originally posted by Shining Arcanine View Post
              I am not a Linux kernel developer, but I have enough experience in software development that I have a decent idea of how the kernel works. If the processor family is not set to a SSE3 capable processor at the time it is compiled, then SSE3 should not be used in the kernel. The only exception to this should be if the compiler could generate multiple code paths, which I thought GCC could do, but RealNC demonstrated that I was wrong in thinking that.
              I only said that I can write lines of ASM code (SSE3) in my program, compile it with no option (amd64 - sse2 for instance) and it will works well, with optimized SSE3 paths. This program can be the kernel or anything else. The only thing I have to do is to check at runtime if SSE3 if available, obviously.

              Originally posted by Shining Arcanine View Post
              Whatever it is called by either of us is pointless as it has no effect on how actual CPUs function. I still call it AMD's version, as I understand it to be a derivative extension.
              Google tells me SSE4a is not a derivative. Or explain me how AMD engineers have implemented 54 Intel instructions in only 8.

              Comment


              • #27
                Originally posted by Shining Arcanine View Post
                It is possible to write code for less capable x86 processor families that will automatically detect more capable x86 processor families and adjust the path to make things faster, but that is something that developers would rely on a compiler to do. It is difficult to maintain kernel code if you do your compiler's job for it. The reason for this is that the Linux kernel supports more than just x86. It doesn't make sense for the kernel developers to write hacks that second guess a specific processor family unless they do it in a way that benefits all architectures. I would be surprised if Linus Torvalds committed code that did this while only being relevant to a single architecture.
                The kernel has to include a decent amount of architecture specific assembly code. The C/C++ language has no way to interact with a lot of the instructions required to simply boot an OS and manage the CPU. Because there's a lot of this code already present, kernel developers tend to be OK with adding more assembly code for specific hotspots in the kernel or drivers - at least if it can be shown that such code makes a significant difference. Typically you just have to make sure there is a generic C path, then you can add optimizations to specific architectures without having to worry about keeping it portable.

                In fact, I believe the kernel's memcpy function is already written directly in assembly, likely just to ensure that everything was being done in the best way possible for such an important function. While using a compiler is generally better for large chunks of code, humans can often still beat it at optimizing small pieces and that can be important. So simply adding a new SSE3 case to the existing x86 assembly probably isn't a big deal.
                Last edited by smitty3268; 17 August 2011, 04:02 PM.

                Comment


                • #28
                  Originally posted by whitecat View Post
                  Google tells me SSE4a is not a derivative. Or explain me how AMD engineers have implemented 54 Intel instructions in only 8.
                  SSE4a is not a SSE4 derivative at all. That's like calling 3DNow an SSE derivative - the only difference is that in the former case AMD took the SSE name rather than creating their own. Something Intel wasn't very happy about, IIRC.

                  Comment


                  • #29
                    Originally posted by smitty3268 View Post
                    SSE4a is not a SSE4 derivative at all. That's like calling 3DNow an SSE derivative - the only difference is that in the former case AMD took the SSE name rather than creating their own. Something Intel wasn't very happy about, IIRC.
                    How is SSE4a not a SSE4 derivative if half of its instructions match SSE4 instructions in opcode, name and functionality?

                    SSE was made after 3DNow, while SSE4a was made after Intel published its SSE4 extensions. The instructions provided by SSE and 3DNow do not intersect.

                    I feel like these points on SSE4a not being a SSE4 derivative are derived from the following rather than any actual technical reason:

                    Last edited by Shining Arcanine; 18 August 2011, 09:22 AM.

                    Comment


                    • #30
                      Originally posted by Ansla View Post
                      You are wrong about this, a classic example would be software RAID. Do a grep for raid6 in your dmesg (assuming you have software RAID support built in) and you will get something that looks like this:

                      raid6: int64x1 1715 MB/s
                      raid6: int64x2 2401 MB/s
                      raid6: int64x4 1701 MB/s
                      raid6: int64x8 1626 MB/s
                      raid6: sse2x1 2894 MB/s
                      raid6: sse2x2 4924 MB/s
                      raid6: sse2x4 5496 MB/s
                      raid6: using algorithm sse2x4 (5496 MB/s)
                      md: raid6 personality registered for level 6

                      As you can see the kernel developers did implement code that works only on x86 machines and they also wrote a mini-benchmark for figuring out which version is best for the CPU it is being ran on. If the sse3 version of memcpy proves to be fast enough to justify the effort they will do the same.
                      Thanks for that. I didn't realize that they had something like this in the code. Still, there are two things I said here. The first was that it was unlikely that the kernel developers would second guess the processor architecture. The second was that if the kernel developers implemented something like this, it would be done in a way that benefits all architectures. I was wrong about the first, but I think I am quite right about the second. Those messages suggest to me that they modularized the code that does this, so on other architectures, code for supporting similar extensions can be put in its place.

                      My original point about having to do recompilation is not wrong though. The compiler won't generate its own SSE3 assembly unless it is told to do it by the build system, so strictly speaking, he would need to recompile his kernel to get SSE3 instructions into areas where the kernel developers did not do this manually.
                      Last edited by Shining Arcanine; 18 August 2011, 09:17 AM.

                      Comment

                      Working...
                      X