Announcement

Collapse
No announcement yet.

Improving The Linux Kernel's Memory Performance

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Shining Arcanine
    replied
    Originally posted by whitecat View Post
    A program that uses SSE3 in order to manipulate data and a program that is compiled with SSE3 is 2 different things.
    You don't have to compile the kernel with "sse3" in order to enable (ie. "use") SSE3 for your program/kernel. I'm not specialist but given the purpose of SSE3 I think that if you compile the kernel with SSE3, there is quite few lines of codes which will be efficiently SSE3 instructions. Also, if devs use SSE3 optimizations to speedup some functions, you don't have to compile your kernel with SSE3.
    I am not a Linux kernel developer, but I have enough experience in software development that I have a decent idea of how the kernel works. If the processor family is not set to a SSE3 capable processor at the time it is compiled, then SSE3 should not be used in the kernel. The only exception to this should be if the compiler could generate multiple code paths, which I thought GCC could do, but RealNC demonstrated that I was wrong in thinking that.

    It is possible to write code for less capable x86 processor families that will automatically detect more capable x86 processor families and adjust the path to make things faster, but that is something that developers would rely on a compiler to do. It is difficult to maintain kernel code if you do your compiler's job for it. The reason for this is that the Linux kernel supports more than just x86. It doesn't make sense for the kernel developers to write hacks that second guess a specific processor family unless they do it in a way that benefits all architectures. I would be surprised if Linus Torvalds committed code that did this while only being relevant to a single architecture.

    Originally posted by whitecat View Post

    SSE4 is:
    Intel -> SSE4.1 : 47 instructions implemented in Intel CPU.
    Intel -> SSE4.2 : 7 instructions implemented in Intel CPU.
    AMD -> SSE4a : 4 "intel" instructions (don't know which ones precisely) + 4 exclusive AMD instructions (not found in Intel implementation).



    Very light version hence.
    AMD: 8 instructions (4 "intel" + 4 "amd exclusive")
    Intel: 54 instructions
    Whatever it is called by either of us is pointless as it has no effect on how actual CPUs function. I still call it AMD's version, as I understand it to be a derivative extension.
    Last edited by Shining Arcanine; 17 August 2011, 02:15 PM.

    Leave a comment:


  • whitecat
    replied
    Originally posted by Shining Arcanine View Post
    You are right, but my initial point about him having to recompile his kernel is correct. Since GCC is not generating multiple code paths, getting a kernel that uses SSE3 requires compiling it for SSE3.
    A program that uses SSE3 in order to manipulate data and a program that is compiled with SSE3 is 2 different things.
    You don't have to compile the kernel with "sse3" in order to enable (ie. "use") SSE3 for your program/kernel. I'm not specialist but given the purpose of SSE3 I think that if you compile the kernel with SSE3, there is quite few lines of codes which will be efficiently SSE3 instructions. Also, if devs use SSE3 optimizations to speedup some functions, you don't have to compile your kernel with SSE3.


    Originally posted by Shining Arcanine View Post
    You say that, but your reference agrees with me:
    SSE4 is:
    Intel -> SSE4.1 : 47 instructions implemented in Intel CPU.
    Intel -> SSE4.2 : 7 instructions implemented in Intel CPU.
    AMD -> SSE4a : 4 "intel" instructions (don't know which ones precisely) + 4 exclusive AMD instructions (not found in Intel implementation).


    Originally posted by Shining Arcanine View Post
    It is AMD's version of SSE4.
    Very light version hence.
    AMD: 8 instructions (4 "intel" + 4 "amd exclusive")
    Intel: 54 instructions

    Leave a comment:


  • Shining Arcanine
    replied
    Originally posted by RealNC View Post
    No, that's not what -mtune is doing. It does not generate any instructions that would not run on other CPUs. It just applies changes that will work everywhere, but are known to result in faster execution. From the docs:

    "-mtune=cpu-type - Tune to cpu-type everything applicable about the generated code, except for the ABI and the set of available instructions."

    So no multiple code paths or anything. That's what the Intel compiler does. GCC doesn't provide that functionality.
    You are right, but my initial point about him having to recompile his kernel is correct. Since GCC is not generating multiple code paths, getting a kernel that uses SSE3 requires compiling it for SSE3.

    Originally posted by whitecat View Post
    You say that, but your reference agrees with me:

    AMD currently supports only 4 instructions from the SSE4 instruction set, but have also added four new SSE instructions, naming the group SSE4a. These instructions are not found in Intel's processors supporting SSE4.1 and alternatively AMD processors aren't supporting Intel's SSE4.1. Support was added for SSE4a for unaligned SSE load-operation instructions (which formerly required 16-byte alignment).
    The general story is that some of the instructions Intel was implementing were hard to implement in the K8 architecture, so they picked the easy ones and added a few. It is AMD's version of SSE4.

    Leave a comment:


  • whitecat
    replied
    Originally posted by Shining Arcanine View Post
    SSE4A is AMD's variant of Intel's SSE4 extensions.
    Not at all.

    Leave a comment:


  • whitecat
    replied
    Originally posted by movieman View Post
    BTW, wasn't SSE3 a compulsory part of AMD64? If so, then this would presumably go into any AMD64 kernel with no need to check processor flags.
    No, AMD64 implies SSE2.

    Leave a comment:


  • curaga
    replied
    You mean the if (!intel) slow(); path?

    Leave a comment:


  • RealNC
    replied
    Originally posted by Shining Arcanine View Post
    There is a distinction between -march style optimizations, rather than -mtune style optimizations. With -mtune the compiler will generate multiple versions of different code paths so that every CPU gets its own "optimized" version. With -march, the compiler assumes that these instructions are always available.
    No, that's not what -mtune is doing. It does not generate any instructions that would not run on other CPUs. It just applies changes that will work everywhere, but are known to result in faster execution. From the docs:

    "-mtune=cpu-type - Tune to cpu-type everything applicable about the generated code, except for the ABI and the set of available instructions."

    So no multiple code paths or anything. That's what the Intel compiler does. GCC doesn't provide that functionality.

    Leave a comment:


  • Shining Arcanine
    replied
    Originally posted by Dylar View Post
    It's indeed pretty interesting.
    But if I use a prebuilt generic x86_64 kernel provided by my distro, is there a way the kernel could autodetect if my CPU has support for SSE3 at runtime, or do I have to recompile the kernel ?
    You seem to be asking two separate questions at the same time.

    1. Yes, there is a way for the kernel to check for SSE3 support. /cat/proc demonstrates this capability.

    2. There is a distinction between -march style optimizations, rather than -mtune style optimizations. With -mtune the compiler will generate multiple versions of different code paths so that every CPU gets its own "optimized" version. With -march, the compiler assumes that these instructions are always available. To the best of my knowledge, the kernel uses -march. I have not checked the code to verify this, but I am fairly certain that if you configure your kernel compilation to build a kernel for hardware newer than what you have, it breaks when you try to run it on the older hardware, which is consistent with -march. Therefore, the kernel must be explicitly compiled for it.

    Originally posted by dimko View Post
    cat /proc/cpuinfo |grep sse3

    It's sort weird, but i dont seem to have SSE3 on my AMD quad core, however, I think extension was there, just for licensing matters it was called something else. I wonder what is SSE4A and if it absorbs SSE3 into itself?
    SSE4A is AMD's variant of Intel's SSE4 extensions.

    Originally posted by dimko View Post
    pni - checked!

    Is it still called PNI on Intel CPU?
    They never renamed it. It would cause newer processors to use code paths meant for much older processors when executing older binaries if they did that.

    They call it pni on AMD cpus for the same reason.

    Originally posted by Smorg View Post
    memcpy? As in... memcpy from string.h? What does this have to do with the kernel? Is there some system call that's also called memcpy?
    If you write a kernel, you need a way of copying data back and forth between real memory and virtual memory. You will have a problem if you hit a page boundary and the rest of what you are copying does not continue on the next page.

    Also, when writing a kernel, you need to write your own library routines, because libraries specified in the ANSI C specification are meant for userland, not kernels.

    Originally posted by liam View Post
    Do we really want to add more x86 specific code to the kernel?
    Other than that, sounds cool. I had no idea hitting the SSE was so costly. I suppose it makes sense that they were intended for rather larger data sets, but still, hadn't occured to me.
    It is not necessarily x86 specific. It is a technique that applies to any CPU that has SSE3-like vector instructions and if they implement it properly, every CPU with such instructions should see a boost.


    Originally posted by movieman View Post
    BTW, wasn't SSE3 a compulsory part of AMD64? If so, then this would presumably go into any AMD64 kernel with no need to check processor flags.
    AMD produced the K8 architecture first and then Intel produced Prescott in response. SSE3 did not exist when the K8 architecture was made, so it was not part of the original x86_64 instruction set.
    Last edited by Shining Arcanine; 16 August 2011, 10:57 PM.

    Leave a comment:


  • DanL
    replied
    Originally posted by movieman View Post
    BTW, wasn't SSE3 a compulsory part of AMD64? If so, then this would presumably go into any AMD64 kernel with no need to check processor flags.
    No. The first AMD64 CPU's didn't have it (130nm and the initial 90nm chips).

    Leave a comment:


  • abral
    replied
    Originally posted by Drago View Post
    You did exactly what on your kernel? Replaced x86 memcopy() implementation with SSE one?
    Sorry, I didn't explained well. My kernel isn't Linux, is a kernel I've written from scratch

    However I wrote four different versions of memcpy (normal, MMX, SSE, SSE2) and used the best according to the CPU support.

    Leave a comment:

Working...
X