Originally posted by Shining Arcanine
View Post
Announcement
Collapse
No announcement yet.
Improving The Linux Kernel's Memory Performance
Collapse
X
-
Originally posted by RealNC View PostNo, that's not what -mtune is doing. It does not generate any instructions that would not run on other CPUs. It just applies changes that will work everywhere, but are known to result in faster execution. From the docs:
"-mtune=cpu-type - Tune to cpu-type everything applicable about the generated code, except for the ABI and the set of available instructions."
So no multiple code paths or anything. That's what the Intel compiler does. GCC doesn't provide that functionality.
Originally posted by whitecat View PostNot at all.
http://en.wikipedia.org/wiki/SSE4
AMD currently supports only 4 instructions from the SSE4 instruction set, but have also added four new SSE instructions, naming the group SSE4a. These instructions are not found in Intel's processors supporting SSE4.1 and alternatively AMD processors aren't supporting Intel's SSE4.1. Support was added for SSE4a for unaligned SSE load-operation instructions (which formerly required 16-byte alignment).
Comment
-
Originally posted by Shining Arcanine View PostYou are right, but my initial point about him having to recompile his kernel is correct. Since GCC is not generating multiple code paths, getting a kernel that uses SSE3 requires compiling it for SSE3.
You don't have to compile the kernel with "sse3" in order to enable (ie. "use") SSE3 for your program/kernel. I'm not specialist but given the purpose of SSE3 I think that if you compile the kernel with SSE3, there is quite few lines of codes which will be efficiently SSE3 instructions. Also, if devs use SSE3 optimizations to speedup some functions, you don't have to compile your kernel with SSE3.
Originally posted by Shining Arcanine View PostYou say that, but your reference agrees with me:
Intel -> SSE4.1 : 47 instructions implemented in Intel CPU.
Intel -> SSE4.2 : 7 instructions implemented in Intel CPU.
AMD -> SSE4a : 4 "intel" instructions (don't know which ones precisely) + 4 exclusive AMD instructions (not found in Intel implementation).
Originally posted by Shining Arcanine View PostIt is AMD's version of SSE4.
AMD: 8 instructions (4 "intel" + 4 "amd exclusive")
Intel: 54 instructions
Comment
-
Originally posted by whitecat View PostA program that uses SSE3 in order to manipulate data and a program that is compiled with SSE3 is 2 different things.
You don't have to compile the kernel with "sse3" in order to enable (ie. "use") SSE3 for your program/kernel. I'm not specialist but given the purpose of SSE3 I think that if you compile the kernel with SSE3, there is quite few lines of codes which will be efficiently SSE3 instructions. Also, if devs use SSE3 optimizations to speedup some functions, you don't have to compile your kernel with SSE3.
It is possible to write code for less capable x86 processor families that will automatically detect more capable x86 processor families and adjust the path to make things faster, but that is something that developers would rely on a compiler to do. It is difficult to maintain kernel code if you do your compiler's job for it. The reason for this is that the Linux kernel supports more than just x86. It doesn't make sense for the kernel developers to write hacks that second guess a specific processor family unless they do it in a way that benefits all architectures. I would be surprised if Linus Torvalds committed code that did this while only being relevant to a single architecture.
Originally posted by whitecat View Post
SSE4 is:
Intel -> SSE4.1 : 47 instructions implemented in Intel CPU.
Intel -> SSE4.2 : 7 instructions implemented in Intel CPU.
AMD -> SSE4a : 4 "intel" instructions (don't know which ones precisely) + 4 exclusive AMD instructions (not found in Intel implementation).
Very light version hence.
AMD: 8 instructions (4 "intel" + 4 "amd exclusive")
Intel: 54 instructionsLast edited by Shining Arcanine; 17 August 2011, 02:15 PM.
Comment
-
Originally posted by Shining Arcanine View PostIt is possible to write code for less capable x86 processor families that will automatically detect more capable x86 processor families and adjust the path to make things faster, but that is something that developers would rely on a compiler to do. It is difficult to maintain kernel code if you do your compiler's job for it. The reason for this is that the Linux kernel supports more than just x86. It doesn't make sense for the kernel developers to write hacks that second guess a specific processor family unless they do it in a way that benefits all architectures. I would be surprised if Linus Torvalds committed code that did this while only being relevant to a single architecture.
raid6: int64x1 1715 MB/s
raid6: int64x2 2401 MB/s
raid6: int64x4 1701 MB/s
raid6: int64x8 1626 MB/s
raid6: sse2x1 2894 MB/s
raid6: sse2x2 4924 MB/s
raid6: sse2x4 5496 MB/s
raid6: using algorithm sse2x4 (5496 MB/s)
md: raid6 personality registered for level 6
As you can see the kernel developers did implement code that works only on x86 machines and they also wrote a mini-benchmark for figuring out which version is best for the CPU it is being ran on. If the sse3 version of memcpy proves to be fast enough to justify the effort they will do the same.
Comment
-
Originally posted by Shining Arcanine View PostI am not a Linux kernel developer, but I have enough experience in software development that I have a decent idea of how the kernel works. If the processor family is not set to a SSE3 capable processor at the time it is compiled, then SSE3 should not be used in the kernel. The only exception to this should be if the compiler could generate multiple code paths, which I thought GCC could do, but RealNC demonstrated that I was wrong in thinking that.
Originally posted by Shining Arcanine View PostWhatever it is called by either of us is pointless as it has no effect on how actual CPUs function. I still call it AMD's version, as I understand it to be a derivative extension.
Comment
-
Originally posted by Shining Arcanine View PostIt is possible to write code for less capable x86 processor families that will automatically detect more capable x86 processor families and adjust the path to make things faster, but that is something that developers would rely on a compiler to do. It is difficult to maintain kernel code if you do your compiler's job for it. The reason for this is that the Linux kernel supports more than just x86. It doesn't make sense for the kernel developers to write hacks that second guess a specific processor family unless they do it in a way that benefits all architectures. I would be surprised if Linus Torvalds committed code that did this while only being relevant to a single architecture.
In fact, I believe the kernel's memcpy function is already written directly in assembly, likely just to ensure that everything was being done in the best way possible for such an important function. While using a compiler is generally better for large chunks of code, humans can often still beat it at optimizing small pieces and that can be important. So simply adding a new SSE3 case to the existing x86 assembly probably isn't a big deal.Last edited by smitty3268; 17 August 2011, 04:02 PM.
Comment
-
Originally posted by whitecat View PostGoogle tells me SSE4a is not a derivative. Or explain me how AMD engineers have implemented 54 Intel instructions in only 8.
Comment
-
Originally posted by smitty3268 View PostSSE4a is not a SSE4 derivative at all. That's like calling 3DNow an SSE derivative - the only difference is that in the former case AMD took the SSE name rather than creating their own. Something Intel wasn't very happy about, IIRC.
SSE was made after 3DNow, while SSE4a was made after Intel published its SSE4 extensions. The instructions provided by SSE and 3DNow do not intersect.
I feel like these points on SSE4a not being a SSE4 derivative are derived from the following rather than any actual technical reason:
Last edited by Shining Arcanine; 18 August 2011, 09:22 AM.
Comment
-
Originally posted by Ansla View PostYou are wrong about this, a classic example would be software RAID. Do a grep for raid6 in your dmesg (assuming you have software RAID support built in) and you will get something that looks like this:
raid6: int64x1 1715 MB/s
raid6: int64x2 2401 MB/s
raid6: int64x4 1701 MB/s
raid6: int64x8 1626 MB/s
raid6: sse2x1 2894 MB/s
raid6: sse2x2 4924 MB/s
raid6: sse2x4 5496 MB/s
raid6: using algorithm sse2x4 (5496 MB/s)
md: raid6 personality registered for level 6
As you can see the kernel developers did implement code that works only on x86 machines and they also wrote a mini-benchmark for figuring out which version is best for the CPU it is being ran on. If the sse3 version of memcpy proves to be fast enough to justify the effort they will do the same.
My original point about having to do recompilation is not wrong though. The compiler won't generate its own SSE3 assembly unless it is told to do it by the build system, so strictly speaking, he would need to recompile his kernel to get SSE3 instructions into areas where the kernel developers did not do this manually.Last edited by Shining Arcanine; 18 August 2011, 09:17 AM.
Comment
Comment