Announcement

**frign** · 03 April 2013, 02:33 PM

BareMetal OS

Originally posted by TemplarGR View Post

This made my day...

BTW, since when were Roller Coaster Tycoon written in assembly? Any proof?

Even though I also don't think big projects should be written in ASM, there is a parallel-computing-OS entirely written in ASM: BareMetal OS

To be honest, this is not much compared to the Linux-Kernel, but it is a very interesting project!

**erendorn** · 04 April 2013, 02:52 AM

Originally posted by Obscene_CNN View Post

I didn't suggest that windows 8 be written in assembly, just abandon the crappy of philosophy ease of development far out weighs all other considerations.

In the real world, a program has specifications. You go the fastest and cheapest route to meet them. You don't overachieve specs, because that's worthless. So yes, ease of development far out weighs all other considerations, at least as long as you can still do it and meet your specs.
And to measure that, you use profiling, you don't just say "wow, I could optimize that" randomly.

Originally posted by Obscene_CNN View Post

Hell if they would just abandon the use of C++ templates it would cut the size to about 1/4th the size it is now.

OT, but I'm interested, you would replace templates with what? At what cost?

**ciplogic** · 04 April 2013, 03:42 AM

Originally posted by wargames View Post

AFAIK compilers don't produce optimized assembly using SIMD instructions, which is the only "corner case" that comes to my mind for using assembly along with writing kernel code that directly interfaces with the hardware.

I'm 99% that they do: look for AutoVectorize into GCC or LLVM. Even Visual Studio (2012) produces it, to not say anything about Intel's Compilers.

In fact if you write properly a matrix multiply, you will find that with -O3 in release mode GCC basically generates the very same code that your hand-written assembly will write.

AutoVectorize depends mostly on loop unrolling to work smooth, and some lower end optimizations levels don't do it extensively because of code bloat, so without doing flag mining, sometimes even with the best compilers you cannot get best performance.

Assembly code cannot get any not more optimization (and maintainable) when a compiled code can and will. If you know that you're writing a ray tracer (or something that really require a fast executable), you can use PGO to profile your application and the compiler will inline and optimize selectively your code based on your usage. Using LTO will do easier inlining decisions. It is humanly impossible to trace huge chunk of code by hand but a compiler can do it. It may require many MB to track all your variables, but I digress: why should you care if this would happen or not.

At last, I really like the balance comment about "managed vs native", because most compilers target high level code to write it well, managed code performance has a minimal gap (I would argue that it still exists and it will always will) that it makes it so suitable for most applications to run. A lot of tools in Fedora (or Ubuntu) use Python or Perl to update packages for example and maybe for a slow machine a full update would not take 1 hour and 3 minutes, but just 1 hour if all scripts will be in the fastest alternative, but hey, Python and Perl are mostly interpreted! I'm also using IntelliJ IDEA and I see just a "startup gap" compared with native code, but after 1 minute of usage, it just works, like thunder.

**SteveMcIntyre** · 04 April 2013, 01:46 PM

Apologies :-(

Originally posted by oliver View Post

While not surprising, extremly disappointing is this little bit:

The disappointing bit of course is that it is in excel, or at least saved to an xls file. I would have expected that the 'better format' where an ods (Open Document Spreadsheet). But then again, while Linaro is a linux group, for linux on arm, I'm pretty sure most just use Windows, Office and just play with Linux via a VM.

Actually, no. Most of us are real Free Software developers used to working on distros, compilers and other packages. I'm a Debian developer (and Debian Project Leader emeritus), for example.

The .xls thing was purely a daft mistake on my part. I've been working locally in Gnumeric (my spreasheet of choice) using its native format, but uploaded in a different format to make it easier for others. I should have picked .ods, yes. Apologies.

**gens** · 04 April 2013, 02:44 PM

Originally posted by ciplogic View Post

Assembly code cannot get any not more optimization (and maintainable) when a compiled code can and will. If you know that you're writing a ray tracer (or something that really require a fast executable), you can use PGO to profile your application and the compiler will inline and optimize selectively your code based on your usage. Using LTO will do easier inlining decisions. It is humanly impossible to trace huge chunk of code by hand but a compiler can do it. It may require many MB to track all your variables, but I digress: why should you care if this would happen or not.

or you can use perf that will even color code the disassembled assembly to show exactly what instruction is the bottleneck
and its way easier to change/rearrange instructions in assembly then to make a compiler understand what its doing wrong

so no, compiled code (in general) will never ever beat some1 that knows what hes doing

bdw, in fact i plan to make a raytracer
so far from what i understand of the mathematics, its almost all some matrix operations
i learned how to use sse for matrix operations (matrix multiplication) but this kind of matrix math will require lots more thinking and creativity
all for that couple % over what a compiler can do, it probably wont be much but hey

**ciplogic** · 04 April 2013, 03:00 PM

Originally posted by gens View Post

or you can use perf that will even color code the disassembled assembly to show exactly what instruction is the bottleneck
and its way easier to change/rearrange instructions in assembly then to make a compiler understand what its doing wrong

so no, compiled code (in general) will never ever beat some1 that knows what hes doing

bdw, in fact i plan to make a raytracer
so far from what i understand of the mathematics, its almost all some matrix operations
i learned how to use sse for matrix operations (matrix multiplication) but this kind of matrix math will require lots more thinking and creativity
all for that couple % over what a compiler can do, it probably wont be much but hey

May you use GCC -s (to export assembly) and compare an instance when you get much better assembly code by your hand?

**gens** · 04 April 2013, 05:35 PM

Originally posted by ciplogic View Post

May you use GCC -s (to export assembly) and compare an instance when you get much better assembly code by your hand?

you can also do it with "objdump -d" and "perf record"(that uses objdump and libelf)
but since gcc dosent care if registers are in numerical order and does more passes in stages it can become hard for a human to read
even with original source code "overlayed" over the disassembly

also a whole program can be over a megabyte, even parts of it (.o) can go over 300k
thats a lot of code

mostly you'd use assembly, be it inline or linked later, to write short functions
so classical approach is best
"use it where it is best to use it"

i'm only bothered by people saying things like "you cant" and "compiler is better"
only thing i'm missing with FASM is a simple debugger, i just need to know where it crashed and values of registers/stack
when i get less lazy il make one; i thought that's hard too but its simple rly

edit: about matrix; it has too be SIMD cuz its a lot of parallel calculations
you can compare and learn from the compiler, but as everybody says compilers sometimes give... slower code (as in algorithm)

**nextstate** · 04 April 2013, 10:43 PM

Originally posted by Obscene_CNN View Post

Your end user may differ in opinion. Especially when they are the ones that have to wait for it and pay for the hardware to store and run it. Take a look at Microsoft's Surface where half of the flash was eaten up by the base software.

Also a common thing over looked which is more and more important today is power consumption. Memory reads/writes and instruction cycles take energy and the more you have to do to accomplish a task the shorter your battery lasts. Power management is a race to get to sleep.

I agree with your original post that you can realize significant performance improvement with Assembly or "fast tight code" in general. I also think that code should not be pre-optimized. I believe that code should be clean and readable first. And in most cases, this code will be fast enough. If it's not, then it should be profiled and the slow parts optimized.

I'm not sure what you were saying about profiling and small functions. I seems like you were saying that if code is made up a lot of small functions, that profiling them wouldn't pinpoint slowness because the CPU would only spend a small amount of time in each function. Well, there should be a higher level function that calls these small functions, and profiling would show the CPU spending more time in that higher level function. Then, if necessary, the clean/readable rules could be broken to optimize that area.

Getting back to video drivers, yes, they should be optimized. Written to be clean, readable and working first. Then, again, the slow parts optimized. This gets back to not pre-optimizing.

As far as MS products, I have no idea what they do to generate products with such a large footprint and performance issues. My guess is that they are more concerned with getting to market than getting to market with a good product.

**ciplogic** · 05 April 2013, 07:57 AM

Originally posted by gens View Post

(...)
so classical approach is best
"use it where it is best to use it"

i'm only bothered by people saying things like "you cant" and "compiler is better"
only thing i'm missing with FASM is a simple debugger, i just need to know where it crashed and values of registers/stack
when i get less lazy il make one; i thought that's hard too but its simple rly

edit: about matrix; it has too be SIMD cuz its a lot of parallel calculations
you can compare and learn from the compiler, but as everybody says compilers sometimes give... slower code (as in algorithm)

I say: you can't write it better than today's compiler do, and if you do it, are in very small cases, where is a compiler bug or you crafted specially against your compiler. In short: "you cant" and "compiler is better"

Try to write your SIMD in assembly and take a matrix multiply for this function and see which gives better assembly code.

In the past I did wrote (like 10 years ago) a memcopy that was working like 2-3 x faster than the runtime (I was using Delphi as starting point), but later Delphi and Window's memcpy was faster than my code, and even a loop that was not copying a byte at a time, but an integer, was really fast enough (like in 90% of my assembly optimized code) for all purposes. Today's compilers will do this memcpy maybe even faster.

About SIMD, yes, most compilers in release mode generate SIMD code if is safe to do it. Even Mono does it (using Mono.SIMD)! And Mono is not a high end compiler. As the code size grows, is very unlikely that a normal user, even with 4-5 years of experience in C++ can write better code using assembly. It will either break the CPU's out-of-order pipeline, or it will forget to move a variable out of loop, or it will not think to put all the data to run in the L1 cache.

One last item, your assembly will remain always your baseline performance for your code. I mean by that if you write optimized low level code instead of using a library, your code will be as fast or as slow as the code is, but you will not benefit when a new Java JIT optimization will target your Java code, or a new LLVM register allocator, or a new GCC LTO that will inline your function aggressively, and GCC will remove one parameter, making a clone of your function and it will inline locally the variable you called your method with one paremeter constant. When you put assembly, you break your opportunity that tomorrow your code can run better (if it doesn't run it already).

Jake2 was running better (and I think it still does) than the original C code. Java 5 was running faster, but today Java 7 (and soon 8) has a better GC throughput, escape analisys, so even the benchmarks of Jake2 that were showing that Java was faster than C code can show a difference even wider. And Quake2 was optimized at places with assembly and was running their C compilers of the time.

**gens** · 05 April 2013, 12:43 PM

Originally posted by ciplogic View Post

I say: you can't write it better than today's compiler do, and if you do it, are in very small cases, where is a compiler bug or you crafted specially against your compiler. In short: "you cant" and "compiler is better"

Try to write your SIMD in assembly and take a matrix multiply for this function and see which gives better assembly code.

In the past I did wrote (like 10 years ago) a memcopy that was working like 2-3 x faster than the runtime (I was using Delphi as starting point), but later Delphi and Window's memcpy was faster than my code, and even a loop that was not copying a byte at a time, but an integer, was really fast enough (like in 90% of my assembly optimized code) for all purposes. Today's compilers will do this memcpy maybe even faster.

About SIMD, yes, most compilers in release mode generate SIMD code if is safe to do it. Even Mono does it (using Mono.SIMD)! And Mono is not a high end compiler. As the code size grows, is very unlikely that a normal user, even with 4-5 years of experience in C++ can write better code using assembly. It will either break the CPU's out-of-order pipeline, or it will forget to move a variable out of loop, or it will not think to put all the data to run in the L1 cache.

One last item, your assembly will remain always your baseline performance for your code. I mean by that if you write optimized low level code instead of using a library, your code will be as fast or as slow as the code is, but you will not benefit when a new Java JIT optimization will target your Java code, or a new LLVM register allocator, or a new GCC LTO that will inline your function aggressively, and GCC will remove one parameter, making a clone of your function and it will inline locally the variable you called your method with one paremeter constant. When you put assembly, you break your opportunity that tomorrow your code can run better (if it doesn't run it already).

Jake2 was running better (and I think it still does) than the original C code. Java 5 was running faster, but today Java 7 (and soon 8) has a better GC throughput, escape analisys, so even the benchmarks of Jake2 that were showing that Java was faster than C code can show a difference even wider. And Quake2 was optimized at places with assembly and was running their C compilers of the time.

i started this hobby in the time of... i guess gcc 4.6.something
gcc has changed since then and i didnt look at disassembly's in a while

well anyway
memcpy in gcc is builtin, meaning it will just copy a template function
if you dont tell it to use sse/avx, that function will be something like
"rep mosb" or to copy more "rep movsd" or in 64bit "rep movsq"
the amd64 C calling convention is good for this as it puts the pointers and the counter in the right places
(actually 3 lines; "label" "rep movsb" "ret" should be a working memcpy, for those that dont know)

if you tell it to use sse, idk if it will and if it does its probably from a template
if you use "-fno-builtin" as documented here it could do worse
meaning that for functions that are not popular the compiler will think and probably do worse if they are not simple

one example is here, where the author asked for help and ended with code twice as fast as fortran

about matrix multiply
i did write a 3x3 matrix with 1x3 matrix multiply, albeit in intrinsics
problem was sse processes 4 floats at a time and theres lots of 3x3 matrices
my solution was to load an extra number from the next matrix (with shuffles) and do 4 matrices in a loop (4x3=12, that is dividable by 4 giving 3 steppes)
idk how a compiler can come up with this solution, especially since it dosent know that that loop will process thousands of matrices
funny thing is i had a lot more problems with pointers in C++ (im bad at C++

) and then the hard drive failed

also about cache
true that a compiler respects cache lines and L2 cache size, but it also gives out a long unwound bunch of machine code
also there is no specification about cache sizes (-mcpu dosent help since its a flag for an architecture, and one can have different cache sizes for one architecture)

i think i can and in at least one case i did
took me longer that it would in higher level languages, but that loop was running 5% of total cpu time and i had nothing better to do

Announcement

Is Assembly Still Relevant To Most Linux Software?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment