Announcement

**linuxgeex** · 18 October 2019, 08:10 PM

Originally posted by ermo View Post

Of course you could do it faster in *insert pet compiled language here* if you needed to; the real win was clearly that Python has a built-in set operation that made the logic trivial to implement and maintain going forward.

That the new implementation showed no run-time performance deficit for sysadmin purposes was merely an unexpected -- but no less welcome -- bonus.

For sysadmin purposes, shell scripts were considered fast enough for huge amounts of standard functionality... need I say more than SYSV Init? Speed was never a huge sysadmin goal as far as programming was concerned. If it was, then the task was handed off to a ... programmer. ;-)

**linuxgeex** · 18 October 2019, 08:20 PM

Originally posted by F.Ultra View Post

Well it also depends on the size of the object stored. Even if we don't consider the cost of copy memory around we also face the problem of fitting the set itself in the cache so we are not hit by cache misses when traversing the set. Now of course we can hopefully trust that the cpu will prefetch in a sane manner here (which is no certainty, especially if the lookup is not linear). Sometimes I miss the more simpler days when I could remember the exact cycle count for each instruction and just look at two pieces of code and immediately see which one fast faster...

These days with the depth of pipelining and reorder buffers of modern processors and SMT, back-to-back cache misses are not a big deal at all. Unless it's several back to back misses, in which case something is terribly wrong with the way it's been coded. Hash tables should be built in a way that the first branch target is the most-taken one so that its mere existence in the first position can be taken as a hint that it should be the first one tried when cached stats for that branch location don't already exist.

If the branch index has some dense nodes and the index value order is unpredictable by the branch predictor (profile-guided optimisation mode), then jump tables become preferable even though they waste a lot of cache. Why? Because the $IC waste for a lot of branches exceeds the $DC waste, and it's easier to mitigate $DC misses than $IC misses. $IC misses always (eventually) result in PC in hazard which means the pipeline has been starved to halt. Luckily with SMT, even in that worst-case scenario, the core remain busy with the other thread. So long as that condition switches intermittently between threads, it's unlikely that both pipelines will starve, and that's the real beauty of SMT for code with random memory access... it hides a ton of memory latency. It effectively doubles the effectiveness of your BP+ROB+Pipelining, cutting your effective memory latency in half. So don't worry about 2 or 3 back to back branch prediction misses if you have IVB+. Worry about it with short pipelines, small ROB and no SMT like A53.

This takes me back to GCC 1.0 architecture discussions in USENET lol. Years before SMT. Crap I am getting old.

**tuxd3v** · 18 October 2019, 08:23 PM

Originally posted by linuxgeex View Post

Java's RAD days are long gone.
The JRE is so massive that compiling takes far longer than C, even when JITting with jar files.

Java RAD model is different from any desktop model..is based in the back-end, not in the front end..

On Compile time Java "does JIT compile"( what it can infer on compile time.. ),
Then you can JIT Compile in run-time, to continue optimising the application( or parts of it, that you couldn't on Compile Time ).. this process is slower of-course, but in a lot of cases it pays off in speed, at the cost of longer compilation process, and Memory..

Originally posted by linuxgeex View Post

That doesn't stop you from throwing 99% of the JRE in the garbage and using the true core language in a much lighter manner (like Google did by creating Dalvik), but then you discard the entire community around the language.

You have several compilation fases.. compilation and inference on what we call compile time, then Jit compiling in runtime, or during warm-up times.. because of all of that, it takes longer, but could be useful for "almost abstract types", variables that store majority of time, some known storage type, and so on..

On Mobile devices,
I think Android is not a good model to follow, too much memory, to much cpu, too slow..

Originally posted by linuxgeex View Post

And in the case of Google, they got sued for re-implementing the core APIs, and some idiot judges actually sided with Oracle because they don't understand the concept of an open specification. And they don't understand the costs of implementation. Using the same API isn't derivative in any way. The work to precisely match an existing API actually costs significantly more than creating your own from scratch. It was an extra cost to Google, in the spirit of cooperation, and the morons are looking only at the small body of the interface definition files and claiming that since those were "copied wholesale" therefore the work is derivative. Each of those judges should be sentenced to 1 year of systems programming with zero chance for parole, and then revise their findings.

I don't agree with what Google did..
Microsoft did the same, with a slightly difference thing..Sun Licensed Java on Windows for free,
Sun won in the court but was determined no fees, or damage was caused( due to the big big mistake sun did when Licensed it for free...but the court failed in acknowledge that the JVM is something separated, and that is proprietary...still C# continues out there..without no fines.. )..

Google tried the same, but Sun doesn't Licensed the JVM technology in any other OS for free..
So here Google continues infringing Java Technology owners Licence, and profiting big with it in Android..

You can run Scala, and others in the Sun JVM, they compile for the JVM but are different Languages..
Something that a lot of people fail, a lot of time, is to Acknowledge that the JVM is something completely different from the Java Language itself..

Sun invested a lot in optimisations in a lot of Arch's, and of-course that technology is proprietary, is not licensed for free.
So google should have been sued big for stealing Sun Technology, and Microsoft also for using its VM for C#..

**pal666** · 18 October 2019, 08:44 PM

original oracle article is full of bullshit
didn't rewrite c code in python. he replaced c linked lists with c sets (i'm pretty sure python "native" sets are implemented in c)

**curfew** · 18 October 2019, 09:11 PM

Originally posted by jacob View Post

In C he would have used some kind of tree library and he would have been skinned alive because that's - oh my god - a dependency!

The only joke here is your poor understanding of software development. The command, or "app", does nothing else than read through one or two text files and print out the information in a different format. This kind of a tiny app should be standalone because it is just dumb to have dependencies for something so trivial. Hence it is actually better implemented using a scripted language than barebone C, because otherwise you're gonna end up writing your own list abstractions to cut down the dependencies.

It's hard for me to imagine where the original performance deficit came from to be able to improve it at all regardless of choice of language. Maybe the original author was trying to use as little RAM as possible (considering 1988 equipment) and had to read through the files multiple times, resulting in excess DISK USAGE. I don't think the performance was found in code but in I/O usage.

**L_A_G** · 18 October 2019, 09:22 PM

Originally posted by vladpetric View Post

One shouldn't make a overly-general statement, that if C is slower, then something must be wrong with the code. Performance happens to be a really complex topic.

Probably not, but it's rather unlikely to be anything except a developer mistake (either in the code or the compiler settings). It's about as safe a bet as an a pool of oil showing up under your car while you're at work being an oil leak. Sure, it could be a prankster, but the first thing to do is still going to he having a look or getting a mechanic to do that if you don't service your own car.

**vladpetric** · 18 October 2019, 09:52 PM

Originally posted by F.Ultra View Post

Well it also depends on the size of the object stored. Even if we don't consider the cost of copy memory around we also face the problem of fitting the set itself in the cache so we are not hit by cache misses when traversing the set. Now of course we can hopefully trust that the cpu will prefetch in a sane manner here (which is no certainty, especially if the lookup is not linear). Sometimes I miss the more simpler days when I could remember the exact cycle count for each instruction and just look at two pieces of code and immediately see which one fast faster...

Ok, so to be clear - you want to have the option of whether to inline the data or not, not be forced to use a pointer all the time. Sometimes it's ok to use a pointer, but many times it is not. As for copying data - stores are really cheap, if there's no load that needs their data immediately afterwards. They go to a write buffer, and get retired even before their data reaches the L1 cache. Yes, it is possible to fill up that buffer ... so there are obviously limits here. But again, you really want the option to inline or not to inline the data.

With respect to prefetching - I can assure you that no modern processor has a prefetch system that can handle that (and btw, feel free to check my research work on the topic as well). Essentially prefetching is done linearly (usually in the L2 cache) - so it works for contiguous data, not linked stuff, or via prefetch instructions. And prefetch instructions won't work here. For effective prefetching you need to know the address a good number of cycles before you actually need the data. There's no point in having a prefetch instruction followed by the load in the same basic block.

Feel free to prove me otherwise - show me a processor that has a prefetching system capable of handling this. BTW, there are research proposals to deal with them. Problem is that they're not too good (in terms of expected speedups), really expensive in terms of chip complexity, and/or require profiling. So - nobody implemented them.

Out-of-order superscalar execution is an absolute necessity for good speed these days. The good old days, well, they weren't that good. I'd seriously (and respectfully) suggest you get up-to-date on processor microarchitecture. I honestly think you'll have fun with that

. I for one find a lot of engineering beauty in modern processor design.

**vladpetric** · 18 October 2019, 09:57 PM

Originally posted by L_A_G View Post

Probably not, but it's rather unlikely to be anything except a developer mistake (either in the code or the compiler settings). It's about as safe a bet as an a pool of oil showing up under your car while you're at work being an oil leak. Sure, it could be a prankster, but the first thing to do is still going to he having a look or getting a mechanic to do that if you don't service your own car.

Replace C with C++ and I actually agree with you.

But C code oftentimes ends up inefficient because of lack of good libraries. There's no good hash map implementation in the standard library, because, well, you can't really write a good, generic one, in C. C++ - yeah, totally.

**cynical** · 18 October 2019, 10:23 PM

Originally posted by linuxgeex View Post

Java's RAD days are long gone. The JRE is so massive that compiling takes far longer than C, even when JITting with jar files.

That's literally impossible, at least for anything of a significant size. Java is compiling to bytecode, while C is compiling to native instructions. The former is far simpler, and that's ignoring things like compiler optimizations (which obviously you don't do with Java, since you leave it to the JVM) and preprocessing/header files. (which will be slow)

As many people already said, all this shows is that algorithms + data structures are more important than programming languages. The minute I saw that the code was implementing sets using linked lists, I knew it would be slow. A red black tree would be fine, a hash table would be optimal.

Originally posted by pal666 View Post

original oracle article is full of bullshit
didn't rewrite c code in python. he replaced c linked lists with c sets (i'm pretty sure python "native" sets are implemented in c)

They are, but that's not really meaningful to say. Their implementation is 2500 lines of C code (- some python specific stuff). It makes a lot more sense to use Python than to write your own implementation of sets in C... For anyone who is curious, Python sets are implemented using hash tables with a hybrid of linear and randomized probing. Apparently they got the algorithm from Knuth's books. (which reminds me that I need to read those)

**microcode** · 19 October 2019, 12:16 AM

"instead of just fixing the program, we completely rewrote it... in an interpreted language with a runtime dramatically larger than the program itself."

Announcement

Rewriting Old Solaris C Code In Python Yielded A 17x Performance Improvement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment