Announcement

**vladpetric** · 19 October 2019, 12:23 AM

Originally posted by linuxgeex View Post

These days with the depth of pipelining and reorder buffers of modern processors and SMT, back-to-back cache misses are not a big deal at all. Unless it's several back to back misses, in which case something is terribly wrong with the way it's been coded. Hash tables should be built in a way that the first branch target is the most-taken one so that its mere existence in the first position can be taken as a hint that it should be the first one tried when cached stats for that branch location don't already exist.

This takes me back to GCC 1.0 architecture discussions in USENET lol. Years before SMT. Crap I am getting old.

You're wrong.

First of all, the depth of pipelineing doesn't help at all with IPC. Actually, it can make things worse (all the things in the pipeline need to be kept somewhere, and the deeper the pipeline, the more record keeping you need). Maybe the size of the re-order buffer can help. Still, given that it takes hundreds of cycles to get to main memory, you really don't wanna go to main memory twice, in series (the second load depends on the first load, so they can't be paralelized). Even the largest ROBs can't compensate for that.

Second, the branch mispredict cost is far smaller than going all the way to memory. Won't matter here.

Third - SMT is irrelevant here- it's a multi-threaded throughput technique, not a single-threaded latency thing. Actually, it can make single-threaded latency worse.

Please, read:

1. https://people.freebsd.org/~lstewart.../cpumemory.pdf

2. some Hennessy and Patterson, a recent edition

Most likely, last time you looked at these issues, the latency of main memory was much much lower than it is right now. Indeed, you mention gcc 1.0, and yeah, in 1987 we barely had L1 caches. There's a reason we have L3 and even L4 caches these days - the gap between processor speed and memory speed has increased considerably. But, you haven't been keeping up.

**F.Ultra** · 19 October 2019, 12:38 AM

Originally posted by vladpetric View Post

Ok, so to be clear - you want to have the option of whether to inline the data or not, not be forced to use a pointer all the time. Sometimes it's ok to use a pointer, but many times it is not. As for copying data - stores are really cheap, if there's no load that needs their data immediately afterwards. They go to a write buffer, and get retired even before their data reaches the L1 cache. Yes, it is possible to fill up that buffer ... so there are obviously limits here. But again, you really want the option to inline or not to inline the data.

With respect to prefetching - I can assure you that no modern processor has a prefetch system that can handle that (and btw, feel free to check my research work on the topic as well). Essentially prefetching is done linearly (usually in the L2 cache) - so it works for contiguous data, not linked stuff, or via prefetch instructions. And prefetch instructions won't work here. For effective prefetching you need to know the address a good number of cycles before you actually need the data. There's no point in having a prefetch instruction followed by the load in the same basic block.

Feel free to prove me otherwise - show me a processor that has a prefetching system capable of handling this. BTW, there are research proposals to deal with them. Problem is that they're not too good (in terms of expected speedups), really expensive in terms of chip complexity, and/or require profiling. So - nobody implemented them.

Out-of-order superscalar execution is an absolute necessity for good speed these days. The good old days, well, they weren't that good. I'd seriously (and respectfully) suggest you get up-to-date on processor microarchitecture. I honestly think you'll have fun with that

. I for one find a lot of engineering beauty in modern processor design.

I think that you misunderstood what I wrote since I did not talk about prefetching for linked-lists but for your C++ style inline sets!

**vladpetric** · 19 October 2019, 12:55 AM

Originally posted by F.Ultra View Post

I think that you misunderstood what I wrote since I did not talk about prefetching for linked-lists but for your C++ style inline sets!

But there is no need for a prefetch if key and value fit within a cache line (typically 64 bytes or larger). You just access your bucket, and everything's in there.

**archsway** · 19 October 2019, 01:31 AM

Originally posted by vladpetric View Post

But there is no need for a prefetch if key and value fit within a cache line (typically 64 bytes or larger). You just access your bucket, and everything's in there.

Buckets?

For unordered_map, maybe, but for this specific case a normal RB-tree map might be better as it gives sorting for free.

But would a B-tree be better than both? It is still sorted, but has many child-nodes for each branch node, so doesn't access as many different memory locations during traversal.

**smitty3268** · 19 October 2019, 01:36 AM

Originally posted by DoMiNeLa10 View Post

How much of a speedup would you get if you used a modern compiler and tweaked it with basic settings such as -Os?

-Os is generally terrible for performance compared with -O2, or -O3.

**curfew** · 19 October 2019, 04:24 AM

Originally posted by microcode View Post

"instead of just fixing the program, we completely rewrote it... in an interpreted language with a runtime dramatically larger than the program itself."

What a load of crap. Python being in core anyway, the runtime size bears no meaning. Besides the options were always going to be rewriting it in one language or the other.

**onicsis** · 19 October 2019, 04:48 AM

Same old discussion why to use a UNIX shell and why not to use python, which version doesn't matter, to writing boot scripts or services. That's because will pull to many dependencies, libraries to be usable, memory or any other resource is or can be significantly higher using python as a shell instead of a classic solution like a sh shell.

Compiled version written in C is a single one executable and and also using proper algorithms designed for this purpose. In '90 or early C CGI where a thing, meaning only acceptable variant to solve critical problems on devices with low memory or CPU

If using python why not jAvAacript

Why some someone want to have [some] fun and replace all shell scripts with JavaScript

scripts or even better rewrite whole SystemD in Javascript

JavaScript everywhere

**Guest** · 19 October 2019, 04:55 AM

Originally posted by smitty3268 View Post

-Os is generally terrible for performance compared with -O2, or -O3.

In case of tight loops it might be better, as you want to keep all of your code inside of the instruction cache. The point is that these optimizations are applicable to all compatible hardware, unlike compiling for specific mircoarchitecture, which will require a specific instruction set that's not the base standard.

**onicsis** · 19 October 2019, 04:59 AM

Critical parts should remain written in C, Rust, D or other compiled language to native code.

**milkylainen** · 19 October 2019, 05:03 AM

Totally bollox implication of statement.
Equivalent of saying that a 3 feet plank is taller than a 1 foot brick wall.

Announcement

Rewriting Old Solaris C Code In Python Yielded A 17x Performance Improvement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment