Announcement

**linuxgeex** · 20 October 2019, 02:48 AM

Originally posted by cybertraveler View Post

The runtime is not a penalty if you can reasonably expect it to already be installed on the system.

The runtime isn't a penalty if it will already be in memory. For example, bash is almost always already in memory, but sed is almost always not. But sed is so tiny we don't care. Python however is rather large. The cost of a program in the cloud is, at a minimum, anonymous_memory_footprint*execution_duration. So if Python uses 17x as much memory and runs 17x faster, it's a wash. If however there's a 100MB anonymous memory task waiting on it (ie a heavy admin script for CPanel), then it's no longer a wash because the 17x faster results will free up that 100MB of memory sooner. So it depends on your workload, a lot.

**linuxgeex** · 20 October 2019, 03:04 AM

Originally posted by vladpetric View Post

But there is no need for a prefetch if key and value fit within a cache line (typically 64 bytes or larger). You just access your bucket, and everything's in there.

If you store the data with the key in memory, and if the key needs to be searched, then you just cut the likelihood that the key was in the cache in half. In some cases that may be a good thing, for example if the key is 32 bytes wide (secure hash) then the odds of getting multiple key matches without a cache miss were probably low to begin with. That decision should be up to the compiler, and that decision can be informed with PGO. It doesn't need to be a language-specific construct or a developer decision.

PS I apologise on behalf of Michael that my response to your arch discussion post was "not approved". I'm not sure why the forum software is so prone to blocking technical language responses... but I'm sorry that it is biased in that way, and nobody else is going to apologise to you for the lack of delivery.

**Ardje** · 20 October 2019, 08:38 AM

It misses that the python version probably uses a lot more memory and for small lists is definitely slower.
It should say that using good list/array/table management should speed up any program in any language.
However good scripting languages (Like Lua, Perl or maybe python) can definitely decrease the amount of needed code. And with that, Lua is probably the fastest for the small tables, as it's compiler is fast. Very very fast. Perl scripts that I swapped for Lua scripts usually were finished when perl was just loading it's interpreter/compiler (like python would too).
In the case of python performance, every time I hear about it, smart people say that python is fast, as long as you use <engine ...> and not c-python. All I know is that Lua is running fast on my esp8266.

**vladpetric** · 20 October 2019, 11:54 AM

Originally posted by linuxgeex View Post

I think you misunderstood everything I said. It had nothing to do with IPC and everything to do with hiding memory latency, which you didn't address. 192-entry ROB with 4 retire units can hide 48 clocks of latency. With SMT you can hide 96 clocks of latency by having the 2 threads take turns. Meanwhile other instructions from both threads in the ROB can still be scheduled to execution ports. Scheduler slots count for the execution ports extend this even further, but not much.

19 stage pipeline can hide about 17 cycles of latency if the fetch occurs at decode and the execute happens at the second to last stage. Between pipelining, ROB, execution port scheduler, and SMT, it's possible to hide 110 cycles of latency. That's enough to hide 2 back to back cache misses if the fetches were fired off in the first dozen cycles.

Deeper pipelines don't hurt IPC directly... quite the opposite, they enable ILP which is the first goal of IPC increases. You can get too much of a good thing. Deeper pipelines increase trace length and trace quantity, which increases die capacitance, which increases energy per cycle, which brings the core to its thermal limits sooner, which lowers clock rate. A healthy uarch won't have a pipeline much deeper than triple the number of its execute ports, because the thermal tradeoff would sacrifice too much frequency.

But that won't stop them from increasing the ROB, BP, and SMT to hide more memory latency as a way to push ILP wider and cope with higher clock multipliers in future designs. Knights Landing had SMT4 to help cope with large NUMA latency. Power8/9 have SMT4 to help cope with workloads that create a lot of cache misses (in-memory databases in particular.)

Yes, SMT decreases per-thread IPC, but it increases per-core IPC by guaranteeing that the core always has something it can be doing, and the wider that core's execution ports get, the more important it becomes to keep them busy, to keep up the efficiency of the core as a whole. On the bright side, the OS can choose to schedule only one thread per core on an SMT-capable CPU if absolute minimum latency is required for that thread. On Linux you can pin cores to specific processes, and you can enable and disable SMT on each core individually, so you have full control over those performance/efficiency tradeoffs. Rejoice!

Everything else being equal, deeper pipelines do hurt IPC directly (and we really don't need to get into circuit effects for that). If you have a deeper frontend pipeline, branch misspeculation cost gets higher. If you have a deeper execution pipeline, then it takes longer to execute back-to-back dependent instructions (essentially you increase the data flow height). If you have a deeper back-end, it takes longer to retire instructions, so you increase back-pressure and prevent the front-end from pushing new instructions onto the re-order buffer. These are all first-order effects (though far from being the only ones) - and all described in H&P btw.

The instruction window (aka reservation stations, or scheduling window) + re-order buffer + dynamic scheduler is what really hides latency.

Pipelining is a technique to increase clock, with potentially negative IPC effects (the effects are not equal, obviously).

The memory latency when going to main memory is in the order of hundreds of cycles (200-500, and that assumes that you don't cross the bus to another NUMA zone, because if that's the case it's much higher). My original point is that you really don't want two of them back-to-back, instead of one. And nothing you're saying contradicts that ...

Sure, you can increase any structure but large is slow, small is fast. You can add extra pipeline stages to not decrease clock, but then you have IPC problems and increased complexity (see my first point).

Again, SMT doesn't hide latency. SMT increases throughput if there's something else to be done on the colocated thread. But, the latency of the original thread that misses is still there ... So, how can you claim that it's hidden? Note that this works quite differently within the same thread. When using dynamic scheduling within the same thread, executing instructions from downstream of the load that missed does hide latency (essentially, the future becomes faster).

Also, the colocated thread may simply be idling (spinning and waiting to service an event). SMT doesn't guarantee anything really in that case.

Oh yeah, with out-of-order execution, latencies aren't additive. I suggest reading some work on microarchitectural critical path (see Fields' work for instance). All I'm saying here is that you can't really add what you hide.

Finally, I have a PhD on the topic, and have done research that is also quoted in two Intel patents. My work also includes microarchitectural critical path analysis.

**cybertraveler** · 20 October 2019, 12:47 PM

Originally posted by linuxgeex View Post

The runtime isn't a penalty if it will already be in memory. For example, bash is almost always already in memory, but sed is almost always not. But sed is so tiny we don't care. Python however is rather large. The cost of a program in the cloud is, at a minimum, anonymous_memory_footprint*execution_duration. So if Python uses 17x as much memory and runs 17x faster, it's a wash. If however there's a 100MB anonymous memory task waiting on it (ie a heavy admin script for CPanel), then it's no longer a wash because the 17x faster results will free up that 100MB of memory sooner. So it depends on your workload, a lot.

Good points about the memory.

Practically speaking, python3 doesn't seem to be too bad on memory consumption. Better than I expected anyway. It seems to use about 1.5x the memory of bash (just judging from if I simply start an interpreter of both bash4 and python3). I expect that if you use it with memory consumption in mind, it wont be much worse than an equivalent C program; especially if you consider what you are gaining in terms of development time, readability & maintainability.

**microcode** · 20 October 2019, 05:59 PM

Originally posted by curfew View Post

What a load of crap. Python being in core anyway, the runtime size bears no meaning. Besides the options were always going to be rewriting it in one language or the other.

The runtime needs to be loaded into memory now, before you can use a basic system utility.

**TylerY86** · 20 October 2019, 08:21 PM

Wrote large set manip tests in python and C.
C app linked to some publicly available set C lib.
Timed from various starts, timed from python REPL, timed from shell, timed from inside main after init too.
C was faster in all cases by large margins. C used less memory by large margins.

Sane code used sanely will be faster whether in C or Python, but Python will always have more overhead than C, sometimes it doesn't matter.

The premise of promoting the capabilities and good qualities of Python are mired here by an unjustifiable performance comparison aimed at language differences instead of implementation differences.

edit; Don't care that python took 3.5s and C took 300ms. Python was easier to write. Don't care if actual time spent manipulating the set took any significant time. Would not rewrite in C until runtime exceeded damn given.

FFS

**curfew** · 21 October 2019, 05:10 AM

Originally posted by microcode View Post

The runtime needs to be loaded into memory now, before you can use a basic system utility.

And this completely irrelevant.

**aht0** · 21 October 2019, 05:44 AM

Originally posted by TylerY86 View Post

edit; Don't care that python took 3.5s and C took 300ms. Python was easier to write. Don't care if actual time spent manipulating the set took any significant time. Would not rewrite in C until runtime exceeded damn given.

FFS

"Easier" compiled language might be a compromise there for quick whipout of some code, provided it has libraries you need. Object Pascal, Ada(Core)..? Python is tempting tho..

**L_A_G** · 21 October 2019, 06:41 AM

Originally posted by vladpetric View Post

Replace C with C++ and I actually agree with you.

But C code oftentimes ends up inefficient because of lack of good libraries. There's no good hash map implementation in the standard library, because, well, you can't really write a good, generic one, in C. C++ - yeah, totally.

Considering compilers that only do C aren't really a thing anymore and pure C is mostly used in limited embedded systems (where Python obviously isn't an option) talking about C++ was implied.

Announcement

Rewriting Old Solaris C Code In Python Yielded A 17x Performance Improvement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment