Originally posted by vladpetric
View Post
Announcement
Collapse
No announcement yet.
Rewriting Old Solaris C Code In Python Yielded A 17x Performance Improvement
Collapse
X
-
Originally posted by microcode View Post"instead of just fixing the program, we completely rewrote it... in an interpreted language with a runtime dramatically larger than the program itself."
In my experience, Python comes pre-installed on most GNU/Linux distros I try or use. It's more rare for me to find PHP, a Java runtime, Ruby or even Perl pre-installed.
I'm not sure if distro makers are deliberately installing CPython or if they just often ship apps which happen to depend on it. I know that in the case of the Ubuntu & Gnome projects, they both like and encourage Python use.
- Likes 1
Comment
-
I look forward to the Linux kernel and Google Chrome getting converted to Python!
Articles like this are nothing more than click-bait for fanbois. If a particular algorithm is implemented inefficiently, then it will be slow - who woulda thunk?
Ask yourself this: What software running on your computer has to be extremely fast and extremely reliable? What programming language is used for that software?
Answers: the kernel, C
There you have it.
Comment
-
Originally posted by DoMiNeLa10 View Post
How much of a speedup would you get if you used a modern compiler and tweaked it with basic settings such as -Os?
As someone wrote here, more news at 11 o'clock.
Edit: for small programs -Os might bring benefit over -O2 or -O3, though in STL heavy big programs I've seen 100% perf gains with -O3 vs -O2 even (gcc). For that particular application in article ease of maintenance is favoured over raw speed so use of Python (in ex over C++) is understandable.Last edited by reavertm; 19 October 2019, 11:00 AM.
Comment
-
Originally posted by archswayBuckets?
For unordered_map, maybe, but for this specific case a normal RB-tree map might be better as it gives sorting for free.
But would a B-tree be better than both? It is still sorted, but has many child-nodes for each branch node, so doesn't access as many different memory locations during traversal.
Originally posted by onicsis View PostSame old discussion why to use a UNIX shell and why not to use python, which version doesn't matter, to writing boot scripts or services. That's because will pull to many dependencies, libraries to be usable, memory or any other resource is or can be significantly higher using python as a shell instead of a classic solution like a sh shell.
If using python why not jAvAacript
Why some someone want to have [some] fun and replace all shell scripts with JavaScript scripts or even better rewrite whole SystemD in Javascript
JavaScript everywhere
Originally posted by tuxd3vPython copied the Lua way, but copied a lot of things wrong..Last edited by cynical; 19 October 2019, 10:57 AM.
Comment
-
Originally posted by onicsis View PostIf using python why not jAvAacript
Why some someone want to have [some] fun and replace all shell scripts with JavaScript scripts or even better rewrite whole SystemD in Javascript
JavaScript everywhere
Comment
-
Originally posted by archsway View Post
Buckets?
For unordered_map, maybe, but for this specific case a normal RB-tree map might be better as it gives sorting for free.
But would a B-tree be better than both? It is still sorted, but has many child-nodes for each branch node, so doesn't access as many different memory locations during traversal.
In any case, the main issue is whether being forced to have the hash value as a pointer is a good idea. Sometimes it's a good choice, but many times it's not, particularly when the value is reasonably small. The extra indirection can cost a lot.
Comment
-
Originally posted by cybertraveler View PostIn my experience, Python comes pre-installed on most GNU/Linux distros I try or use. It's more rare for me to find PHP, a Java runtime, Ruby or even Perl pre-installed.
So unless you are talking about a container or embedded distribution that uses busybox, I'd be surprised if you found a distribution not shipping with Perl.
Comment
-
Originally posted by vladpetric View Post
You're wrong.
First of all, the depth of pipelineing doesn't help at all with IPC. Actually, it can make things worse (all the things in the pipeline need to be kept somewhere, and the deeper the pipeline, the more record keeping you need). Maybe the size of the re-order buffer can help. Still, given that it takes hundreds of cycles to get to main memory, you really don't wanna go to main memory twice, in series (the second load depends on the first load, so they can't be paralelized). Even the largest ROBs can't compensate for that.
Second, the branch mispredict cost is far smaller than going all the way to memory. Won't matter here.
Third - SMT is irrelevant here- it's a multi-threaded throughput technique, not a single-threaded latency thing. Actually, it can make single-threaded latency worse.
19 stage pipeline can hide about 17 cycles of latency if the fetch occurs at decode and the execute happens at the second to last stage. Between pipelining, ROB, execution port scheduler, and SMT, it's possible to hide 110 cycles of latency. That's enough to hide 2 back to back cache misses if the fetches were fired off in the first dozen cycles.
Deeper pipelines don't hurt IPC directly... quite the opposite, they enable ILP which is the first goal of IPC increases. You can get too much of a good thing. Deeper pipelines increase trace length and trace quantity, which increases die capacitance, which increases energy per cycle, which brings the core to its thermal limits sooner, which lowers clock rate. A healthy uarch won't have a pipeline much deeper than triple the number of its execute ports, because the thermal tradeoff would sacrifice too much frequency.
But that won't stop them from increasing the ROB, BP, and SMT to hide more memory latency as a way to push ILP wider and cope with higher clock multipliers in future designs. Knights Landing had SMT4 to help cope with large NUMA latency. Power8/9 have SMT4 to help cope with workloads that create a lot of cache misses (in-memory databases in particular.)
Yes, SMT decreases per-thread IPC, but it increases per-core IPC by guaranteeing that the core always has something it can be doing, and the wider that core's execution ports get, the more important it becomes to keep them busy, to keep up the efficiency of the core as a whole. On the bright side, the OS can choose to schedule only one thread per core on an SMT-capable CPU if absolute minimum latency is required for that thread. On Linux you can pin cores to specific processes, and you can enable and disable SMT on each core individually, so you have full control over those performance/efficiency tradeoffs. Rejoice!
Comment
Comment