Announcement

Collapse
No announcement yet.

Rewriting Old Solaris C Code In Python Yielded A 17x Performance Improvement

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #51
    Originally posted by vladpetric View Post

    But there is no need for a prefetch if key and value fit within a cache line (typically 64 bytes or larger). You just access your bucket, and everything's in there.
    Because the world is larger than just hash tables/sets?

    Comment


    • #52
      Originally posted by microcode View Post
      "instead of just fixing the program, we completely rewrote it... in an interpreted language with a runtime dramatically larger than the program itself."
      The runtime is not a penalty if you can reasonably expect it to already be installed on the system.

      In my experience, Python comes pre-installed on most GNU/Linux distros I try or use. It's more rare for me to find PHP, a Java runtime, Ruby or even Perl pre-installed.

      I'm not sure if distro makers are deliberately installing CPython or if they just often ship apps which happen to depend on it. I know that in the case of the Ubuntu & Gnome projects, they both like and encourage Python use.

      Comment


      • #53
        I look forward to the Linux kernel and Google Chrome getting converted to Python!

        Articles like this are nothing more than click-bait for fanbois. If a particular algorithm is implemented inefficiently, then it will be slow - who woulda thunk?
        Ask yourself this: What software running on your computer has to be extremely fast and extremely reliable? What programming language is used for that software?
        Answers: the kernel, C
        There you have it.

        Comment


        • #54
          Originally posted by g04th3rd3r View Post
          I look forward to the Linux kernel and Google Chrome getting converted to Python!
          Well, it would be a bigger mess..
          Python copied the Lua way, but copied a lot of things wrong..

          Comment


          • #55
            Originally posted by DoMiNeLa10 View Post

            How much of a speedup would you get if you used a modern compiler and tweaked it with basic settings such as -Os?
            Close to none. C does not benefit performance from -Os, which is optimisation for size (unless in some extreme small-enough-to-fit-in-cache cases). But even with -O3 gain would be likely minimal since the problem is elsewhere, with data structures and algorithm. Linked list based set with linear performance characteristics (list cannot be vectorised even with -O3) will be slower than hash set with constant performance characteristics, with big enough n, much slower, 17x. Proper C++ implementation with -O3 would wipe the floor with Python version here.
            As someone wrote here, more news at 11 o'clock.
            Edit: for small programs -Os might bring benefit over -O2 or -O3, though in STL heavy big programs I've seen 100% perf gains with -O3 vs -O2 even (gcc). For that particular application in article ease of maintenance is favoured over raw speed so use of Python (in ex over C++) is understandable.
            Last edited by reavertm; 19 October 2019, 11:00 AM.

            Comment


            • #56
              Originally posted by archsway
              Buckets?

              For unordered_map, maybe, but for this specific case a normal RB-tree map might be better as it gives sorting for free.

              But would a B-tree be better than both? It is still sorted, but has many child-nodes for each branch node, so doesn't access as many different memory locations during traversal.
              Order may not matter if you just care about membership. An RB-tree would be better than a B-tree for a simpler data structure because the data is stored directly in the tree. You see B-trees where the indirection is necessary due to massive size of the data, like filesystems or web indexes.

              Originally posted by onicsis View Post
              Same old discussion why to use a UNIX shell and why not to use python, which version doesn't matter, to writing boot scripts or services. That's because will pull to many dependencies, libraries to be usable, memory or any other resource is or can be significantly higher using python as a shell instead of a classic solution like a sh shell.

              If using python why not jAvAacript

              Why some someone want to have [some] fun and replace all shell scripts with JavaScript scripts or even better rewrite whole SystemD in Javascript
              JavaScript everywhere
              Why not Python or JavaScript? The days of limited memory are almost gone, and both languages are much nicer to use than shell scripts. You are also less likely to have errors in the code because they are more expressive and a lot of basic functionality is built-in. JavaScript can run even on embedded devices, and these days it is very fast. If the runtime is there already (often true in Python's case) it's just a pure gain to make use of it.

              Originally posted by tuxd3v
              Python copied the Lua way, but copied a lot of things wrong..
              Lua didn't exactly get everything right either... *cough* 1-based indexing *cough*
              Last edited by cynical; 19 October 2019, 10:57 AM.

              Comment


              • #57
                Originally posted by onicsis View Post
                If using python why not jAvAacript

                Why some someone want to have [some] fun and replace all shell scripts with JavaScript scripts or even better rewrite whole SystemD in Javascript
                JavaScript everywhere
                Node.js doesn't work that well if you want to write a proper input filter that could be piped with other things due to how it handles IO, so it's a bad idea, not to mention the startup times of v8.

                Comment


                • #58
                  Originally posted by archsway View Post

                  Buckets?

                  For unordered_map, maybe, but for this specific case a normal RB-tree map might be better as it gives sorting for free.

                  But would a B-tree be better than both? It is still sorted, but has many child-nodes for each branch node, so doesn't access as many different memory locations during traversal.
                  An in-memory, cache-aligned B(+) tree can actually work quite well - I've seen it in a few places. With C++ you can figure out at compilation time the sizes for the keys and values, and size a node accordingly, so that it all lines up nicely.

                  In any case, the main issue is whether being forced to have the hash value as a pointer is a good idea. Sometimes it's a good choice, but many times it's not, particularly when the value is reasonably small. The extra indirection can cost a lot.

                  Comment


                  • #59
                    Originally posted by cybertraveler View Post
                    In my experience, Python comes pre-installed on most GNU/Linux distros I try or use. It's more rare for me to find PHP, a Java runtime, Ruby or even Perl pre-installed.
                    Really, a distribution that doesn't pre-install Perl? At least on Arch, Perl is depended on by OpenSSL, LXC, cowsay, git, perf, rsync, etc., and just OpenSSL out of those will be pulled in by a package such as wpa_supplicant, ssh, Python, coreutils, ostree (which flatpak needs) or more.

                    So unless you are talking about a container or embedded distribution that uses busybox, I'd be surprised if you found a distribution not shipping with Perl.

                    Comment


                    • #60
                      Originally posted by vladpetric View Post

                      You're wrong.

                      First of all, the depth of pipelineing doesn't help at all with IPC. Actually, it can make things worse (all the things in the pipeline need to be kept somewhere, and the deeper the pipeline, the more record keeping you need). Maybe the size of the re-order buffer can help. Still, given that it takes hundreds of cycles to get to main memory, you really don't wanna go to main memory twice, in series (the second load depends on the first load, so they can't be paralelized). Even the largest ROBs can't compensate for that.

                      Second, the branch mispredict cost is far smaller than going all the way to memory. Won't matter here.

                      Third - SMT is irrelevant here- it's a multi-threaded throughput technique, not a single-threaded latency thing. Actually, it can make single-threaded latency worse.
                      I think you misunderstood everything I said. It had nothing to do with IPC and everything to do with hiding memory latency, which you didn't address. 192-entry ROB with 4 retire units can hide 48 clocks of latency. With SMT you can hide 96 clocks of latency by having the 2 threads take turns. Meanwhile other instructions from both threads in the ROB can still be scheduled to execution ports. Scheduler slots count for the execution ports extend this even further, but not much.

                      19 stage pipeline can hide about 17 cycles of latency if the fetch occurs at decode and the execute happens at the second to last stage. Between pipelining, ROB, execution port scheduler, and SMT, it's possible to hide 110 cycles of latency. That's enough to hide 2 back to back cache misses if the fetches were fired off in the first dozen cycles.

                      Deeper pipelines don't hurt IPC directly... quite the opposite, they enable ILP which is the first goal of IPC increases. You can get too much of a good thing. Deeper pipelines increase trace length and trace quantity, which increases die capacitance, which increases energy per cycle, which brings the core to its thermal limits sooner, which lowers clock rate. A healthy uarch won't have a pipeline much deeper than triple the number of its execute ports, because the thermal tradeoff would sacrifice too much frequency.

                      But that won't stop them from increasing the ROB, BP, and SMT to hide more memory latency as a way to push ILP wider and cope with higher clock multipliers in future designs. Knights Landing had SMT4 to help cope with large NUMA latency. Power8/9 have SMT4 to help cope with workloads that create a lot of cache misses (in-memory databases in particular.)

                      Yes, SMT decreases per-thread IPC, but it increases per-core IPC by guaranteeing that the core always has something it can be doing, and the wider that core's execution ports get, the more important it becomes to keep them busy, to keep up the efficiency of the core as a whole. On the bright side, the OS can choose to schedule only one thread per core on an SMT-capable CPU if absolute minimum latency is required for that thread. On Linux you can pin cores to specific processes, and you can enable and disable SMT on each core individually, so you have full control over those performance/efficiency tradeoffs. Rejoice!



                      Comment

                      Working...
                      X