Announcement

Collapse
No announcement yet.

Intel Publishes Blazing Fast AVX-512 Sorting Library, Numpy Switching To It For 10~17x Faster Sorts

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #31
    Originally posted by ms178 View Post
    I totally get yout point when it comes to Intel's milking, but isn't AMD just as bad nowadays? AM5 is disgustingly expensive as a platform, so are their GPUs. I miss the times where there was a price/performance champion giving us more for less.
    AMD has announced a lower-cost A620 chipset and their RX 6000 GPUs are currently selling below MSRP. Once that GPU inventory clears out, maybe they'll reprice the RX 7000 models, and I'm also hoping they'll do a refresh.

    I still think it's nice how the Ryzen 7000 CPUs launched below their predecessors' MSRP, inflation be damned.
    Last edited by coder; 16 February 2023, 03:06 PM.

    Comment


    • #32
      Originally posted by chuckula View Post
      Cloudflare is a web hosting service and they are not experts in actually writing software targeting hardware, unlike the developers who posted these results.
      Instead of ad hominem attacks, how about either pointing out a flaw in their methodology or providing better data?

      Originally posted by chuckula View Post
      If AVX-512 is so terrible for Linux, he should also stop most GPU driver development because it takes literally millions of lines of often buggy kernel code just to get a different model of GPU to turn on. Compared to that AVX-512 is a trivial point-patch.
      You've mischaracterized his position, which wasn't that it was bad for Linux, but bad for CPUs. I'm not going to waste more time on rehashing that debate, because I don't 100% agree with him and it was also a point that can't be completely divorced from the time period in which it was made.

      All I'm saying is that clock-throttling was a very real issue on 14 nm CPUs. I don't know how bad it is on subsequent ones, which is why I'd like to see some analysis of that. Unlike you, I try to be data-driven about these things.

      Furthermore, I actually experienced the clock-throttling issue myself. By disabling AVX-512 and using only AVX2, we were able to increase throughput by nearly 50% on a Skylake SP server CPU. After having been burned by it, I really want to know to what extent it's still an issue, before I get behind the notion of sprinkling AVX-512 in yet more places.

      Originally posted by chuckula View Post
      In the real world, Daniel Lemire has been showing massive improvements using AVX-512 on a wide range of workloads
      In synthetic benchmarks, or within real-world applications? Because synthetic benchmarks tend to benefit from AVX-512 more than they suffer from clock-throttling. But it's when you mix like 10% AVX-512 with a bunch of non- AVX-512 code that the downclocking hurts you more than the acceleration helps.

      Originally posted by chuckula View Post
      there just needs to be more software support for it.
      If it's a net negative, then we need less not more. That's why this question is so crucial.

      Comment


      • #33
        Originally posted by ms178 View Post

        I totally get yout point when it comes to Intel's milking, but isn't AMD just as bad nowadays? AM5 is disgustingly expensive as a platform, so are their GPUs. I miss the times where there was a price/performance champion giving us more for less.

        About AM5, I have not been able to find any confirmation that the mobos are so expensive because AMD decided they should be.

        About DDR5, same thing.

        CPU's wise? They are way faster and either priced the same or cheaper than previous gen, so how is that bad?

        GPU's? Yes, I agree with you, they should be cheaper.

        But they observed that nvidia is charging US$2K for a GPU and the rabid followers numbers keep increasing, so they said, maybe we can find some similar fools.

        Comment


        • #34
          Originally posted by coder View Post
          FWIW, Intel seems to do its best when they're down. After Pentium 4, we got Core 2. And after Ice Lake & Rocket Lake, we got Alder Lake.

          I'm eager to see how much they can catch up to TSMC, with Intel 4 and their 18A nodes. Sierra Forest should be very interesting, as well. Doing some scaling analysis on it could be fascinating.
          As I said, I dont want to see that or anything from them.

          They have shown over and over what they will do to us when they are on top so I am baffled by people wanting them back on top.

          We should learn our lesson and perhaps pray for them to die and be replaced by someone else.

          Comment


          • #35
            Originally posted by bug77 View Post

            I'm pretty sure that we did, O(n*log(n)) is the best we can do. We can work on the constant factor a bit, but as far as general purpose sorting is concerned, that's it.
            What we can do these days, I believe is design algorithms for sets of data following a particular pattern and then pick the right algorithm for the job (think how bubble sort beats quick sort and friends for data that is almost sorted from the start - something that happens a lot irl).

            That said, I have to state the obvious here: a 17x improvement means current sorts will happen that much faster, but it doesn't mean we get to sort 17x as much data in the same timeframe. Unfortunately.

            Also thanks for mentioning those algorithms, I'll be sure to take a look.
            Companies will pay large sums of money to reduce constant factors.

            Also, for what it is worth, non-general purpose sorting algorithms are situationally useful. I have been sitting on an algorithm for sorting filesystem path lists in O(nlog(n/k)) time, where k is the number of directories, since 2015 (give or take a year). I had initially implemented it in zfs list, but it had failed a test case, which blocked it from being merged into ZFS. Before fixing it, I had extremely demotivational feedback from an academic who became hostile when I claimed that I could sort file paths asymptotically faster than O(nlog(n)) time, so I ended up sitting on it for years. I plan to fix the test case issues and get it into ZFS later this year.

            The algorithm is a meta-algorithm that operates using a general purpose sorting algorithm as its core, that I assume in my writing following this to be an O(nlog(n)) algorithm. When the paths are represented in a tree structure, the result of a comparison between siblings is identical to the result of a comparison between any child from the first sibling and any child from the second sibling. Thus, we can skip comparisons between non-siblings when sorting directory paths by invoking the general purpose sorting algorithm on only the immediate children of a directory (i.e. siblings). Skipping comparisons because we already know what the result will be due to the tree structure enables us to sort a list of paths faster with fewer than nlog(n) comparisons, and the math works out to make it asymptotically fewer comparisons, since we invoke the general purpose sorting algorithm with time O(n/k * log(n/k)) (following a substitution), k times. The result is that we can traverse a tree from the root to the leaves, sorting only the immediate children, and the end result of the traversal will be sorted, in O(nlog(n/k)) time.

            A neat aspect of it is that every time you finish sorting leaves via pre-order traversal, you can print the leaves, since the algorithm only needs to sort a subset of data before what comes first in the list is known. That makes an implementation very quick to start printing output if it takes advantage of this property. Similarly, you only need to read the subset being sorted via the general purpose sort prior to sorting it, so an implementation can begin printing sorted file paths while it is still reading directories. That avoids the wall of wait that you have with the traditional method where you cannot sort until you have all file paths and cannot print until you finish sorting.

            The best case performance of the algorithm is O(n) where every directory has at most 2 children, while the worst case time is O(n(log(n)) where the only directory is the root directory. In the former case, 1 comparison or less per node is a linear number of comparisons. In the latter case, the insight that comparisons between non-siblings can be skipped does not avoid comparisons when everything (minus the root directory) is a sibling, so our time complexity becomes that of the general purpose sorting algorithm we wrap. Outside of the worst case performance, the constant factor is also smaller than it is for a general purpose sorting algorithm since we only compare base names rather than full paths.

            It is a shame that I let myself be discouraged. zfs list would have been much faster had I finished implementing it years ago. Anyway, that is an example of a special purpose sorting algorithm. When your data has special properties that you can exploit to sort faster, you can get some amazing results.
            Last edited by ryao; 16 February 2023, 02:57 PM.

            Comment


            • #36
              Originally posted by coder View Post
              ark.intel.com actually has a field which indicates single vs. dual-FMA of their large-die 14 nm CPUs. The label is literally "# of AVX-512 FMA Units" and here are two Cascade Lake models with differing numbers:
              and so, to answer my question ... yes, even the lowest core count of these new SPR-WS chips have dual avx512 FMA per core



              Comment


              • #37
                Originally posted by NeoMorpheus View Post
                About AM5, I have not been able to find any confirmation that the mobos are so expensive because AMD decided they should be.
                I saw this analysis:

                It was written before AMD announced the A620 chipset.

                Comment


                • #38
                  Originally posted by NeoMorpheus View Post
                  They have shown over and over what they will do to us when they are on top so I am baffled by people wanting them back on top.
                  Not necessarily on top, but I want Intel to be competitive. That's best for the sector.

                  Look at it this way: if AMD didn't face stiff competition from Intel, AMD's investors would do the same things as Intel's did. They'd probably boot Lisa Su and put someone in charge who's better at cutting costs and milking their customers. And the windfall would go into funding dividends and stock buybacks. Meanwhile, the pace of improvements from AMD would slow to a rate similar to what we got accustomed to Intel delivering, during the quad-core era.

                  As long as AMD can keep growing but faces strong competition from Intel and Nvidia, they can hopefully continue to justify reinvesting their earnings back into their business. I think it's noteworthy that AMD hasn't done big layoffs like most other big tech companies.
                  Last edited by coder; 16 February 2023, 03:19 PM.

                  Comment


                  • #39
                    Originally posted by coder View Post
                    Not necessarily on top, but I want Intel to be competitive. That's best for the sector.

                    Look at it this way: if AMD didn't face stiff competition from Intel, AMD's investors would do the same things as Intel's did. They'd probably boot Lisa Su and put someone in charge who's better at cutting costs and milking their customers. And the windfall would go into funding dividends and stock buybacks. Meanwhile, the pace of improvements from AMD would slow to a rate similar to what we got accustomed to Intel delivering, during the quad-core era.

                    As long as AMD can keep growing but faces strong competition from Intel and Nvidia, they can hopefully continue to justify reinvesting their earnings back into their business. I think it's noteworthy that AMD hasn't done big layoffs like most other big tech companies.
                    I understand the fact of competition and as I said before, others can do this, ARM, RISC-V, POWER, and who knows who else.

                    The one that is not needed and honestly should just die is Intel.

                    Comment


                    • #40
                      Originally posted by ryao View Post
                      When the paths are represented in a tree structure, the result of a comparison between siblings is identical to the result of a comparison between any child from the first sibling and any child from the second sibling. Thus, we can skip comparisons between non-siblings when sorting directory paths by invoking the general purpose sorting algorithm on only the immediate children of a directory (i.e. siblings).
                      This reminds me of tries. If it helps, or if you want to look at a similar implementation, you might find it useful to look up a trie sorting example.
                      Perhaps citing that similarity will also make it more palatable to some of your reviewers.

                      Originally posted by ryao View Post
                      Anyway, that is an example of a special purpose sorting algorithm. When your data has special properties that you can exploit to sort faster, you can get some amazing results.
                      😎

                      Comment

                      Working...
                      X