Announcement

Collapse
No announcement yet.

One-Line Patch For Intel Meteor Lake Yields Up To 72% Better Performance, +7% Geo Mean

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • panikal
    replied
    Originally posted by coder View Post
    I like to tell people: "get it right, then make it fast". If the answer's not right (or the system isn't stable), I don't care how fast it is.

    Of course, I work in software. In hardware, I expect you tend to find stability problems that don't crop up until you run the thing at speed. However, if it's not stable at slow speeds, chances are the situation is going to be even worse at high speeds.
    The majority of the testing involved pushing specific patterns of 1s and/or 0s (patterns and codes were super proprietary) through the various busses and electrical connections at very high rates to trigger electromagnetic interference and "worst case" and "best case" data flow scenarios. By the time it's released a server board / CPU has had many tens of thousands of hours of testing maxing electrical connections to their corners.....and now we get meltdown and spectre.

    Testing is a constant struggle to evolve your methods against increasing complexity and change within a profitable time frame.
    ​​​

    Leave a comment:


  • coder
    replied
    Originally posted by panikal View Post
    Mainstream benchmarking was actually a separate process that occurred from Alpha onwards but was pretty lightweight compared to actual hardware validation that happened. The entire validation cycle could take 12-18 months.
    I like to tell people: "get it right, then make it fast". If the answer's not right (or the system isn't stable), I don't care how fast it is.

    Of course, I work in software. In hardware, I expect you tend to find stability problems that don't crop up until you run the thing at speed. However, if it's not stable at slow speeds, chances are the situation is going to be even worse at high speeds.

    Leave a comment:


  • panikal
    replied
    I did work on westmere, sandy bridge was halfway done when I left. Even then we did thousands of hours of validation per server board, per cycle (Alpha/Beta/silver/etc), using internal tooling and pattern testing in a ton of configs across a range of mainstream OS (Win/RHEL/SLES) - performance was part of it but not the focus of most testing... Mainstream benchmarking was actually a separate process that occurred from Alpha onwards but was pretty lightweight compared to actual hardware validation that happened. The entire validation cycle could take 12-18 months.

    There's hundreds of kernel knobs exposed that can tune performance so it doesn't surprise me stuff like this falls through the cracks no matter how robust the process gets.

    Leave a comment:


  • ms178
    replied
    Originally posted by panikal View Post

    Having done this actual job at Intel once upon a long time ago (specifically, recommend kernel/bios defaults for new product initiatives in the server division) I can say that these defaults (at least when I was doing it) were the result of setting 'sane' defaults and no one on either the kernel, bios, or product teams realizing that changing it and running a benchmark = big change.

    Stuff like this is often coded by a dev in a vacuum (red cover guide and maybe a prototype) , with minimal benchmarking on the initial roll-out scenarios because in order for the more main stream Engineering Product teams to test it has to be in a kernel, so tuning parameters like this....fall through the cracks. Someone chose a conservative default that sounded good at the time and no one revisited that decision until now..but someone at least DID revisit it!
    Thanks a lot for your valuable insights! I would have thought they would do a benchmark marathon with the concerning teams two weeks before release with lots of pizzas and soft drinks to tweak all possible knobs with experts from all teams together in a room on a weekend or two.

    But seriously, I would have thought that there is a team responsible and a process for revisiting all of the small little things that could yield a huge impact before release.

    By the way, I still have fond memories of the X58 platform and the Westmere Xeons that were overclockable from 2,66 Ghz to 4,2 Ghz at ease. In my opinion it was the best platform concerning the longevity and value over time with used server CPUs. Were you involved working on these back in the day?

    Leave a comment:


  • coder
    replied
    Originally posted by rmfx View Post
    How come some schedulers regularly claim +50 percent for some given tasks then ?
    It would help if you could be more specific.

    Most of the time we hear about eye-popping performance improvements, they're from outlier cases in some microbenchmark designed to tease out such performance problems and barely applicable to the typical user or scenario. Case in point:

    Phoronix, Linux Hardware Reviews, Linux hardware benchmarks, Linux server benchmarks, Linux benchmarking, Desktop Linux, Linux performance, Open Source graphics, Linux How To, Ubuntu benchmarks, Ubuntu hardware, Phoronix Test Suite


    If you aren't running servers with hundreds or thousands of active connections over like 100 Gbps links, you'd never see the inefficiency they fixed.

    Originally posted by rmfx View Post
    ​I build a silly hypothesis this way:
    If Linux was developed by an alien team with an average 800 IQ from now on, I bet we would be surprised the amount of extra performance they could enable that we thought were not even feasible, using next gen logic and math level and killing every single bottleneck at the lowest level.
    We know Linux has some inefficiencies, because we read about optimizations like folios and fixes to other weird, performance-robbing legacy. However, you should also keep in mind that vendors of high-performance hardware and big cloud users (as well as embedded & realtime devs) are profiling, analyzing, and optimizing the kernel to get more performance. It's really not as if nobody hasn't looked at this stuff or that it isn't continually being looked at. You can bet nearly all of the low-hanging fruit and big wins have already been taken.

    As I said before, there's a pretty simple test you can do by comparing the theoretical performance vs. practical, observed performance. People do variations of this all the time, just to sanity-check their benchmark results and to see whether it would be worth digging deeper to understand why the performance isn't closer to what they think it should be. However, if it is performing how you think it should, then there's probably not a lot of places performance-robbing code can hide!

    Birdie has a variation on this, where he likes to claim that if your application is bottlenecked by OS performance, then you're doing something wrong. Well, he seems to have a very narrow view of how people use computers, but there's a point in that. You can just look at how much system time a process is using. If that number is near zero, than there's probably very little potential for optimizations in the kernel to yield a performance improvement.

    That said, I have my own reservations about how Linux does certain things around multicore and multithreading. I do think there are decent-sized wins to be had by revisiting how userspace code utilize multiple cores, but about the only time you'd see a big win from refactoring that might be on a busy machine that's heavily oversubscribed. It wouldn't tend to make much difference in PTS benchmarks, since those are just running one thing at a time.

    Leave a comment:


  • rmfx
    replied
    Originally posted by coder View Post
    In general, no. The things which are easily quantifiable (i.e. how fast a given FFT kernel runs) tend to be tuned to within a reasonable approximation of the hardware's theoretical limits. What's harder is when you're doing complex operations that involve lots of different pieces of code, perhaps such something like a database, which has its own complexity, then has to go through the kernel, filesystem, and down to the hardware. In such cases, it's much harder to say how fast it should be, and it can also be hard to figure out which parts are really the limiting factors.

    That's why things like io_uring tend to take a while to come around. Someone has to start beating on a particular performance area and decide there are big wins to be had by refactoring. In the case of io_uring, by removing one set of bottlenecks, Axboe was able to uncover other bottlenecks, and eventually tuned the entire stack. It should be noted that the other big enabler was SSDs. Back when we only had spinning hard disks, the hardware was the main bottleneck and not one you could do much about. The software only had to be fast enough not to get in its way, but that still allowed the fast path to get way too cluttered to avoid bottlenecking fast NVMe storage.
    How come some schedulers regularly claim +50 percent for some given tasks then ?

    I build a silly hypothesis this way:
    If Linux was developed by an alien team with an average 800 IQ from now on, I bet we would be surprised the amount of extra performance they could enable that we thought were not even feasible, using next gen logic and math level and killing every single bottleneck at the lowest level.
    Last edited by rmfx; 12 June 2024, 12:42 PM.

    Leave a comment:


  • bhptitotoss
    replied
    Originally posted by coder View Post
    You troll enough that it probably was not so obvious to most of us whether or not you were truly joking. Not helped by the fact that it sucked as a joke.

    Then again, since you seem to think you can do anything as well as anyone far more experienced and better trained than you, I guess it shouldn't surprise me that you fancy yourself a comedian (among many other things).
    oh please your whining is ridiculous even trolls know better try harder next time

    Leave a comment:


  • bhptitotoss
    replied
    Originally posted by coder View Post
    You obviously didn't read far enough, because I was crystal clear that I didn't like it. However, you said the marketing people were bad at their job, but all you succeeded in doing was to show that you don't understand their job. I think none of us have the data to say whether their tactics are actually successful in their aims. That's what's actually needed in order to say whether they're any good or not.


    If you willfully misunderstand what I'm saying, it does seem rather pointless.
    sounds like you're mad because you can't understand anything I said LOL stop crying

    Leave a comment:


  • bhptitotoss
    replied
    Originally posted by Phoronos View Post

    Intel processors names are impossible to understand and you defend that ? LOL
    Stop writing please.
    OMFG seriously what a joke intel naming is totally cringe stop defending it dude

    Leave a comment:


  • coder
    replied
    Originally posted by rmfx View Post
    Who knows how much performances are buried by improvable code..
    Maybe computers could get 30x faster with unchanged hardware if a few thousands loc where replaced.
    In general, no. The things which are easily quantifiable (i.e. how fast a given FFT kernel runs) tend to be tuned to within a reasonable approximation of the hardware's theoretical limits. What's harder is when you're doing complex operations that involve lots of different pieces of code, perhaps such something like a database, which has its own complexity, then has to go through the kernel, filesystem, and down to the hardware. In such cases, it's much harder to say how fast it should be, and it can also be hard to figure out which parts are really the limiting factors.

    That's why things like io_uring tend to take a while to come around. Someone has to start beating on a particular performance area and decide there are big wins to be had by refactoring. In the case of io_uring, by removing one set of bottlenecks, Axboe was able to uncover other bottlenecks, and eventually tuned the entire stack. It should be noted that the other big enabler was SSDs. Back when we only had spinning hard disks, the hardware was the main bottleneck and not one you could do much about. The software only had to be fast enough not to get in its way, but that still allowed the fast path to get way too cluttered to avoid bottlenecking fast NVMe storage.

    Leave a comment:

Working...
X