Announcement

Collapse
No announcement yet.

Core i9 7900X vs. Threadripper 1950X On Ubuntu 17.10, Antergos, Clear Linux

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • chithanh
    replied
    With ffmpeg this is not surprising, since ffmpeg employs runtime CPU detection to automatically switch code paths. With -O3 and potentially other options you may see bigger differences.

    Leave a comment:


  • geearf
    replied
    Originally posted by arjan_intel View Post

    It's mostly the math-heavy things (libm from glibc, the BLAS library of your choice, etc) where there is a real split in performance between generations. The AVX2+FMA line is a split where performance fundamentally changes.
    (On, say, Skylake cpus like this core i9, a float point add takes 4 cycles, a floating point mul takes 4 cycles, but a FMA (multiple and add) ALSO takes 4 cycles. this means that code that does lots of multiply and adds on FP can get significant benefit)
    So I've just done a quick test but it wasn't that exciting :
    I rebuilt yasm, glibc (that took a lot longer than I expected), ffmpeg and x264 with march=native (haswell) 02, and I gained about 20 seconds in encoding a 20m long video, 20 seconds looks not too bad, but that was actually only less than 3% of the total time...
    Maybe with 03 or other flags it would be a bigger percentage as in Michael's test.
    Last edited by geearf; 13 September 2017, 11:55 AM.

    Leave a comment:


  • chithanh
    replied
    Originally posted by arjan_intel View Post
    this is not actually correct. FMA is still allowed even at O2;
    I don't know what "allowed even at O2" is supposed to mean. It happens at -O2 if you enable -mfma explicitly or indirectly through -march=...
    However, C99 requires that contractions only happen when there is a corresponding source level expression. GCC ignores this and will contract even where not allowed. Some references:
    https://gcc.gnu.org/c99status.html
    https://gcc.gnu.org/bugzilla/show_bug.cgi?id=37845#c5

    Originally posted by arjan_intel View Post
    the result is not less accurate than without FMA... it's a little more accurate.
    The result is changed and becomes unpredictable.

    Leave a comment:


  • torsionbar28
    replied
    Originally posted by tiwake View Post
    (have yet to run gentoo... I'm not man enough.. maybe when I get a threadripper desktop sometime next year or something)
    Tried it for a while myself, but I found it cumbersome in practice. It's not unusual to experience breakage due to unenforced dependencies, etc. and then you're left perusing the forums to find a solution. Meanwhile your machine is totally borked because the systemd you just installed is incompatible with the glibc you built yesterday. Ain't nobody got time fo' dat.

    But at least machines these days are fast enough that compile times are not a major issue. I was running Gentoo on a 667 Mhz Alpha 21264 for a while, and sometimes it took *days* to compile all the updates, and by the time it was finished, there was a stack of new updates that needed compiling. The machine spent more time compiling its own updates than doing any productive work, lol.

    Leave a comment:


  • sdack
    replied
    Originally posted by jrch2k8 View Post
    This has nothing to do with hardware support since x86_64 default to SSE2+ class hardware at bare minimum and the kernel can switch implementation of algorithms(when needed) at runtime(if you CPU support AVX2 the kernel will load all the modules that use SIMD optimized for AVX2, kernels are smarter than user space code).

    The main difference is compilation process and how much runtime debug apparatus have the kernel enabled(frame pointers, kprobes, etc.), I think Fedora enable lots debug features in their default kernel but I'm not sure.

    In the cases you see those huge massive spikes between distros is mostly due to "Too complex LOOPS" that simply cannot be optimized by certain compiler version specially at default settings, this means if a newer version of the compiler with certain extra flags find a way to break those loops and optimize them(vectorize them, parallelize them, etc) the gain will be huge compared to the un optimized one that is basically worst case scenario result
    Again, these are explanations of why it is, but not why it should stay this way.

    Leave a comment:


  • geearf
    replied
    Originally posted by arjan_intel View Post

    It's mostly the math-heavy things (libm from glibc, the BLAS library of your choice, etc) where there is a real split in performance between generations. The AVX2+FMA line is a split where performance fundamentally changes.
    (On, say, Skylake cpus like this core i9, a float point add takes 4 cycles, a floating point mul takes 4 cycles, but a FMA (multiple and add) ALSO takes 4 cycles. this means that code that does lots of multiply and adds on FP can get significant benefit)
    Interesting, thank you!

    Leave a comment:


  • arjan_intel
    replied
    Originally posted by chithanh View Post
    Given that Debian is very stability oriented, I don't think they can follow all of those.

    Using contractions like FMA will potentially affect the result of computations, therefore they are allowed by IEEE 754 and ISO C/C++ only within very strict limits. If you turn them on in gcc (via -ffp-contract=fast) the results of floating point math will effectively become implementation defined.

    In order to be effective, autovectorization often involves algebraic optimizations (reordering of operations) which runs afoul of language standards.
    this is not actually correct. FMA is still allowed even at O2; the result is not less accurate than without FMA... it's a little more accurate.

    Leave a comment:


  • chithanh
    replied
    Given that Debian is very stability oriented, I don't think they can follow all of those.

    Using contractions like FMA will potentially affect the result of computations, therefore they are allowed by IEEE 754 and ISO C/C++ only within very strict limits. If you turn them on in gcc (via -ffp-contract=fast) the results of floating point math will effectively become implementation defined.

    In order to be effective, autovectorization often involves algebraic optimizations (reordering of operations) which runs afoul of language standards.

    Leave a comment:


  • arjan_intel
    replied
    Originally posted by jrch2k8 View Post

    all right I'll try it later once I have all the questions
    we're also usually on freenode IRC (mostly US timezones though) in #clearlinux

    Leave a comment:


  • jrch2k8
    replied
    Originally posted by arjan_intel View Post

    mailing list is the preferred place; no need to be shy we're all friendly peeps
    all right I'll try it later once I have all the questions

    Leave a comment:

Working...
X