Announcement

Collapse
No announcement yet.

Targeted Intel oneAPI DPC++ Compiler Optimization Rules Out 2k+ SPEC CPU Submissions

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Developer12
    replied
    Originally posted by Kabbone View Post

    He might be trolling but I don't think you know what you are talking about. For engineering you need massive FP performance and not every code is optimized or suited for a GPU
    Even in very FP-heavy code you still encounter a lot of integer instructions, and improving those makes *everyone* faster. These days a modern CPU can do FP fast enough with only a modest number of FP units that it becomes a moot point for most software, even engineering stuff like matlab.

    The whole reason intel brought in AVX512 is because they want to market their CPUs (having really no GPUs to speak of yet) as being suitable for AI workloads in some way, even if it's just inference. The alternative is to cede market share to AMD/nvidia, as well as loose out on potential stock value gains by not having an answer to the AI boom.

    Leave a comment:


  • Developer12
    replied
    Originally posted by sophisticles View Post

    You haven't ever taken a computer architecture class, have you?

    I take it Mr. "Developer12" (were "Developer1-11" already taken?) that you are unaware that modern x86 CPU's do not have floating point units.

    All of AMD's and Intel's current processors use the SIMD units for x87, aka floating point, math.

    You may also want to recalculate your "100% of the population uses integer" conclusion but in an ironic twist that only people with functioning brains could have seen coming, you need floating point math to do so.

    Thanks for playing.
    Right off to the rude comments are you?

    You have no idea how modern processors implement floating point, and even less idea what CPU designers (like myself) are referring to when they say "integer." Go have a look at my previous comment for a brief list of example instruction classes. Damn near every instruction a CPU executes relies on on the integer pipelines, and FP-heavy code is no exception. Even in the best case a significant number of instructions are not FP.

    Have a look at the block diagram for how a modern, superscalar x86 CPU works. Zen is a great example. You'll see the usual instruction ingest and decoding, followed by a large scheduling and reorder buffer that feeds into a series of integer, floating point, and sometimes branch pipelines. The instructions are issued to these units in parallel as they become available and as the instructions' dependencies become resolved. When instructions are complete they pass out of the end of these parallel units and are retired, which means the results are committed back to memory or software-visible registers as-needed.

    Yes, you can overdose on tons of parallel FPU pipeline units, if you want really amazing FP performance (and extensions like AVX512 demand it), but you're not going to see much improvement on most workloads. Every single workload under the sun, meanwhile, has to pass through the integer pipelines at some point or another. You can't write code, even FP-heavy code, without loop indexing (which requires integer addition and comparison) and branches (which may or may not be handled by the integer pipelines, depending on the microarchitecture). you also can't go without various load/store instructions, since you need to actually have the numbers your FP is operating on. These in turn often rely on integer math for indexing/address calculation (especially on x86!).

    There's also a pretty quick drop-off in additional performance as you keep adding FP pipelines, because there's only so much ILP (instruction-level parallelism) that you can extract. Extensions like AVX 512 can help with this to some degree by making it easier to express big, wide FP operations (yay "SIMD"), but at the end of the day there's precious little in the way of FP workloads that can benefit form wider and wider FP instruction sets that wouldn't see MUCH better performance on a GPU. Ultimately, it makes more sense to run such heavy FP code on a hardware device (a GPU) spcialized for that task. You'll see much wider throughput and much lower latency, without impacting everyone else including yourself.

    Do some research and then maybe you can find valid criticisms. I spent a hell of a lot of years studying and implementing this stuff when I got my (multiple!) degrees in this area.
    Last edited by Developer12; 10 February 2024, 04:22 PM.

    Leave a comment:


  • sophisticles
    replied
    Originally posted by osw89 View Post

    Leave talking about hardware to actual EEs since you obviously don't even know the basics. All modern CPUs have SIMD capable integer and FP units, not magic SIMD units that do math and somehow replace the FPU. You implement SIMD by having multiple instances of data processing blocks like adders and multipliers in your execution units. You see those blocks labeled "floating point"? Those make up the FPU that you claim doesn't exist anymore and it doesn't matter whether it's a separate chip or on the same die, it's still an FPU. There are multiple MUL/ADD/ALUs to enable SIMD. SIMD is a feature of FP/integer units, calling an FPU an SIMD unit is like calling a GPU an h264 unit since it can decode and encode h264.
    AMD-Zen-2-vs-Zen-3.png
    So you're an electrical engineer?

    May want to take a refresher course:

    https://en.wikipedia.org/wiki/Floati...urrent%20archi tectures%2C%20the,newer%20Intel%20and%20AMD%20proc essors.

    In some current architectures, the FPU functionality is combined with SIMD units to perform SIMD computation; an example of this is the augmentation of the x87 instructions set with SSE instruction set in the x86-64 architecture used in newer Intel and AMD processors.




    Intel has actually created three separate generations of floating-point hardware for the x86. They started with a rather hideous stack-oriented FPU modeled after a pocket calculator in the early 1980's, started over again with a register-based version called SSE in the 1990's, and have just recently created a three-input extension of SSE called AVX.
    I know for a fact that Intel since the introduction of the Pentium III has combined the FPU with the SSE until and one of the reasons why Intel used to tell developers that floating point operations were deprecated and SIMD should be used whenever possible.

    I could have sworn that AMD had followed suit decades ago but maybe they didn't and maybe that's why many people used to avoid AMD processors for scientific workloads.

    It could also be why there were applications in the late 90's that would only run on Intel processors.​

    Leave a comment:


  • intelfx
    replied
    Originally posted by pong View Post
    I would not be surprised to see NVIDIA pull an Apple and just stick some general ARM/RISCV cores or user-equivalent execution capability into their GPU dies in a generation or two and just call that a GPGPU computer, forget x86, forget traditional motherboard form factors and the GPU being a "peripheral" to some lame CPU.
    They did just that, see a recent Phoronix article on NVIDIA’s GH200 (Grace Hopper).

    Leave a comment:


  • s_j_newbury
    replied
    Originally posted by acobar View Post

    Let's put a bit of context to it: Linus said it in a time when AVX512, frequently, caused the CPU to overheat, and as so, triggered thermal throttling, lowering by a big chunk the system performance. Intel implementation was, somehow, flawed, but seems to be better in the new, and quite expensive, special parts now.

    Linus critics were merited then, but things change as time goes by.
    Other more generally useful improvements could have potentially been made instead of optimizing design for AVX512. There's no free lunch. There is a generally held view amongst many "programmers" that FP is how maths is done, which is wrong. FP is how maths is approximated.

    Leave a comment:


  • acobar
    replied
    Originally posted by sophisticles View Post

    I think it's a shame that a guy with a Master's degree in Computer Science, that leads development of a project with such a wide reach and makes millions a year, thinks that floating point is such a special use case that no one cares about.

    I guess mathematicians, scientists, analysts, economists and business people don't count.
    Let's put a bit of context to it: Linus said it in a time when AVX512, frequently, caused the CPU to overheat, and as so, triggered thermal throttling, lowering by a big chunk the system performance. Intel implementation was, somehow, flawed, but seems to be better in the new, and quite expensive, special parts now.

    Linus critics were merited then, but things change as time goes by.

    Leave a comment:


  • osw89
    replied
    Originally posted by sophisticles View Post

    All of AMD's and Intel's current processors use the SIMD units for x87, aka floating point, math.
    Leave talking about hardware to actual EEs since you obviously don't even know the basics. All modern CPUs have SIMD capable integer and FP units, not magic SIMD units that do math and somehow replace the FPU. You implement SIMD by having multiple instances of data processing blocks like adders and multipliers in your execution units. You see those blocks labeled "floating point"? Those make up the FPU that you claim doesn't exist anymore and it doesn't matter whether it's a separate chip or on the same die, it's still an FPU. There are multiple MUL/ADD/ALUs to enable SIMD. SIMD is a feature of FP/integer units, calling an FPU an SIMD unit is like calling a GPU an h264 unit since it can decode and encode h264.
    AMD-Zen-2-vs-Zen-3.png

    Leave a comment:


  • lowflyer
    replied
    Just another affirmation to *never ever buy intel again* in the future. It's not the first time that intel has been caught cheating.

    Leave a comment:


  • pong
    replied
    He's got some (general / abstract) points but I found some of the ad hoc assertions cringe-worthy as you apparently did as well.

    It may be fair to say that we've reached a point where some things in computing architecture are better specialized than not. And
    within a domain of specialization perhaps strive to create the most generally useful implementation until that gets too painful with trade-offs
    where something is too bottlenecked to serve multiple distinct use cases well at which point one needs to bifurcate the implementation again
    and sub-specialize according to optimum-for-use-case area implementations.

    Early x86 had specializations for CPU vs FPU as separate chips, and the FPU did provide a big speed boost for its intended use cases.

    Then we've had "DSP" processors good at MAC and small matrix / vector / FIR / FFT type stuff and eventually those bifurcated into FP DSP and integer DSP.

    Then we got fixed function GPUs which worked ok for rendering 1990s 3D but were too limited and inflexible particularly for the increasing costs
    and varied use cases so they morphed into programmable GPUs which basically were fast wide SIMD machines.

    Now people (ab)use GPUs for HPC because they're programmable, highly parallel, and have 10x the RAM BW as motherboards except at the
    consumer level GPUs suck because they're "toys" and overpriced and not integrated into the overall computer architecture so they're in a way
    the sort of thing Linus laments -- good for special purpose FP / integer M/V/tensor / SIMD stuff but totally "special case" and not bringing any
    integration to the common computer architecture being attached by slow crappy PCIE slots etc.

    Now we're getting NPUs which are bifurcating in a way from GPUs, still doing fast / parallel stuff but also not always ideally mapping to GPUs
    due to NN architectures and not having GPUs being optimized for those architectures, data types, etc.

    And in the mean while ~1990s-now CPUs basically continue to "suck" at architecture. They've scaled pretty well on single thread performance,
    they've gotten some very modest degree of parallelism so you can see 8-32 core boxes often enough, attached RAM capacity / BW has modestly increased (however totally sucky on consumer platforms).

    But compared to GPUs/NPUs CPUs suck at vector operations FP and integer. CPUs suck an NN/ML. CPUs suck at RAM BW where let's see a 12-DIMM server ($10k*N) has what something like 700 GBy/s RAM BW where for under $1k you get a GPU with over 1TBy/s VRAM BW that it actually can routinely come close to saturating on real world memory BW heavy streaming data flow calculations.

    Whereas you've finally gotten SOME architectural deviations from the "hasn't improved much since the 1990s" consumer CPU / motherboard level like the Apple M series high end parts with "unified memory" that can achieve something like 400 GBy/s RAM to "processing" throughput where that
    "processing" includes some SOC level aggregation of NPU/CPU/GPU/DSP like functional blocks all sitting on a fast wide memory bus for a (bit) less
    eye-watering cost / size than one could get 400 GBy/s RAM BW on X86.

    Meanwhile x86 CPUs have been sucking at performance / power variously so ARM et. al. try to establish a TCO & MIPS/power cost and density efficiency for multi-core et. al. workloads.

    GPUs/NPUs being streaming / dataflow "DSP" like things basically just can run full-out and will suck as much power & BW as they're designed to handle at close to peak indefinitely in a tight loop of processing.

    So it's sort of double-speak to say "I want scalar general purpose CPUs to be better, damn the special vector instructions!" yet apparently accept GPUs as "good" though not acknowledging that one has GHz limits and power limits and consumer x86 is severely memory BW starved (compared to GPUs) and
    the only reasons "GPUs" are good are that they're SIMD, deal with multiple use-case optimized data types from FP32 on down to i8 etc. and have HUGE RAM BW and efficient memory streaming dataflow designs for thousands of thread groups processing vectors / matrices of strided / contiguous RAM blocks.
    .
    He's right that Intel / AMD have done stupid stuff "for the benchmarks" as probably in their own ways NVIDIA, et. al. do and it's good to call them out for that.

    OTOH there's nothing wrong with HPC / vector / NPUs and satisfying their need for FLOPS / IPS / BW and special instructions / architectures etc.
    Is AVX512 good for useful stuff? Well I guess. But overall it seems "too little too late" since if Intel/AMD wanted to scale the fundamental
    PC/CPU/RAM/MB ARCHITECTURE they could have followed GPU-like paths while keeping CISC/RISC general purpose compute functionality along for the ride and we'd have massive RAM BW, ECC, MMU, IOMMU, virtualization, efficient matrix / vector / streaming computing, massive SIMD, etc.
    but we don't and in a lot of workloads the CPUs sit nearly idle (being relatively useless) while the GPUs run at 100%.
    It's a sad time for Intel/AMD when newcomers like ARM/RISC-V/GPUs/NPUs threaten to eat their lunches but here we've been for a decade+.

    Until NPUs specialize into their own thing entirely for training / inference the best we've got are GPUs (which sucks) and CPUs are kind of
    sad left behind things. I would not be surprised to see NVIDIA pull an Apple and just stick some general ARM/RISCV cores or user-equivalent execution capability into their GPU dies in a generation or two and just call that a GPGPU computer, forget x86, forget traditional motherboard form factors and the GPU being a "peripheral" to some lame CPU.

    Wake me up when I can get 1TBy/s ECCed RAM BW to 256 GBy of RAM on a platform AMD/Intel can sell me for around the price of a 4090
    and have their SIMD FP / integer capability roughly match a 4090 and then I'll say the processing platform has evolved in a general purpose useful
    way for OS/application/GPU/DSP/NPU functions. Otherwise the "specialized" HW is still what's being most relevant and
    for mere "OS" tasks an ordinary ~16 core CPU is pretty good until it falls down at things it can't even fractionally compete with doing (NPU,GPU,...).
    Otherwise I have more hope in NVIDIA/Apple than Intel/AMD.



    Originally posted by sophisticles View Post

    I think it's a shame that a guy with a Master's degree in Computer Science, that leads development of a project with such a wide reach and makes millions a year, thinks that floating point is such a special use case that no one cares about.

    I guess mathematicians, scientists, analysts, economists and business people don't count.

    Leave a comment:


  • Kabbone
    replied
    Originally posted by Developer12 View Post

    Do you know what 100% of the population uses? Integer performance. Your standard adds, subtracts, stores, shifts, loads, branches, jumps, calls, etc. Integer perf is what determines how fast the kernel runs, and it's what determines how fast every application runs.

    Most of your examples are someone using a spreadsheet. This isn't the 90's anymore, when 100 MHz was king. Even if you drop FP hardware entirely you can emulate it in software on a modern CPU using integer operations far faster than a human can perceive. The bottleneck for these jobs is how fast human meat fingers can enter numbers.​ They're not going to notice the difference if you trim a few hundred thousand transistors and have two FP units in each core instead of four.​

    And for those people *actually* relying on floating point performance? I don't know if you noticed but for the last 10 years they've been doing it on a GPU. 3D rendering? GPU. AI? GPUs. Weather modelling? GPUs. Exhaustive mathematical proofs? GPU. Nuclear weapons simulation? GPUs. Guess what: job submission to the GPU is 100% determined by integer performance.

    It's worth noting for historical context that in the era to which linus is referring (the late 90's) floating point was a big deal for one single reason: rendering 3D video games. 0% of games still do that on the CPU. No games are even still capable of it.
    He might be trolling but I don't think you know what you are talking about. For engineering you need massive FP performance and not every code is optimized or suited for a GPU

    Leave a comment:

Working...
X