Announcement

Collapse
No announcement yet.

L1 Terminal Fault - The Latest Speculative Execution Side Channel Attack

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • oiaohm
    replied
    Originally posted by Weasel View Post
    Bandwidth or throughput is stuff that you can do at the same time as long as it doesn't depend on the results of previous stuff. Like, you know, a memcpy, where it reads 64 bytes, writes 64 bytes, none of which depend on each other. Then the next 64 bytes are read, next 64 bytes are written. Again none of these depend on each other, because they're unaliased memory addresses. What you wrote is never read when you do the memcpy. There's absolutely NO fucking reason to wait 4 clock cycles. (that's what memmove is for, if aliasing is a possibility)
    It not 1 clock cycle its at least 2. First you have to validate that the 64 bytes of memory you want to read from L1 cache is in fact in L1 cache. If not you need to have it transferred from L2 cache. If its not in L2 you need to have it transferred from L3 cache.

    You are not getting what causes the saturation. Caches are only a limited size and can only store so much.

    Speculative execution has the downside of needing to store more in cache so increasing you cache miss rate. Think about you memcpy you have two different memcpy on two different of a branch. One memcpy is a debug function so is basically never is almost called if you speculative execution is going down there 30 percent of the time its polluting your cache with information cache does not need.

    Please note you focused on on the 4 cycles bit. It does not change the fact that unless speculative execution gets 70% branch hit rate on a 14 long pipeline its slower than a 5 long pipeline waiting for a branch to be solved if the pipelines contain exactly the same optimisations other than speculative execution. Stalling does not pollute the cache and does not cause extra cache misses.

    You have two causes of stalls cache miss and branch pause. Speculative execution attempts to deal with branch pause while creating more cache miss and being wasteful.

    Barrel processors deal with branch pause adds tolerance to cache miss without increasing cache miss rate for the price that you must multi thread at the cpu level

    Short pipeline cpus(under 5 pipeline length) the increase cache miss rate caused by speculative execution causes large performance problems than pausing in branches because you have more cache miss delays and these cache miss delays consume more time than waiting in the branch. Y

    Basically speculative execution is not that much of a solution kind of works in cpu with long pipelines and large caches because the delay caused solving branch due to long pipeline is so long over 14 long pipeline its too long even for speculative execution to help. Also L1 cache has a max electrical size so a cpu with a pipeline of 5 has the same size L1 as a cpu with a pipeline of 14. So you cannot in fact expand l1 to deal with speculative execution increasing cache miss rate instead of you have do things like L2 so adding more processing steps to request information from memory so making your memory stalls worse.

    Out of order execution where you accelerate processing of branch code and delay code that is not need to solve the branch to fill in the branch delay. Since you can do out of order using only instructions you know you have to execute this does not cause increased cache miss. Out of order like this can be used to attempt to cover over cache miss as well.

    The reality there is a very big question if speculation execution really works. Or have we just used speculative execution to cover up for bad chip design.

    Your transfers from l3 to L2 is about 1/3 of the speed of L2 to L1 and that is about 1/3 of the speed from L1 to internal processing. Of course you have to be able to run out of bandwidth. 1 request for something that is not in L2 but is in L3 will take longer for the cpu to get that information that what a cpu with a pipeline of 14 will take to solve a branch. 1 request for something not in l1 but is in l2 for system that has a l2 if not l3 is long than a cpu with a pipeline of 5 take to solve a branch. Short pipeline makes speculative execution causing cache misses not tolerable as you start having longer delays due to cache misses than branch solves so speculative execution is gaining you absolutely nothing.

    Speculative execution looks fine until you start waking up how expensive cache missing is and how slow the cache bandwidth is to the cpu and the fact that L1 is basically fixed size for everyone.

    As more of our applications are coming multi threaded there comes a point where there may not be enough single threaded workloads to bother about out of order execution or speculative execution instead just use barrel designed processors that love multi threading.

    The general workload is changing.

    Leave a comment:


  • Weasel
    replied
    Originally posted by oiaohm View Post
    Exactly at some point you will have to wait anyhow. If your branch wait lines are short enough they can line up with you memory wait so effectively come non existing.

    I am not talking about main memory bandwidth. I am talking about the fact you saturate the bandwidth between your cpu core and memory. Once that is saturated you are going to stall anyhow its the limited core to memory bandwidth that brings the idea of speculative execution undone.
    Yes, which is no worse than waiting on the branch itself (for single-threaded performance). And you can't saturate it with only 1 core. And the stuff "between the cpu core and memory" is the main memory, you know? I mean the main memory only gets requests from the caches or the core (unless you bypass that).

    Originally posted by oiaohm View Post
    The core L1 bandwidth is 64 byte per 4 clock cycles.
    Ok dude, I'm done, you're just a parrot, you don't read, I even bold + capitalize stuff for emphasis in the hope that you'll actually manage to read it, and you still go on with the same nonsense bullshit... really lost cause.

    Here's what the 4 clock cycle latency is: if you write to the L1 cache, you must wait 4 clock cycles before that same memory (not cache line, but only the memory you wrote) is available for reading. This is store-forwarding latency. You can fucking write a LOT more in that time or read OTHER, UNRELATED stuff, even from the same cache line, without waiting for 4 clock cycles, this is why throughput is MUCH higher than latency.

    Bandwidth or throughput is stuff that you can do at the same time as long as it doesn't depend on the results of previous stuff. Like, you know, a memcpy, where it reads 64 bytes, writes 64 bytes, none of which depend on each other. Then the next 64 bytes are read, next 64 bytes are written. Again none of these depend on each other, because they're unaliased memory addresses. What you wrote is never read when you do the memcpy. There's absolutely NO fucking reason to wait 4 clock cycles. (that's what memmove is for, if aliasing is a possibility)

    In this case, say you write 64-bytes to cache line 1, then you write 64-bytes to cache line 2. These are done in 2 clock cycles total, each done in 1 clock cycle. You don't fucking wait for the RESULT (latency) of the first write before you can write the second. ffs.

    Even writing a SINGLE BYTE to a cache line and reading that SINGLE BYTE will incur a 4 clock cycle latency. Of course, other writes will run at the same time in parallel, since throughput is much higher. Throughput is 64 bytes per clock so...
    Last edited by Weasel; 24 August 2018, 07:57 AM.

    Leave a comment:


  • oiaohm
    replied
    Originally posted by Weasel View Post
    Latency is another problem, at some point you will have to wait. But that's no different than waiting on a branch, Lol.
    Exactly at some point you will have to wait anyhow. If your branch wait lines are short enough they can line up with you memory wait so effectively come non existing.

    I am not talking about main memory bandwidth. I am talking about the fact you saturate the bandwidth between your cpu core and memory. Once that is saturated you are going to stall anyhow its the limited core to memory bandwidth that brings the idea of speculative execution undone.

    The core L1 bandwidth is 64 byte per 4 clock cycles. If your pipeline is short(5 long) your worst to solve of a branch is 9 clock cycles in that solve you. That is 9 clock cycles to process 2 instruction at very worst. With correctly order code as this is 5 instruction groups the general delay can be brought back to 4 clock cycles compliers can achieve this most of the time. Hang on its skip one memory operation if a memory operation has had to go out to L2/L3 its going to long than the branch solve even in the worst case. So memory bandwidth to caches is a great stall problem in short pipeline cpus not branch processing.

    Now lets look at your 14 long pipeline. That is 27 is your worst to solve branch 27 clock cycle to process 2 instructions is down right horrible. At at 14 instruction groups thinking a branch can be that close there is no real way to have compiler reorder code to half this most of the time. Half is still 13 clock cycles so losing 3 memory operations. Lot hard to hide if you are directly hit the cache. Long pipeline cpus the branch processing gets worse than reading memory from L1. Longer pipeline gets the worse this becomes.

    Lets say we go speculative. execution to deal with this problem and we keep the 14 long pipeline You get 70/30 ie 70 hit 30 miss this bring you back with 14 length pipe to losing 1 memory operation per branch as you would normally lose in a 5 pipeline cpu with well optimised code without any branch prediction. When you get the reverse 30/70 ie 30 percent hit 70 miss you are losing 2 memory operations per hit and if you only get 50/50 your are losing 1.5 again worse than the 5 pipeline cpu. At 14 long pipeline you are only matching 5 pipeline cpu if you are lucky. But that is not the end of the story.

    When you fill in those lost memory operations with speculative actions you can be triggering ripples down though your cache requesting stuff that will never be used with a non speculative execution processor. Of course those request consume up your bandwidth between l3 to L2 and L2 to L1 so making requests for the memory you need to proceed forwards more congested. Also the speculation is filling your cache with data and code that you would not need if you were not doing speculative execution.

    So pipeline 14 is truly the upper limit you really want a pipeline of 9 so that 50/50 with speculative execution equals a 5 long pipeline. Even so speculative execution is costing quite a bit. Think about if item you need has been pushed out of you L3 cache because of a speculative action and you have to pull it back in from memory this has quite cost. You memory bandwidth from main memory to your cpu core is quite limited.

    When you look a barrel processor things are different.
    5 pipeline barrel it hits a point where it has to wait for a branch to solve if there is another thread for those 4-7 not in use cycles its able to fill them in with another thread. You are not stressing your limited cache memory bandwidth with any requested data that you are not going to use.

    5 pipeline barrel is happy with 8 threads. Does not like 1 thread. Due to having 8 threads every time a branch function comes up its able to solve it instantly. Only stall to a 5 pipeline barrel processor with 8 threads is running out of memory bandwidth through caches. That 8 does not have to be threads. If you have 8 not interdependent instructions these from a single thread code could be used to fill a barrel processor 8 threads this is where out of order execution boost a barrel processor single thread processing a lot by converting single thread processing into more threads. Instead of speculation with a barrel processor you focus on out of order execution because if you can make a single thread out of order enough it will come enough threads to make your barrel design happy.

    A 5 pipeline barrel processor can have more than 8 threads so when you have to wait on a long function like a slow design divide it can let it happen. Basically a barrel processor can be dynamic delay tolerant processor again processing as much as possible.

    A barrel processor without an out of order system will not keep up with non barrel processor on a single threaded workload even if that processor does not have speculative execution. But barrel processor without out of order will out perform with multi threaded workloads even if the competing processor has speculative execution. Problem is a barrel processor with a short pipeline with an out of order system for when it does not have enough threads can keep up and beat a processor with speculative execution most of the time because its not wasting memory bandwidth and not filling caches with stuff that not needed.

    Issue here is resources in cpus are limited any waste cannot be recovered.

    Leave a comment:


  • Weasel
    replied
    Originally posted by oiaohm View Post
    It is not that simple. 64 bytes from L1 to L2 and from L2 to L3 right. No its not. You have 64 bytes for the data cache and 64 bytes for the instruction cache in L1 so you can be doing 128bytes. Then you have your speculation on your L2 and your speculation on your l3.
    https://www.7-cpu.com/cpu/Haswell.html
    Nothing like being horrible wrong. Generally 64 bytes from L1 in intel cpus takes 4 clock cycles. Request to L2 takes 12 clock cycles before the cpu core has it and if you are out at L3 you have been stalled for at least 30 cycles.
    You REALLY have no idea what you are talking about.

    Just for your info, the 4 clock cycle is store-forwarding LATENCY and we were talking about BANDWIDTH. In fact we were talking about bandwidth to the MAIN MEMORY (i.e. RAM not a fucking cache). RAM has no concept of "instruction cache" or "data cache". Obviously the LATENCY will be way slower 1 clock cycle, way way way slower, it also depends on RAM speed, if you want to access that data fast. The cache itself can only pull (currently) 64 bytes per clock per core, but that won't saturate the BANDWIDTH of the main memory with just 1 core.

    Latency is another problem, at some point you will have to wait. But that's no different than waiting on a branch, Lol.

    Just stop man, you're embarrassing.
    Last edited by Weasel; 22 August 2018, 07:41 AM.

    Leave a comment:


  • oiaohm
    replied
    Originally posted by Weasel View Post
    Yes, and memcpy, or just a simple
    Code:
    rep movsb
    on newer CPUs will, on large amounts of memory copied, do it 64 bytes per clock cycle (obviously, barring latency, which is irrelevant for bandwidth). You can't go faster (i.e. more bytes at once) than that with just 1 core. So no matter what the speculative code does it can't do it by itself.
    It is not that simple. 64 bytes from L1 to L2 and from L2 to L3 right. No its not. You have 64 bytes for the data cache and 64 bytes for the instruction cache in L1 so you can be doing 128bytes. Then you have your speculation on your L2 and your speculation on your l3.
    https://www.7-cpu.com/cpu/Haswell.html
    Nothing like being horrible wrong. Generally 64 bytes from L1 in intel cpus takes 4 clock cycles. Request to L2 takes 12 clock cycles before the cpu core has it and if you are out at L3 you have been stalled for at least 30 cycles.

    Welcome to horrible fact of cpu caches you run out of memory bandwidth very quickly. Each one is about 3 times faster. Ie L1 is 3 times faster than L2 and L2 is 3 times faster than L3. Dropping L2 out of design you can go to a L1 that is just 1/3 of speed of L3.

    The risc-v chips I have been talking about can do 64 bytes per clock cycle. X86 chips are not t hat super good on cache bandwidth usage or function. Speculation in caches consumes so many clock cycles of cache access. You are correct that the hardware can do 64 bytes per cycle in x86 but half what you need for filling to halves of the L1 and it coming 4 clock cycles because you are losing half to speculation at least.

    Originally posted by Weasel View Post
    lol what? You do realize a coin flip is 50% right? It's not 50% "when you're lucky", wtf. This made my day. :-).
    Problem here is branch speculation is not a coin flip.
    https://en.wikipedia.org/wiki/Monty_Hall_problem
    Its in fact a relation to the Monty Hall problem that is why its 30% hit 70% failure. When you pick a branch is very much like picking the Monty Hall doors. Except one door is already open and you are always picking the door with a 1/3 chance instead of the door with 2/3 chance.

    Branch prediction is in fact 2 coin flips what you choose and what happens.

    This is really like the Monty Hall problem LZMA works out that you fairly much always hitting the 1/3 door. There are other algorithms where with branch prediction you will be getting 70% success with 30% failure. Its insanely lucky if algorithm gives you exactly 50% on branch prediction in something like x86 cpu that is only processing 1 branch. Your branch prediction is either really good or really bad if algorithm is really good for branch prediction it really like doing a coin spin with a bias coin. Computer algorithms are way more common to be 30 percent success and 70percent failure at branch prediction.
    https://www.smithsonianmag.com/scien...050-145465423/

    Coin flip is not as unbias as you think. The algorithm at play is very much what have you done to the coin.

    If you read some of the early white papers on speculative execution they picked some lucky work loads and were like speculative execution great we are getting branch prediction right 70 percent of the time.

    Originally posted by Weasel View Post
    Think about it: why even have a long pipeline on a system that waits on branches and then runs another thread on the same core?!? Why not have small pipelines but more cores instead, which is the same thing but more efficient?
    Barrel processors exist for a reason. Barrel processor does not equal longer pipeline.
    https://en.wikipedia.org/wiki/Barrel_processor
    Small pipelines with more cores results in more hardware sitting idle. Objective of barrel processor is very much the same as speculative execution to keep the processing units as full as possible. Barrel processor is using threads to have enough instructions to process to keep the cpu processing parts under full load.

    Really you don't want a long pipeline. If you break your single thread code into as many branches as possible how compilers split code around usage of registers but into threads instead you get something a barrel processor likes. There have been some historic out of order barrel processors that did this when they had only single thread program running.

    You see a 14 long pipe line in a x86 it does suffer from cache stalls. A 5 long pipeline barrel cpu running with 10 threads will perform well no stalls.

    Basically there are two ways to take on the single thread performance problem. Use speculation to guess so executing 1 branch and if you get it wrong have stalls/jitter or have methods that turn single threaded into multi threaded because your core is naturally multi threaded if the cpu core you have is naturally single threaded turning code into multi-thread does not help you that much.

    Speculative execution on a barrel processor can get insanely warped. Don't pick one branch process each branch in a individual thread in the cpu at the same time and throw away what ever thread was wrong. This is the only form of speculative execution that lands you with exactly 50% success and 50% failure every single time also means you have more complex cache and memory management that is paths to failure. Its is just better to have methods that make single thread into multi thread when there is not enough workload to have barrel processor 100 percent happy. There are quite a few methods to turn single threaded into multi threaded that don't use speculation.

    Proper designed barrel processors that are truly multi threaded at core you do not have jitter due to speculative execution failures/successes even if they have speculative execution. Proper multi threaded cpu cores have predictable execution times.

    Please note small pipeline with more cores even at 95% branch hit you still have 5 percent miss where a barrel processor that automatically is changing between threads can get 100% processing load.

    Barrel is one of these things it perform great in all multi threaded workloads. Barrel is great for real-time systems needing dependable timing on processing. Barrel need work to be the best when you only have a single thread to process like instruction processing having ability to make more than 1 thread from a single stream of instructions. What is at the core of x86 is not ideal for real-time needing dependable timing and not ideal under heavy multi thread workload. Worst x86 is not really super great in single threaded.

    Please note I don't see that long pipelines can be justified really we need all CPU to get their pipeline length back under 10 they will be forced to because once we cannot go down to smaller nm the only way to reduce power usage will be reduce circuit. Best way to reduce circuit without in fact reducing number of processing items per clock cycle is optimise the heck out your pipeline to make it as short as you can while having all the performance optimisations like multi instructions at once. Also once we cannot go to a smaller nm building into hardware thread management will be another power saving by reducing context switching overhead.

    Leave a comment:


  • Weasel
    replied
    Originally posted by oiaohm View Post
    This is wrong off the bat current x86 Intel cpu cannot process anything that is not transported from l3 to l2 then from l2 to L1.

    SSE cache-bypass store basically gives direction to the caches not to keep copy of this operation and write in a straight up disposable section of cache. You are still writing though the caches. Why must you read/write though a cache is the mmu can be busy when you instruction attempts to read/write.

    memcpy that not sparse you will not notice much of a problem because it will be a workload hitting the cache most of the time. Cache line is 64 bytes long on x86. Please note some of the highest compression algorithms are sparse problems the 10 min mark shows why 8 and 16 byte are optimal for sparse problems. When you are not using a cache system optimal for sparse problems and you put speculative execution on top you have path to hell.
    Yes, and memcpy, or just a simple
    Code:
    rep movsb
    on newer CPUs will, on large amounts of memory copied, do it 64 bytes per clock cycle (obviously, barring latency, which is irrelevant for bandwidth). You can't go faster (i.e. more bytes at once) than that with just 1 core. So no matter what the speculative code does it can't do it by itself.

    Originally posted by oiaohm View Post
    CPU dropping back to blind guess is how come LZMA ends up at 30 percent. A unpredictable branch taking algorithm quickly kicks the living heck out of speculative execution. Blind guess is not 50% success rate its 50% if you lucky real world most 30% with 70% failure.
    lol what? You do realize a coin flip is 50% right? It's not 50% "when you're lucky", wtf. This made my day. :-)

    Longer pipelines have diminishing returns obviously, I never denied it. Yet, we do it because we care about single-threaded performance. Most of the CPU's design, from speculative execution to long pipelines, is made for that purpose only.

    Think about it: why even have a long pipeline on a system that waits on branches and then runs another thread on the same core?!? Why not have small pipelines but more cores instead, which is the same thing but more efficient? The fact is that both situations only work when you have multiple threads. And guess what? That's called a GPU.

    In contrast, speculative execution + long pipeline is the only design that is made to increase single-threaded performance. It's really as simple as that.

    At some point of course, the gains are too small in single-threaded performance, so they resort to adding more cores and the like. But they're still mostly focused on single-threaded performance, otherwise we'd all be using 128-core mini-core CPUs or such which have zero speculative execution and are thus super slow at single-threaded performance.
    Last edited by Weasel; 20 August 2018, 08:17 AM.

    Leave a comment:


  • oiaohm
    replied
    Originally posted by Weasel View Post
    A single-threaded memcpy won't saturate your memory bandwidth, and cache has nothing to do with it, since it bypasses the cache anyway. The cache doesn't load faster than a memcpy can.
    This is wrong off the bat current x86 Intel cpu cannot process anything that is not transported from l3 to l2 then from l2 to L1.

    SSE cache-bypass store basically gives direction to the caches not to keep copy of this operation and write in a straight up disposable section of cache. You are still writing though the caches. Why must you read/write though a cache is the mmu can be busy when you instruction attempts to read/write.

    memcpy that not sparse you will not notice much of a problem because it will be a workload hitting the cache most of the time. Cache line is 64 bytes long on x86. Please note some of the highest compression algorithms are sparse problems the 10 min mark shows why 8 and 16 byte are optimal for sparse problems. When you are not using a cache system optimal for sparse problems and you put speculative execution on top you have path to hell.

    Originally posted by Weasel View Post
    If a branch is consistently predicted less than 50% of the time, what makes you think the CPU won't simply reset to blind guess, which is still a 50% success rate being completely blind (i.e. flip a coin). You literally have no idea what you're talking about when it comes to branch prediction.
    CPU dropping back to blind guess is how come LZMA ends up at 30 percent. A unpredictable branch taking algorithm quickly kicks the living heck out of speculative execution. Blind guess is not 50% success rate its 50% if you lucky real world most 30% with 70% failure.

    200 processing units is not that many either you when have a pipeline 14 long as x86 this 200/14 a lot would think but is really 200/27 once you have to allow for items with high latency what is roughly 8 processing units being cleared per clock cycle. This is the other problem with increasing pipeline length you need more and more processing units to clear the same number of processing units per clock cycle. So exactly the same number of processing unit cleared per clock cycle as A73 but that only 11 long pipeline so 21x8=168 processing units(kind of explains lower power usage right)

    Longer your pipeline gets the more problems you have. First problem is crossing 8 and starting to find your branch processing being too slow so needing to use cray barrel method for accelerating single threaded or speculative execution at that point.

    Cray barrel optimisation for single threaded execution(out of order execution with read ahead in a barrel cpu design) splits instructions need for branch calculation into one thread with high priority and general processing into another thread from 1 thread workload. So the result is no such thing as a single threaded workflow. Cray method starts failing you when you get to a pipeline length of 10-11 also was patented until 2008. This method does not suffer from something https://en.wikipedia.org/wiki/Pipeline_stall yes the nasty pipeline stall. Of course to use this method you need to have barrel thread management engine in your cpu so you can break a single thread into 2 or more threads as your out of order solution particularly so you can process branch path though code quickly to know what instructions to be feeding into the general thread/threads. Of course barrel thread management in cpu would know when it only has 1 thread to process and should attempt thread splitting or should just cycle between threads. Due to the cray stuff being patented and not usable until 2008 its still not in most textbooks.

    Everything fails when you cross pipeline length of 14. Power ineffective and speculative execution starts failing you at a pipeline length of 15+

    Weasel I guess it never crossed you mind that is possible to design a cpu that internally there is no such thing as a single thread workflow yet processes single threaded workflows insanely well. This is the problem being tunnelled visioned on single thread performance this means you don't look at cray barrel that is a pure multi thread cpu core design and the optimisations that can be put on that for single threaded to convert single threaded to multi threaded. If you have great multi thread performance its possible to use methods to make single threads be processed by multi threads so get good single threaded performance as well. Now if you have poor multi thread performance there is no optimisation you can magically do to get good multi thread performance. Worse is if you have poor single threaded as well then you are totally screwed and totally screwed is the x86 chip.

    Design a true all rounder cpu you will be look back at the cray barrel system and cray patents for optimising single threaded workflows on barrel designed cpu and have to keep your pipeline length under 10 and as close to 4 as you can get. This kind of cpu looks completely different to what out current general cpus look like now but would be a true general purpose cpu processing that does multi threaded as good as single threaded.

    Leave a comment:


  • Weasel
    replied
    Originally posted by oiaohm View Post
    In a intel x86 you can because the caches are designed for a 90 percent hit rate. So if you are doing something that as a 70 percent cache miss cpu speed by size if intel cache lines(the size transfers are done between memory and caches) is greater than your memory speed.
    A single-threaded memcpy won't saturate your memory bandwidth, and cache has nothing to do with it, since it bypasses the cache anyway. The cache doesn't load faster than a memcpy can.

    Originally posted by oiaohm View Post
    https://www.youtube.com/watch?v=QdTwLs_8RZE
    Here you see for big data Risc-v beating your Xeon and Xeon Phi and GPU processors. Not by a small margin. Totally by focusing on make sure your memory system is effective. Using a cache that when you have high miss rate has lower cost. Cache lines reduced down to 8 to 16 bytes. All intel x86 are using 64 byte cache lines.
    Where in the video is the actual measured benchmarks? It's too long and I don't have time to watch it all, so at what minute/sec is it?

    And I hope you are talking about single-threaded workload, right? I keep having to repeat this for some stupid reason since you keep dodging it.

    Since, you know, that's the... whole point of speculative execution...

    Originally posted by oiaohm View Post
    That is straight up ignoring that you have a cpu with caches designed for 90% hit rate so not waiting and attempting to execute results in getting in cache hell.

    Also its not wait 100% of the time not doing speculative execution.
    When encountering a branch, you are waiting literally 100% of the time without speculative execution. Obviously we're talking about the point where you encounter a branch.

    Originally posted by oiaohm View Post
    That is wrong cray style barrel multi threading where cpu is not just processing 1 thread at time can fill 200 entry queue without speculative execution but this is not the only way.
    So does any CPU with SMT or Hyperthreading but I don't care about filling the entire queue with DIFFERENT THREADS.

    I'm talking about SINGLE THREADED PERFORMANCE FFS. That's the **whole** god damn point of speculative execution: to fill the pipeline with ONE THREAD's execution as much as possible and thus, to increase SINGLE THREADED PERFORMANCE not to fill it with other threads because that's what SMT is for, not speculative execution.

    Speculative execution is **ONLY** about single threaded performance. **ONLY**. Why the fuck do you keep ignoring this on purpose.

    I don't give a fuck if any of your RISC examples are for multiple threads: that completely invalidates them. It is **ONLY** about single-threaded performance. That's the entire point behind speculative execution. Single-threaded performance.

    Repeat that 1000 times until you get it.

    Originally posted by oiaohm View Post
    Option 1) If you have a shorter pipeline system processing branches filling an out of order system it no problem filling execution units without speculative execution most of the time. You can be having as high as a 95% hit rate in a 5 long pipeline cpu on branches.(hit rate meaning no stall when you got to branch everything was ready to solve branch) Yet a speculative execution solution is more often than not under 50%. Of course if your cpu has 14 long pipeline you cannot do this.

    Option 2) An out of order system bias to processing to solve branches(barrel multi threading) like how you have compilers doing loop unrolling. There is some extra complexity to get performance without speculative execution but the extra complexity method does not have the downside problem. This kind of system gets you normally to 8-10 pipeline length once you get longer than that the delays in pipeline start being too much to mask by this method.
    If a branch is consistently predicted less than 50% of the time, what makes you think the CPU won't simply reset to blind guess, which is still a 50% success rate being completely blind (i.e. flip a coin). You literally have no idea what you're talking about when it comes to branch prediction.
    Last edited by Weasel; 19 August 2018, 07:48 AM.

    Leave a comment:


  • oiaohm
    replied
    Originally posted by Weasel View Post
    Lol man, you can't even saturate the entire memory bandwidth with just 1 core (thread).
    In a intel x86 you can because the caches are designed for a 90 percent hit rate. So if you are doing something that as a 70 percent cache miss cpu speed by size if intel cache lines(the size transfers are done between memory and caches) is greater than your memory speed.

    [QUOTE=Weasel;n1041693]Show me RISC-V that can compete with whatever current x86 CPU (at performance and same class), Xeon Phi or not, doesn't matter.
    https://www.youtube.com/watch?v=QdTwLs_8RZE
    Here you see for big data Risc-v beating your Xeon and Xeon Phi and GPU processors. Not by a small margin. Totally by focusing on make sure your memory system is effective. Using a cache that when you have high miss rate has lower cost. Cache lines reduced down to 8 to 16 bytes. All intel x86 are using 64 byte cache lines.

    You should not be able to saturate entire memory bandwidth with 1 core but intel design you can and it all because how the cache is designed.

    Originally posted by Weasel View Post
    SIMD is vector so you're just using random buzzwords right now to refer to the same thing and appear as if you have a point.
    No risc-v has cray vector and one of the prototypes is SIMD implement kind how intel did it. So they are two different things when talking about Risc chips. This is just you not knowing the terminology.


    Originally posted by Weasel View Post
    Throwing 70% of the time still means you execute 30% of the time.
    Waiting 100% of the time means you execute 0% of the time.
    That is straight up ignoring that you have a cpu with caches designed for 90% hit rate so not waiting and attempting to execute results in getting in cache hell.

    Also its not wait 100% of the time not doing speculative execution.

    Originally posted by Weasel View Post
    There's absolutely no way you will fill a 200 entry queue without speculative execution. Have you ever looked at 99.9% of HLL code? There's ifs and while and other loops all over the place. Branches are probably every 20 instructions at most if not sooner.
    That is wrong cray style barrel multi threading where cpu is not just processing 1 thread at time can fill 200 entry queue without speculative execution but this is not the only way.

    Also there is ways for filling 200 entry queue by a different optimisation. Something you have not considered because you have not looked at in order cores does every branch cause a stall the answer is no it does not. When you don't do speculative execution and do pure out of order core without speculative execution once the information required for a branch is solved you can proceed down that branch even if everything from before that branch is not processed yet.

    Originally posted by Weasel View Post
    Reality is most software for normal users is FULL of branches (and yes a loop is a conditional branch!!!).
    The question you have not asked is how much overhead does that cause in a cpu with out of order execution but lacking speculative execution. Please note pipeline length is also a factor here. 20 instructions so your pipeline is 14 long hmm problem. This is where pipeline is problem.

    Remember 14 long pipeline means there can be a 14 clock cycle delay between when instructions enters to when it result comes out. So your basic for loop
    for(c=0;c<10:c++){
    }
    So there could be a 14 clock cycle delay between when c++ gets processed to when you can compare if c is less than 10. So quite a large stall. 20 instructions with multi instruction processing lets say 2 roughly instructions at a time that is 10 long in clock cycles. 14 clock cycle delay due to pipeline that long you are kind of forced to use speculative execution because you cannot get the results back out the pipeline quick enough for the branch. If the c++ is at the start loop and the c<10 is at the end loop so in two pipe processing groups this now comes not 14 but 27 cycles. Yes even if its not 20 instructions but 20 clocks between branches you are in trouble pipeline length has forced you into speculative execution.

    If cpu is 5 long pipeline. So 9 clock cycle over head at worse. 2 instruction at a time 10 clock cycles. If the c++ was placed at one end of the loop and the compare c<10 at the other when you get to the branch you know the result so you don't need to perform speculative execution.

    Pipeline length is quite a factor if you need speculative execution or not. The shorter the pipeline less benefit speculative execution is until there is absolutely no benefit at all..

    Originally posted by Weasel View Post
    Read the entire thing you realize why that happens, and without speculative execution you'd need about 4x logical cores per physical core to even attempt to use all those idle execution units. Which is useless when you only have a single thread (see below).
    No this is wrong a big reason for needing speculative execution is the over head of too long of pipeline.

    Option 1) If you have a shorter pipeline system processing branches filling an out of order system it no problem filling execution units without speculative execution most of the time. You can be having as high as a 95% hit rate in a 5 long pipeline cpu on branches.(hit rate meaning no stall when you got to branch everything was ready to solve branch) Yet a speculative execution solution is more often than not under 50%. Of course if your cpu has 14 long pipeline you cannot do this.

    Option 2) An out of order system bias to processing to solve branches(barrel multi threading) like how you have compilers doing loop unrolling. There is some extra complexity to get performance without speculative execution but the extra complexity method does not have the downside problem. This kind of system gets you normally to 8-10 pipeline length once you get longer than that the delays in pipeline start being too much to mask by this method.

    Barrel processing groups you instructions so that you can target the instruction to solve for the next branch and delay the instruction not need for the next branch and use them to fill the space. Barrel multi threading your single thread code comes multi threads so more suitable for SMT processing even better biased.

    Option 3) speculative execution the worse option this gets you to 14 long pipelines and is wasteful. Wasting processing units on results you with throw away worse wasting cache requests for information that will never be used. Longer than 14 pipelines is not effective for anyone. Only way 15 and longer pipelines could be effective is if someone designs a new method for accelerating branches..

    Mix of option 1 and 2 is insanely effective. Particularly when 2 can be done as thread management engine to accelerate SMT workloads and reduce overhead of SMT workloads.

    Cray was making highly multi threaded hardware but people where giving them lots of single thread work to process. Barrel multi threading is a different option to achieve a lot of the same things as speculative execution except there is no speculation in barrel multi threading its more biasing so you process branches fast then process all the instructions you know you have to process and using that to fill all slots.

    Weasel there is more than 1 way to solve the fill processing units problem. But first you have to see the bottlenecks. Pipeline length is major bottleneck the longer the pipeline the more restricted your branch solving optimisation methods become. The way the cache system is optimised is another major bottleneck.

    Leave a comment:


  • Weasel
    replied
    Originally posted by oiaohm View Post
    Quite a bit difference and that the problem particularly when your memory bandwidth is saturated. Problem is you will be waiting on the memory system be this a in order or out of order core of any form. Waiting in the branches so keeping memory request to what is required can increase you performance by 200 times on different single threaded workloads because you are not being stalled as long because you are not needing as much memory bandwidth.

    You are right you will have to wait when you at max memory bandwidth. Having failed speculative execution put more pressure on the memory system causes you to wait more than the performance you gain by speculative execution. You are not thinking about it right.

    If you have 30 percent right. You about 2/3 more memory bandwidth caused by the wrong paths. If 1/3 would have been over saturated memory system any how you are well and truly in the weeds of performance as the complete memory system stalls your processing this is why you see insane performance boost when you avoid this. 200 to 2000 times faster on particular algorithms have been seen in cpus that have instructions to say do not use speculative execution here because we know this is not going to work well.

    While you have surplus memory bandwidth to use speculative execution can be a good idea in some cases. Once you are short on memory bandwidth speculative execution comes really bad makes being stalled at branch points in code not be a big problem. This is why you need the ability to turn speculative execution off.
    Lol man, you can't even saturate the entire memory bandwidth with just 1 core (thread).

    Originally posted by oiaohm View Post
    Not exactly true. Risc-v boom shows that correct out of order designs are the same size as their in order relations. There has been historic way of designing out of order that is quite silicon expensive but that is not the only way you can design it.
    Alright, since you don't seem to get it.

    Show me RISC-V that can compete with whatever current x86 CPU (at performance and same class), Xeon Phi or not, doesn't matter.

    If you don't then it's meaningless. I don't give a shit about "theory" or claims or what they claim or babbling or "lab" or "prototypes". In "theory", we'd have 50 Ghz by now, or so people said in the 90s. ffs dude.

    So, put up or shut up about RISC-V. Seriously. (and energy efficiency is completely *useless* in our current discussion, don't even bring it up and then "extrapolating" based on that, that's not how reality works)

    Originally posted by oiaohm View Post
    SIMD turns out to take more transistors than doing vector in risc-v and vector in risc-v gets recycled for out of order.
    SIMD is vector so you're just using random buzzwords right now to refer to the same thing and appear as if you have a point.

    Originally posted by oiaohm View Post
    Out of order execution does not mean you have to-do speculative execution. Think about it this is only a 200 entry queue. Lets say you fill that up with 70 percent of instructions who results you will be throwing because away speculative execution you could be really hurting things.
    Waiting isn't any better.

    Throwing 70% of the time still means you execute 30% of the time.
    Waiting 100% of the time means you execute 0% of the time.

    I wonder which one is better, indeed.

    There's absolutely no way you will fill a 200 entry queue without speculative execution. Have you ever looked at 99.9% of HLL code? There's ifs and while and other loops all over the place. Branches are probably every 20 instructions at most if not sooner.

    This is called reality. General purpose CPUs are not for FORTRAN, that's what GPUs or GPU-like supercomputer CPUs are for. Reality is most software for normal users is FULL of branches (and yes a loop is a conditional branch!!!).

    Oh by the way, here's a post from bridgman (AMD): https://www.phoronix.com/forums/foru...53#post1041653

    Here's an excerpt for you: As cores get a bit wider each year the number of workloads where SMT can help goes up a bit as well.

    Read the entire thing you realize why that happens, and without speculative execution you'd need about 4x logical cores per physical core to even attempt to use all those idle execution units. Which is useless when you only have a single thread (see below).

    Think about it until it sinks in.

    Originally posted by oiaohm View Post
    So not only does speculative execution hurt memory bandwidth when you are out it can in fact be hurting over all performance particular if you have another thread you could have been processing that the results you would have 100 percent used.
    I told you already, I don't care about another thread, I'm talking about single-threaded performance.
    Last edited by Weasel; 18 August 2018, 08:21 AM.

    Leave a comment:

Working...
X