Announcement

Collapse
No announcement yet.

L1 Terminal Fault - The Latest Speculative Execution Side Channel Attack

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #71
    Originally posted by Weasel View Post
    A single-threaded memcpy won't saturate your memory bandwidth, and cache has nothing to do with it, since it bypasses the cache anyway. The cache doesn't load faster than a memcpy can.
    This is wrong off the bat current x86 Intel cpu cannot process anything that is not transported from l3 to l2 then from l2 to L1.

    SSE cache-bypass store basically gives direction to the caches not to keep copy of this operation and write in a straight up disposable section of cache. You are still writing though the caches. Why must you read/write though a cache is the mmu can be busy when you instruction attempts to read/write.

    memcpy that not sparse you will not notice much of a problem because it will be a workload hitting the cache most of the time. Cache line is 64 bytes long on x86. Please note some of the highest compression algorithms are sparse problems the 10 min mark shows why 8 and 16 byte are optimal for sparse problems. When you are not using a cache system optimal for sparse problems and you put speculative execution on top you have path to hell.

    Originally posted by Weasel View Post
    If a branch is consistently predicted less than 50% of the time, what makes you think the CPU won't simply reset to blind guess, which is still a 50% success rate being completely blind (i.e. flip a coin). You literally have no idea what you're talking about when it comes to branch prediction.
    CPU dropping back to blind guess is how come LZMA ends up at 30 percent. A unpredictable branch taking algorithm quickly kicks the living heck out of speculative execution. Blind guess is not 50% success rate its 50% if you lucky real world most 30% with 70% failure.

    200 processing units is not that many either you when have a pipeline 14 long as x86 this 200/14 a lot would think but is really 200/27 once you have to allow for items with high latency what is roughly 8 processing units being cleared per clock cycle. This is the other problem with increasing pipeline length you need more and more processing units to clear the same number of processing units per clock cycle. So exactly the same number of processing unit cleared per clock cycle as A73 but that only 11 long pipeline so 21x8=168 processing units(kind of explains lower power usage right)

    Longer your pipeline gets the more problems you have. First problem is crossing 8 and starting to find your branch processing being too slow so needing to use cray barrel method for accelerating single threaded or speculative execution at that point.

    Cray barrel optimisation for single threaded execution(out of order execution with read ahead in a barrel cpu design) splits instructions need for branch calculation into one thread with high priority and general processing into another thread from 1 thread workload. So the result is no such thing as a single threaded workflow. Cray method starts failing you when you get to a pipeline length of 10-11 also was patented until 2008. This method does not suffer from something https://en.wikipedia.org/wiki/Pipeline_stall yes the nasty pipeline stall. Of course to use this method you need to have barrel thread management engine in your cpu so you can break a single thread into 2 or more threads as your out of order solution particularly so you can process branch path though code quickly to know what instructions to be feeding into the general thread/threads. Of course barrel thread management in cpu would know when it only has 1 thread to process and should attempt thread splitting or should just cycle between threads. Due to the cray stuff being patented and not usable until 2008 its still not in most textbooks.

    Everything fails when you cross pipeline length of 14. Power ineffective and speculative execution starts failing you at a pipeline length of 15+

    Weasel I guess it never crossed you mind that is possible to design a cpu that internally there is no such thing as a single thread workflow yet processes single threaded workflows insanely well. This is the problem being tunnelled visioned on single thread performance this means you don't look at cray barrel that is a pure multi thread cpu core design and the optimisations that can be put on that for single threaded to convert single threaded to multi threaded. If you have great multi thread performance its possible to use methods to make single threads be processed by multi threads so get good single threaded performance as well. Now if you have poor multi thread performance there is no optimisation you can magically do to get good multi thread performance. Worse is if you have poor single threaded as well then you are totally screwed and totally screwed is the x86 chip.

    Design a true all rounder cpu you will be look back at the cray barrel system and cray patents for optimising single threaded workflows on barrel designed cpu and have to keep your pipeline length under 10 and as close to 4 as you can get. This kind of cpu looks completely different to what out current general cpus look like now but would be a true general purpose cpu processing that does multi threaded as good as single threaded.

    Comment


    • #72
      Originally posted by oiaohm View Post
      This is wrong off the bat current x86 Intel cpu cannot process anything that is not transported from l3 to l2 then from l2 to L1.

      SSE cache-bypass store basically gives direction to the caches not to keep copy of this operation and write in a straight up disposable section of cache. You are still writing though the caches. Why must you read/write though a cache is the mmu can be busy when you instruction attempts to read/write.

      memcpy that not sparse you will not notice much of a problem because it will be a workload hitting the cache most of the time. Cache line is 64 bytes long on x86. Please note some of the highest compression algorithms are sparse problems the 10 min mark shows why 8 and 16 byte are optimal for sparse problems. When you are not using a cache system optimal for sparse problems and you put speculative execution on top you have path to hell.
      Yes, and memcpy, or just a simple
      Code:
      rep movsb
      on newer CPUs will, on large amounts of memory copied, do it 64 bytes per clock cycle (obviously, barring latency, which is irrelevant for bandwidth). You can't go faster (i.e. more bytes at once) than that with just 1 core. So no matter what the speculative code does it can't do it by itself.

      Originally posted by oiaohm View Post
      CPU dropping back to blind guess is how come LZMA ends up at 30 percent. A unpredictable branch taking algorithm quickly kicks the living heck out of speculative execution. Blind guess is not 50% success rate its 50% if you lucky real world most 30% with 70% failure.
      lol what? You do realize a coin flip is 50% right? It's not 50% "when you're lucky", wtf. This made my day. :-)

      Longer pipelines have diminishing returns obviously, I never denied it. Yet, we do it because we care about single-threaded performance. Most of the CPU's design, from speculative execution to long pipelines, is made for that purpose only.

      Think about it: why even have a long pipeline on a system that waits on branches and then runs another thread on the same core?!? Why not have small pipelines but more cores instead, which is the same thing but more efficient? The fact is that both situations only work when you have multiple threads. And guess what? That's called a GPU.

      In contrast, speculative execution + long pipeline is the only design that is made to increase single-threaded performance. It's really as simple as that.

      At some point of course, the gains are too small in single-threaded performance, so they resort to adding more cores and the like. But they're still mostly focused on single-threaded performance, otherwise we'd all be using 128-core mini-core CPUs or such which have zero speculative execution and are thus super slow at single-threaded performance.
      Last edited by Weasel; 20 August 2018, 08:17 AM.

      Comment


      • #73
        Originally posted by Weasel View Post
        Yes, and memcpy, or just a simple
        Code:
        rep movsb
        on newer CPUs will, on large amounts of memory copied, do it 64 bytes per clock cycle (obviously, barring latency, which is irrelevant for bandwidth). You can't go faster (i.e. more bytes at once) than that with just 1 core. So no matter what the speculative code does it can't do it by itself.
        It is not that simple. 64 bytes from L1 to L2 and from L2 to L3 right. No its not. You have 64 bytes for the data cache and 64 bytes for the instruction cache in L1 so you can be doing 128bytes. Then you have your speculation on your L2 and your speculation on your l3.

        Nothing like being horrible wrong. Generally 64 bytes from L1 in intel cpus takes 4 clock cycles. Request to L2 takes 12 clock cycles before the cpu core has it and if you are out at L3 you have been stalled for at least 30 cycles.

        Welcome to horrible fact of cpu caches you run out of memory bandwidth very quickly. Each one is about 3 times faster. Ie L1 is 3 times faster than L2 and L2 is 3 times faster than L3. Dropping L2 out of design you can go to a L1 that is just 1/3 of speed of L3.

        The risc-v chips I have been talking about can do 64 bytes per clock cycle. X86 chips are not t hat super good on cache bandwidth usage or function. Speculation in caches consumes so many clock cycles of cache access. You are correct that the hardware can do 64 bytes per cycle in x86 but half what you need for filling to halves of the L1 and it coming 4 clock cycles because you are losing half to speculation at least.

        Originally posted by Weasel View Post
        lol what? You do realize a coin flip is 50% right? It's not 50% "when you're lucky", wtf. This made my day. :-).
        Problem here is branch speculation is not a coin flip.

        Its in fact a relation to the Monty Hall problem that is why its 30% hit 70% failure. When you pick a branch is very much like picking the Monty Hall doors. Except one door is already open and you are always picking the door with a 1/3 chance instead of the door with 2/3 chance.

        Branch prediction is in fact 2 coin flips what you choose and what happens.

        This is really like the Monty Hall problem LZMA works out that you fairly much always hitting the 1/3 door. There are other algorithms where with branch prediction you will be getting 70% success with 30% failure. Its insanely lucky if algorithm gives you exactly 50% on branch prediction in something like x86 cpu that is only processing 1 branch. Your branch prediction is either really good or really bad if algorithm is really good for branch prediction it really like doing a coin spin with a bias coin. Computer algorithms are way more common to be 30 percent success and 70percent failure at branch prediction.


        Coin flip is not as unbias as you think. The algorithm at play is very much what have you done to the coin.

        If you read some of the early white papers on speculative execution they picked some lucky work loads and were like speculative execution great we are getting branch prediction right 70 percent of the time.

        Originally posted by Weasel View Post
        Think about it: why even have a long pipeline on a system that waits on branches and then runs another thread on the same core?!? Why not have small pipelines but more cores instead, which is the same thing but more efficient?
        Barrel processors exist for a reason. Barrel processor does not equal longer pipeline.

        Small pipelines with more cores results in more hardware sitting idle. Objective of barrel processor is very much the same as speculative execution to keep the processing units as full as possible. Barrel processor is using threads to have enough instructions to process to keep the cpu processing parts under full load.

        Really you don't want a long pipeline. If you break your single thread code into as many branches as possible how compilers split code around usage of registers but into threads instead you get something a barrel processor likes. There have been some historic out of order barrel processors that did this when they had only single thread program running.

        You see a 14 long pipe line in a x86 it does suffer from cache stalls. A 5 long pipeline barrel cpu running with 10 threads will perform well no stalls.

        Basically there are two ways to take on the single thread performance problem. Use speculation to guess so executing 1 branch and if you get it wrong have stalls/jitter or have methods that turn single threaded into multi threaded because your core is naturally multi threaded if the cpu core you have is naturally single threaded turning code into multi-thread does not help you that much.

        Speculative execution on a barrel processor can get insanely warped. Don't pick one branch process each branch in a individual thread in the cpu at the same time and throw away what ever thread was wrong. This is the only form of speculative execution that lands you with exactly 50% success and 50% failure every single time also means you have more complex cache and memory management that is paths to failure. Its is just better to have methods that make single thread into multi thread when there is not enough workload to have barrel processor 100 percent happy. There are quite a few methods to turn single threaded into multi threaded that don't use speculation.

        Proper designed barrel processors that are truly multi threaded at core you do not have jitter due to speculative execution failures/successes even if they have speculative execution. Proper multi threaded cpu cores have predictable execution times.

        Please note small pipeline with more cores even at 95% branch hit you still have 5 percent miss where a barrel processor that automatically is changing between threads can get 100% processing load.

        Barrel is one of these things it perform great in all multi threaded workloads. Barrel is great for real-time systems needing dependable timing on processing. Barrel need work to be the best when you only have a single thread to process like instruction processing having ability to make more than 1 thread from a single stream of instructions. What is at the core of x86 is not ideal for real-time needing dependable timing and not ideal under heavy multi thread workload. Worst x86 is not really super great in single threaded.

        Please note I don't see that long pipelines can be justified really we need all CPU to get their pipeline length back under 10 they will be forced to because once we cannot go down to smaller nm the only way to reduce power usage will be reduce circuit. Best way to reduce circuit without in fact reducing number of processing items per clock cycle is optimise the heck out your pipeline to make it as short as you can while having all the performance optimisations like multi instructions at once. Also once we cannot go to a smaller nm building into hardware thread management will be another power saving by reducing context switching overhead.

        Comment


        • #74
          Originally posted by oiaohm View Post
          It is not that simple. 64 bytes from L1 to L2 and from L2 to L3 right. No its not. You have 64 bytes for the data cache and 64 bytes for the instruction cache in L1 so you can be doing 128bytes. Then you have your speculation on your L2 and your speculation on your l3.

          Nothing like being horrible wrong. Generally 64 bytes from L1 in intel cpus takes 4 clock cycles. Request to L2 takes 12 clock cycles before the cpu core has it and if you are out at L3 you have been stalled for at least 30 cycles.
          You REALLY have no idea what you are talking about.

          Just for your info, the 4 clock cycle is store-forwarding LATENCY and we were talking about BANDWIDTH. In fact we were talking about bandwidth to the MAIN MEMORY (i.e. RAM not a fucking cache). RAM has no concept of "instruction cache" or "data cache". Obviously the LATENCY will be way slower 1 clock cycle, way way way slower, it also depends on RAM speed, if you want to access that data fast. The cache itself can only pull (currently) 64 bytes per clock per core, but that won't saturate the BANDWIDTH of the main memory with just 1 core.

          Latency is another problem, at some point you will have to wait. But that's no different than waiting on a branch, Lol.

          Just stop man, you're embarrassing.
          Last edited by Weasel; 22 August 2018, 07:41 AM.

          Comment


          • #75
            Originally posted by Weasel View Post
            Latency is another problem, at some point you will have to wait. But that's no different than waiting on a branch, Lol.
            Exactly at some point you will have to wait anyhow. If your branch wait lines are short enough they can line up with you memory wait so effectively come non existing.

            I am not talking about main memory bandwidth. I am talking about the fact you saturate the bandwidth between your cpu core and memory. Once that is saturated you are going to stall anyhow its the limited core to memory bandwidth that brings the idea of speculative execution undone.

            The core L1 bandwidth is 64 byte per 4 clock cycles. If your pipeline is short(5 long) your worst to solve of a branch is 9 clock cycles in that solve you. That is 9 clock cycles to process 2 instruction at very worst. With correctly order code as this is 5 instruction groups the general delay can be brought back to 4 clock cycles compliers can achieve this most of the time. Hang on its skip one memory operation if a memory operation has had to go out to L2/L3 its going to long than the branch solve even in the worst case. So memory bandwidth to caches is a great stall problem in short pipeline cpus not branch processing.

            Now lets look at your 14 long pipeline. That is 27 is your worst to solve branch 27 clock cycle to process 2 instructions is down right horrible. At at 14 instruction groups thinking a branch can be that close there is no real way to have compiler reorder code to half this most of the time. Half is still 13 clock cycles so losing 3 memory operations. Lot hard to hide if you are directly hit the cache. Long pipeline cpus the branch processing gets worse than reading memory from L1. Longer pipeline gets the worse this becomes.

            Lets say we go speculative. execution to deal with this problem and we keep the 14 long pipeline You get 70/30 ie 70 hit 30 miss this bring you back with 14 length pipe to losing 1 memory operation per branch as you would normally lose in a 5 pipeline cpu with well optimised code without any branch prediction. When you get the reverse 30/70 ie 30 percent hit 70 miss you are losing 2 memory operations per hit and if you only get 50/50 your are losing 1.5 again worse than the 5 pipeline cpu. At 14 long pipeline you are only matching 5 pipeline cpu if you are lucky. But that is not the end of the story.

            When you fill in those lost memory operations with speculative actions you can be triggering ripples down though your cache requesting stuff that will never be used with a non speculative execution processor. Of course those request consume up your bandwidth between l3 to L2 and L2 to L1 so making requests for the memory you need to proceed forwards more congested. Also the speculation is filling your cache with data and code that you would not need if you were not doing speculative execution.

            So pipeline 14 is truly the upper limit you really want a pipeline of 9 so that 50/50 with speculative execution equals a 5 long pipeline. Even so speculative execution is costing quite a bit. Think about if item you need has been pushed out of you L3 cache because of a speculative action and you have to pull it back in from memory this has quite cost. You memory bandwidth from main memory to your cpu core is quite limited.

            When you look a barrel processor things are different.
            5 pipeline barrel it hits a point where it has to wait for a branch to solve if there is another thread for those 4-7 not in use cycles its able to fill them in with another thread. You are not stressing your limited cache memory bandwidth with any requested data that you are not going to use.

            5 pipeline barrel is happy with 8 threads. Does not like 1 thread. Due to having 8 threads every time a branch function comes up its able to solve it instantly. Only stall to a 5 pipeline barrel processor with 8 threads is running out of memory bandwidth through caches. That 8 does not have to be threads. If you have 8 not interdependent instructions these from a single thread code could be used to fill a barrel processor 8 threads this is where out of order execution boost a barrel processor single thread processing a lot by converting single thread processing into more threads. Instead of speculation with a barrel processor you focus on out of order execution because if you can make a single thread out of order enough it will come enough threads to make your barrel design happy.

            A 5 pipeline barrel processor can have more than 8 threads so when you have to wait on a long function like a slow design divide it can let it happen. Basically a barrel processor can be dynamic delay tolerant processor again processing as much as possible.

            A barrel processor without an out of order system will not keep up with non barrel processor on a single threaded workload even if that processor does not have speculative execution. But barrel processor without out of order will out perform with multi threaded workloads even if the competing processor has speculative execution. Problem is a barrel processor with a short pipeline with an out of order system for when it does not have enough threads can keep up and beat a processor with speculative execution most of the time because its not wasting memory bandwidth and not filling caches with stuff that not needed.

            Issue here is resources in cpus are limited any waste cannot be recovered.

            Comment


            • #76
              Originally posted by oiaohm View Post
              Exactly at some point you will have to wait anyhow. If your branch wait lines are short enough they can line up with you memory wait so effectively come non existing.

              I am not talking about main memory bandwidth. I am talking about the fact you saturate the bandwidth between your cpu core and memory. Once that is saturated you are going to stall anyhow its the limited core to memory bandwidth that brings the idea of speculative execution undone.
              Yes, which is no worse than waiting on the branch itself (for single-threaded performance). And you can't saturate it with only 1 core. And the stuff "between the cpu core and memory" is the main memory, you know? I mean the main memory only gets requests from the caches or the core (unless you bypass that).

              Originally posted by oiaohm View Post
              The core L1 bandwidth is 64 byte per 4 clock cycles.
              Ok dude, I'm done, you're just a parrot, you don't read, I even bold + capitalize stuff for emphasis in the hope that you'll actually manage to read it, and you still go on with the same nonsense bullshit... really lost cause.

              Here's what the 4 clock cycle latency is: if you write to the L1 cache, you must wait 4 clock cycles before that same memory (not cache line, but only the memory you wrote) is available for reading. This is store-forwarding latency. You can fucking write a LOT more in that time or read OTHER, UNRELATED stuff, even from the same cache line, without waiting for 4 clock cycles, this is why throughput is MUCH higher than latency.

              Bandwidth or throughput is stuff that you can do at the same time as long as it doesn't depend on the results of previous stuff. Like, you know, a memcpy, where it reads 64 bytes, writes 64 bytes, none of which depend on each other. Then the next 64 bytes are read, next 64 bytes are written. Again none of these depend on each other, because they're unaliased memory addresses. What you wrote is never read when you do the memcpy. There's absolutely NO fucking reason to wait 4 clock cycles. (that's what memmove is for, if aliasing is a possibility)

              In this case, say you write 64-bytes to cache line 1, then you write 64-bytes to cache line 2. These are done in 2 clock cycles total, each done in 1 clock cycle. You don't fucking wait for the RESULT (latency) of the first write before you can write the second. ffs.

              Even writing a SINGLE BYTE to a cache line and reading that SINGLE BYTE will incur a 4 clock cycle latency. Of course, other writes will run at the same time in parallel, since throughput is much higher. Throughput is 64 bytes per clock so...
              Last edited by Weasel; 24 August 2018, 07:57 AM.

              Comment


              • #77
                Originally posted by Weasel View Post
                Bandwidth or throughput is stuff that you can do at the same time as long as it doesn't depend on the results of previous stuff. Like, you know, a memcpy, where it reads 64 bytes, writes 64 bytes, none of which depend on each other. Then the next 64 bytes are read, next 64 bytes are written. Again none of these depend on each other, because they're unaliased memory addresses. What you wrote is never read when you do the memcpy. There's absolutely NO fucking reason to wait 4 clock cycles. (that's what memmove is for, if aliasing is a possibility)
                It not 1 clock cycle its at least 2. First you have to validate that the 64 bytes of memory you want to read from L1 cache is in fact in L1 cache. If not you need to have it transferred from L2 cache. If its not in L2 you need to have it transferred from L3 cache.

                You are not getting what causes the saturation. Caches are only a limited size and can only store so much.

                Speculative execution has the downside of needing to store more in cache so increasing you cache miss rate. Think about you memcpy you have two different memcpy on two different of a branch. One memcpy is a debug function so is basically never is almost called if you speculative execution is going down there 30 percent of the time its polluting your cache with information cache does not need.

                Please note you focused on on the 4 cycles bit. It does not change the fact that unless speculative execution gets 70% branch hit rate on a 14 long pipeline its slower than a 5 long pipeline waiting for a branch to be solved if the pipelines contain exactly the same optimisations other than speculative execution. Stalling does not pollute the cache and does not cause extra cache misses.

                You have two causes of stalls cache miss and branch pause. Speculative execution attempts to deal with branch pause while creating more cache miss and being wasteful.

                Barrel processors deal with branch pause adds tolerance to cache miss without increasing cache miss rate for the price that you must multi thread at the cpu level

                Short pipeline cpus(under 5 pipeline length) the increase cache miss rate caused by speculative execution causes large performance problems than pausing in branches because you have more cache miss delays and these cache miss delays consume more time than waiting in the branch. Y

                Basically speculative execution is not that much of a solution kind of works in cpu with long pipelines and large caches because the delay caused solving branch due to long pipeline is so long over 14 long pipeline its too long even for speculative execution to help. Also L1 cache has a max electrical size so a cpu with a pipeline of 5 has the same size L1 as a cpu with a pipeline of 14. So you cannot in fact expand l1 to deal with speculative execution increasing cache miss rate instead of you have do things like L2 so adding more processing steps to request information from memory so making your memory stalls worse.

                Out of order execution where you accelerate processing of branch code and delay code that is not need to solve the branch to fill in the branch delay. Since you can do out of order using only instructions you know you have to execute this does not cause increased cache miss. Out of order like this can be used to attempt to cover over cache miss as well.

                The reality there is a very big question if speculation execution really works. Or have we just used speculative execution to cover up for bad chip design.

                Your transfers from l3 to L2 is about 1/3 of the speed of L2 to L1 and that is about 1/3 of the speed from L1 to internal processing. Of course you have to be able to run out of bandwidth. 1 request for something that is not in L2 but is in L3 will take longer for the cpu to get that information that what a cpu with a pipeline of 14 will take to solve a branch. 1 request for something not in l1 but is in l2 for system that has a l2 if not l3 is long than a cpu with a pipeline of 5 take to solve a branch. Short pipeline makes speculative execution causing cache misses not tolerable as you start having longer delays due to cache misses than branch solves so speculative execution is gaining you absolutely nothing.

                Speculative execution looks fine until you start waking up how expensive cache missing is and how slow the cache bandwidth is to the cpu and the fact that L1 is basically fixed size for everyone.

                As more of our applications are coming multi threaded there comes a point where there may not be enough single threaded workloads to bother about out of order execution or speculative execution instead just use barrel designed processors that love multi threading.

                The general workload is changing.

                Comment

                Working...
                X