Announcement

Collapse
No announcement yet.

Wine Developers Release Hangover Alpha To Run Windows x86_64 Programs On 64-Bit ARM

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #61
    Originally posted by oiaohm View Post

    Really no. This case is quite easy to read if you understand intel standards of charting.
    That's like saying that "platform X is quite easy to use if you understand their standards of UI design". The whole point is that one shouldn't need to understand a given business's "internal standards" for how to communicate.

    (That's why GNOME 3.x is controversial and why UX experts hate touch gestures for being a learnability regression from WIMP GUIs.)

    Originally posted by oiaohm View Post
    There is a video that goes with those sides.
    I assumed that was probably the case, but I don't remember you linking to it, so it's of limited relevance.

    Originally posted by oiaohm View Post
    Yes 1,0 on a bar is x32 performing the same as what x32 is vs on that bar. and 1.5 on that bar would be x32 faster than the vs.
    OK. I now withdraw my objection on the data side of things.

    Originally posted by oiaohm View Post
    They do but its to intel standards that is what happened here.
    So Intel has standardized on ignoring on best practices agreed upon by all communication and data visualization experts? (Things like always labelling your numeric axes and picking the proper kind of chart to highlight the relevant aspects of the data.)

    The books I was assigned in university on the topic are reprints of books written in the 1950s and 1960s which have remained popular ever since. (One is a 56th printing, bought around 2010.)

    Intel has no excuse.

    Originally posted by oiaohm View Post
    This is something I mentioned this goes outside x86. You find power and risc-v implementations with less registers in hardware than in the ISA.. This is 16 cpu general registers with a ISA of 32 general registers

    The sweet spot for program register usage appears to be 16 registers. 32 bit x86 only has 8 general registers in the ISA so is in trouble it has too few. 64 bit gets you to 16 general registers.
    That may or may not be true, but it's beside the point. The point is that you argued that register renaming is for mapping some number of logical registers to some smaller number of physical registers. I disputed that and gave citations to back it up. You then tried to fault me for not giving citations for a point I wasn't making.

    Originally posted by oiaohm View Post
    Bigger you register count slower the clock they are.
    I've never heard this before and it doesn't seem any more obvious than "the more RAM you install, the slower your CPU will run". (Though I can imagine it being similar to "beyond a certain amount, a system of a given generation must use registered RAM and that is slower".) Do you have a citation?

    Originally posted by oiaohm View Post
    It would be very interesting to see the Itanium instruction set on a cpu with only 16 fast registers being renamed.
    Everything I've read about the Itanium instruction set focused on how the large number of registers, directly addressable by the compiler, was a core element of its design. I'm skeptical that it would be very useful to use ia64 on something with an amd64-sized register file.

    Comment


    • #62
      Originally posted by ssokolow View Post
      So Intel has standardized on ignoring on best practices agreed upon by all communication and data visualization experts? (Things like always labelling your numeric axes and picking the proper kind of chart to highlight the relevant aspects of the data.)

      The books I was assigned in university on the topic are reprints of books written in the 1950s and 1960s which have remained popular ever since. (One is a 56th printing, bought around 2010.)
      I said it intel standard today. But if you follow it back its California Institute of Technology(Caltech) standard from 1950. Gordon Moore of Moore law fame at Intel is who we have to blame for it. This is very much like companies still using COBOL or some other standard that we think should have died out. At least the Intel documentation is consistently defective in exactly the same ways. There is something from Caltech 1950 standard that does disagree with the more modern standards. The idea of highlight the relevant aspects of the data this is anti-Caltech as this could bias another person interpenetration of the data. Also not labelling the Y axis means you have to read the other data with it to understand the background of the results. Sorry this is Caltech 1950 method again. Other than the introduction of colours instead of hatching nothing else how the Intel personal do their charts for the past 50 years.

      There is dispute over best practices even today how to present data to avoid biasing a second option. So far the old standard as been serving Intel very well when it comes to doing second options on data..


      Originally posted by ssokolow View Post
      That may or may not be true, but it's beside the point. The point is that you argued that register renaming is for mapping some number of logical registers to some smaller number of physical registers. I disputed that and gave citations to back it up. You then tried to fault me for not giving citations for a point I wasn't making.
      Computer architecture, A Quantitative Approach is the book you need to pick up.

      Lets say you have no register re-usage tricks. As that book state 8 moving 16 cpu registers can increase your performance by up to 100% this is maths and back-able by real world tests. But from 16 to 32 real registers your performance only increases 10 percent at mest.

      Originally posted by ssokolow View Post
      I've never heard this before and it doesn't seem any more obvious than "the more RAM you install, the slower your CPU will run". (Though I can imagine it being similar to "beyond a certain amount, a system of a given generation must use registered RAM and that is slower".) Do you have a citation?
      Its part of electrical design is also covered in the book. Faster storage in the system is your registers they have the tightest tolerances so adding more physical registers in 1 block equals having to run at a slightly slower clock speed to the registers as you have made the electrical circuit bigger. 10% is not that much of a clock drop to get to. The electrical speed limitation is why L1 cache on most processes is 32 kb data and 32 kb instruction any bigger and you will start really ruining your clock speed. Yes this causes L2 slower than L1 and L3 to be slower than L2. As you get more distance away from the cpu core the slower you get. This is also why you are seeing Video cards these days doing system in package with the video ram its to get it electrically closer to be faster.

      Think about it this way for registers to be as fast as possible the electrical travel distance has to be as short as possible. I give you a simple bread board 1 led to light its really simple to make that really wire short. Now I repeat and give you 2 LED its got electrically longer. Each time it gets electrically longer the max clock speed drops due to higher latency. Normally when you add ram sticks to a computer they are on that long of wires they are already insanely slow by CPU standards. So it does not make much difference. Inside the cpu dia it self these links are very short and it not hard to add like 100 percent to the lengths by increasing register count.

      When people talk about register renaming what they don't get is in most case this means using smaller register file inside the cpu for task. Basically it is large number of registers in smaller register files(blocks) spread out over cpu core. Register renaming not really taking a task and giving placing it inside 1 big register file with tones of register entries. So the way register renaming is working inside the cpu is taking a ISA with larger register count and making it work in a register file with less register entries. So it kind of working different to what you expect.


      You will see it with the second one there about MIPS with renaming registers. So you have CPU wide master registers and you have blocks of renaming registers that are used by register renaming. So lets say your 32 bit x86 ISA code is only using 8 master register the other 8 master registers for x86 64 bit master register file for the task are left completely used as they are not renaming registers. So not using an area in side the cpu where you could store another 8 values for rapid access.

      Intel design document states the renaming register configuration. x86 design just does not let 32 bit x86 catch up and beat 64 bit x86 in anything bar rare cases because of this.

      There are some silicon design that only run with renaming registers and skip the master registers. Its not like a master register file is mandatory to exist this is some of the odd ones that are like 32 ISA registers 16 physical registers..

      Originally posted by ssokolow View Post
      Everything I've read about the Itanium instruction set focused on how the large number of registers, directly addressable by the compiler, was a core element of its design. I'm skeptical that it would be very useful to use ia64 on something with an amd64-sized register file.
      Why it would be interesting is if having all these extra registers for the compiler to use to see if it helped solve out with register renaming. It could be a complete flop as well. There is no real world data to show how register renaming will behave with this kind of setup but it would be avoiding the electrical wire overhead. Could be good also could be down right horrible.

      Comment


      • #63
        Originally posted by oiaohm View Post
        There is dispute over best practices even today how to present data to avoid biasing a second option. So far the old standard as been serving Intel very well when it comes to doing second options on data..
        I'm doubtful that there's dispute over this particular point. It's percentage change, so use a percentage change graph. Aside from making it easier to see whether a given bar is above or below 1.0 (A.K.A. 100%), it's the proper way to remove all that wasteful distance between 0.0 and 1.0 since a percentage change graph can scale from the middle rather than the bottom.

        My suspicion is that they just got lazy and couldn't find a Percentage Change preset for bar charts in PowerPoint's charting wizard.

        Originally posted by oiaohm View Post
        Computer architecture, A Quantitative Approach is the book you need to pick up.

        Lets say you have no register re-usage tricks. As that book state 8 moving 16 cpu registers can increase your performance by up to 100% this is maths and back-able by real world tests. But from 16 to 32 real registers your performance only increases 10 percent at mest.

        Its part of electrical design is also covered in the book. Faster storage in the system is your registers they have the tightest tolerances so adding more physical registers in 1 block equals having to run at a slightly slower clock speed to the registers as you have made the electrical circuit bigger. 10% is not that much of a clock drop to get to. The electrical speed limitation is why L1 cache on most processes is 32 kb data and 32 kb instruction any bigger and you will start really ruining your clock speed. Yes this causes L2 slower than L1 and L3 to be slower than L2. As you get more distance away from the cpu core the slower you get. This is also why you are seeing Video cards these days doing system in package with the video ram its to get it electrically closer to be faster.
        I actually have a used copy of the fourth edition of that which I've been meaning to read. What are the titles of the sections you're referring to?

        I don't have time to read all of that before I go to bed. I'll try to make time to read it and respond to your related comments after I've slept.

        EDIT: I'm going to need at least another day. I hadn't realized how close a couple of deadlines were getting.
        Last edited by ssokolow; 01 March 2019, 05:39 AM.

        Comment


        • #64
          Originally posted by oiaohm View Post
          No you have not debunked it at all. You have not give one cit with benchmarks say that what you believe is backed in real hardware. I have given the benchmarks that show what I am saying in real hardware.
          No, both me and stefan told you that the benchmarks were improperly done, and explained why as well, not going to repeat myself.

          Originally posted by oiaohm View Post
          Again where is your benchmarks backing all these statements.
          What the fuck has benchmarks to do with CPU complexity?!?? Dude, do you even UNDERSTAND what I'm talking about? It seems it's way over your head.

          Originally posted by oiaohm View Post
          That right they don't exist because they are all false. Memory aliasing checks has been a going cpu design problem. CPU do not have unlimited amounts of time to check for all possible code paths. Memory aliasing checks in cpu is something that in theory can fix the registers pushed to memory and pulled back but this required effectively infinity time to do this perfect 100 percent of the time basically its a turing machine problem. Since your first counter point is wrong to start off with you are screwed.
          So much bullshit it's unreal. Please read Agner Fog's optimization and uarch docs. Those are the Bible for this sort of thing.

          Originally posted by oiaohm View Post
          So you claim that 32 bit less registers is not more complex to solve is wrong. More registers mean less memory aliasing checks so resulting is a more effective solve. Benchmarks also show this is party because of compiler behaviour. If compiler has registers to use it will use them.

          Lets say we have a 2 register system

          R1=1
          R2=3
          R1+R2 into R1=4
          R1 pushed to memory
          R1=3
          R2+R1 into R2=6
          R1 pulled back from memory

          Now let say this above is on a 4 register.
          R1=1
          R2=3
          R1=R2 into R1=4
          R3=3
          R2+R3 into R2=6

          End both of these R1=4 and R2=6
          You realize we have superscalar OoO CPUs right? So, when you "R1 pulled back from memory" happens, it is renamed to a different register and so it happens at the same time as everything after it got pushed to memory. Its value does not depend on anything you do with R1 because it is a completely different register internally (this is the whole point of register renaming). It's easy to do because the CPU knows that by loading a full register width from memory, it will effectively discard all previous contents, so what R1 was before simply doesn't matter. Thus it can simply rename it to a completely unused register by that point and avoid waiting AT ALL.

          This is also why x86_64 zero-extends when you load 32-bit values into a 64-bit register automatically, by design. If it zero extends, it knows it can simply rename the register because its upper half won't depend on anything it had prior (otherwise it would have to wait).

          The load/store forwarding latency is only 4 clocks in this case.

          In a real typical workload, you won't reload the register so quick after pushing it -- you will reload some other memory, which means there's no latency because it gets reloaded as soon as it is written, perhaps from last loop iteration. The CPU does NOT execute instructions in order, one by one, because we don't use 1970 CPUs.

          And of course the load/store forwarding is already done in the CPU as I mentioned.

          Did you know that on many AMD CPUs it's faster to store in memory and reload than to move between normal register and SSE register?!???

          That's because the CPU already has a lot of this logic done (for the rest of the cases) so it simply reuses it, while coding a move between reg/SSE requires extra logic and they didn't bother that much there since it's not done that often. Trade-offs and all that. It has better throughput and that's what matters in 99% cases, not latency. So AMD decided it wasn't worth it.

          tl;dr your 2-register code looks something like this internally in the CPU:

          R111=1
          R37=3
          R111+R37 into R111=4
          ^^^ 1 clock cycle

          R111 pushed to memory
          R99 forward loaded, can do operations with it now (this is the original R1 "renamed")
          R111=3
          R37+R111 into R37=6
          ^^^ 2 clock cycles

          Yes all of those happen at the same time.

          Now using R1 some more would use R99 not R111 because R111 is dead (got completely replaced by the load). The CPU is much smarter than you think it is.
          Last edited by Weasel; 28 February 2019, 01:29 PM.

          Comment


          • #65
            Originally posted by Weasel View Post
            tl;dr your 2-register code looks something like this internally in the CPU:

            R111=1
            R37=3
            R111+R37 into R111=4
            ^^^ 1 clock cycle

            R111 pushed to memory
            R99 forward loaded, can do operations with it now (this is the original R1 "renamed")
            R111=3
            R37+R111 into R37=6
            ^^^ 2 clock cycles.
            Redo the break down again this time basing off the 4 register code.

            R1=1
            R2=3
            R1=R2 into R1=4
            R3=3
            R2+R3 into R2=6
            This after register renaming comes like the following majorly different to the 2 ISA register renaming.
            R1=1 maped RX1
            R2=3 Maped RX2
            R2=3 Mapped RX3
            R3=3 Mapped RX4
            RX1+RX2=R1=4
            RX3+RX4=R2=6
            Number of clocks 1 due to cpu threading this. No memory operations to que. Basically using register renaming the 4 register ISA has taken half the clock cycles as the 2 register ISA and does not have a memory operation. The more times this happens the slower the ISA that is short on registers is compared to the ISA with more registers.

            Please note the R1 pulled back from memory that you missed is that the next operation I did not write is using R1. For example R1+R1=R1 could be next operation. Not even the compiler is going to code a call back from memory if it not about to straight up use that register value.

            Of course this is still going to be in registers in the 4 register example. So you are using your cpu extra registers effectively with no extra memory operations.

            If you are short on registers in the ISA you will be doing extra memory operations. Memory operations even to L1 cache instead of registers is a hell load slower.

            Studies put the sweet spot for registers at 16 and normal x86 32 bit ISA has 8 so you are screwed.

            Weasel I gave you 2 register and 4 register examples and you moron believed you had proven me wrong without doing the 4 register as well. I provided both if you had done the register renaming solve on both the problem is cleanly in your face. Extra operation to memory and back has sneaked due to lack of registers in the ISA.

            PS Weasel you R111 you overwrote so lost is value out of registers so now forced to pull it back from L1 highly smart of your version of CPU right. This is exactly the mistakes cpu with ISA with lack of registers start doing with register renaming.
            Last edited by oiaohm; 06 March 2019, 10:33 AM.

            Comment


            • #66
              Originally posted by oiaohm View Post
              Number of clocks 1 due to cpu threading this. No memory operations to que. Basically using register renaming the 4 register ISA has taken half the clock cycles as the 2 register ISA and does not have a memory operation. The more times this happens the slower the ISA that is short on registers is compared to the ISA with more registers.

              Please note the R1 pulled back from memory that you missed is that the next operation I did not write is using R1. For example R1+R1=R1 could be next operation. Not even the compiler is going to code a call back from memory if it not about to straight up use that register value.
              I don't think you understand that operations are not executed in-order. We have out-of-order CPUs.

              As long as the reloaded register does not depend on former value of R1 it can be calculated by the CPU anytime it wants. It doesn't have to be after or "at the same time", it could even be before.

              Originally posted by oiaohm View Post
              PS Weasel you R111 you overwrote so lost is value out of registers so now forced to pull it back from L1 highly smart of your version of CPU right. This is exactly the mistakes cpu with ISA with lack of registers start doing with register renaming.
              Also note that it does not hit the reload from L1 cache, since it is a store-to-load forwarding. It does incur an additional clock or two of latency, true, but the issue in most cases is not latency, but throughput. Since it's executed out of order its latency usually does not matter, unless you have a very long dependency chain (and I mean a very long one).

              That's why stores/loads to memory are recommended on some CPUs even instead of moves from normal register to SSE register. The CPUs are designed with a large amount of stores/loads and this way you make use of them. Less so between inter-register moves like that. And again, throughput is what's important, not latency.

              Comment


              • #67
                Originally posted by Weasel View Post
                Also note that it does not hit the reload from L1 cache, since it is a store-to-load forwarding. It does incur an additional clock or two of latency, true, but the issue in most cases is not latency, but throughput. Since it's executed out of order its latency usually does not matter, unless you have a very long dependency chain (and I mean a very long one).
                Compliers make the longest dependency chains they think they can get away with this is what intel benchmarks found. So very long dependency chains are the normal not the rare event. This is why 32 bit with 8 registers is hit so hard and 32 bit with 16 registers is faster most of the time.

                Originally posted by Weasel View Post
                As long as the reloaded register does not depend on former value of R1 it can be calculated by the CPU anytime it wants. It doesn't have to be after or "at the same time", it could even be before.
                My 2 register version of code is directly depending on a former value of R1 that had been pushed out to memory. I had it design it that way. 4 register is not facing that problem. "As long as reloading register does not depend on former" intel benchmarks and some of their more details white papers on the topic say this is only likely to be true about 20 percent of the time yes 80 percent of the time there is going to be dependency at 8 registers the numbers alter at 16 to 90 percent of the time there is not depend on former 10 percent there is. At 32 registers depend on former drops to basically zero seams to be as complex as us humans can code most of the time..

                There is another road block. A push to memory CPU has received the instruction for has to be performed.

                Throughput as you said is imported. Memory bandwidth is a limited resource. Memory bandwidth to L1 is the one you will tax out and kill performance with 32 bit code with only 8 bit registers before you will with 32 bit code with 16 registers. Excess memory operations is a pure nightmare when you want to go faster.

                Register renaming does not help with the excess memory operations caused by not having enough registers to handle the complexity humans code.

                Register renaming is a good way to make a particular ISA go faster than it would without it. But Register renaming does nothing to solve the throughput throttling caused by running out of memory bandwidth, ISA register count has direct effects on memory bandwidth usage due to it altering the amount of memory operations the compiler on average will add to the code.

                Register renaming in fact is a little more evil than you think. Using register renaming you can process more quicker and if it like 32 bit x86 code with 8 registers you are also increasing you writes to L1 faster as well so bringing the bottleneck closer. 32 bit x86 code with 16 registers less memory operations more L1 bandwidth to use so more throughput before being choked L1. 64 bit x86 still has general 16 registers but instead of being restricted to 32 bit they are 64 bit wide lot of operations are able to take advantage of that and reduce memory operations a little more as well as operational effectiveness...

                Get it register renaming is not solving the lack of register problem in the ISA it one of the items why the lack of registers in the ISA reaches out and chokes the system throughput with memory operations so fast particularly once you start doing speculative execution.

                Register renaming works well at this stage when you have 16 or more registers in the ISA. Under 16 registers in the ISA register renaming makes a particular issue worse by being faster. Basically register renaming makes you deplete you L1 bandwidth faster where you have a ISA without enough registers for how humans program and how compilers turn that human programmed code into binaries result in increase memory operations. Adding registers renaming is not a solution to lack of memory bandwidth right. Increasing ISA register count in particular cases can decrease memory bandwidth requirement.

                Register renaming is not a cure to ISA with lack of registers. The cure for a ISA with lack of registers move to a ISA with enough registers current studies say between 16 to 32 registers but we have not tested more and these numbers might because programmers are use to coding with 8 registers limitations. Like in a decade we could be looking at needing between 32-64 registers to be the sweet spot.

                Comment


                • #68
                  Originally posted by oiaohm View Post
                  Compliers make the longest dependency chains they think they can get away with this is what intel benchmarks found. So very long dependency chains are the normal not the rare event. This is why 32 bit with 8 registers is hit so hard and 32 bit with 16 registers is faster most of the time.
                  What the hell have compilers to do with anything? Dependency chains are a feature of an algorithm, compiler can't do shit about it.

                  For example, loops is where most matters. If a loop carries dependency between iterations, that's the long dependency chain. This is rare. Most loops operate on an array of data, and so can either parallelize it with SIMD or simply don't depend on much of the previous iteration for next array elements.

                  Note that having more registers DOES NOT help you with dependency chains in the least, it only decreases latency a tiny bit. (and it doesn't even "Add up" in most cases because the CPU can already start the forwarding while it was processing rest of thing).

                  Originally posted by oiaohm View Post
                  My 2 register version of code is directly depending on a former value of R1 that had been pushed out to memory. I had it design it that way. 4 register is not facing that problem. "As long as reloading register does not depend on former" intel benchmarks and some of their more details white papers on the topic say this is only likely to be true about 20 percent of the time yes 80 percent of the time there is going to be dependency at 8 registers the numbers alter at 16 to 90 percent of the time there is not depend on former 10 percent there is. At 32 registers depend on former drops to basically zero seams to be as complex as us humans can code most of the time..

                  There is another road block. A push to memory CPU has received the instruction for has to be performed.

                  Throughput as you said is imported. Memory bandwidth is a limited resource. Memory bandwidth to L1 is the one you will tax out and kill performance with 32 bit code with only 8 bit registers before you will with 32 bit code with 16 registers. Excess memory operations is a pure nightmare when you want to go faster.
                  Well it's clear you don't read so I'll stop bothering trying to answer you. Stay ignorant if you like it.

                  Memory bandwidth is expensive but you're never touching it. L1 cache is already used, you already use it, but the RELOAD in this case is forwarded and DOES NOT HAPPEN from the L1 cache.

                  Just read up on store-to-load forwarding. That's exactly what one Specter variant even abuses. There's no L1 cache hit in your example.

                  You seriously VASTLY underestimate how smart CPUs are but it doesn't surprise me since you think "speculative execution is bad" despite the fact it is exactly the reason x86 outperforms Itanium in most real code. And speculative execution is not just about branches.
                  Last edited by Weasel; 07 March 2019, 08:46 AM.

                  Comment


                  • #69
                    Originally posted by Weasel View Post
                    Just read up on store-to-load forwarding. That's exactly what one Specter variant even abuses. There's no L1 cache hit in your example.
                    Stop attempting to use magic bullets to dig you way out.

                    Intel’s modern implementations of store-to-load forwarding succeeds in almost all cases where it is practical to forward. The only case where forwarding fails (load partially needs to forward) occurs rarely in practice, so it is probably not worth implementing hardware to improve this. The high penalty for a correctly-predicted dependent load is a bit of a concern though, as it is now more than 3× worse than for the older Yorkfield.
                    Would have paid to read up on it yourself. This case they were testing in 64 bit mode x86. Run the same test again in 32 bit mode x86 and the thing that claimed to occurs rarely in practice now happens regularly.

                    Yes if you don't hit L1 you still high penalty when store to load forwarding even works so you still have lower throughput than if you avoided the memory operation and stuck to registers.

                    Originally posted by Weasel View Post
                    You seriously VASTLY underestimate how smart CPUs are but it doesn't surprise me since you think "speculative execution is bad" despite the fact it is exactly the reason x86 outperforms Itanium in most real code. And speculative execution is not just about branches.
                    You are thinking CPU are magic Weasel there is a cost to pay for store-to-load forwarding and register renaming.

                    Register renaming and store-to-load forwarding don't fix ISA issue of lack of registers. Again store-to-load forwards improves the outcome by reducing L1 work a bit but the price paying for this operation is not cheap.

                    Yes speculative execution cost and the issues store to load forwards compared to having enough registers means it particularly bad in speculative execution to be using store to load forwards if you can avoid it.

                    Weasel what under researched by you CPU feature you are going to bring out next to attempt to save this arguement. You are throwing up ever single counter arguement of someone who does not know this topic. So you are the normal idiot who believes cpu features are magic bullets and will solve ISA errors. Cpu features can reduce the harm issue in ISA problem but they only reduce the harm not fix. Fix requires fixing the ISA.


                    Lets correct you on that speculative execution thing. Itanium has all the speculative execution features of modern x86 only one problem. Itanium implementation is in fact secure and required programmer to provide more information to the cpu so it will not error their code out when it sees possible suspect speculative execution paths. To be horrible once they fix all the faults in x86 speculative execution we are going to be slowed down quite a bit so x86 advantage over Itanium might not have been as much as we though. X86 has been faster than Itanium by being insecure.



                    Comment


                    • #70
                      Originally posted by oiaohm View Post
                      Stop attempting to use magic bullets to dig you way out.
                      http://blog.stuffedcow.net/2014/01/x...isambiguation/
                      Intel’s modern implementations of store-to-load forwarding succeeds in almost all cases where it is practical to forward. The only case where forwarding fails (load partially needs to forward) occurs rarely in practice, so it is probably not worth implementing hardware to improve this. The high penalty for a correctly-predicted dependent load is a bit of a concern though, as it is now more than 3× worse than for the older Yorkfield.
                      Would have paid to read up on it yourself. This case they were testing in 64 bit mode x86. Run the same test again in 32 bit mode x86 and the thing that claimed to occurs rarely in practice now happens regularly.
                      Dude what the fuck are you talking about? Just shut up already, you don't understand what you are even reading.

                      You don't even know what a "partial load" is, holy christ. If anything, if x86_64 didn't zero extend to 64 bits, it would suffer from this all the time anytime it used a 32-bit register. They had to do it because 32-bit values are used a lot more than 64-bit values. So it was saved by a design decision that x86 (32-bit) doesn't suffer from in the first place. Both suffer if you load different offset (i.e. store in [rax] and read from [rax+1] when it's a word or dword or qword) but again that has nothing to do with your example which doesn't load ANYTHING partially.

                      There's literally no reason to argue with someone like you when you don't even understand basics. Keep babbling.

                      Obviously I'm not going to be able to explain stuff better than articles that YOU GET WRONG so this is hopeless. Lost cause.

                      Comment

                      Working...
                      X