Announcement

Collapse
No announcement yet.

Linux Kernel Orphans Itanium Support, Linus Torvalds Acknowledges Its Death

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #11
    Originally posted by vladpetric View Post
    I think you're also forgetting about the instruction trace caches in modern designs, which for a lot of codes get decoding off the critical path.
    Not to be overly pedantic, but I think you meant to say uop caches rather than trace caches. While both store micro operations and therefore bypass the hardware decoder, uop caches do not have trace building and selection circuitry. To my knowledge the Pentium 4 was the only commercial processor that used a trace cache.

    Comment


    • #12
      Originally posted by vladpetric View Post
      Is it really annoying to decode instructions when they are arbitrary size, in bytes? Yes. Is it doable though? Yes. (and I've talked to designers of x86 as well).
      The problem is that is scales very poorly, as you try to decode more instructions in parallel. This is because the start of each instruction depends on the length of the previous ones.

      Originally posted by vladpetric View Post
      I think you're also forgetting about the instruction trace caches in modern designs, which for a lot of codes get decoding off the critical path.
      u-op caches are good for relatively tight code, but there are lots of examples where the time spent is too spread out to get much benefit from them.

      Comment


      • #13
        I'm hoping Intel takes up maintenance of it and gives it another go. Highly unlikely, I know, but as Intel looks around for life after x86, they might come to rediscover IA64's unrealized potential.

        Comment


        • #14
          Originally posted by coder View Post
          The problem is that is scales very poorly, as you try to decode more instructions in parallel. This is because the start of each instruction depends on the length of the previous ones.


          u-op caches are good for relatively tight code, but there are lots of examples where the time spent is too spread out to get much benefit from them.
          In my experience, u-op caches don't work well for things that have really high instruction footprint, such as databases and the kernel. However, those tend to have low IPC, to the point that instruction decode width is not even remotely close to being a problem (MLP is a much bigger deal there). There are other cases, sure, and feel free to cite some that also have high IPC.

          Comment


          • #15
            Originally posted by coder View Post
            The problem is that is scales very poorly, as you try to decode more instructions in parallel. This is because the start of each instruction depends on the length of the previous ones.


            u-op caches are good for relatively tight code, but there are lots of examples where the time spent is too spread out to get much benefit from them.
            Please show me a decode-bound workload.

            Comment


            • #16
              Originally posted by vladpetric View Post
              Please show me a decode-bound workload.
              I'm guessing some JIT code probably tends towards that end of the spectrum. Maybe also some video compression cases, where I seem to recall that x264 was having problems fitting some of their loops in L1 instruction cache.

              BTW, why reply to the same message twice?

              Comment


              • #17
                Originally posted by vladpetric View Post
                Do you need additional resources to implement the ISA? Yes. Do those matter? Well, it depends. On a mobile chip they might, from a power consumption perspective (and we're really not in the 1900s anymore with transistor budgets ... ). On a desktop/server chip? Absolutely not.
                I don't understand why you think power doesn't matter for desktop or server. Power is clearly a limiting factor for Intel (and some of AMD's top-binned chips) and matters a great deal for datacenter customers.

                Also, die area matters, too. Look at the Ampere Altra, which packs 80 cores in much less silicon than AMD populates with 64, and will soon even go to 128!

                Comment


                • #18
                  Originally posted by coder View Post
                  I don't understand why you think power doesn't matter for desktop or server. Power is clearly a limiting factor for Intel (and some of AMD's top-binned chips) and matters a great deal for datacenter customers.

                  Also, die area matters, too. Look at the Ampere Altra, which packs 80 cores in much less silicon than AMD populates with 64, and will soon even go to 128!
                  It does matter, just not as much. A desktop processor idles at tens of watts, a mobile one, some where in the low hundred mW (several orders of magnitude lower; similar argument for TDP). Sure, area matters too, but not as much now as 20 years ago (to put it differently, the proportional area that the decoder takes has been decreasing considerably, with big caches dominating).

                  Comment


                  • #19
                    Originally posted by coder View Post
                    I'm guessing some JIT code probably tends towards that end of the spectrum. Maybe also some video compression cases, where I seem to recall that x264 was having problems fitting some of their loops in L1 instruction cache.

                    BTW, why reply to the same message twice?
                    Well, you can take handbrake or something similar, run an x264 benchmark, and measure both IPC and icache performance. It really shouldn't be a big deal at all (perf stat <command> will probably give you that).

                    What's the big deal?

                    Comment


                    • #20
                      Originally posted by coder View Post
                      they might come to rediscover IA64's unrealized potential.
                      One of the problems with IA64 was that it was too far ahead of compiler technology of the time, and to get good performance advantages with it required compiler capabilities that were not widely available (hand assembly could show impressive results, but that is not practical for large code bases). Another problem with IA64 was that Intel was unwilling to take the leap of faith and fully commit and put it on their most advanced lithography and displace existing (and profitable) x86 processors which were already supply constrained, so all the IA64 processors were a generation or two or more behind in speeds and feeds.

                      Comment

                      Working...
                      X