Announcement

**Nelson** · 05 January 2024, 09:10 AM

I only had hands on experience with Merced. It was quick and compilers absolutely could do a good job, some could do a great job. Really great? No. Closer to Optimal? No. Now I worked for the competition and have some bias but I believe the compiler line is largely because intel and hp assembled a shitty team that never produced anything interesting; to this date I’m still unaware of any interesting output and to be fair they weren’t run by a group that has serious software chops so blaming the software is an easy institutional explanation that feels better than saying intel made a crappy chip.

epic instructions were basically 3 32bit instructions passed in as one unit. Think of it as instruction fusion. Compilers absolutely could and do take that in to consideration. Additionally Merced would prefetch both branches on a branch, assuming your compiler(really just the instruction tiler) could encode that, and there wasn’t a misprediction penalty; this actually made the compiler easier to write in ways. It also had surplus registers and all sorts of other things that helped with compilers too, optimal register allocation was a very hot and challenging issue at the time. I don’t know that Rice’s theorem applies more to Merced than any processor

It failed for the same reasons that almost all hardware architectures fail: it was expensive because of low volume, it wasn’t faster enough to justify the loss of compatibility, and then the biggest technical miss was that they had no energy story when laptops were starting to match pcs in sales. It lacked enterprise reliability and serviceability capabilities that IBM and Sun needed so they didn’t adopt it, which further hurt the volume and then all of intel's grand predictions just turned in to dramatic fails. Making x86 64bit fixed the most immediate technical needs of the market and you could run all your old DOS and Windows stuff on it.

**dragon321** · 06 January 2024, 04:52 PM

Originally posted by Dawn View Post

Seems reasonable. I'm sure you have specific criticisms with that in mind, not just "muh compilers" or "EPIC is bad."

What is it? Advanced loads? RSE? Rotation for loop pipelining?

It's nice that you reject one of the core issue of Itanium as "muh compilers" without any reason.

But the thing is it was pretty big issue. There is good reason why current architectures don't try to throw scheduling to the compiler like Itanium did. It's basically impossible to create compiler for Itanium with similar efficiency that other architectures have. Compiler has no way to predict exactly when and where things will be available in memory (you can't predict things like interrupts etc.). It can only guess and guess is guess - sometimes valid but sometimes (in that case quite often) invalid. So you have architecture where compiler needs to guess output code during build and if that guess is not valid then performance is wasted. This is not definition of good architecture.

Add poor performance of executing x86 code to that and you should have enough reasons why Itanium failed.

**vladpetric** · 07 January 2024, 02:18 PM

Originally posted by Svyatko View Post

Elbrus (ExpLicit Basic Resources Utilization Scheduling) uses VLIW https://en.wikipedia.org/wiki/Elbrus_(computer)
https://en.wikipedia.org/wiki/Elbrus_2000

Well, Elbrus is a bit of a disaster in terms of performance or performance per Watt. See for instance https://en.wikipedia.org/wiki/Elbrus-8S

The way I look at it is - the Russians didn't believe that Itanic was such a failure that they wanted their own version, the Itansky.

**vladpetric** · 07 January 2024, 03:43 PM

Originally posted by DavidBrown View Post

In theory, theory and practice are the same. In practice, they are different. There's nothing wrong with your theory here, but the practical issues are completely different.

This is just a political/ideological meme. These kind of memes are claimed true because they sound clever and (maybe) they are funny. Does that make them true under any circumstances? NO. As far as I'm concerned what can be claimed without evidence, can also be dismissed without evidence. As an aside, it is quite typical for such memes to be used in political "debates". But more about that later.

Are there differences between theoretical computer science and applied computer science? Yes, absolutely there are (see, for instance, galactic algorithms). But in the vast majority of cases the results of theoretical computer science (including, but not limited to, the myriad of lower bounds, or NP-completeness/hardness) apply. The burden of proof should be on the person making a claim that a theoretical result doesn't apply, not the other way around.

The practice of the halting problem is as follows: there are no good automatic solvers for the halting problem, period (practical tools, applying to practical code, that detect a large majority of infinite loops, not even all of them). It should be fairly easy to disprove such a statement btw (you show me one ...). Manual tools that one uses (compiler, debugger, etc) don't do this, period. Just because I can use one and it leads me, sometimes, to a situation which might be an infinite loop doesn't mean much, when I have to figure it out myself anyway. The compiler must prove certain invariants (e.g., two references can't alias) before doing certain optimizations - it is an automated tool. So, arguments about what one can do by hand don't apply ...

Also the evidentiary value for a program halting on some test inputs is low. It really doesn't mean that it won't infinitely loop on some different types of inputs.

You correctly identify one of the nails in Itanium's coffin, namely small basic blocks and arbitrary control flow (IIRC the average for SPEC cpu integer 2000 & 2006 was around 5-6 instructions per basic block). But you refuse somehow to accept that this kind of arbitrary control flow is critical to Turing complete languages with their Halting problem and Rice's theorem.

Ok, what is the second major nail? A compiler simply can't tell ahead of time how the cache hierarchy will behave at run time. Yeah, you can easily profile problem loads - typically those loads are misses all the way to memory that take 200-400 cycles to complete. However, a compiler can't do that much with that information; problem is that you need to insert a prefetch way in advance for it to be effective. E.g., if your IPC is 1, which fairly low for modern processors, you need to insert the prefetch about 100 dynamic instructions earlier to reduce the latency by 100 cycles. Now that is really difficult for a variety of reasons. BTW I have done quite a bit of research in the area, I really know what I'm talking about.

But even if you profile more aggressively (collect much more information, e.g., by running an actual processor simulation), you still have a more fundamental problem: one run is not necessarily going to be representative for another. So in one run you could have a set of performance-critical loads all hitting in the L2, but in another missing in the L3 and going all the way to memory. And the VLIW compiler is somehow supposed to schedule these misses statically? Note that when the VLIW political movement was at its peak (late 90s roughly) misses simply didn't matter as much. But the trend of memory getting slower relative to CPU power was there ... it was fairly obvious that more cache hierarchy levels are going to be introduced, which will result in higher variability for the latency of a single instruction. Dynamic scheduling can handle this variability. Can static scheduling handle it? No.

There are proposals for addressing this issue (e.g., run-ahead), but they all came short honestly.

What about the third nail? This is a bit of a meta-nail, overlapping perhaps with the other two. Well, as you said Rice's theorem applies irrespective of the instruction set. Yes, absolutely correct! (the theorem is about Turing-complete programming languages, not the ISA). The problem is that VLIW requires the compiler to do a lot more than the regular compiler optimization. Essentially, for VLIW the compiler doing an amazing job with instruction scheduling is a hard requirement. In comparison, compiler optimizations for regular dynamically scheduled processors are nice-to-haves/optional, because they're intrinsically capable of high IPC (the most powerful and efficient processor line at this time, the M line from Apple, can issue a maximum of ~11 integer instructions in every cycle, something that can't be sustained continuously, but provides much better performance versus something that can issue 3).

The CPU performance world has been plagued by cherry-picking from day one, and things are not much better now. It's always easy to pick one example that makes one's point, e.g., DSP working well with VLIW, but that tells you almost nothing about how things will work on other types of workloads. So yeah, cherry-picking cum laude.

BTW, performance critical DSP for modern processors is done by hand (assembly or intrinsics) and small vectors (SSE, AVX, and similar instruction sets for ARM etc). The compilers genuinely suck at vectorizing actually ... (even with the most recent compilers, it's faster to write a vectorized loop by hand, either with assembly or intrinsics, than it is to massage the code for the compiler to do it).

To be continued ...

**vladpetric** · 07 January 2024, 04:12 PM

Originally posted by DavidBrown View Post

You could avoid all the complications of hardware scheduling, speculative execution, register renaming, branch prediction, etc., that makes x86 processors so complicated and means that a large proportion of the power used by x86 devices is "bureaucracy", rather than actually executing instructions.

At a technical level, you're maybe partly correct. Strictly speaking, if you can get rid of something that consumes power, then yes, you get a net gain.

But:

1. You do have to do something better instead. In this case, VLIW actually made things worse. Through poor resource utilization, the performance per Watt became far worse.
2. Your power breakdown is based on the situation in the late 90s/early 2000s. What you're so up-in-the-arms about matters a lot less these days.

At a political level, the above statement indicates to me that you're part of the large group of dynamic scheduling haters. Of course, nobody likes being called a hater, and at the same time this won't make them stop hating, either.

In general, reformist political movements work as follows:

1. One starts with something to hate (in this case, dynamic scheduling/out-of-order execution). This is fairly critical, because hate/fear/disgust are far more powerful motivators than enthusiasm/love. A political movement needs an element of hate, in other words.
2. One identifies a problem about the thing they hate (and typically it is absolutely real ...).
3. Through cherry-picking/aggressive spinning that problem becomes the biggest problem in the world. If you only look at the negatives of something (OoO execution/dynamic scheduling) you can make it look so bad, that people start believing that everyone would be better off if we just stopped doing it altogether.

As far as I'm concerned, you're doing exactly this.

**vladpetric** · 07 January 2024, 04:46 PM

Originally posted by archkde View Post

I do not disagree that "don't know" may happen frequently in practice, or even that it can be hard to quantify in the first place. All I'm saying is that it is perfectly compatible with Rice's theorem that "don't know" is so rare it only happens once every 10 years in the entire world in practice.

The problem with "don't know" is that the compiler needs to treat it as "no!" (as in, can't optimize).

Ok, so there is a gray area, which we agree about. But in that gray area, I think you're implicitly assuming the theoretically best possible situation from a optimizing compiler/static optimizer angle. After working with processors and compilers for a while, I think you're wrong.

Originally posted by archkde View Post

And what makes Itanium so much different here? My understanding was that it is much more about coming up with a reasonable desired property in the first place, but you seem to be much more knowledgeable in that area.

As I detailed in a parallel thread over here (sorry, won't detail it again, I'll ask you to read that, it's two messages above, the third nail), for regular (dynamically scheduled/OoOe) processors optimizations are nice-to-haves/optional, whereas for Itanium - absolutely necessary. Essentially, the Intel Itanium proponents/early champions claimed that they will do something they simply couldn't ...

**DavidBrown** · 08 January 2024, 05:17 AM

Originally posted by vladpetric View Post

This is just a political/ideological meme. These kind of memes are claimed true because they sound clever and (maybe) they are funny. Does that make them true under any circumstances? NO. As far as I'm concerned what can be claimed without evidence, can also be dismissed without evidence. As an aside, it is quite typical for such memes to be used in political "debates". But more about that later.

I had thought it was obvious that I was not being serious. But sometimes these things are not as obvious in written form as they would be when spoken. We all know that that theory and practice are different because they cover different things and have different priorities. As an example, in theory the fastest multiplication algorithm currently known scales as O(n . log n). In practice, the fastest algorithm scales as O(n . log n . log log n), because the faster algorithm is only faster in practice with numbers that are so big that the earth does not have enough energy to run the algorithm.

Originally posted by vladpetric View Post

Are there differences between theoretical computer science and applied computer science? Yes, absolutely there are (see, for instance, galactic algorithms). But in the vast majority of cases the results of theoretical computer science (including, but not limited to, the myriad of lower bounds, or NP-completeness/hardness) apply. The burden of proof should be on the person making a claim that a theoretical result doesn't apply, not the other way around.

Of course computer science theory is very important to practical computer programming. I have not suggested anything different.

Originally posted by vladpetric View Post

The practice of the halting problem is as follows: there are no good automatic solvers for the halting problem, period (practical tools, applying to practical code, that detect a large majority of infinite loops, not even all of them).

You are misunderstanding a good deal here, I think - or at least explaining things very badly.
The theory of the halting problem says there are no general automatic solvers to for the halting problem. It is not concerned about them being "good" or not, or whether they apply to "practical code". And computation theory also says there /is/ a general automatic solver for the halting problem for a given practical computer. Do you want a halting decider that can handle any program that will run on your PC, determining if the program will halt within your lifetime? It's easy, in theory. But of course it is totally impossible in practice.

Originally posted by vladpetric View Post

Ok, what is the second major nail? A compiler simply can't tell ahead of time how the cache hierarchy will behave at run time.

Agreed.

Originally posted by vladpetric View Post

BTW, performance critical DSP for modern processors is done by hand (assembly or intrinsics) and small vectors (SSE, AVX, and similar instruction sets for ARM etc). The compilers genuinely suck at vectorizing actually ... (even with the most recent compilers, it's faster to write a vectorized loop by hand, either with assembly or intrinsics, than it is to massage the code for the compiler to do it).

Also true.

Compilers are definitely getting better at vectorising SIMD code on general-purpose CPUs, but can usually only handle fairly simple cases without significant manual help. Getting the best out of DSPs is much harder again - and if you are not getting the best out of the inner loops of your kernels, there's little point in using a dedicated DSP in the first place.

**DavidBrown** · 08 January 2024, 05:48 AM

Originally posted by vladpetric View Post

At a technical level, you're maybe partly correct. Strictly speaking, if you can get rid of something that consumes power, then yes, you get a net gain.

But:

1. You do have to do something better instead. In this case, VLIW actually made things worse. Through poor resource utilization, the performance per Watt became far worse.

So far, so good.

Originally posted by vladpetric View Post

2. Your power breakdown is based on the situation in the late 90s/early 2000s. What you're so up-in-the-arms about matters a lot less these days.

But that is not correct. The work I have called "bureaucracy" overhead - hardware re-ordering, register renaming, speculative execution (especially when it is discarded, but also when it must be tracked and turned into "real" execution), and so on, is a large cost for modern processors. It is a major cost in development costs and times for the design, it is a big cost in complexity (and therefore risk of errors or security problems), a cost in die area (though not a big one), a big cost in power consumption (depending on the characteristics of the code that is running), and a limiting factor for clock speed.

Processors that emphasise lower power and cheaper devices minimise the "bureaucracy". They don't do multiple instructions per cycle, and their throughput is limited either by pipleline stalls or by the lower clock speed limits of short pipeline designs. But avoiding or reducing this kind of overhead is a major reason why ARM cores are generally more power efficient than x86 cores, and a major part of the design of "little" cores that have become popular for ARM, Intel and AMD.

Originally posted by vladpetric View Post

At a political level, the above statement indicates to me that you're part of the large group of dynamic scheduling haters. Of course, nobody likes being called a hater, and at the same time this won't make them stop hating, either.

Now you have moved from being wrong, to being completely deluded. My comments are purely technical and fact based (to the best of my knowledge, though of course I could be getting some things wrong) - not "political" (whatever you mean by that), and not "hate". The very idea that there exists a "large group of dynamic scheduling haters" is utterly bizarre, And even if such a group existed, I would not be in it - dynamic scheduling (and other features I called "bureaucracy" for short) is a perfectly good trade-off, giving higher processing speeds at the cost of more power and more complex devices. Sometimes high speed is your priority, sometimes low power is your priority.

Feel free to continue a technical discussion here, but please leave the conspiracy theories at home. And please don't extrapolate so wildly from what other people write.

**vladpetric** · 15 January 2024, 09:31 PM

Ok, so let me start by saying that when it comes to X86 you seem to conflate/bundle together the complexities of the instruction set (architecture, instruction set architecture, such as ARM, X86, and Itanium) with the implementation type (microarchitecture - whether in-order or out-of-order). Yes, it is the case that Itanium gets rid of both out-of-order execution and the complexities of x86 (while adding some of its own, such as the register stack). But the two things are almost completely orthogonal.

The vast majority ARM implementations are not for the high performance area. They are mobile low-power chips, which indeed can have more performance per Watt than high performance chips. But they don't really scale to high performance, and also there aren't really any noteworthy x86 chips in that space (can't really make a comparison honestly).

There is a significant exception to that, the Apple M line which is an ARM high performance, super-aggressive out-of-order superscalar (As an aside, the Apple line is based on PA Semi, which was founded by Jim Keller, an engineer who participated in one way or the other in many of the most successful processor designs on this planet, using different instruction sets, and pretty much all of them out-of-order superscalar since the mid 90s).

Back in the late 90s the designers of MIPS R10000, Pentium II, Alpha 21264 (late 90s) all made the choice to implement dynamic scheduling, when the transistor budgets where in the range of 6-25 million per chip, and those decisions worked quite well from a performance standpoint. Guess what, the relative cost of dynamic scheduling is much lower now (also relative speed of memory vs cpu is significantly worse). It made sense back then, it makes a lot more sense now, when a single core can have about 2 billion transistors.

You seem to be quite negative about speculative execution and branch prediction, and I find it perplexing ... Speculative execution is a critical enabler for high performance processors/high IPC. More importantly, VLIW actually needs even better branch prediction/speculative execution than Out-of-order processors, because VLIW is in order and thus its performance is more dependent on good prediction (reasons are complex, see critical path work by Fields and others to get a better understanding). So, does VLIW make things simpler for branch prediction and speculation? Ummm, not at all, on the contrary.

From a power perspective, if you compare a VLIW design to an OoO design of exactly the same width, there are a bunch of power-critical structures that are the same, e.g. register file, bypass network, and functional units (VLIW doesn't improve these things). And yes, these will consume more power than the renaming logic. But more importantly, this comparison doesn't take into account the fact that OoO processors can be made much much wider than VLIW ones. E.g., the Apple M processors are 11 wide for integer, and 2-3 wide for FP if I remember correctly (so in a single cycle the processor could issue as many as 13 instructions at the same time, though that is not longer-term sustainable). Can a VLIW design be made that wide? No, not really. I mean the compilers had a hard time making bundles of size 3 with EPIC ...

This doesn't mean that the structures required for OoOe don't take power - they do, obviously. But really, their impact on the power bottom line is considerably smaller than you'd think. No, I don't have a publicly available paper with a breakdown between all those components (if you do, please share it)

BTW, as one of the architects of the Power line said (forgot his name), when you have complexity, you deal with it, because the alternative is to have lower performance, and that's not acceptable.

Would it be nice not to have such complexity (e.g., by moving that complexity to software) ? Sure, if it's technically feasible. But it's worth keeping an eye on the big picture here.

When looking at instruction scheduling within a fixed window of N instructions (with speculative execution of course), the time in hardware is O(N), and the hardware cost has both linear and O(N * M) components with M being the number of instructions that can be issued every cycle.

When looking at static scheduling of instructions, the complexity of that is ... ah wait, you can't put an O() cost on it because the problem itself is undecidable.

Bottomline here is that Static Instructions != Dynamic Instructions. Given Rice's theorem, one can't treat them as idempotent. VLIW proponents try to do exactly this - handwave that Static Instructions, Dynamic Instructions ... same thing really! (not!).

I strongly suggest you read the Shen and Lipasti book. While the current edition is from 2013 (so a tad old), it's still very much relevant today. If you find a newer/better book, please do share a link.

Amazon.com

https://www.amazon.com/Modern-Processor-Design-Fundamentals-Superscalar/dp/1478607831

**DavidBrown** · 16 January 2024, 07:58 AM

Originally posted by vladpetric View Post

Ok, so let me start by saying that when it comes to X86 you seem to conflate/bundle together the complexities of the instruction set (architecture, instruction set architecture, such as ARM, X86, and Itanium) with the implementation type (microarchitecture - whether in-order or out-of-order). Yes, it is the case that Itanium gets rid of both out-of-order execution and the complexities of x86 (while adding some of its own, such as the register stack). But the two things are almost completely orthogonal.

No, the two things are most certainly not orthogonal. They are not the same, and a given ISA can have many very different implementations, but they are not unrelated. While modern x86 cpus translate incoming CISC-style x86 instructions into more RISC-style instructions for execution, they still have to support the semantics of the x86 ISA. For example, the x86 ISA requires strong memory ordering, support for bus locking and RMW instructions, and specialised register usage. A large proportion of instructions affect the flag register. These sorts of things make the implementation massively more complex, and are a strong tie between the ISA and the implementation.

But it is entirely true that a given ISA can be implemented in many ways, ranging from simple one instruction at a time through to massively OOO implementations.

Originally posted by vladpetric View Post

The vast majority ARM implementations are not for the high performance area. They are mobile low-power chips, which indeed can have more performance per Watt than high performance chips. But they don't really scale to high performance, and also there aren't really any noteworthy x86 chips in that space (can't really make a comparison honestly).

There is a significant exception to that, the Apple M line which is an ARM high performance, super-aggressive out-of-order superscalar (As an aside, the Apple line is based on PA Semi, which was founded by Jim Keller, an engineer who participated in one way or the other in many of the most successful processor designs on this planet, using different instruction sets, and pretty much all of them out-of-order superscalar since the mid 90s).

ARM implementations cover a wide variety too - and they include high-power, high-speed devices for server usage with vastly more total throughout than Apple's line. Apple's "M" cores are the fastest desktop ARM devices, but not the fastest ARM devices overall. Still, it is absolutely true that most ARM core designs, and certainly the majority by numbers produced, are simpler implementations with lower power requirements. ARM implementations completely outclass x86 devices on throughput per watt except occasionally at the very high end. A key reason for this is the x86 ISA compared to the ARM ISA - x86 has many features that were fine for early designs that were single scaler, but scale very badly for superscaler and OOO - even pipelined execution can be difficult. There's a reason why all "big" ISA's, other than the x86, have many orthogonal registers, load/store architectures, and minimal use of flags. It is vastly simpler to make a superscaler OOO implementation of ARM, PowerPC, RISC-V, etc., than to make such an implementation for x86 ISA. The only reason that there are fast x86 implementations is that Intel and AMD can throw a lot more money at the task - giving processors that are very fast at the cost of requiring a great deal of power. As I said originally, x86 ISA requires much more complicated, and therefore more expensive and power consuming, "bureaucracy" overhead for faster designs than you need for more RISC-style ISAs.

This is not particularly contentious, or difficult to understand. With RISC-style ISA's, if you have two "add" instructions back-to-back, they might be "add r1, r2, r3" and "add r4, r5, r6". These are easy to run in parallel - there's no contention on any resource (other than easily duplicated things like adders and register ports). No bus locking, no register overlap, no flags, no problem. Of course there will be stalls when the same register is used in successive instructions, but that's only done when needed, and it's only one type of tracking. For your two x86 adds, even if the registers used are different and there's no memory access so you have "add a, b" and "add c, d", these both have register contention on their own (for input and output) and both need to set the flag register. In real x86 code, the "a" register is used vastly more than other registers, instructions can trap, they can have memory accesses mixed with ALU operations.

The more you try to do things in parallel or pipelined, the worse this effect gets - there are diminishing returns. Even with good RISC ISAs, 32 registers becomes a limiting factor and you have to have all your dynamic register renaming and tracking. (More than 32 gpr's is counter-productive in general software, as higher register counts mean more of the instruction encoding space is taken by register numbers, and you can't do compile-time scheduling over a long enough distance in code/time to use them well. But x86-64's 16 gprs is too small, adding to the challenges for that ISA.) This is a major reason for the move towards multi-core rather than faster single cores.

Another special challenge that the x86 world has, that RISC does not, is that such a large proportion of existing x86 code is complete crap. It is made with outdated or poor quality tools, or by people who don't know how to use their tools well, and a great deal of it is very old. Compatibility with pre-existing x86 code is the sole reason the x86 ISA has its current position. But a lot of that code is poor quality - a lot is still 32-bit and restricted to 8 gprs, and perhaps even compiled for "generic 386". So modern x86 cpus have to put a great deal more effort into making old, poor code fast, while modern RISC cpus can expect code to be compiled with better scheduling, register allocation, etc.

Originally posted by vladpetric View Post

You seem to be quite negative about speculative execution and branch prediction, and I find it perplexing ... Speculative execution is a critical enabler for high performance processors/high IPC.

It would be perplexing if it were remotely true. I can only assume you don't actually read what I write.

I have said this kind of thing is a significant cost in complexity and power requirements. I have said that the x86 ISA requires much more of the complexity than RISC ISAs do, and that the aim of the EPIC design of the Itanium was to minimise the complexity here. But I am in no way negative to these techniques - they are essential to higher instructions per clock cycle and higher throughput. Pointing out the costs, and what makes these costs greater or lesser, is not negativity.

Originally posted by vladpetric View Post

More importantly, VLIW actually needs even better branch prediction/speculative execution than Out-of-order processors, because VLIW is in order and thus its performance is more dependent on good prediction (reasons are complex, see critical path work by Fields and others to get a better understanding). So, does VLIW make things simpler for branch prediction and speculation? Ummm, not at all, on the contrary.

Again, you seem to be imagining things that I did not write.

However, it is perhaps worth noting that the only successful VLIW designs are DSPs, and they do not (normally) use branch prediction, speculative execution or OOO. They use VLIW to get explicit parallel execution in the inner loops of the DSP algorithms. Dynamic re-arrangements or scheduling would not help these loops, but would mess up the consistency of the timings.

Originally posted by vladpetric View Post

Would it be nice not to have such complexity (e.g., by moving that complexity to software) ? Sure, if it's technically feasible. But it's worth keeping an eye on the big picture here.

When looking at instruction scheduling within a fixed window of N instructions (with speculative execution of course), the time in hardware is O(N), and the hardware cost has both linear and O(N * M) components with M being the number of instructions that can be issued every cycle.

When looking at static scheduling of instructions, the complexity of that is ... ah wait, you can't put an O() cost on it because the problem itself is undecidable.

Of course you can figure out cost functions. Just because a problem is, in its most general and unlimited form, unsolvable, does not mean you throw your hands up in the air and say it can't be done, or you have no idea of the cost! Do you think compiler writers should not bother doing instruction scheduling during compilation, just because some guy on the internet says it is undecideable? Do you think it matters what can and cannot be proven about the ultimate limits of scheduling? It is not remotely relevant to the practice of compilers (or processors, or anything else). The practice is limited by the work that compiler writers can do, and the patience that compiler users have waiting for builds to finish. The costs and benefits are seen from real-world statistics - how fast real code runs.

Originally posted by vladpetric View Post

Bottomline here is that Static Instructions != Dynamic Instructions. Given Rice's theorem, one can't treat them as idempotent. VLIW proponents try to do exactly this - handwave that Static Instructions, Dynamic Instructions ... same thing really! (not!).

Rice's theorem is irrelevant here. Seriously - stop going on about it. It's like the rest of your attempts at name-dropping - when you get the details wrong and misunderstanding the relevance, it does not make you look sophisticated or educated. It shows you have heard of a few things, and think mentioning them will make you look clever.

I have not heard VLIW proponents equate dynamic and static scheduling. Not even when Intel was most enthusiastic and optimistic about the EPIC Itanium architecture did they mix this up. They aimed to use better static scheduling to reduce the need for dynamic scheduling while maintaining high IPC counts - they did not think they are the same thing. Real-world VLIW designs are, for the most part, DSPs - and the designers, users and advocates know the difference. I don't personally know anyone who would describe themselves as a "VLIW proponent" to ask, but I expect that most people who are interested in the details of processor design know the difference.

Originally posted by vladpetric View Post

I strongly suggest you read the Shen and Lipasti book.

I strongly suggest you read what I have written, and stop tilting at windmills and imagining what you think I am saying.

And I've snipped your advert for a site dedicated to the destruction of book shops. If you want to post a link to a book, post a link to the book (I don't mean illegal copies) - such as their website, or the publisher's site. The appropriate link is https://www.waveland.com/browse.php?t=624

Announcement

Glibc 2.39 Should Be Out On 1 February & Might Drop Itanium IA64 Linux Support

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment