Announcement

Collapse
No announcement yet.

Glibc 2.39 Should Be Out On 1 February & Might Drop Itanium IA64 Linux Support

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Eirikr1848
    replied
    Does anyone know of glibc support was dropped?

    Leave a comment:


  • Gonk
    replied
    Another big change - though one that we've known was coming - with glibc 2.39 is the final removal of libcrypt. Building of it was turned off by default in 2.38, but in 2.39 it is gone. And good riddance.

    Leave a comment:


  • Gonk
    replied
    Originally posted by OingoBoingo View Post
    It [is] just that when IA64 originally started, in the early 90s, neither intel nor HP were expecting x86 to be so scalable as to have also evolved into an out-of-order high performance 64bit architecture by the beginning of the century. Props to AMD's arch team for that.
    Jim Keller again.

    Leave a comment:


  • OingoBoingo
    replied
    Dear lord, it's fascinating how the FUD about Itanium from decades ago is still being spouted ha ha.

    The compilers were fine, from the get go, and so was the architecture once Itanium2 was released. Which was one of the top performer CPUs of its generation BTW.

    IA65 wasn't particularly VLIW, as each bundle was just 3 instructions long on average. Besides VLIW code analysis wasn't a "mystery" and had been actively worked on since the 70s. Trace compilers, even by the late 80s, were fairly good at figuring out static superscalar schedulings. IA64 provided huge rotating arch register files for the compiler.

    Besides Itanium2 also had plenty of dynamic execution support. I just laugh at some of the discussions that involve itanium where people try to "solve" problems they don't understand. Thinking that somehow some of the top architecture/compiler teams in the industry (Intel/HP) somehow missed the basics of computer organization and the intro to compiler classes.

    What sunk itanium wasn't performance nor compilers. What did the architecture in was poor down scalability due to high power consumption introduced with predication. Poor backwards compatibility with the largest software library in the world then (x86), and high cost due to lack of accessibility of the same economies of scale as x86.
    Once AMD figured out how to extend x86 to 64bit with an architecture that kept 32bit performance as well, IA64 had little value proposition in comparison.

    But from an architectural performance standpoint, on its own, Itanium wasn't a bad architecture since it managed to either match or outperform its contemporary out-of-order high performance RISC designs on its original use cases. It just that when IA64 originally started, in the early 90s, neither intel nor HP were expecting x86 to be so scalable as to have also evolved into an out-of-order high performance 64bit architecture by the beginning of the century. Props to AMD's arch team for that.

    Leave a comment:


  • DavidBrown
    replied
    Originally posted by vladpetric View Post
    Ok, so let me start by saying that when it comes to X86 you seem to conflate/bundle together the complexities of the instruction set (architecture, instruction set architecture, such as ARM, X86, and Itanium) with the implementation type (microarchitecture - whether in-order or out-of-order). Yes, it is the case that Itanium gets rid of both out-of-order execution and the complexities of x86 (while adding some of its own, such as the register stack). But the two things are almost completely orthogonal.
    No, the two things are most certainly not orthogonal. They are not the same, and a given ISA can have many very different implementations, but they are not unrelated. While modern x86 cpus translate incoming CISC-style x86 instructions into more RISC-style instructions for execution, they still have to support the semantics of the x86 ISA. For example, the x86 ISA requires strong memory ordering, support for bus locking and RMW instructions, and specialised register usage. A large proportion of instructions affect the flag register. These sorts of things make the implementation massively more complex, and are a strong tie between the ISA and the implementation.

    But it is entirely true that a given ISA can be implemented in many ways, ranging from simple one instruction at a time through to massively OOO implementations.
    Originally posted by vladpetric View Post
    The vast majority ARM implementations are not for the high performance area. They are mobile low-power chips, which indeed can have more performance per Watt than high performance chips. But they don't really scale to high performance, and also there aren't really any noteworthy x86 chips in that space (can't really make a comparison honestly).

    There is a significant exception to that, the Apple M line which is an ARM high performance, super-aggressive out-of-order superscalar (As an aside, the Apple line is based on PA Semi, which was founded by Jim Keller, an engineer who participated in one way or the other in many of the most successful processor designs on this planet, using different instruction sets, and pretty much all of them out-of-order superscalar since the mid 90s).
    ARM implementations cover a wide variety too - and they include high-power, high-speed devices for server usage with vastly more total throughout than Apple's line. Apple's "M" cores are the fastest desktop ARM devices, but not the fastest ARM devices overall. Still, it is absolutely true that most ARM core designs, and certainly the majority by numbers produced, are simpler implementations with lower power requirements. ARM implementations completely outclass x86 devices on throughput per watt except occasionally at the very high end. A key reason for this is the x86 ISA compared to the ARM ISA - x86 has many features that were fine for early designs that were single scaler, but scale very badly for superscaler and OOO - even pipelined execution can be difficult. There's a reason why all "big" ISA's, other than the x86, have many orthogonal registers, load/store architectures, and minimal use of flags. It is vastly simpler to make a superscaler OOO implementation of ARM, PowerPC, RISC-V, etc., than to make such an implementation for x86 ISA. The only reason that there are fast x86 implementations is that Intel and AMD can throw a lot more money at the task - giving processors that are very fast at the cost of requiring a great deal of power. As I said originally, x86 ISA requires much more complicated, and therefore more expensive and power consuming, "bureaucracy" overhead for faster designs than you need for more RISC-style ISAs.

    This is not particularly contentious, or difficult to understand. With RISC-style ISA's, if you have two "add" instructions back-to-back, they might be "add r1, r2, r3" and "add r4, r5, r6". These are easy to run in parallel - there's no contention on any resource (other than easily duplicated things like adders and register ports). No bus locking, no register overlap, no flags, no problem. Of course there will be stalls when the same register is used in successive instructions, but that's only done when needed, and it's only one type of tracking. For your two x86 adds, even if the registers used are different and there's no memory access so you have "add a, b" and "add c, d", these both have register contention on their own (for input and output) and both need to set the flag register. In real x86 code, the "a" register is used vastly more than other registers, instructions can trap, they can have memory accesses mixed with ALU operations.

    The more you try to do things in parallel or pipelined, the worse this effect gets - there are diminishing returns. Even with good RISC ISAs, 32 registers becomes a limiting factor and you have to have all your dynamic register renaming and tracking. (More than 32 gpr's is counter-productive in general software, as higher register counts mean more of the instruction encoding space is taken by register numbers, and you can't do compile-time scheduling over a long enough distance in code/time to use them well. But x86-64's 16 gprs is too small, adding to the challenges for that ISA.) This is a major reason for the move towards multi-core rather than faster single cores.

    Another special challenge that the x86 world has, that RISC does not, is that such a large proportion of existing x86 code is complete crap. It is made with outdated or poor quality tools, or by people who don't know how to use their tools well, and a great deal of it is very old. Compatibility with pre-existing x86 code is the sole reason the x86 ISA has its current position. But a lot of that code is poor quality - a lot is still 32-bit and restricted to 8 gprs, and perhaps even compiled for "generic 386". So modern x86 cpus have to put a great deal more effort into making old, poor code fast, while modern RISC cpus can expect code to be compiled with better scheduling, register allocation, etc.
    Originally posted by vladpetric View Post
    ​You seem to be quite negative about speculative execution and branch prediction, and I find it perplexing ... Speculative execution is a critical enabler for high performance processors/high IPC.
    It would be perplexing if it were remotely true. I can only assume you don't actually read what I write.

    I have said this kind of thing is a significant cost in complexity and power requirements. I have said that the x86 ISA requires much more of the complexity than RISC ISAs do, and that the aim of the EPIC design of the Itanium was to minimise the complexity here. But I am in no way negative to these techniques - they are essential to higher instructions per clock cycle and higher throughput. Pointing out the costs, and what makes these costs greater or lesser, is not negativity.
    Originally posted by vladpetric View Post
    More importantly, VLIW actually needs even better branch prediction/speculative execution than Out-of-order processors, because VLIW is in order and thus its performance is more dependent on good prediction (reasons are complex, see critical path work by Fields and others to get a better understanding). So, does VLIW make things simpler for branch prediction and speculation? Ummm, not at all, on the contrary.
    Again, you seem to be imagining things that I did not write.

    However, it is perhaps worth noting that the only successful VLIW designs are DSPs, and they do not (normally) use branch prediction, speculative execution or OOO. They use VLIW to get explicit parallel execution in the inner loops of the DSP algorithms. Dynamic re-arrangements or scheduling would not help these loops, but would mess up the consistency of the timings.
    Originally posted by vladpetric View Post
    ​Would it be nice not to have such complexity (e.g., by moving that complexity to software) ? Sure, if it's technically feasible. But it's worth keeping an eye on the big picture here.

    When looking at instruction scheduling within a fixed window of N instructions (with speculative execution of course), the time in hardware is O(N), and the hardware cost has both linear and O(N * M) components with M being the number of instructions that can be issued every cycle.

    When looking at static scheduling of instructions, the complexity of that is ... ah wait, you can't put an O() cost on it because the problem itself is undecidable.
    Of course you can figure out cost functions. Just because a problem is, in its most general and unlimited form, unsolvable, does not mean you throw your hands up in the air and say it can't be done, or you have no idea of the cost! Do you think compiler writers should not bother doing instruction scheduling during compilation, just because some guy on the internet says it is undecideable? Do you think it matters what can and cannot be proven about the ultimate limits of scheduling? It is not remotely relevant to the practice of compilers (or processors, or anything else). The practice is limited by the work that compiler writers can do, and the patience that compiler users have waiting for builds to finish. The costs and benefits are seen from real-world statistics - how fast real code runs.
    Originally posted by vladpetric View Post
    ​Bottomline here is that Static Instructions != Dynamic Instructions. Given Rice's theorem, one can't treat them as idempotent. VLIW proponents try to do exactly this - handwave that Static Instructions, Dynamic Instructions ... same thing really! (not!).
    Rice's theorem is irrelevant here. Seriously - stop going on about it. It's like the rest of your attempts at name-dropping - when you get the details wrong and misunderstanding the relevance, it does not make you look sophisticated or educated. It shows you have heard of a few things, and think mentioning them will make you look clever.

    I have not heard VLIW proponents equate dynamic and static scheduling. Not even when Intel was most enthusiastic and optimistic about the EPIC Itanium architecture did they mix this up. They aimed to use better static scheduling to reduce the need for dynamic scheduling while maintaining high IPC counts - they did not think they are the same thing. Real-world VLIW designs are, for the most part, DSPs - and the designers, users and advocates know the difference. I don't personally know anyone who would describe themselves as a "VLIW proponent" to ask, but I expect that most people who are interested in the details of processor design know the difference.
    Originally posted by vladpetric View Post

    I strongly suggest you read the Shen and Lipasti book.
    I strongly suggest you read what I have written, and stop tilting at windmills and imagining what you think I am saying.

    And I've snipped your advert for a site dedicated to the destruction of book shops. If you want to post a link to a book, post a link to the book (I don't mean illegal copies) - such as their website, or the publisher's site. The appropriate link is https://www.waveland.com/browse.php?t=624

    Leave a comment:


  • vladpetric
    replied
    Ok, so let me start by saying that when it comes to X86 you seem to conflate/bundle together the complexities of the instruction set (architecture, instruction set architecture, such as ARM, X86, and Itanium) with the implementation type (microarchitecture - whether in-order or out-of-order). Yes, it is the case that Itanium gets rid of both out-of-order execution and the complexities of x86 (while adding some of its own, such as the register stack). But the two things are almost completely orthogonal.

    The vast majority ARM implementations are not for the high performance area. They are mobile low-power chips, which indeed can have more performance per Watt than high performance chips. But they don't really scale to high performance, and also there aren't really any noteworthy x86 chips in that space (can't really make a comparison honestly).

    There is a significant exception to that, the Apple M line which is an ARM high performance, super-aggressive out-of-order superscalar (As an aside, the Apple line is based on PA Semi, which was founded by Jim Keller, an engineer who participated in one way or the other in many of the most successful processor designs on this planet, using different instruction sets, and pretty much all of them out-of-order superscalar since the mid 90s).

    Back in the late 90s the designers of MIPS R10000, Pentium II, Alpha 21264 (late 90s) all made the choice to implement dynamic scheduling, when the transistor budgets where in the range of 6-25 million per chip, and those decisions worked quite well from a performance standpoint. Guess what, the relative cost of dynamic scheduling is much lower now (also relative speed of memory vs cpu is significantly worse). It made sense back then, it makes a lot more sense now, when a single core can have about 2 billion transistors.

    You seem to be quite negative about speculative execution and branch prediction, and I find it perplexing ... Speculative execution is a critical enabler for high performance processors/high IPC. More importantly, VLIW actually needs even better branch prediction/speculative execution than Out-of-order processors, because VLIW is in order and thus its performance is more dependent on good prediction (reasons are complex, see critical path work by Fields and others to get a better understanding). So, does VLIW make things simpler for branch prediction and speculation? Ummm, not at all, on the contrary.

    From a power perspective, if you compare a VLIW design to an OoO design of exactly the same width, there are a bunch of power-critical structures that are the same, e.g. register file, bypass network, and functional units (VLIW doesn't improve these things). And yes, these will consume more power than the renaming logic. But more importantly, this comparison doesn't take into account the fact that OoO processors can be made much much wider than VLIW ones. E.g., the Apple M processors are 11 wide for integer, and 2-3 wide for FP if I remember correctly (so in a single cycle the processor could issue as many as 13 instructions at the same time, though that is not longer-term sustainable). Can a VLIW design be made that wide? No, not really. I mean the compilers had a hard time making bundles of size 3 with EPIC ...

    This doesn't mean that the structures required for OoOe don't take power - they do, obviously. But really, their impact on the power bottom line is considerably smaller than you'd think. No, I don't have a publicly available paper with a breakdown between all those components (if you do, please share it)

    BTW, as one of the architects of the Power line said (forgot his name), when you have complexity, you deal with it, because the alternative is to have lower performance, and that's not acceptable.

    Would it be nice not to have such complexity (e.g., by moving that complexity to software) ? Sure, if it's technically feasible. But it's worth keeping an eye on the big picture here.

    When looking at instruction scheduling within a fixed window of N instructions (with speculative execution of course), the time in hardware is O(N), and the hardware cost has both linear and O(N * M) components with M being the number of instructions that can be issued every cycle.

    When looking at static scheduling of instructions, the complexity of that is ... ah wait, you can't put an O() cost on it because the problem itself is undecidable.

    Bottomline here is that Static Instructions != Dynamic Instructions. Given Rice's theorem, one can't treat them as idempotent. VLIW proponents try to do exactly this - handwave that Static Instructions, Dynamic Instructions ... same thing really! (not!).

    I strongly suggest you read the Shen and Lipasti book. While the current edition is from 2013 (so a tad old), it's still very much relevant today. If you find a newer/better book, please do share a link.​

    Leave a comment:


  • DavidBrown
    replied
    Originally posted by vladpetric View Post

    At a technical level, you're maybe partly correct. Strictly speaking, if you can get rid of something that consumes power, then yes, you get a net gain.

    But:

    1. You do have to do something better instead. In this case, VLIW actually made things worse. Through poor resource utilization, the performance per Watt became far worse.
    So far, so good.
    Originally posted by vladpetric View Post

    2. Your power breakdown is based on the situation in the late 90s/early 2000s. What you're so up-in-the-arms about matters a lot less these days.
    But that is not correct. The work I have called "bureaucracy" overhead - hardware re-ordering, register renaming, speculative execution (especially when it is discarded, but also when it must be tracked and turned into "real" execution), and so on, is a large cost for modern processors. It is a major cost in development costs and times for the design, it is a big cost in complexity (and therefore risk of errors or security problems), a cost in die area (though not a big one), a big cost in power consumption (depending on the characteristics of the code that is running), and a limiting factor for clock speed.

    Processors that emphasise lower power and cheaper devices minimise the "bureaucracy". They don't do multiple instructions per cycle, and their throughput is limited either by pipleline stalls or by the lower clock speed limits of short pipeline designs. But avoiding or reducing this kind of overhead is a major reason why ARM cores are generally more power efficient than x86 cores, and a major part of the design of "little" cores that have become popular for ARM, Intel and AMD.

    Originally posted by vladpetric View Post
    At a political level, the above statement indicates to me that you're part of the large group of dynamic scheduling haters. Of course, nobody likes being called a hater, and at the same time this won't make them stop hating, either.
    Now you have moved from being wrong, to being completely deluded. My comments are purely technical and fact based (to the best of my knowledge, though of course I could be getting some things wrong) - not "political" (whatever you mean by that), and not "hate". The very idea that there exists a "large group of dynamic scheduling haters" is utterly bizarre, And even if such a group existed, I would not be in it - dynamic scheduling (and other features I called "bureaucracy" for short) is a perfectly good trade-off, giving higher processing speeds at the cost of more power and more complex devices. Sometimes high speed is your priority, sometimes low power is your priority.

    Feel free to continue a technical discussion here, but please leave the conspiracy theories at home. And please don't extrapolate so wildly from what other people write.

    Leave a comment:


  • DavidBrown
    replied
    Originally posted by vladpetric View Post
    This is just a political/ideological meme. These kind of memes are claimed true because they sound clever and (maybe) they are funny. Does that make them true under any circumstances? NO. As far as I'm concerned what can be claimed without evidence, can also be dismissed without evidence. As an aside, it is quite typical for such memes to be used in political "debates". But more about that later.
    I had thought it was obvious that I was not being serious. But sometimes these things are not as obvious in written form as they would be when spoken. We all know that that theory and practice are different because they cover different things and have different priorities. As an example, in theory the fastest multiplication algorithm currently known scales as O(n . log n). In practice, the fastest algorithm scales as O(n . log n . log log n), because the faster algorithm is only faster in practice with numbers that are so big that the earth does not have enough energy to run the algorithm.
    Originally posted by vladpetric View Post
    Are there differences between theoretical computer science and applied computer science? Yes, absolutely there are (see, for instance, galactic algorithms). But in the vast majority of cases the results of theoretical computer science (including, but not limited to, the myriad of lower bounds, or NP-completeness/hardness) apply. The burden of proof should be on the person making a claim that a theoretical result doesn't apply, not the other way around.
    Of course computer science theory is very important to practical computer programming. I have not suggested anything different.
    Originally posted by vladpetric View Post
    ​The practice of the halting problem is as follows: there are no good automatic solvers for the halting problem, period (practical tools, applying to practical code, that detect a large majority of infinite loops, not even all of them).
    You are misunderstanding a good deal here, I think - or at least explaining things very badly.
    The theory of the halting problem says there are no general automatic solvers to for the halting problem. It is not concerned about them being "good" or not, or whether they apply to "practical code". And computation theory also says there /is/ a general automatic solver for the halting problem for a given practical computer. Do you want a halting decider that can handle any program that will run on your PC, determining if the program will halt within your lifetime? It's easy, in theory. But of course it is totally impossible in practice.
    Originally posted by vladpetric View Post
    Ok, what is the second major nail? A compiler simply can't tell ahead of time how the cache hierarchy will behave at run time.
    Agreed.
    Originally posted by vladpetric View Post
    BTW, performance critical DSP for modern processors is done by hand (assembly or intrinsics) and small vectors (SSE, AVX, and similar instruction sets for ARM etc). The compilers genuinely suck at vectorizing actually ... (even with the most recent compilers, it's faster to write a vectorized loop by hand, either with assembly or intrinsics, than it is to massage the code for the compiler to do it).
    Also true.

    Compilers are definitely getting better at vectorising SIMD code on general-purpose CPUs, but can usually only handle fairly simple cases without significant manual help. Getting the best out of DSPs is much harder again - and if you are not getting the best out of the inner loops of your kernels, there's little point in using a dedicated DSP in the first place.

    Leave a comment:


  • vladpetric
    replied
    Originally posted by archkde View Post
    I do not disagree that "don't know" may happen frequently in practice, or even that it can be hard to quantify in the first place. All I'm saying is that it is perfectly compatible with Rice's theorem that "don't know" is so rare it only happens once every 10 years in the entire world in practice.
    The problem with "don't know" is that the compiler needs to treat it as "no!" (as in, can't optimize).

    Ok, so there is a gray area, which we agree about. But in that gray area, I think you're implicitly assuming the theoretically best possible situation from a optimizing compiler/static optimizer angle. After working with processors and compilers for a while, I think you're wrong.

    Originally posted by archkde View Post
    And what makes Itanium so much different here? My understanding was that it is much more about coming up with a reasonable desired property in the first place, but you seem to be much more knowledgeable in that area.
    As I detailed in a parallel thread over here (sorry, won't detail it again, I'll ask you to read that, it's two messages above, the third nail), for regular (dynamically scheduled/OoOe) processors optimizations are nice-to-haves/optional, whereas for Itanium - absolutely necessary. Essentially, the Intel Itanium proponents/early champions claimed that they will do something they simply couldn't ...
    Last edited by vladpetric; 07 January 2024, 04:59 PM.

    Leave a comment:


  • vladpetric
    replied
    Originally posted by DavidBrown View Post
    You could avoid all the complications of hardware scheduling, speculative execution, register renaming, branch prediction, etc., that makes x86 processors so complicated and means that a large proportion of the power used by x86 devices is "bureaucracy", rather than actually executing instructions.
    At a technical level, you're maybe partly correct. Strictly speaking, if you can get rid of something that consumes power, then yes, you get a net gain.

    But:

    1. You do have to do something better instead. In this case, VLIW actually made things worse. Through poor resource utilization, the performance per Watt became far worse.
    2. Your power breakdown is based on the situation in the late 90s/early 2000s. What you're so up-in-the-arms about matters a lot less these days.

    At a political level, the above statement indicates to me that you're part of the large group of dynamic scheduling haters. Of course, nobody likes being called a hater, and at the same time this won't make them stop hating, either.

    In general, reformist political movements work as follows:

    1. One starts with something to hate (in this case, dynamic scheduling/out-of-order execution). This is fairly critical, because hate/fear/disgust are far more powerful motivators than enthusiasm/love. A political movement needs an element of hate, in other words.
    2. One identifies a problem about the thing they hate (and typically it is absolutely real ...).
    3. Through cherry-picking/aggressive spinning that problem becomes the biggest problem in the world. If you only look at the negatives of something (OoO execution/dynamic scheduling) you can make it look so bad, that people start believing that everyone would be better off if we just stopped doing it altogether.

    As far as I'm concerned, you're doing exactly this.
    Last edited by vladpetric; 07 January 2024, 04:48 PM.

    Leave a comment:

Working...
X