Announcement

**coder** · 09 June 2022, 01:00 PM

Originally posted by carewolf View Post

No, I mean this code the story is about. It operates on 32 bytes at a time aka 256bits, so it won't run faster on a 512bit implementation (well at least according to this summary).

Oh, right. The patch does call it an initial implementation, leaving open the possibility for further improvements.

Realistically, the A64FX is the only 512-bit implementation of which I'm aware. So, it's not currently very consequential to support no larger than 256-bit. Also, perhaps there's a point of diminishing returns that 256-bit already exceeds, though 512-bit is a typical cacheline size.

**WorBlux** · 09 June 2022, 01:20 PM

Originally posted by coder View Post

I disagree. You & Linus aren't thinking hard enough about the practical realities of software, on such a system. What's going to happen is that software will spawn too many threads, they'll get faulted off of the weak cores, and will simply contend for time on the more capable cores.

Often, apps are ignorant of what ISA extensions the libraries they're using even employ. So, putting the burden on the app developer to manage threads and affinities based on core capabilities is unreasonable and unrealistic.

Maybe, maybe not. Linux distros being source based there is the potential for clever solutions. Libraries generally have multiple paths and fallback, and you could use namespace mechanism as a sort of LD_PRELOAD so that libraries know which path to pick in a way transparent to the app. Maybe you don't even open up the option until an app has high CPU demand.

In any case, while I believe you that making it work well would be tricky, I also believe that at least a few would find it worth the trouble.

**coder** · 09 June 2022, 03:46 PM

Originally posted by WorBlux View Post

there is the potential for clever solutions. Libraries generally have multiple paths and fallback, and you could use namespace mechanism as a sort of LD_PRELOAD so that libraries know which path to pick in a way transparent to the app.

That's a nice idea, but the problem with a core-specific dispatch mechanism is that once a thread starts executing on a more capable core, it's basically bound to them. Because, when it gets prempted, it could have some intermediate state that involves the ISA-specific extensions not available by the smaller cores. It might be feasible for the OS to track which address ranges correspond to which ISA levels, so it could know when it might be safe to migrate a thread to a core of a lower class.

The other problem you'll likely find is that code paths optimized for a particular ISA extension tend to have in-memory datastructures which are specific to that codepath. So, if there are some threads executing the AVX2 path and others executing the AVX-512 path, shared state might not be consistently read/written, between them.

Originally posted by WorBlux View Post

In any case, while I believe you that making it work well would be tricky, I also believe that at least a few would find it worth the trouble.

I'm not opposed to experimentation and even giving users a non-default option to shoot themselves in the foot.

Also, what Guest said is right - OS developers can already trap certain opcodes on certain cores, if they wanted to simulate a heterogeneous CPU.

**coder** · 09 June 2022, 03:48 PM

Originally posted by atomsymbol

Secondly, some of the solutions I mentioned to Torvalds are based on forms of binary translation, which would enable an app to call a library function without the app dealing with the problem of whether-or-when the library is using AVX2 or AVX-512. But that is no longer a "single-line patch" nor a "1000-line patch".

Anything that's JIT-compiled can conceivably do it right, assuming the compiler is sophisticated enough.

**brucethemoose** · 09 June 2022, 11:40 PM

Originally posted by coder View Post

I'm trying to understand your point about TSX. So, you agree that it was comparable to TME, but you're concerned that it's gone and doesn't appear to be coming back?

Yeah I should have been clearer, TSX is at best segmented to parts of the enterprise space, which is not doing adoption any favors.

The situation seems to be more complicated than that though.

**brucethemoose** · 09 June 2022, 11:43 PM

Originally posted by carewolf View Post

This implementation is 32bytes at a time, so 32 * 8 = 256bit, so similar to AVX2.

Right, but future implementations will be wider.

My point is that SVE2 software written now is going to support those wider implementations in the future, while AVX2 software written now is going to be stuck with AVX2, and without the more flexible instructions of AVX512.

ARM is going to standardize SVE2 in everything relatively early, while it appears that Intel will have AVX2 products floating around awhile longer, and AVX512 remains extremely fragmented.

**coder** · 10 June 2022, 04:06 AM