Announcement

**coder** · 29 November 2017, 10:23 PM

Originally posted by jabl View Post

In defence of AVX-512, Intel has ever so slowly evolved their SIMD extensions towards a "real vector ISA", which are significantly better targets for vectorizing compilers rather than packed SIMD. With AVX2 we got gather, now with AVX-512 we finally have scatter and predication. The major things lacking, AFAICS, is a vector length register (so the compiler doesn't have to generate a separate tail loop), and strided load/store.

Fair points, but the world has moved on. Not much code would benefit from AVX-512 that wouldn't benefit more from a GPU.

That said, I do like Xeon Phi as a catch-all. If, for some reason, you can't fully avail your code of a GPU, then Xeon Phi isn't too bad a compromise (if an expensive one).

**c117152** · 29 November 2017, 11:14 PM

Originally posted by coder View Post

That said, I do like Xeon Phi as a catch-all. If, for some reason, you can't fully avail your code of a GPU, then Xeon Phi isn't too bad a compromise (if an expensive one).

Better yet, don't compromise: http://www.nec.com/en/global/solutio...sx/A100-1.html http://www.nec.com/en/global/solutio.../sx/index.html

Btw, for the record, Intel isn't wasting die space on these instruction.

First off, these instructions are essentially a (very) limited alternative to exposed pipelines or wider issuing where a single instruction is broken down to multiple branches and pipelines and uses lots of existing units. That's to say, it's just a way to use existing silicon rather than a whole lot of new silicon.

More importantly, Intel has nothing better to do with the silicon. Die space has been abundant and cheap for over a decade to the point Intel had nothing better to do then to pack in the bridges and add all sorts of meh to useless features. What they're lacking is space next to the cache lanes which is why the chips need to be throttled down when so many units get fired up all at once... But adding these instructions is mostly laboring the decoder which is harmless so long as people profile their code.

Overall, I'm more pissed at those damn graphics core robbing me of perfectly good lanes I can use for actually decent graphics cards and expansion cards for SATA/RAID and whatnot. Well, at least AMD gets it.

**sleeplessclassics** · 30 November 2017, 07:51 AM

Okay, when Cannonlake gets launched ~40% Die Area for iGPU and 25% for AVX512 (something that can be done by GPU) does Intel even know what people use CPU for anymore?
I am so glad AMD started the core wars, at least we will have software that can run on multiple cores.

**coder** · 01 December 2017, 03:22 AM

Originally posted by c117152 View Post

Better yet, don't compromise: http://www.nec.com/en/global/solutio...sx/A100-1.html http://www.nec.com/en/global/solutio.../sx/index.html

Totally non-sequitur. First, if you can build your code to effectively utilize the vector engine, then you can certainly run it on a GPU. Second, a GPU would be cheaper and faster.

IMO, that chip is just a silly waste of Japanese taxpayers' money. ...not that I care what they spend their money on. There are certainly worse things...

Originally posted by c117152 View Post

Btw, for the record, Intel isn't wasting die space on these instruction.

First off, these instructions are essentially a (very) limited alternative to exposed pipelines or wider issuing where a single instruction is broken down to multiple branches and pipelines and uses lots of existing units. That's to say, it's just a way to use existing silicon rather than a whole lot of new silicon.

Sorry, I don't buy it. Got a source on that?

Even if you're right about the execution units, which I certainly doubt, you're wrong about the vector registers. They definitely had to double those.

Originally posted by c117152 View Post

More importantly, Intel has nothing better to do with the silicon. Die space has been abundant and cheap for over a decade to the point Intel had nothing better to do then to pack in the bridges and add all sorts of meh to useless features.

This is more nonsense. It would be true if they were still building single-core CPUs, but their Xeon dies do get quite big and die area on the latest process node certainly ain't cheap.

Originally posted by c117152 View Post

Overall, I'm more pissed at those damn graphics core robbing me of perfectly good lanes I can use for actually decent graphics cards and expansion cards for SATA/RAID and whatnot.

Back in the days of Sandybridge/Ivybridge, you got x4 fore PCIe lanes per socket. I guess your point is that, rather than add more pins to the socket, Intel chose to repurpose those lanes for more display outputs? Otherwise, I don't see why you blame their GPU for their paucity of PCIe lanes.

Speaking of AMD, I wish they'd brought out another x4, in AM4. That would seal the deal, for me. The die seems to have x32, so it's just a matter of the darn socket.

**c117152** · 03 December 2017, 07:27 PM

Originally posted by coder View Post

Totally non-sequitur. First, if you can build your code to effectively utilize the vector engine, then you can certainly run it on a GPU. Second, a GPU would be cheaper and faster.

Ha? Did you even read through the data sheets? It's generic OpenMP. You literally compile and run linux and have your floating points operations offloaded on the vector engine.

Originally posted by coder View Post

Sorry, I don't buy it. Got a source on that?

Their documentations and the thermal imaging that was released a couple of years back clearly showing an AVX 256 operation heating up the decoder and the centralized regions of the different cores smack dead next to the L1s.

Originally posted by coder View Post

Even if you're right about the execution units, which I certainly doubt, you're wrong about the vector registers. They definitely had to double those.

Look at slides 7 onwards here: https://www.usenix.org/sites/default...ides_koppe.pdf Look up the video and read the associated paper.

Originally posted by coder View Post

This is more nonsense. It would be true if they were still building single-core CPUs, but their Xeon dies do get quite big and die area on the latest process node certainly ain't cheap.

The way they scale Xeon and i7 production on the same wafers and sometimes even target the same die means a lot of useless crap that barely makes sense on Xeons ends up in non-server hardware. If it's a useful feature they'll disable it but if it's useless crap barely anyone uses they'll just leave it in since it doesn't affect their market segmentation. Also, a big part of this is to use the wider user-base as beta testers of sorts. If all of use idiots think "Oh! Cool! Look Intel gives us free vector instructions!" and start optimizing stuff for it even if it doesn't actually worth our time, Intel's real customers, data-centers, benefit. Microsoft took the same approach towards WSL.

Originally posted by coder View Post

Back in the days of Sandybridge/Ivybridge, you got x4 fore PCIe lanes per socket. I guess your point is that, rather than add more pins to the socket, Intel chose to repurpose those lanes for more display outputs? Otherwise, I don't see why you blame their GPU for their paucity of PCIe lanes.

Same points different meaning. In a counter point, if AMD doesn't see the point in putting any of their good GPUs in Ryzen's desktop CPU dies and feels it's better to let people populate PCIe slots with dGPUs instead, it means that when Intel puts bad GPUs in their dies, they're effectively wasting die space and PCIe lanes since, as far as they're concerned, they have die space to spare and their customers should fork up extra cash and buy Xeons for those extra PCIe lanes.

But we can leave the x86 world and examine the ARM and POWER dies and see the same patterns: As nodes shrink they run out of useful features so extensions and accelerators of dubious benefits start coming and going. You can also observe this by how, despite officially deprecating old instruction, ARM cores still implement most of their old stuff. The reason is they simply have nothing better to do with the extra space for now.

Btw, all the recent gcc and llvm releases had significant auto-vectorization improvements. If those instructions are true helpful in the general case you would be hearing something by now.

**coder** · 04 December 2017, 01:10 AM

Originally posted by c117152 View Post

Ha? Did you even read through the data sheets? It's generic OpenMP. You literally compile and run linux and have your floating points operations offloaded on the vector engine.

You don't think OpenMP can be used with GPUs? And just because you can use it doesn't mean it makes sense. Otherwise, there'd be no need for CUDA, OpenCL, etc.

Originally posted by c117152 View Post

Their documentations and the thermal imaging that was released a couple of years back clearly showing an AVX 256 operation heating up the decoder and the centralized regions of the different cores smack dead next to the L1s.

I had asked for a source showing that AVX-512 is "just a way to use existing silicon rather than a whole lot of new silicon". If you have one, let's see it. Otherwise, this is pure and utter BS.

Originally posted by c117152 View Post

Look at slides 7 onwards here: https://www.usenix.org/sites/default...ides_koppe.pdf Look up the video and read the associated paper.

Okay, looked at the slides. Don't see anything about not having to enlarge registers for AVX-512.

Originally posted by c117152 View Post

The way they scale Xeon and i7 production on the same wafers and sometimes even target the same die means a lot of useless crap that barely makes sense on Xeons ends up in non-server hardware.

No. The workstation & server dies are distinct from their desktop chips.

Originally posted by c117152 View Post

when Intel puts bad GPUs in their dies, they're effectively wasting die space and PCIe lanes since, as far as they're concerned, they have die space to spare and their customers should fork up extra cash and buy Xeons for those extra PCIe lanes.

I still don't see why you think their lack of PCIe lanes has anything to do with their iGPUs. It's a marketing decision that they made about how many lanes they think desktop users should need. If they wanted to add more, I'm sure the iGPU isn't stopping them. And it's not even connected via PCIe, FYI. It sits on the CPU's internal ring bus, as a peer of the CPU cores and system agent.

Originally posted by c117152 View Post

But we can leave the x86 world and examine the ARM and POWER dies and see the same patterns: As nodes shrink they run out of useful features so extensions and accelerators of dubious benefits start coming and going. You can also observe this by how, despite officially deprecating old instruction, ARM cores still implement most of their old stuff. The reason is they simply have nothing better to do with the extra space for now.

This is nonsense. That's not free - you can bet they wish they could simplify their instruction decoder. They only conceivable reason they still support deprecated features is to maintain binary backward compatibility, because many customers probably insist on it.

**c117152** · 05 December 2017, 06:09 PM

Originally posted by coder View Post

You don't think OpenMP can be used with GPUs? And just because you can use it doesn't mean it makes sense. Otherwise, there'd be no need for CUDA, OpenCL, etc.

But it does with this. It runs perfectly with vector engines. That's the point. They don't even have an alternative. Just automatic offloading of floating point instructions and openmp.

Originally posted by coder View Post

I had asked for a source showing that AVX-512 is "just a way to use existing silicon rather than a whole lot of new silicon". If you have one, let's see it. Otherwise, this is pure and utter BS.

Look in https://www.intel.com/content/dam/ww...ion-manual.pdf and compare p.34 to p.40 under execution units. They clearly renamed SIMD to VEC units and added a couple of ports to handle the wider instruction. More importantly, (v)addp* was under it's own FP Mov ex-unit and is now shared with Vec Add meaning they've moved some of floating point operations to use the same silicon as the vector operations. Even disregarding the power and thermal dissipation issues I've brought up earlier, this is likely also why the new vector operations conflict with floating point (and probably other stuff I'm missing) when run concurrently. You're literally saturating ports (physical cache lines between execution units) with competing instructions.

Originally posted by coder View Post

Okay, looked at the slides. Don't see anything about not having to enlarge registers for AVX-512.

Compare the number of execution stacks in p.39 to p.33. If it's just a wider instruction there's no need to double the ports. But if it's broken down to multiple smaller instructions then you need the interconnect.

Originally posted by coder View Post

No. The workstation & server dies are distinct from their desktop chips.

Same wafers; Disabled features via fusing out faulty circuits. Google it. it's common knowledge.

Originally posted by coder View Post

I still don't see why you think their lack of PCIe lanes has anything to do with their iGPUs.

Because sharing an ~8mb LLC$ isn't enough for textures and no one wants their iGPU thrashing their CPU caches.

Originally posted by coder View Post

This is nonsense. That's not free - you can bet they wish they could simplify their instruction decoder. They only conceivable reason they still support deprecated features is to maintain binary backward compatibility, because many customers probably insist on it.

No. ARM doesn't cater to binary backwards compatibility like Intel. They've been breaking instructions left and right over the years ( https://www.riscosopen.org/wiki/docu...ility%20primer ) when it was necessary as well as kept some instructions in the cores without bothering to remove them (or letting anyone know) even though they already announced their deprecation since it didn't bother anyone.

**coder** · 06 December 2017, 06:01 AM

Originally posted by c117152 View Post

...

Okay, mate. I'm going to do us both a favor and stop this nonsense. I could (and actually would like to) refute each of your points, but it's clear that the issue isn't your points but rather spouting off about something with which you have no firsthand experience or direct knowledge. And then doubling down when you're called out and seeing if you can throw enough at the wall to make it stick.

I think, if you put half the effort into educating yourself about this stuff as you seem to put into winning internet arguments, you'd actually be in a much better position to win said arguments (not to mention succeeding in more fruitful endeavors).

Call this what you will, but I'm cutting my losses (i.e. lost time & energy).

Announcement

Skylake AVX-512 Benchmarks With GCC 8.0

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment