Libre RISC-V Open-Source Effort Now Looking At POWER Instead Of RISC-V

nokipaike replied

24 October 2019, 07:37 PM
Originally posted by lkcl View Post

this was the hope that inspired Larrabee. they created an absolutely fantastic "Parallel Compute Engine". unnnfortunately, the GPU-level performance was so bad that the team was *not allowed* to publish the numbers

Jeff Bush from Nyuzi had to research it, and i talked with him over the course of several months: we established that a software-only GPU - with no custom accelerated opcodes - would have only TWENTY FIVE percent the performance of, say, MALI 400, for the same silicon die area. that means that if you had a comparable performing software-only GPU, it would require FOUR times the power (and die area).

obviously, that's not going to fly

in speaking with Mitch Alsup on comp.arch i found out a little bit more about why this is. it turns out one of the reasons is that if you want a "fully accurate" IEEE754 FP unit, to get that extra 0.5 ULP (units in last place), you need THREE TIMES the silicon area.

in a GPU you just don't care that much about accuracy, and that's why in the Vulkan Spec you are allowed a lot less accurate answers in SQRT, RSQRT, SIN, COS, LOG etc.

basically there are areas where you are trading speed for accuracy, and these tend to conflict badly with "accuracy" requirements of traditional "Compute" Engines. we are kinda... lunies for even trying however if you look at the MIPS 3D ASE (you can still find it online), running instructions twice to get better accuracy is a known technique, and if we plan the ALUs in advance, we can "reprocess" intermediary results using microcoding, and serve *both* markets - GPU (less accurate, less time, less power), and IEEE754 (fully accurate, longer, more power).

You all tend to exclude neural network techniques in the management of the cores and threads for energy and performance as a winning part of the puzzle.

Useless I try to explain how the wheel is made, there are those who have already done it and understand much better than me ..
This is an interesting video that tries to make explicit where the big players are moving ...

The FUTURE of Computing Performance

https://youtu.be/3PjNgRWmv90

Last edited by nokipaike; 24 October 2019, 07:40 PM.
Likes 1
Leave a comment:
lkcl replied

23 October 2019, 05:09 PM
Originally posted by nokipaike View Post

if you think about it, it would consume very little. it would be easy to design and build. the power would be all in parallelism.

yyeh, but sadly, its general-purpose performance would suck. we want to combine the two tasks, so that you don't *have* two L1 caches, two sets of RAM, two sets of everything-but-slightly-different.

so we are doing a compromise: turns out that if you have every other pipeline latch being dynamically "transparent", you can turn a 5-stage pipeline into a 10-stage one at the flick of a switch.

running on low power, at low speed, you open up the gates and two pipeline combinatorial blocks are now connected back-to-back. want to run at desktop-level speeds, close up the gates and you have a 10-stage pipe that can run at 1.6ghz.
Likes 2
Leave a comment:
lkcl replied

23 October 2019, 05:05 PM
Originally posted by starshipeleven View Post

Wrong, no GPU dedicates transistor logic to each Vulkan command.
Vulkan is not as simple as a media codec (x264 for example) where you can make an ASIC for that specific algorithm.

GPUs are using cores that are more general-purpose than that, while not as general-purpose as CPU cores.

we're finding this out. Jacob has been studying the Vulkan spec for some time: Mitch Alsup has been helping on comp.arch to keep us on the "straight and narrow" as far as gate-count is concerned, and yes, you have sin, cos and atan2, but you do *not* bother to put down arctan or arccos etc. because those can be computed to reasonable accuracy in software, just like on any general-purpose processor, and they're so infrequently used that on the face of it it's not worthwhile adding them

however, one of the things that we want to provide is the "unusual" stuff - the "long tail" of 3D, so that people can innovate and don't get caught out by the "mass market" GPU focus. and for that, we simply can't predict what people *might* use the Libre GPU for. therefore, we may just have to put the hardware opcodes in anyway. buuut, doing so is... expensive (if they are dedicated units) so, one thing we might do is just put in a CORDIC engine, and use microcode for anything that's not commonly used. CORDIC is so versatile it can do almost anything, it's really amazing.

that way, all the "unpopular" opcodes, well, at least there *is* a small performance gain to be had, and we can see what happens in the market as customers pick up on it.
Likes 2
Leave a comment:
lkcl replied

23 October 2019, 04:59 PM
Originally posted by nokipaike View Post

do you want to create an opensource gpu?
write an architecture that is already multicore - parallel computing.
Create a "micro-core that has a minimum of cycles required" for each existing Opengl / Vukan API loop call. Get help from a machine learning architecture to optimize performance, consumption, memory allocation and semaphore.
voila here is your open-gpu.

this was the hope that inspired Larrabee. they created an absolutely fantastic "Parallel Compute Engine". unnnfortunately, the GPU-level performance was so bad that the team was *not allowed* to publish the numbers

Jeff Bush from Nyuzi had to research it, and i talked with him over the course of several months: we established that a software-only GPU - with no custom accelerated opcodes - would have only TWENTY FIVE percent the performance of, say, MALI 400, for the same silicon die area. that means that if you had a comparable performing software-only GPU, it would require FOUR times the power (and die area).

obviously, that's not going to fly

in speaking with Mitch Alsup on comp.arch i found out a little bit more about why this is. it turns out one of the reasons is that if you want a "fully accurate" IEEE754 FP unit, to get that extra 0.5 ULP (units in last place), you need THREE TIMES the silicon area.

in a GPU you just don't care that much about accuracy, and that's why in the Vulkan Spec you are allowed a lot less accurate answers in SQRT, RSQRT, SIN, COS, LOG etc.

basically there are areas where you are trading speed for accuracy, and these tend to conflict badly with "accuracy" requirements of traditional "Compute" Engines. we are kinda... lunies for even trying however if you look at the MIPS 3D ASE (you can still find it online), running instructions twice to get better accuracy is a known technique, and if we plan the ALUs in advance, we can "reprocess" intermediary results using microcoding, and serve *both* markets - GPU (less accurate, less time, less power), and IEEE754 (fully accurate, longer, more power).
Likes 1
Leave a comment:
log0 replied

23 October 2019, 04:37 PM
Originally posted by Qaridarium

if your "Many tiny cores" where right then the core count would go up the last 4 years.

Nonsense

More cores/processing units means more transistors. With increasing number of cores you also need more logic to feed them efficiently. All this means bigger chips, lower yields, higher costs.

In the end it is a design parameter. If you can push frequency you'll rather go with less cores, as it is simply cheaper.

Btw GPU core is typically simply a FP/INT unit, very much like INT and FP units in a CPU. So calling it a "tiny core" is not wrong.
Leave a comment:
starshipeleven replied

22 October 2019, 04:06 PM
Originally posted by Qaridarium

this is just wrong it means: "many specific cores" what means ASIC Transistor logic for every Vulkan command

many small cores if it is not specific ASIC transistor logic will not do the job.

Wrong, no GPU dedicates transistor logic to each Vulkan command.
Vulkan is not as simple as a media codec (x264 for example) where you can make an ASIC for that specific algorithm.

GPUs are using cores that are more general-purpose than that, while not as general-purpose as CPU cores.
Leave a comment:
nokipaike replied

22 October 2019, 12:52 PM
Originally posted by Qaridarium

I really have to time to talk with people who claim OpenGL is SANE and should be implemented on a Open-source GPU without going over OpenGL-over-Vulkan layer for legacy purpose only. read again: Legacy purpose only and read again only OpenGL-Over-Vulkan and not OpenGL on the hardware directly.

in my first comment if you notice, I talked about opengl / vulkan API, this to highlight the fact that since it was a comment on a theoretical architecture, which APIs are spoken about, I considered it less relevant. I wanted to express some basic concepts in a few words ... but you started to dispute in futile details for the purpose of the comment just to emphasize your knowledge of technical syllogisms ...
and you don't want to understand ....
I'm not dueling ...
Leave a comment:
starshipeleven replied

22 October 2019, 11:03 AM
Originally posted by Qaridarium

yes but many tiny cores is not what makes them efficient.

Yes it is. They are efficient because they are tiny.

if you have 4096 Power9 cores

Power9 cores are not tiny. They are big because they are general-purpose.

"Power" is an architecture name, and "Power9" is an implementation name.

It is like "x86_64" and "Ivy Bridge".

"hardware ASIC chips who have the Transistor ASIC to run GPU tasks efficient with low power consumption."

It means "many small cores" in practice.
Leave a comment:
nokipaike replied

22 October 2019, 10:02 AM
Originally posted by Qaridarium

"how many API calls are opengl? are 300-400?"

OpenGL is obsolete API and also OpenGL was first biased by it first member: MIcrosoft then later Nvidia sapotaged the OpenGL standard. this means no sane person would ever touch OpenGL as a API for a Open-Source GPU...

you go with Vulkan as a Nvidia-Free API or you can smallow toxine pills and die.

"300-400 micro-cores"

if you do not have the ASIC logic transistors for the Vulkan micro operations then you are doomed. even with 400cores and it will burn much energy.

this means: you first need the "ASIC logic transistors" and then after this you can go multi-core

Sorry but you're fighting back with futile questions.
The standards exist, they are opensource and they are not obsolete. All that is needed is to build efficient machines around them.

I gave you a conceptual example of theoretical functioning.

...and you continue to ignore the deep learning core...
Leave a comment:
nokipaike replied

22 October 2019, 09:24 AM
Originally posted by Qaridarium

"multicore - parallel computing" does not do what a GPU want you can run Gallium LLVMpipe Driver - Mesa3D on a 24core Power9 cpu but the result is high Power consumption and low FPS per workload.

actually I imagined something quite different.
how many API calls are opengl? are 300-400?
here, I imagined 300-400 micro-cores all the same that have a few mhz of cycles (once you build a gpu from scratch it is much easier for you to design an efficient one and clone it for each API call) obviously you plan every core so that power up only when needed. ... then optimizations and tricks would come from a dedicated deep learning core and algorithms, Teaching them to get maximum performance at the minimum energy required.

if you think about it, it would consume very little. it would be easy to design and build. the power would be all in parallelism.
Leave a comment:

Announcement

Libre RISC-V Open-Source Effort Now Looking At POWER Instead Of RISC-V

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment: