Announcement

**seesturm** · 17 February 2024, 05:01 PM

Originally posted by qarium View Post

I am pretty sure the lowest integer length in use in AI workloads are INT8 and not INT4... i do not know any inference mode implementation using int4...

so it is nonsense to demant INT4 acceleration... what is in use is minifloat like 4FP 6FP 8FP but of course my Vega64 only support 16FP in hardware not any smaller one. also keep in mind that INT8 is not in use on GPUs they only use INT8 to accelerate inference mode on CPUs.

https://github.com/ggerganov/llama.cpp is using 4-bit quantisation. Looked at the source code and when running on GPU it still seems to use 8-bit. But as far as I understand it a 4-bit GPU implementation should be feasible.

And for CPU inference it defintely uses 4-bit.

**qarium** · 18 February 2024, 03:54 AM

Originally posted by seesturm View Post

https://github.com/ggerganov/llama.cpp is using 4-bit quantisation. Looked at the source code and when running on GPU it still seems to use 8-bit. But as far as I understand it a 4-bit GPU implementation should be feasible.
And for CPU inference it defintely uses 4-bit.

yes you can use 4bit but what you can do with it is very limited. and as you say many hardware do not have 4bit units and this for over 40+years
in the past they did abandon 4/8/16bit and then the last 6-7 years the AI stuff did come up-... a AMD radeon 580 was wully 32bit
they added 16bit shaders with Vega...
the last deep learning book i did read did only talk about 8bit for inference.... 4bit is really not widely on use
however like DLSS you can use it for stuff that absolutely does not need precision. some calculations of DLSS is only 4bit ...

i do not know what they do in llama.cpp that is in 4bit... but i also do not know any hardware who supports that means even modern cpu do not have independent 4 bit calculation units.

Announcement

AMD Certifies PRO W7800 & RX 7900 GRE For ROCm, Officially Adds ONNX Runtime

Comment

Comment