Originally posted by SilverBird775
View Post
There is so many things wrong in this that i cannot even comprehend.
How many assembly lines you wrote in your life? And how many times you saw godbolt like output with march=native and -O2 or higher?
1st. AVX2 is already used wide in games, to a point certain games used to raise minimum requirement to support AVX (assasin's creed origins for example). Later ubisoft released alternative patch non-AVX so people with older CPUs can play but AVX and AVX2 is used, and if those 2 are used probably AVX512 also could, if not downclocking problems and bad adaptation on market. Even windows desktop uses AVX, it has wide usage for CAD software, Blender, codecs etc. all gladly will take advantage of AVX and AVX2.
2nd. Not everything can be parellized, or can only take advantage of parellization until certain point. Read about Amdahl's law. When it is not big deal for scientific workloads that mostly can be well splited but it is problem for games for example. Also the more cores you have the higher latency core to core you have. That is why 9980XE at 5GHz is still slower at gaming then 9900k at 5GHz because HEDT core latency is higher. Desktop computers are not servers/supercomputers where you can just "let's assign a thread for every single user". Games running 144fps+ per second are caring LOTS of latency and there more cores is not better. So in a lot of cases if you can do workload faster on one thread it is a lot better.
3rd. "No you cannot. It is a real problem. It is way, way too wide. There is a common 3D/4D matrix operations which can be coded with AVX/AVX2 but that's is already too luxurious for a price. Luxurious because of data permutation toll you have to pay to prepare "chunks". It's like trying to move the whole Earth baggage with a single train! Yes, it endeed move quick and looks epic. But loading and unloading this train is a grand problem."
?!?!?!? Horrible comparison, absolutly not true for desktop users, and btw AVX of all kinds have already instruction to load data in one go and they dont' take longer then 2 cycles. Actually it is not big deal to unload the train since data for sake of AVX is usually nicly arrayed which means data is not all around memory, can be niced cached and so on. For multicore CPUs oh boy, not only cores do on their own requests to RAM, the thing is cores asks for data all around memory space, cores often request data from other cores, also from programming perspective you do lots of unnecessery copying, adding more data like locks, synchronization stuff only to avoid problems. For encoders or webrowsers, the more cores you have, the bigger memory consumption is. AVX stuff don't have such problems.
4th. "So you just stating that doubling the transistor budget for a whole CPU for something no one will ever use is not a problem but it is." Even 9800X is not even close to double size of 9900k, despite having LOTS of more PCI-e lanes, 2 more memory channels, and having 2 dummy cores and being older generation wise and having slightly less cache. Per core 9980XE vs 9900k is just 4mm^2 diffrence per core (26 vs 22) and it is totally normal considering all other diffrences (like mesh vs ring). It is same sort of BS people say about nvidia about "radicolously large tensor/raytracing cores". In reality they cover less then 0.3% of die in current RTX serie.
Comment