Originally posted by Weasel
View Post
SSE cache-bypass store basically gives direction to the caches not to keep copy of this operation and write in a straight up disposable section of cache. You are still writing though the caches. Why must you read/write though a cache is the mmu can be busy when you instruction attempts to read/write.
memcpy that not sparse you will not notice much of a problem because it will be a workload hitting the cache most of the time. Cache line is 64 bytes long on x86. Please note some of the highest compression algorithms are sparse problems the 10 min mark shows why 8 and 16 byte are optimal for sparse problems. When you are not using a cache system optimal for sparse problems and you put speculative execution on top you have path to hell.
Originally posted by Weasel
View Post
200 processing units is not that many either you when have a pipeline 14 long as x86 this 200/14 a lot would think but is really 200/27 once you have to allow for items with high latency what is roughly 8 processing units being cleared per clock cycle. This is the other problem with increasing pipeline length you need more and more processing units to clear the same number of processing units per clock cycle. So exactly the same number of processing unit cleared per clock cycle as A73 but that only 11 long pipeline so 21x8=168 processing units(kind of explains lower power usage right)
Longer your pipeline gets the more problems you have. First problem is crossing 8 and starting to find your branch processing being too slow so needing to use cray barrel method for accelerating single threaded or speculative execution at that point.
Cray barrel optimisation for single threaded execution(out of order execution with read ahead in a barrel cpu design) splits instructions need for branch calculation into one thread with high priority and general processing into another thread from 1 thread workload. So the result is no such thing as a single threaded workflow. Cray method starts failing you when you get to a pipeline length of 10-11 also was patented until 2008. This method does not suffer from something https://en.wikipedia.org/wiki/Pipeline_stall yes the nasty pipeline stall. Of course to use this method you need to have barrel thread management engine in your cpu so you can break a single thread into 2 or more threads as your out of order solution particularly so you can process branch path though code quickly to know what instructions to be feeding into the general thread/threads. Of course barrel thread management in cpu would know when it only has 1 thread to process and should attempt thread splitting or should just cycle between threads. Due to the cray stuff being patented and not usable until 2008 its still not in most textbooks.
Everything fails when you cross pipeline length of 14. Power ineffective and speculative execution starts failing you at a pipeline length of 15+
Weasel I guess it never crossed you mind that is possible to design a cpu that internally there is no such thing as a single thread workflow yet processes single threaded workflows insanely well. This is the problem being tunnelled visioned on single thread performance this means you don't look at cray barrel that is a pure multi thread cpu core design and the optimisations that can be put on that for single threaded to convert single threaded to multi threaded. If you have great multi thread performance its possible to use methods to make single threads be processed by multi threads so get good single threaded performance as well. Now if you have poor multi thread performance there is no optimisation you can magically do to get good multi thread performance. Worse is if you have poor single threaded as well then you are totally screwed and totally screwed is the x86 chip.
Design a true all rounder cpu you will be look back at the cray barrel system and cray patents for optimising single threaded workflows on barrel designed cpu and have to keep your pipeline length under 10 and as close to 4 as you can get. This kind of cpu looks completely different to what out current general cpus look like now but would be a true general purpose cpu processing that does multi threaded as good as single threaded.
Comment