Originally posted by boitano
View Post
Announcement
Collapse
No announcement yet.
CPUs From 2004 Against AMD's New 64-Core Threadripper 3990X + Tests Against FX-9590
Collapse
X
-
- Likes 2
-
Originally posted by Raka555 View Post
With int64_t:
r7-3700x
real 0m8.138s vs 0m8.138s
i7-3770:
real 0m13.938s vs 0m4.083s
i7-4600u:
real 0m15.719s vs 0m4.883s
But that aside, the test program shows that even for a plain "int" current compilers emit 32bit division; so maybe this is so common that it's worth
optimizing the hardware for.Last edited by mlau; 10 February 2020, 03:17 PM.
- Likes 2
Comment
-
Originally posted by boitano View PostMaybe I'm wrong but It seems to me AMD put a 64 cores workstation CPU to market more because they can than because there's a sizeable market for it. Feels more like a trollish move against Intel.
It's true there isn't a ton of software yet that can take advantage of that many threads at once. But those who have that kind of workload, already own the required software, and have pockets deep enough to buy these flagship chips. Think MATLAB, Maya, 4K video editing, or finite element analysis (structural stress, etc). Outside of these specialized segments, threading has always been a chicken or egg scenario. The hardware doesn't exist because there's no software to take advantage of it. The software doesn't exist because why put in the effort when there's no hardware to run it on.
AMD is doing something bold here, and they're walking the walk when it comes to delivering massive performance in a single socket. This is the future. IPC has not increased so dramatically over the years, as the benchmarks clearly show. If it was up to intel, we'd all still be running 4-core chips based on sandy bridge and 14nm++++. Or maybe even 32 bit chips, because Itanium.Last edited by torsionbar28; 10 February 2020, 04:56 PM.
- Likes 3
Comment
-
Originally posted by Raka555 View Post
With int64_t:
r7-3700x
real 0m8.138s vs 0m8.138s
i7-3770:
real 0m13.938s vs 0m4.083s
i7-4600u:
real 0m15.719s vs 0m4.883s
This is a nice find. So 64bit integers are hurting Intel big time, while they are doing very well with 32bit integers. No wonder they were pushing x32 so hard.
I wonder what happened to the x32 efforts...
Something similar seems to be happening on the RPIs as well. "lilunxm12" reported the following on their pi3b+ on 64bit Ubuntu 20.04:
real 0m18.248s
user 0m18.193s
sys 0m0.005s
I am getting the following on my rpi3b+ with Raspbian 10 32bit:
real 1m26.870s
user 1m26.827s
sys 0m0.022s
That is a huge difference. But my RPI might be throttling as I don't have active cooling on it.
real 0m33.329s
user 0m33.280s
sys 0m0.009s
Comment
-
Originally posted by torsionbar28 View Post$10,000 Xeon Platinum 8280
Originally posted by torsionbar28 View Postthreading has always been a chicken or egg scenario.
- Likes 3
Comment
-
Originally posted by Raka555 View Post
If you mean what I get when timing it:
r7-3700x:
real 0m8.138s
i7-3770:
real 0m4.083s
i7-4600u:
real 0m4.883s
Snapdragon 820:
$clang --version
clang version 8.0.1 (tags/RELEASE_801/final)
Target: aarch64-unknown-linux-android
$ clang prime.c -o prime -O3
$ time ./prime
664580
real 0m2.402s
user 0m2.380s
sys 0m0.000s
Update:
Int64 version
$ clang prime64.c -O3 -o prime64
$ time ./prime64
664580
real 0m2.489s
user 0m2.470s
sys 0m0.000s
So either clang generates significantly more optimal code compared to gcc, or time switch to arm64But I'd rather say this benchmark is not that representative of real world cpu performance.
Last edited by klokik; 10 February 2020, 06:33 PM.
- Likes 4
Comment
-
Originally posted by TemplarGR View PostBulldozer was a great architecture and was a step towards Fusion. AMD's grand plan was to eliminate FPU and SIMD from the cpu cores completely, eventually, and move those calculations on the iGPU. This makes a metric ton of sense, since cpu cores only rarely calculate floating point math. And those calculations are better suited for gpgpu, which is only hindered these days by pcie latency. AMD Fusion was the best idea for cpus in 2 decades. But AMD didn't have the software and marketing grunt to push for such change, and Intel realising they would lose if AMD went that road, doubled up on AVX and their floating point calculations, especially per thread.
The first bet worked: Bulldozer cores were smaller than K10 cores had been. The second bet succeeded marginally with Piledriver -- AMD was definitely able to get higher clocks out of the uarch compared to K10. But the third bet -- the one about IPC? That failed. Instead of being confined to a small number of scenarios, the "corner cases" of Bulldozer dominated its performance. L1 cache contention between the two CPU cores caused lower than expected CPU scaling in multi-threaded workloads, which exacerbated the problem. Kaveri and Carrizo would later improve this penalty by increasing the size of the L1 caches and making other improvements to the chip, but those changes came much later. The original BD design was downright bad, which is why the CPU was delayed from launch in 2011. AMD had good yield on Bulldozer (a CPU they knew wouldn't be competitive) and poor yield on Llano, the APU it could have actually sold. AMD delayed Bulldozer and forced GF to eat a mountain of losses by only paying them for good Llano die. GF paid them back the following year by forcing AMD to give up its interest in the foundry in exchange for being able to manufacture parts at TSMC.
The idea you have -- namely that you can just replace your FPU with a GPU -- never would have worked. For starters, there's non-zero latency caused by spinning a workload off to the GPU. That workload has to be setup and initialized by the CPU, and GPUs are high latency devices. Yes, AMD made some noise about this idea at the very beginning of the Fusion PR rollout, but there's a reason they never pursued it. GPU acceleration is about using the GPU for workloads and in areas where it makes sense to do so, not in areas where it does not. In many cases, the latency hit for setting a problem up on the GPU is larger than the increased performance the video card can bring to bear on the problem.
Originally posted by TemplarGR View PostThese days on 7nm, cpu cores even with all those SIMD parts, are TINY. It would have made a lot more sense to have even tinier cpu cores by removing the floating point units (which cost a LOT of silicon), adding tons of cache, and a beefy igpu, and move those calculations there. It would have been far better performant. It would allow the cpu cores to stop bothering with things they are not at their best, and leave the igpu do what it is best suited for... But this failed to evolve because idiots thought Bulldozer was a failure just because video games relied still on single and dual cores and as we all know, gaming is the most important thing in computing.... Even today intel sells a ton of cpus because it has slightly better per core performance and this matters to gaming. People are cretins. Now all AMD is doing is copying Intel's design but selling it at a far lower profit margin.... Yay.
I was at AMD's 2013 HSA event, where Sun executives pledged that Java 8 would be fully HSA-aware and capable. It was not. Bulldozer was a bad architecture. According to Jim Keller (who told me this in-person), when he was hired, AMD had to make a choice between fixing BD or building Zen. The effort to do one was judged to be as difficult as the other. He decided to go for Zen, and the team backed him. Bulldozer didn't get put down because it sucked in gaming. Bulldozer got put down because it was a poor design, period, and AMD felt it wasn't worth the effort of fixing. The decision to build Zen was made about a month after Keller was hired, and they never looked back. Nor should they have. Best decision AMD ever made.
- Likes 3
Comment
-
Originally posted by klokik View Post
So either clang generates significantly more optimal code compared to gcc, or time switch to arm64But I'd rather say this benchmark is not that representative of real world cpu performance.
This was originally not a "benchmark". I actually needed to produce prime numbers, so it stems from a real world application.
(And I know this is not the optimized version)Last edited by Raka555; 11 February 2020, 01:38 AM.
Comment
-
Originally posted by lilunxm12 View Post
You just shouldn't do that on a 32 bit system, as single registers can't hold the operand
I would rather put my money on "modern compilers" not generating optimal code for 32bit ARM.
Comment
Comment