Benchmarking AMD FX vs. Intel Sandy/Ivy Bridge CPUs Following Spectre, Meltdown, L1TF, Zombieload

skeevy420 replied

29 May 2019, 01:51 PM
Originally posted by atomsymbol

It is probable you will regret your behavior 20 years from now.

Monkeys cannot write Shakespeare's work in finite time. https://en.wikipedia.org/wiki/Infinite_monkey_theorem

You have free will to choose whether to incline to monkeys or to Shakespeare.

If you had any idea how obscene Shakespeare actually is you probably wouldn't use that as your example. The puns and wordplay used back then don't come off the same way these days...but if you do know what to look for in those regards, Shakespeare makes for some good and funny reading.

Honestly, I can tell you right now I'm not going to regret saying "IO fucked" or calling a certain "testing mouse" "retarded".

Are you passing -pipe to the compiler?

Yes I am. Is it detrimental in a ramdisk context? Note that I have a 24GB ramdisk (systemd default) and 24GB of system memory available while it's compiling (48GB total).
Leave a comment:
skeevy420 replied

26 May 2019, 10:44 AM
Originally posted by atomsymbol

I am not entirely convinced ramdisk makes sense on a spinning hard-disk for compilation tasks if the machine has a lot of RAM. A lot of RAM means that almost all intermediate data generated during the task will be served from the Linux kernel disk cache rather than be re-read from the disk. Writes to the spinning disk happen asynchronously, so they aren't interrupting the task as long as the writes do not exceed the disk write bandwidth. The initial (cold) read from the disk is there in both cases, a ramdisk does not reduce the amount of initial cold reads from the spinning disk. A ramdisk compared to a cache prevents automatic eviction of data from RAM, the user has explicit control over which data is in RAM and which data is on a disk.

In summary, ramdisk is only useful when:
Task data writes exceed the disk write bandwidth (which is about 100 MB/s in case of a spinning disk)
The task write bandwidth requirement can be lowered by data compression (gzip/xz/zstd on individual files, compressed debug sections, ccache compression, btrfs filesystem compression, jpeg instead of png, ...)

The task's data access pattern does not match the Linux kernel's disk cache eviction policy

Multitask during a Wine or kernel compile on spinners without using a ramdisk and get back to me.

I suppose if all one is doing is just compiling something, sure, probably not as necessary. The second you want to watch Game of Thrones with SMPlayer or comment on Phoronix with Firefox you'll throw your computer out of the window.

EDIT: And that's with keeping my sources on one disk, my os on another, and my media on yet another disk. It's very easy to IO fuck yourself with spinners.
Leave a comment:
numacross replied

26 May 2019, 09:50 AM
Originally posted by Wielkie G View Post

How are they different? The SMT is symmetric - each hardware thread is equal to the other one.

Try your favorite workload on core 0 (by setting core affinity) and then core 1. See that there is no difference between these two results.

Now try to run two instances of the workload, one on core 0 and the other on core 1. You will see that each workload is slower, but the aggregate throughput might be higher. For example, if each core throughput is 60% the original, then the aggregate is 120% and the SMT performance uplift is +20%.

For example (Windows, as I don't have access to Linux right now), the 7zip compression benchmark on my machine (i7 3770k) shows 4500-4600 MIPS on core 0 and on core 1, when only one core is being used. When I run two instances (one on core 0 and the other on core 1) they show 3000-3100MIPS each - that's 6000-6200 MIPS aggregate and a 30-40% uplift.

7-zip is a well behaved integer load which scales great on HT. The decompression benchmark would produce even greater numbers.

Trying it with ffmpeg version 4.1.1 which abuses every SIMD including AVX2 on my 4790K (Turbo disabled, constant 4.3GHz):
Single instance limited to 1 thread (both decoding and encoding) in a cmd.exe started with /affinity 1 - core 0

ffmpeg -threads 1 -i test.mp4 -benchmark -preset slow -crf 22 -c:a copy -threads 1 test_out.mkv
bench: rtime=150.166s
Single instance limited to 2 threads in a in a cmd.exe started with /affinity 3 - core0 and 1

ffmpeg -threads 2 -i test.mp4 -benchmark -preset slow -crf 22 -c:a copy -threads 2 test_out2.mkv
bench: rtime=130.316s
Single instance limited to 1 thread in a in a cmd.exe started with /affinity 1 - core 0 and prime95 FMA3 load running on core 1 at the same time

bench: rtime=257.803

As you can see even an AVX2 load that is well behaved (-threads 2) actually got sped up a little by HT, however when the physical core is already loaded with AVX2 both virtual threads have to compete for shared resources. The aggregate bandwidth still being a bit better than -threads 1, let's be fair.

I'm not arguing that HT doesn't work - in most cases it does, but it has quirks and disadvantages with some loads. It can also complicate matters if you are reliant on latency of execution or are running threads that have long chains of computation dependencies.

Another potential trap is the cost of keeping all 8 HT cores synchronized vs. the cost of doing it for only 4 physical cores. It's not unheard of for games to perform better with HT disabled:
Leave a comment:
skeevy420 replied

26 May 2019, 08:42 AM
Originally posted by atomsymbol

I am not sure I understand the advantages of ramdisks. Assuming the machine already has a SSD/NVMe disk mounted to /, is there a measurable performance advantage to using a ramdisk for /tmp?

You assume too much with my system

But, yeah, if you compile a lot of source it's a damn great speed up for IO on spinners and reduces drive wear regardless of the underlying disk. You can also move games over to them and be able to load up assets just a hair faster...I've been known to set mine as large as 40gb to play modded Skyrim.

While it has never happened to me, I'd rather my ramdisk get full of a buggy program's super log spam over my hard drive.
Leave a comment:
Wielkie G replied

26 May 2019, 08:08 AM
Originally posted by numacross View Post

But they are different, and every operating system knows that there is a difference between the core 0 and 1 in a single-core HT CPU. This scales of course to multi-core ones.

How are they different? The SMT is symmetric - each hardware thread is equal to the other one.

Try your favorite workload on core 0 (by setting core affinity) and then core 1. See that there is no difference between these two results.

Now try to run two instances of the workload, one on core 0 and the other on core 1. You will see that each workload is slower, but the aggregate throughput might be higher. For example, if each core throughput is 60% the original, then the aggregate is 120% and the SMT performance uplift is +20%.

For example (Windows, as I don't have access to Linux right now), the 7zip compression benchmark on my machine (i7 3770k) shows 4500-4600 MIPS on core 0 and on core 1, when only one core is being used. When I run two instances (one on core 0 and the other on core 1) they show 3000-3100MIPS each - that's 6000-6200 MIPS aggregate and a 30-40% uplift.
Leave a comment:
skeevy420 replied

26 May 2019, 07:53 AM
Originally posted by debianxfce View Post

Of course you need endless amount of RAM when you run windows, gnome3 or kde in your VMs. An average Xfce desktop PC user needs 4GB RAM, but you can not buy 2x2GB memory sticks and 2x4GB starts to be rare too. 2x8GB starts to be mainstream.

Which means you'd need 16GB of ram to cover 4 Debian XFCE VMs, 8 more for the host system, 8 more for /tmp....or 24GB of ram is where the average XFCE desktop PC user who runs VMs would want to start.
Leave a comment:
skeevy420 replied

26 May 2019, 07:51 AM
Originally posted by atomsymbol

In my opinion, with year 2019 common CPU thread counts (8-16) and assuming 2 GiB of memory per thread in parallel tasks using all CPU threads, 16-32 GiB of memory is slowly becoming the norm for a desktop/workstation. 8-16 GiB is on the border of being a limiting factor to full utilization of the CPU.

Taking a look at https://www.ec2instances.info most of the EC2 instances have at least 2 GiB of memory per vCPU. The Nano instances have less memory per vCPU (minimum being 256 MiB per vCPU) which is enough to run certain types of applications, but this does not negate the fact that the optimum for a year 2019 desktop/workstation is at least 2 GiB of memory per CPU thread.

That's sort of how I factored ram for my current system. 8 cores * 2 for HT = 16 * 2 = 32GB So i figured 32gb was a decent starting point and ended up getting 48Gb because it was $10 more. Why not? What I didn't account for was systemd using half of that for /tmp by default so it really comes out to 48gb / 2sysd = 24Gb / 16HT = 1.5GB per thread. Just means I need to pick up another 24GB of ram to get that 2gb per cpu (with a 36gb ramdisk as a bonus). I do all my compiles on my current 24gb ramdisk except for Firefox with PGO...24Gb ain't enough for that (seriously) so I could actually make use of 72GB of ram.

Keep my large numbers in mind if you compile your own software or plan on doing it. We need assloads of ram for some of these compiler and optimization processes.
Leave a comment:
Veerappan replied

25 May 2019, 08:19 PM
Originally posted by debianxfce View Post

4-8GB RAM is enough to disable swapping.

That's funny... I have a minimum of 12GB of memory constantly committed to non-cache things all day at work. This goes way up when I spin up testing VMs (often 2 or 3 at a time).

My work laptop is maxed out at 16GB (Thinkpad t440p) and even with zswap enabled, I often go a few GB into swap when I have to test certain workflows.

I'm looking forward to my next laptop refresh (this fall), so I can finally jump to 32GB RAM.
Likes 1
Leave a comment:
skeevy420 replied

25 May 2019, 03:49 PM
Originally posted by ermo View Post

I tend to keep my hardware around for a while and I initally bought a cheap FX-8350 instead of a 3770k years back for packaging purposes (it's even outfitted with a surprisingly cheap 32GB DDR3-2400 RAM kit) since it cost half the money (both CPU and motherboard) for the same amount of performance in that specific task back when I got it.

With the newest vulnerabilities coming to light, I'm not exactly regretting that decision. And with the DDR4 RAM prices only dropping recently, moving to a newer platform hasn't really been on the table cost-benefit wise in the past.

Thanks for the benchmarks Michael!

That's how I feel about my Westmeres I picked up a few years ago when my Q6600 (with the FSB mod) didn't cut it since they were both dirt cheap and I wanted a system that supported ECC for ZFS. I ended up with dual X5687s ([email protected], 16 with SMT) in a Dell T5500 with 48GB of ram (DDR3 1333 R-ECC) and 2x 480gb 7200rpm hdds for $350. I'm cheap, so if I can get an entire workstation for the cost of a new CPU, hells yeah. Found an RX 580 4Gb for only $140 earlier this year. For $490 total, it's a pretty decent setup for 1080p Linux gaming and compiling stuff here and there; especially once mitigations are factored in.
Likes 1
Leave a comment:
ermo replied

25 May 2019, 02:47 PM
I tend to keep my hardware around for a while and I initally bought a cheap FX-8350 instead of a 3770k years back for packaging purposes (it's even outfitted with a surprisingly cheap 32GB DDR3-2400 RAM kit) since it cost half the money (both CPU and motherboard) for the same amount of performance in that specific task back when I got it.

With the newest vulnerabilities coming to light, I'm not exactly regretting that decision. And with the DDR4 RAM prices only dropping recently, moving to a newer platform hasn't really been on the table cost-benefit wise in the past.

Thanks for the benchmarks Michael!
Likes 6
Leave a comment:

Announcement

Benchmarking AMD FX vs. Intel Sandy/Ivy Bridge CPUs Following Spectre, Meltdown, L1TF, Zombieload

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment: