Announcement

**Wielkie G** · 25 May 2019, 06:06 AM

Originally posted by numacross View Post

There are no "hardware threads" because they share resources of the single physical core making them not equal.

The hardware threads sharing a single physical core doesn't make them not equal. If you are comparing a situation when both hardware threads are running to when one is running, then you are right, but I was not talking about such a situation. I just meant that both of these hardware threads alone are exactly the same when they run with the same circumstances (either when they run alone on a physical core or when they run alongside another hardware thread).

Originally posted by numacross View Post

Linux from 2.6 and Windows from XP have HT-aware schedulers that try to avoid placing 2 threads on both "virtual cores" of the same physical core. They instead prefer using one thread from 2 separate cores in order to minimize the cost of sharing resources.

Correct, making the effect of HT on single-thread benchmarks (like perl interpreter startup) even less likely.

Originally posted by numacross View Post

If what you say is true then why bother with modifying scheduling because of HT?

As written above, I didn't say that the hardware threads don't lose performance when they are not running alone on a core. They do. But when both threads run the same code, their performance should be mostly identical. When one of them runs code and the other doesn't, it shouldn't matter which one does - the performance should be the same either way.

The term "virtual core" makes you believe that there are "good cores" and "virtual cores", which is totally incorrect. The original quote can be modified in the following way to make it less ambiguous:

Originally posted by modified quote

Perl can actually perform better if HT is disabled to avoid the chances it's stuck running on a hardware thread sharing a physical core with another running thread.

**numacross** · 25 May 2019, 06:37 AM

Originally posted by Wielkie G View Post

But when both threads run the same code, their performance should be mostly identical.

I think this is what I'm disagreeing the most. If said code uses the shared units heavily (floating-point for example or SIMD) then the threads will vary wildly in execution latency because of waiting periods.

Originally posted by Wielkie G View Post

When one of them runs code and the other doesn't, it shouldn't matter which one does - the performance should be the same either way.

Yes, or if they are able to saturate the duplicated execution pipelines (integer loads for example).

Originally posted by Wielkie G View Post

The term "virtual core" makes you believe that there are "good cores" and "virtual cores", which is totally incorrect.

But they are different, and every operating system knows that there is a difference between the core 0 and 1 in a single-core HT CPU. This scales of course to multi-core ones.

Depending on the load and scheduling you can ignore this difference or it'll bite you hard.

**ermo** · 25 May 2019, 02:47 PM

I tend to keep my hardware around for a while and I initally bought a cheap FX-8350 instead of a 3770k years back for packaging purposes (it's even outfitted with a surprisingly cheap 32GB DDR3-2400 RAM kit) since it cost half the money (both CPU and motherboard) for the same amount of performance in that specific task back when I got it.

With the newest vulnerabilities coming to light, I'm not exactly regretting that decision. And with the DDR4 RAM prices only dropping recently, moving to a newer platform hasn't really been on the table cost-benefit wise in the past.

Thanks for the benchmarks Michael!

**skeevy420** · 25 May 2019, 03:49 PM

Originally posted by ermo View Post

I tend to keep my hardware around for a while and I initally bought a cheap FX-8350 instead of a 3770k years back for packaging purposes (it's even outfitted with a surprisingly cheap 32GB DDR3-2400 RAM kit) since it cost half the money (both CPU and motherboard) for the same amount of performance in that specific task back when I got it.

With the newest vulnerabilities coming to light, I'm not exactly regretting that decision. And with the DDR4 RAM prices only dropping recently, moving to a newer platform hasn't really been on the table cost-benefit wise in the past.

Thanks for the benchmarks Michael!

That's how I feel about my Westmeres I picked up a few years ago when my Q6600 (with the FSB mod) didn't cut it since they were both dirt cheap and I wanted a system that supported ECC for ZFS. I ended up with dual X5687s ([email protected], 16 with SMT) in a Dell T5500 with 48GB of ram (DDR3 1333 R-ECC) and 2x 480gb 7200rpm hdds for $350. I'm cheap, so if I can get an entire workstation for the cost of a new CPU, hells yeah. Found an RX 580 4Gb for only $140 earlier this year. For $490 total, it's a pretty decent setup for 1080p Linux gaming and compiling stuff here and there; especially once mitigations are factored in.

**Veerappan** · 25 May 2019, 08:19 PM

Originally posted by debianxfce View Post

4-8GB RAM is enough to disable swapping.

That's funny... I have a minimum of 12GB of memory constantly committed to non-cache things all day at work. This goes way up when I spin up testing VMs (often 2 or 3 at a time).

My work laptop is maxed out at 16GB (Thinkpad t440p) and even with zswap enabled, I often go a few GB into swap when I have to test certain workflows.

I'm looking forward to my next laptop refresh (this fall), so I can finally jump to 32GB RAM.

**skeevy420** · 26 May 2019, 07:51 AM

Originally posted by atomsymbol

In my opinion, with year 2019 common CPU thread counts (8-16) and assuming 2 GiB of memory per thread in parallel tasks using all CPU threads, 16-32 GiB of memory is slowly becoming the norm for a desktop/workstation. 8-16 GiB is on the border of being a limiting factor to full utilization of the CPU.

Taking a look at https://www.ec2instances.info most of the EC2 instances have at least 2 GiB of memory per vCPU. The Nano instances have less memory per vCPU (minimum being 256 MiB per vCPU) which is enough to run certain types of applications, but this does not negate the fact that the optimum for a year 2019 desktop/workstation is at least 2 GiB of memory per CPU thread.

That's sort of how I factored ram for my current system. 8 cores * 2 for HT = 16 * 2 = 32GB So i figured 32gb was a decent starting point and ended up getting 48Gb because it was $10 more. Why not? What I didn't account for was systemd using half of that for /tmp by default so it really comes out to 48gb / 2sysd = 24Gb / 16HT = 1.5GB per thread. Just means I need to pick up another 24GB of ram to get that 2gb per cpu (with a 36gb ramdisk as a bonus). I do all my compiles on my current 24gb ramdisk except for Firefox with PGO...24Gb ain't enough for that (seriously) so I could actually make use of 72GB of ram.

Keep my large numbers in mind if you compile your own software or plan on doing it. We need assloads of ram for some of these compiler and optimization processes.

**skeevy420** · 26 May 2019, 07:53 AM

Originally posted by debianxfce View Post

Of course you need endless amount of RAM when you run windows, gnome3 or kde in your VMs. An average Xfce desktop PC user needs 4GB RAM, but you can not buy 2x2GB memory sticks and 2x4GB starts to be rare too. 2x8GB starts to be mainstream.

Which means you'd need 16GB of ram to cover 4 Debian XFCE VMs, 8 more for the host system, 8 more for /tmp....or 24GB of ram is where the average XFCE desktop PC user who runs VMs would want to start.

**Wielkie G** · 26 May 2019, 08:08 AM

Originally posted by numacross View Post

But they are different, and every operating system knows that there is a difference between the core 0 and 1 in a single-core HT CPU. This scales of course to multi-core ones.

How are they different? The SMT is symmetric - each hardware thread is equal to the other one.

Try your favorite workload on core 0 (by setting core affinity) and then core 1. See that there is no difference between these two results.

Now try to run two instances of the workload, one on core 0 and the other on core 1. You will see that each workload is slower, but the aggregate throughput might be higher. For example, if each core throughput is 60% the original, then the aggregate is 120% and the SMT performance uplift is +20%.

For example (Windows, as I don't have access to Linux right now), the 7zip compression benchmark on my machine (i7 3770k) shows 4500-4600 MIPS on core 0 and on core 1, when only one core is being used. When I run two instances (one on core 0 and the other on core 1) they show 3000-3100MIPS each - that's 6000-6200 MIPS aggregate and a 30-40% uplift.

**skeevy420** · 26 May 2019, 08:42 AM

Originally posted by atomsymbol

I am not sure I understand the advantages of ramdisks. Assuming the machine already has a SSD/NVMe disk mounted to /, is there a measurable performance advantage to using a ramdisk for /tmp?

You assume too much with my system

But, yeah, if you compile a lot of source it's a damn great speed up for IO on spinners and reduces drive wear regardless of the underlying disk. You can also move games over to them and be able to load up assets just a hair faster...I've been known to set mine as large as 40gb to play modded Skyrim.

While it has never happened to me, I'd rather my ramdisk get full of a buggy program's super log spam over my hard drive.

**numacross** · 26 May 2019, 09:50 AM

Originally posted by Wielkie G View Post

How are they different? The SMT is symmetric - each hardware thread is equal to the other one.

Try your favorite workload on core 0 (by setting core affinity) and then core 1. See that there is no difference between these two results.

Now try to run two instances of the workload, one on core 0 and the other on core 1. You will see that each workload is slower, but the aggregate throughput might be higher. For example, if each core throughput is 60% the original, then the aggregate is 120% and the SMT performance uplift is +20%.

For example (Windows, as I don't have access to Linux right now), the 7zip compression benchmark on my machine (i7 3770k) shows 4500-4600 MIPS on core 0 and on core 1, when only one core is being used. When I run two instances (one on core 0 and the other on core 1) they show 3000-3100MIPS each - that's 6000-6200 MIPS aggregate and a 30-40% uplift.

7-zip is a well behaved integer load which scales great on HT. The decompression benchmark would produce even greater numbers.

Trying it with ffmpeg version 4.1.1 which abuses every SIMD including AVX2 on my 4790K (Turbo disabled, constant 4.3GHz):

Single instance limited to 1 thread (both decoding and encoding) in a cmd.exe started with /affinity 1 - core 0

ffmpeg -threads 1 -i test.mp4 -benchmark -preset slow -crf 22 -c:a copy -threads 1 test_out.mkv
bench: rtime=150.166s

Single instance limited to 2 threads in a in a cmd.exe started with /affinity 3 - core0 and 1

ffmpeg -threads 2 -i test.mp4 -benchmark -preset slow -crf 22 -c:a copy -threads 2 test_out2.mkv
bench: rtime=130.316s

Single instance limited to 1 thread in a in a cmd.exe started with /affinity 1 - core 0 and prime95 FMA3 load running on core 1 at the same time

bench: rtime=257.803

As you can see even an AVX2 load that is well behaved (-threads 2) actually got sped up a little by HT, however when the physical core is already loaded with AVX2 both virtual threads have to compete for shared resources. The aggregate bandwidth still being a bit better than -threads 1, let's be fair.

I'm not arguing that HT doesn't work - in most cases it does, but it has quirks and disadvantages with some loads. It can also complicate matters if you are reliant on latency of execution or are running threads that have long chains of computation dependencies.

Another potential trap is the cost of keeping all 8 HT cores synchronized vs. the cost of doing it for only 4 physical cores. It's not unheard of for games to perform better with HT disabled:

Announcement

Benchmarking AMD FX vs. Intel Sandy/Ivy Bridge CPUs Following Spectre, Meltdown, L1TF, Zombieload

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment