Announcement

**coder** · 13 July 2022, 04:30 AM

Originally posted by jochendemuth View Post

One set of results stuck out a little. Relational databases (MariaDB, PostgreSQL) didn't scale from 1P to 2P. Results were dire for AMD, but even Intel had slowdowns going to 2P.

There are also cases where Intel improved from 1P -> 2P, while AMD significantly regressed! For instance, look at Facebook RocksDB 7.0.1Test: Read Random Write Random. And while it's true that PostgresSQL regressed on 2P of both Intel and AMD, the regression was far worse on AMD.

If I were AMD, I'd be analyzing these and every other case where performance regressed on 2P. Ideally, there are simple software tweaks they can make to improve how the software utilizes NUMA systems. And either way, it could inform future CPU design.

**piotrj3** · 13 July 2022, 06:30 AM

Originally posted by brucethemoose View Post

Technically a splitter like av1an is very scalable, and TBH is what most should be using for non-realtime encoding if they have the RAM.

Live reencoding is very different, yeah.

Generally video encoding can be spliced in many parts and encode each seperatly, you can also splice each frame similary to rendering. Issue is however that unlike rendering, you have quite strong reliance of work done by 1 thread and another thread. Macroblocks are variable sized, variable position so they also bite into each other.

Same goes from frame to frame encoding - there are stuff like i-frames and b-frames so next frame relies on work done in previous frame. But if you splice video into chunks you have an issue, what about referenced frames when 1st chunk ends and 2nd chunk starts. in those places there is compression you lost.

You do video encoding on 2 hour 4k+ video with 16 cores - 32 threads? Yea not an issue, some compression loss won't be noticable long term.

You do video encoding for 60 second video 480p on 2P, 128core-256 thread machine? You gonna have a bad time. And encoder defaults even not allow you to use full potential of machine for that very reason.

Generally speaking pick any modern video encoder it on one bitrate with just 1 thread, after encode same video to same bitrate with as many threads you can and compare SSIM/PSNR/VMAF. Not only 1 thread will do less cycles then multithread version, quality will be better for same file size.

Of course 1 thread encoding is insane long term, but most video encoders are optimized for 4-16 cores.

**yump** · 14 July 2022, 09:18 PM

Originally posted by Linuxxx View Post

Both intel_pstate & intel_cpufreq are also just CPU drivers, yet there is a significant difference between the two, even though both of them are using the performance governor:

https://www.phoronix.com/scan.php?pa...k-pstate&num=7

Care to explain where that difference comes from?

intel_pstate is weird. It has it's own governors that behave differently from the regular kernel cpufreq ones, which are used in "active+no HWP" mode, but that isn't default on any CPUs anymore IIRC. That's why "powersave" with the cpufreq governors means "always lowest frequency", but on intel_pstate it means "something kind of like ondemand".

On newer CPUs, the default is "HWP", hardware pstates, where the CPU firmware is in full control of the CPU frequency. From page 1 of that article, it sounds like intel_pstate also sets the energy-performance preference MSR to max performance, if the performance governor is chosen. Documentation agrees.

amd-pstate doesn't work quite like that. Apparently the chips do have significant autonomy, but it seems like it's only used for saying inside thermal and power limits. For long-term efficiency related CPU frequency scaling, amd-pstate just wires up the kernel cpufreq governor to the "desired performance" field of the relevant MSR. AMD has an energy-performance preference field (Ctrl+f "EnergyPerfPref" in the Processor Programming Reference (PPR) for AMD Family 19h Model 51h, Revision A1), but grepping through the kernel source, and reading amd-pstate.c, it looks like they aren't touching it (follow what modifies the cppc_req_cached field), so it would be left at whatever state it was before. According to that PPR, the reset vector is 0, which means prefer maximum performance if it's the same as Intel. Unless the BIOS messes with it in response to something acpi_cpufreq does, I wouldn't expect any difference between amd-pstate performance and acpi_cpufreq performance.

I think I remember seeing somewhere that Intel's reset vector for EPP is either chip-dependent, or it's very common for firmware to change it when the platform power profile changes.

Unfortunately I don't have access to any chips new enough to try any of this, so everything I've written here is based on docs and the kernel source.

**brucethemoose** · 17 July 2022, 06:17 PM

Originally posted by piotrj3 View Post

Generally video encoding can be spliced in many parts and encode each seperatly, you can also splice each frame similary to rendering. Issue is however that unlike rendering, you have quite strong reliance of work done by 1 thread and another thread. Macroblocks are variable sized, variable position so they also bite into each other.

Same goes from frame to frame encoding - there are stuff like i-frames and b-frames so next frame relies on work done in previous frame. But if you splice video into chunks you have an issue, what about referenced frames when 1st chunk ends and 2nd chunk starts. in those places there is compression you lost.

You do video encoding on 2 hour 4k+ video with 16 cores - 32 threads? Yea not an issue, some compression loss won't be noticable long term.

You do video encoding for 60 second video 480p on 2P, 128core-256 thread machine? You gonna have a bad time. And encoder defaults even not allow you to use full potential of machine for that very reason.

Generally speaking pick any modern video encoder it on one bitrate with just 1 thread, after encode same video to same bitrate with as many threads you can and compare SSIM/PSNR/VMAF. Not only 1 thread will do less cycles then multithread version, quality will be better for same file size.

Of course 1 thread encoding is insane long term, but most video encoders are optimized for 4-16 cores.

This isn't necessarily true. "Smart" splitters like av1an split the video wherever there should be an I frame anyway, and it varies parameters per-segment (by doing a quick test encode + VMAF measurement) so it can spend its bitrate budget on more difficult scenes more intelligently than standalone encoders typically can. The final VMAF is usually *higher* than standalone encoders at the same average bitrate, and it can specifically avoid the black crush issues many encoders suffer from (but that don't necessarily show up on objective measurements).

And if you're feeling really fancy, you can insert some filters into the encoding chain (like denoising or sharpening) that further scale across cores.

The tradeoff is much more RAM usage, more external dependancies and some CPU overhead, but thats certainly worth it on a 256T machine.

Is that going to saturate such a machine with a 3 scene, 60 second video? No. But hopefully such a server is doing some other stuff concurrently anyway.

**coder** · 17 July 2022, 06:56 PM

Originally posted by brucethemoose View Post

thats certainly worth it on a 256T machine.

Is that going to saturate such a machine with a 3 scene, 60 second video? No. But hopefully such a server is doing some other stuff concurrently anyway.

BTW, it feels to me like the main use case for encoding on a high core count machine is probably for cloud use cases. Most cloud users are probably processing large batches of videos and more interested in efficiency and aggregate throughput than the absolute shortest processing time for a single video. So, I wouldn't expect them to push the thread count per video past the point of diminishing returns.

Announcement

AMD EPYC 7773X Performance Continues To Impress With Tremendous Opportunity For Large-Cache Server CPUs

Comment

Comment

Comment

Comment

Comment