Announcement

**coder** · 09 January 2021, 04:50 PM

I wonder if the 10-bit decode path involves mostly scalar, integer ops. That would explain why the x86 architectures do so much worse on it, as well as why the Altra is really able to pull ahead.

**zxy_thf** · 09 January 2021, 07:28 PM

Originally posted by coder View Post

I wonder if the 10-bit decode path involves mostly scalar, integer ops. That would explain why the x86 architectures do so much worse on it, as well as why the Altra is really able to pull ahead.

My guess is there is some bug in the x86 code path.
For PPC and ARM the FPS were halved (compared with Chimera 1080p), but x86 performance goes down to 1/4.
This is weird.

**Michael** · 09 January 2021, 07:29 PM

Originally posted by coder View Post

I wonder if the 10-bit decode path involves mostly scalar, integer ops. That would explain why the x86 architectures do so much worse on it, as well as why the Altra is really able to pull ahead.

At least according to a reader on Twitter, Netflix paid for hand written ARM assembly for the 10-bit code path but no one has paid for x86_64 10-bit hand tuned code yet...

**coder** · 09 January 2021, 11:53 PM

Originally posted by zxy_thf View Post

My guess is there is some bug in the x86 code path.
For PPC and ARM the FPS were halved (compared with Chimera 1080p), but x86 performance goes down to 1/4.
This is weird.

It's not weird, if you consider that x86 has 256-bit AVX2 (and 512-bit AVX-512) that is certainly used in the 8-bit path. 10-bit probably has more trouble using it, so it's probably implemented using scalar code. That explains why it takes a bigger hit on x86 than Power.

And ARMv8-A only has 128-bit vector extensions, which explains its performance disadvantage at 8-bit.

**edwaleni** · 10 January 2021, 12:15 AM

Dav1d 0.8.1

- ARM32 optimizations for 10bit bitdepth for SGR
- ARM32 optimizations for 16bit bitdepth for blend/w_masl/emu_edge
- ARM64 optimizations for 10bit bitdepth for SGR
- x86 optimizations for wiener in SSE2/SSSE3/AVX2

I also took a look at the use of VSX vector instructions in the wiener filter for PPC. I have seen a lot of SSE/AVX related code but not any VSX before.

Also was curious to see how they did discreet CPU architecture references.

P7ZIP needs an update to support multi-arch and was looking for some ideas.

**arideden** · 10 January 2021, 10:16 AM

To compare these wildly different machines I think you need to use common metrics like fps per watt or fps per dollar. In fact maybe create a scenario like 'which hardware is best for an online streaming setup' and take the hardware cost, rack space cost and electricity cost into account and produce a dollar cost for each type of hardware for running 1000 streams concurrently for a year.

**pkese** · 10 January 2021, 11:20 AM

Originally posted by Michael View Post

Talos II 2P server with 44 cores / 176 threads and then for ARM64 was the Altra in its 160 core 2P configuration. For these quick reference tests are also performance numbers for the Xeon Platinum 8280 2P and EPYC 7742 2P.

How hard would it be to add the information about how many cores and threads do the Xeon and Epyc platforms have?

**BlueSwordM** · 10 January 2021, 11:58 AM

Guys, guys.

The actual reason x86_64 CPUs perform a lot worse in 10-bit is that until recently, x86_64 CPUs did not have any SIMD assembly code written for HBD(10-bit+) decoding.
Even now, these patches haven't been merged into the main branch.

**name99** · 10 January 2021, 08:44 PM

M1 (4+4 cores) looks at those numbers and laughs

https://twitter.com/videolan/status/1329403827309715456?s=21

Announcement

POWER9 + ARM64 Performance For Dav1d 0.8 AV1 Decoding

POWER9 + ARM64 Performance For Dav1d 0.8 AV1 Decoding

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment