Announcement

**aufkrawall** · 10 June 2021, 05:49 AM

Originally posted by tildearrow View Post

...

Haven't had a single crash (amdvlk related issues not counted) with both 5700 XT & 6800 since November. There were a few corruption issues with RADV & RDNA2, they've got fixed quickly and entirely.
Also tested VAAPI encoding, seems to encode HEVC 10 bit just fine...

**Linuxxx** · 10 June 2021, 07:49 AM

Originally posted by MadCatX View Post

This sounds like complete nonsense. If the guy was having thermal issues, switching off interrupt balancing is not going to fix that. If you want to see what irqbalance is, how it works and why it's necessary to have it even with the recent kernels, check out this talk from LinuxConf.au (https://www.youtube.com/watch?v=hjMWVrqrt2U)

Instead of listening to false advertisement from one of the authors of "irqbalance" who apparently is desperately trying to still stay relevant, you should look out for what is happening in the real world out there.

How about you you go & teach those Debian maintainers that removed irqbalance a lesson:

In the meantime the kernel can do balancing on it's own. In 4.9, I've
seen it working with aacraid, each queue gets hard pinned to it's own
CPU from 0 to $NRCPUS. In 4.19 I've seen the same working properly with
virtio-net.

With 4.19, even on real hardware, where interrupts have an affinity for
all cpus, each interrupt is actually delivered to different cpu.

Random example for this, it even selects only one thread of each core:

| 26: 0 0 0 0 92 0 0 0 IR-PCI-MSI 3670017-edge eno1-TxRx-0
| 27: 0 0 0 0 0 167 0 0 IR-PCI-MSI 3670018-edge eno1-TxRx-1
| 28: 0 0 0 0 0 0 467 0 IR-PCI-MSI 3670019-edge eno1-TxRx-2
| 29: 0 0 0 0 0 0 0 454 IR-PCI-MSI 3670020-edge eno1-TxRx-3

Now irqbalance comes to re-do the existing pinning, and the result is not
longer correct but $RANDOM for the hard queue-to-cpu case of virtio.

At least Google considers the work irqbalance does to "correct" the existing balancing a large problem.

And while you're at it:
From the above quote it is also clear that Google has no clue either...

**MadCatX** · 10 June 2021, 08:38 AM

Originally posted by Linuxxx View Post

Instead of listening to false advertisement from one of the authors of "irqbalance" who apparently is desperately trying to still stay relevant,

Which parts of his presentation do you consider false advertising and why? The guy even specifically mentions that he'd love to move more IRQ balancing logic into the kernel but the kernel developers believe that the kind of logic irqbalance does does not belong into the kernel.

Originally posted by Linuxxx View Post

you should look out for what is happening in the real world out there.

How about you you go & teach those Debian maintainers that removed irqbalance a lesson:

And while you're at it:
From the above quote it is also clear that Google has no clue either...

This is one quote with very little context and no details. It also seems to be related only to VMs and Debian Cloud images. Also, the distro being Debian, I'd venture a guess and say that they might have been paring an older irqbalance with a newer kernel and they ended up stepping onto each other.

Meanwhile, there are no real benchmarks or well technically backed articles that'd demonstrate that irqbalance is useless.

**Linuxxx** · 10 June 2021, 10:03 AM

Originally posted by MadCatX View Post

Which parts of his presentation do you consider false advertising and why? The guy even specifically mentions that he'd love to move more IRQ balancing logic into the kernel but the kernel developers believe that the kind of logic irqbalance does does not belong into the kernel.

This is one quote with very little context and no details. It also seems to be related only to VMs and Debian Cloud images. Also, the distro being Debian, I'd venture a guess and say that they might have been paring an older irqbalance with a newer kernel and they ended up stepping onto each other.

Meanwhile, there are no real benchmarks or well technically backed articles that'd demonstrate that irqbalance is useless.

You really ought to use Linux more often than just ponder about it.

Last I checked, even my Chromebook wasn't using "irqbalance".

Also, here are some more real-world examples for the academic doubters:

Removing IRQBALANCE fixed major stuttering in BeamNG.Drive

I switched to linux just a couple of days ago, chose Pop OS after trying out many distros.

I play very few games, beamng is one of that. I have watched videos of it running fine on linux so I expected it to run fine on my system too. Setup steam and lutris and launched the game, it was literally unplayable and was freezing every second. Switching wine and dxvk versions did not help. Couldn't find anything useful online. Raising an issue on github I was told it wasn't a dxvk or wine issue that's for sure.

Well by pure luck I came to know about irqbalance while setting up cpufreq gnome extension (Its very cool btw). It displayed warning about irqbalance being enabled. From what I have understood it distributes cpu load across all cpu cores in order to increase performance. I read some threads about it on github and other forums which said to disable it as it causes more harm than good.

So I did what is mentioned in the cpufreq faq page,

sudo apt purge irqbalance
and reboot.

It did work. The game is running flawlessly now. I am not sure this if this issue is specific to pop os or not. Just thught posting it here might help some other people...

**MadCatX** · 10 June 2021, 11:09 AM

Originally posted by Linuxxx View Post

You really ought to use Linux more often than just ponder about it.

I'm trying to have a conversation here but I see that you'd rather take a verbal swing at me instead of answering my question.

I checked some machines I have access to - desktops, physical servers and various VMs - running Fedora, Alma, various flavors of CentOS and openSUSE and out of 9 machines 7 were running irqbalance. The only two exceptions were an ancient CentOS 6 installation and a very minimalistic CentOS 7 setup that I use as a simple Wireguard gateway. I have no Ubuntu machine at hand but the internet suggests that at least 20.04 comes with irqbalance enabled by default.

Last I checked, even my Chromebook wasn't using "irqbalance".

Did you check that ChromeOS doesn't use some custom tool to balance IRQs? Given that Google knows the exact hardware and the expected workload, a well chosen static configuration might do a better job than irqbalance. The hardware is also quite simple, there are no NUMA nodes or complicated cache hierarchies that irqbalance accounts for in its logic.

Also, here are some more real-world examples for the academic doubters:

If you read the conversation thoroughly, even the DXVK developer is sceptical about irqbalance being the problem here. The person who reported this also doesn't seem to have verified their findings by reenabling irqbalance and checking if the issue comes back.

**Volta** · 10 June 2021, 12:21 PM

Originally posted by zexelon View Post

Kudos to Nvidia! I am toying with possibly getting a 3070 TI though I really want the 3080 TI. Their Linux support is second to none. I have a 3090 for use in AI/ML work and it is untouchable in performance

Would like to replace the 1070 I use for gaming while the 3090 is being hammered training.

Linus' middle finger especially for you. F*ck nvidia and its fanboys.

**zexelon** · 10 June 2021, 01:00 PM

Originally posted by Volta View Post

Linus' middle finger especially for you. F*ck nvidia and its fanboys.

That was a well known photoshop using the deep fakes algorithm. It was originally a thumbs up at how well they spank ATI at performance and features

**Linuxxx** · 10 June 2021, 02:00 PM

Originally posted by MadCatX View Post

I'm trying to have a conversation here but I see that you'd rather take a verbal swing at me instead of answering my question.

I checked some machines I have access to - desktops, physical servers and various VMs - running Fedora, Alma, various flavors of CentOS and openSUSE and out of 9 machines 7 were running irqbalance. The only two exceptions were an ancient CentOS 6 installation and a very minimalistic CentOS 7 setup that I use as a simple Wireguard gateway. I have no Ubuntu machine at hand but the internet suggests that at least 20.04 comes with irqbalance enabled by default.

Did you check that ChromeOS doesn't use some custom tool to balance IRQs? Given that Google knows the exact hardware and the expected workload, a well chosen static configuration might do a better job than irqbalance. The hardware is also quite simple, there are no NUMA nodes or complicated cache hierarchies that irqbalance accounts for in its logic.

If you read the conversation thoroughly, even the DXVK developer is sceptical about irqbalance being the problem here. The person who reported this also doesn't seem to have verified their findings by reenabling irqbalance and checking if the issue comes back.

Here's the output of cat /proc/interrupts on ChromeOS (just small part, because crappy formatting):

HTML Code:

CPU0 CPU1 CPU2 CPU3
0    1678 0    0   IO-APIC 9-fasteoi acpi
2419 0    0    0   IO-APIC 45-fasteoi mmc0
0    0    0    177 PCI-MSI 442368-edge snd_hda_intel:card0
0    0    642  0   PCI-MSI 327680-edge xhci_hcd

As you can clearly see, each hardware module's interrupts are associated with only a single core.
The very same pattern also occurs on my Ubuntu PC with »irqbalance« disabled.

Now compare the output of the above command with one of your systems where »irqbalance« is running.

Notice something?

Well, I do:
Each of your hardware driver interrupts are bouncing around between two different CPU cores, randomly chosen by »irqbalance«.
(Please post the output here, so that everybody can clearly see what I mean.)

Now, tell me:
Why should it be beneficial for your "i915", "amdgpu" or "nvidia" kernel modules to be scheduled around like mad between two distinctly different CPU cores with their own cache locality all the damn time?

The Linux kernel already distributes the interrupts all by itself by default; all »irqbalance« achieves is to waste CPU time & increase latency!

Don't believe me?
Well, then at least listen to the advice given by the SuSE folks in their technical documentation for AMD's EPYC:

irqbalance can be a source of latency, for no significant performance improvement. Thus the suggestion is again to disable it.

**MadCatX** · 10 June 2021, 06:16 PM

Originally posted by Linuxxx View Post

Here's the output of cat /proc/interrupts on ChromeOS (just small part, because crappy formatting):

HTML Code:

CPU0 CPU1 CPU2 CPU3
0 1678 0 0 IO-APIC 9-fasteoi acpi
2419 0 0 0 IO-APIC 45-fasteoi mmc0
0 0 0 177 PCI-MSI 442368-edge snd_hda_intel:card0
0 0 642 0 PCI-MSI 327680-edge xhci_hcd

As you can clearly see, each hardware module's interrupts are associated with only a single core.
The very same pattern also occurs on my Ubuntu PC with »irqbalance« disabled.

Now compare the output of the above command with one of your systems where »irqbalance« is running.

Well, I do:
Each of your hardware driver interrupts are bouncing around between two different CPU cores, randomly chosen by »irqbalance«.
(Please post the output here, so that everybody can clearly see what I mean.)

Machine 1 (4C/8T laptop, Arch Linux): (https://pastebin.com/raw/mdpEiXqz)
Machine 2 (4C/4T file server with cheap desktop hardware, CentOS 7): (https://pastebin.com/raw/d4Ea5M6X)
Machine 3 (2 socket EPYC server, CentOS Stream): (https://pastebin.com/raw/gxUKQeih)
Machine 4 (4 socket Intel server, openSuSE): (https://pastebin.com/raw/6HM7VNst)
EDIT: Added one more laptop to make the data as clear as possible
Machine 5 (8C/16T laptop, Arch Linux): (https://pastebin.com/raw/CpTrfh8p)

All of the listed machines have irqbalance active.

Notice something?

One interesting thing I noticed is the handling of NICs. Both of the high performance servers I listed have NICs with 8 RxTx queues. irqbalance seems to have picked 8 CPU cores to service the NIC but it didn't assign a specific core to each queue. This sounds like a reasonable compromise between resource locality and load balancing. Especially on the EPYC machine it is quite apparent.

Now, tell me:
Why should it be beneficial for your "i915", "amdgpu" or "nvidia" kernel modules to be scheduled around like mad between two distinctly different CPU cores with their own cache locality all the damn time?

Because the idea is not to pin a specific interrupt to one and only one CPU core. The idea is to maximize resource efficiency.

The Linux kernel already distributes the interrupts all by itself by default; all »irqbalance« achieves is to waste CPU time & increase latency!

What kind of concrete evidence do you have to back this up?

Don't believe me?
Well, then at least listen to the advice given by the SuSE folks in their technical documentation for AMD's EPYC:

Link, or it didn't happen!

**mdedetrich** · 10 June 2021, 06:42 PM

Originally posted by WolfpackN64 View Post

NVIDIA's Linux support is amongst the worst. If they work, they work, great performance and all. If they don't, they're taking your entire system down with them.

What the flying f**k are you talking about. Firstly from using NVidia for the past 10 years on countless Linux distros/desktop VM's this has never happened. Secondly from a technical perspective, since Linux is a monolithic kernel any graphics driver can "bring the system down" (or any driver really). You would need a micro/formally verified kernel (like seL4) to guarantee a driver not bringing the "system down".

Originally posted by mppix View Post

The Nvidia driver still breaks periodically unless you are real careful with updates and Nvida Wayland support is so bad that non-gamers are better off running the desktop on the iGPU/APU and use the discrete Nvidia card for compute.

I don't know what your definition of "breaks periodically is", but I have had this happen once in maybe 5 years? The only real issue is that after a new Kernel release it may take time for NVidia to update their driver but this is to be expected.

Originally posted by mppix View Post

Then you have also the GPU acceleration issues. Many distros don't compile their packages with support for Nvidia, e.g. ffmpeg does not come with Nvidia acceleration support enabled in Debian/Ubuntu, and to my knowledge Firefox and Chromium don't have Nvidia backends either.

Blame Linux for having 3 different video hardware acceleration apis (and Google is trying to make another). In either case, distrubtions have solved this issue ages ago (at least arch/manjaro has which maintains patchset to support all of these video acceleration API's)

Also in regards to Intel's drivers being stable, this is true but at the same time the performance of their GPU's are shit and so they are only really used for desktop (and basic acceleration for things like youtube).

Its really easy to have a stable driver if the performance is only average so its not an apples to apples comparison.

Announcement

NVIDIA GeForce RTX 3070 Ti Linux Performance

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment