The Issues With The Linux Kernel DRM, Continued

Written by Michael Larabel in Linux Kernel on 25 March 2011 at 10:14 AM EDT. 43 Comments
LINUX KERNEL
Yesterday Linus voiced his anger towards DRM, once again. But not the kind of DRM that is commonly criticized, Digital Rights Management, but rather the Linux kernel's Direct Rendering Manager. With the Linux 2.6.39 kernel it's been another time when Linus has been less than happy with the pull request for this sub-system that handles the open-source graphics drivers. Changes are needed.

The responses in our forums have been mixed among mostly end-users. Example:
So, Linus, whats better solution:
- "module $BLOB taints kernel", or
- unstable opensource driver stack

Of course, if s/Linux/Windows/ then s/kernel folks/responsible company/; but they keep pushing that then it would inevitably be s/opensource/proprietary/

Why? I dunno. Greed, Fear, Dinosaurs?

In the past day though, the discussion between Linus Torvalds and the DRM developers has continued in a heated manner. Dave's response to the message from Linus:
Linus,

Take a step back, it was an enhancement to a current API, had gotten reviewed by two people when I merged it and made sense. Michel raised his concern after that point, so no matter what it was already in a tree I'd pushed out to public so the only answer when he raised his concern was to revert or fix it. Its a minor problem. Like I'd have pushed this patch post merge window, it solves a real problem that Ilija was seeing and he stepped up and fixed it, post-merge review is what happened here, and really this is nothing compared to say the fallout in the VFS after 2.6.38-rc1.

If you think this has anything to do with Intel's ability to break your hardware on every merge then you've got your wires crossed.

Dave

The response by Ilija Hadzic, author of the patch in question, was:
Just to be fair to Michel (and prevent any unnecessary "fights" on the list, for which I am sure people have had enough by now), the concern in question that triggered revision of the API was raised in time, but I oversaw it in the pile of other comments I was also trying to address.

I am neither the first nor the last guy to inadvertently drop a review comment due to limited "bandwidth", so this should not be made into a big deal. Especially, when no damage was done (except for a few extra E-mails and a little extra work).

Dave did the follow-up patch to satisfy Michel (thanks!) and I have already submitted the user space stuff that matches it, so hopefully we are all aligned now.

thanks,

Ilija

Still being dissatisfied, Linus said:
No, it's about the fact that I expect to be pushed code that is WRITTEN AND TESTED BEFORE THE MERGE WINDOW.

The merge window is not for writing new code. The code that gets merged should have been written two weeks ago already. The only new code that I want to see are actual regressions.

I have been talking about this for YEARS now. It's not a new issue. I hate seeing patches sent to me while they are clearly still being discussed and developed. There's something seriously wrong there when that happens.

Linus

And David Airlie again:
Like seriously you really think VFS locking rework wasn't under development or discussion when you merged it? I'm sure Al would have something to say about it considering the number of times he cursed in irc about that code after you merged it.

Here's the point you are missing. I'd quite happily have pushed this *outside the merge window* because it solves a real problem with 0 probability of introducing any new problems, so f'ing what if it was under discussion everything in the kernel is still being discussed and developed. The ABI change was a minor move of the field to leave a larger hole for future changes, it wasn't a fucking fanotify syscall.

This isn't even close to the level of the usual type of fuckups you get in a merge window, it just happens you were cc'ed on the discusson, otherwise I'm betting you'd never even notice. I'm betting something much worse landed in this merge window that you should be giving a fuck about, but this isn't the droid you are lookin for.

Dave.

Linus again:
Umm. That code was basically over a year old by the time it was merged.

How old was the code we're talking about now? Seriously?

And your argument that this case is something you'd have pushed even outside the merge window - I think that sounds like more of the same problem. You say it fixes a problem - but does it fix a REGRESSION?

Do you see the difference? Every single commit I get "fixes a problem". But our rules for these things are much stricter than that.

###
This isn't even close to the level of the usual type of fuckups you get in a merge window, it just happens you were cc'ed on the discusson, otherwise I'm betting you'd never even notice. I'm betting something much worse landed in this merge window that you should be giving a fuck about, but this isn't the droid you are lookin for.
###

Maybe not. But why is it always the DRM tree that has these issues? Why is it that the DRM tree is the one that gets relatively _huge_ patches after -rc1 is out?

I really REALLY wish that you graphics people would at some point admit that you guys have a problem. I am hoping that the intel side is being worked on.

Instead, I see what seems to be you being in a hurry, and arguments why uncooked code should be merged even outside the merge window.

Do you see what I'm aiming at here?

If this was a one-time event, we wouldn't be having this discussion. But the DRM tree is one of the BIGGEST issues after the merge window has closed. And it's EVERY SINGLE RELEASE.

Why? Some introspection please. You don't even have to answer me. I ask you to answer that to yourself.

Linus

After that, Linus ended up pulling the DRM tree and writing another brief message. ".. regardless, it's pulled now. I just hope that some day I'll be taken by surprise, and the drm tree won't be the biggest issue after -rc1."

Jerome Glisse, the developer who's long been working on the open-source ATI driver going back to his R500-series xf86-video-avivo driver before AMD even had an open-source strategy, had the following response on the matter, which basically is seeking that the DRM sub-system is an exception due to its complexity and still being far behind:
Below are my feeling and likely don't reflect any others people feeling.

DRM have been trying to play catchup for years, GPU are likely the most complex piece of hardware you can find in a computer (from memory management, to complex modesetting with all kind of tweaks to the utterly crazy 3d engine and everythings that comes with it) Unlike others piece of hardware, when it comes to fully use a GPU acceleration, there is no common denominator that we would be able to put in the kernel (like a tcp stack or filesystem). I am sure very few people would like to see a full GL stack into the kernel.

This results in various non common API that each driver expose to the userspace and it's all half cooked, because we have a tendency to release early (which is not necessarily wrong in my eyes). If i were to do it cleanly for one device i wouldn't freeze the API before i got a full fast stack (ie fast GL driver, video decompression, dual GPU, efficient power management) this is exactly what nouveau is doing, they are in the experimental for good reasons, they have the freedom to fix their API and they keep improving it each time their userspace progress.

So from my POV either frozen API for DRM is not a good solution (there is a reason why closed source driver bundly everythings together kernel, GL, ddx, ...) either we should leave in experimental until we get our API right which would likely means several years (2-4years as a wild guess) given current number of people working on this. That would mean that most distribution wouldn't enable the open source driver and then open source driver likely wouldn't get enough testing (kind of a chicken and egg problem).

I am not even talking about on dramatic GPU changes in last few years. For instance few years ago having a dual screen setup meaned that you were king of somethings, or least on the top of the hill. Nowadays dual, triple screens or even more, is common setup but some DRM API was designed without even thinking that one day there would be more than 2 crtc (expectation was likely that there would be flying car by then).

Well this are my feeling, we are just chasing a fast moving target and always shooting short on the API and freeze ourself into corner case. Maybe we, or just i, are bad at designing API (well not always i do believe for instance that the modesetting API we expose is a good one).

Cheers,
Jerome


Ben Skeggs, now working at Red Hat and has long been involved with the Nouveau project, also had his own response on the matter. Ben basically just says that the Linux binary drivers are the lucky ones with the "huge advantage" here since they ship their entire driver stack as one package, as a result, they can break their internal APIs and make other changes whenever they wish.

Ben even suggests moving the open-source GPU drivers to a similar model to the binary drivers may be the better approach. "Part of me does think such an approach with the open source graphics drivers would be better. The current model doesn't really fit too well in my opinion. Though, admittedly, there's different problems to going other ways."

David responded again this morning about the problems with under Linus's model of having to queue up patches for up to six months for the next kernel's release cycle in order to get fixes out the door, if following the policy of Linus if only fixing "regressions" in post -rc1 periods. Meanwhile, user-space components can be updated in a matter of days.

Besides these problems, there's a lack of testing going on for experimental DRM code living in the drm-next tree. "I'm also aware we never get enough testing coverage before stuff hits your tree, we'd need 1000s of testers to run drm-next and we just don't have that variation. So yes when new features hit -rc1 with the drm they nearly always cause regressions, its just not possible to test this stuff on every GPU/monitor/bios combination in existance before we give it to you, that just isn't happening. Like radeon pageflipping caused machines to completely hang and I didn't find out until -rc7 due to lack of testing coverage."

Like Ben, David Airlie even thinks about having out-of-tree drivers or some fundamental change to better address these open-source driver issues. "I'm seriously contemplating going back to out-of-tree drivers so we can actually get test coverage before you get things, however that comes with its own set of completely insane problems. Its not like I'm not aware of the problems here, I'm very aware, I'm just clueless on how to provide actual valuable drm code to users in anything close to a timely manner, people buy new graphics card quicker than I can get code into the kernel."

Otherwise users end up waiting months for kernel fixes. "Thats the problem really I read all the discussion and there wasn't much that seemed bad, I think the problem with your suggestions was there was a lot of latitude to disagree with them and I read the comments and disagreed with them as well, and it fixed the problem so I decided it should be pushed or we'd end up waiting another 6 months to fix it for the people who it actually affects. This isn't the message that I'd like to send to people who get off their arses and fix our fuckups."

Jerome Glisse that too much code sharing in the open-source DRM drivers is also a problem. "My feeling on that is that maybe too much code sharing across gpu of different generation hurt more than it helps. I have got the feeling that some of the newer Intel asic share some of the bit of older one and that intel is focusing there attention on newer one and obviously doesn't have time or resource to check that change they do don't impact older hw (i don't think such testing is doable without massive investment which is very very unlikely to happen given size of linux market)."

Under the current model, it's also challenging to deliver open-source driver support for new hardware without months passing after a new ASIC's availability and there generally being "out of the box" support or an easy-install path. Speaking of DRM bugs, open-source AMD Fusion graphics support is still broken.
Related News
About The Author
Michael Larabel

Michael Larabel is the principal author of Phoronix.com and founded the site in 2004 with a focus on enriching the Linux hardware experience. Michael has written more than 20,000 articles covering the state of Linux hardware support, Linux performance, graphics drivers, and other topics. Michael is also the lead developer of the Phoronix Test Suite, Phoromatic, and OpenBenchmarking.org automated benchmarking software. He can be followed via Twitter, LinkedIn, or contacted via MichaelLarabel.com.

Popular News This Week