Announcement

Collapse
No announcement yet.

Linux Patch Sparks Differing Views Over External Monitor Handling With iGPU vs. dGPU

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #61
    I agree there should be a toggle if someone doesn’t require the dGPU for their multi monitor setup.

    From my experience users that buy Nvidia GPUs to use Linux either require CUDA/Tensor cores. In this case they use the dGPU as headless. Even on their laptops for dev work prior to scaling up to a supercomputer.

    Originally posted by Volta View Post

    Nobody cares about POS like proprietary nvidia blobs. Kernel developers can do what they want except braking user space. In this case tainted nvidia blob is the problem.
    I do, it’s the best GPU driver experience I’ve had on Linux. When I was using RADV in 2019 every simple change I had to perform was done manually via a config file/DE display settings. Including enabling variable refresh rate. Along with that I had to wait for the distro to release a kernel upgrade to get RADV updates. I hope Wayland doesn’t completely demolish nvidia-settings. It was nice having a consistent experience across all GNU distros.

    If I wanted to use ROCm I was restricted to a handful of supported operating systems. With my Nvidia GPU I just need to install the Container Toolkit on any distribution to utilize CUDA.

    Comment


    • #62
      Originally posted by Xake View Post

      Oh, you are so right, they should really be grateful for their toxic users demanding everything their way, pissing on their work, calling them names, and over all behaving like the Karens normally found on r/EntitledPeople.

      Yes, that was sarcasm.

      Just because a fix for a race-condition broke your setup does not make it a "fix". Quoting it in that way as reserv0 does in the bugreport is exactly that kind of toxic behaviour that makes devs quit.
      Hell, maybe the commit really did fix something which exposed things that earlier broke for people in the background in peculiar and hard-to-debug ways, or even fixed it.
      Just because it broke from a vocal bunch might not mean it did not fix stuff for a less vocal bunch. Because it really did not break everything for everyone.
      I am saying this as a user having to have three different computers with different setups of different nvidia-gfx-cards (two of them being laptops) that need the binary blob for getting different parts of the setups to work, and cannot in any way reproduce the bugs mentioned in the reports. Or maybe I am just lucky with my Fedora install having maintainers of the kernel fixing stuff before it hits users.

      Also, about if the kernel-devs are volunteers. What are they paid to do? Are they first-line-support-agents employed to hand-holding cry-babies trying to make them debug a problem to be able to fix the real problem? Are they employed to take the abuse of cry-babies that question their work when they disagree with the correct way to go forward with a problem?
      No. That is not what they are employed to do.
      That essentially makes the time they lay on stuff like that whiny bug report volunteered time. Time they lay on doing stuff outside of their work description.

      As I have come to understand, also as maintainers of kernel sub systems they are not really required to handle every single bug-report, because there is a general agreement that many bugreports are just bad. Mostly when the bug reports have not reached the maintainers the correct way, and users do not want to troubleshoot since they lock their minds on how it should be fixed and fails to acknowledge what they think is a fix might break other stuff instead and worse, something the maintainer maybe knows since they made that fix to begin with.

      So yeah, you brought a product which have a bad reputation of using a out of tree binary non-open-source driver which often break with new kernel releases. So no-one knows what it does and if it might break with changes to the mainline kernel or why. And also with a reputation that the vendor (nvidia) can be slow in fixing breakage for users they do not really care about (read bleeding edge customers or non-high-volume-customers) and even more so if the problem is not easily reproduced.
      And something breaks due to a fix and you get hit by this.
      The correct thing to do is to find out _why_ that fixed broke the driver from that vendor. Which only the vendor can do and help with since no-one else knows how it works.
      And instead you ask the maintainers to remove the fix. And then the maintainer says this it not the correct way to handle this problem and that the vendor needs to look into why it breaks, you starts to whine and making comments suggesting that the fix is not really a fix, stuff like:

      which just totally goes to show that you know nothing about the problems, the policies about what is OK to break and what is not OK to break (userspace is not OK, however the nvidia-driver is not userspace), or what is going on here. You just want to demand things. Since something broke. For you. On a device. From a vendor that the head-maintainer of the kernel said this about 10 years ago: https://youtu.be/_36yNWw_07g
      So nothing here is new. Nvidia breaking is not a new issue. Not a new problem. Not something that is going to stay "BroKEn ForEVEr!1!!" since the problem sooner or later will get fixed. Or there is a workaround in the distro or one of your own device it works until something else breaks.
      The bugreport against the mainline kernel is not the correct place to whine about that. That is the correct place to take time to debug why it broke (not what, that is only something to help figure out why), and fix that problem. Which might need the involvement of the nvidia devs.
      So go somewhere else than the bugreport to whine why it is being ignored or "not cared about".

      Oh, hell. I just realized what you and birdie uses phoronix forums for. A place in hope of finding other whiny entitled people to form you own little bubble of self pity.
      When user is a little toxic so was response. That is 1.

      2nd. User that reports a bug is not just a user. It is contributor that helps document issues as well. So does people making technical documentation of hardware. So does people reporting security issues. Not only people contributing code are contributing. This is something i feel is EXTREMLY prelevant among GPL communities underrating work of other people that are not making less effort.

      3rd. It is normal that user or QA will care more about that things are working way they should instead of means how are working. For commercial developer priority still number 1 is, if it is working and usable, 2nd priority is readable code, open sourcing, expandability of it etc. And I feel like for open source project it is quite important that when those priorities are conflicting, person can just look at situation overall and think if that is really what you want. Issues like that definitly do not help users.

      4th. Issue is impactful and doesn't only affect Nvidia. It impacted some ARM64 gpu (that was open source) during contributing window, it impacted AMD propertiary driver THAT HAS TO EXIST (for many reasons) even when AMD is so nice and open and contributes stuff.

      5th. Many kernel developers forget about use cases when relativly old machine with older kernel gets new GPU and needs driver for it. Then you have to backport it or risk installing newer kernel that might be incompatible with distro.

      6th. Companies want to give you day 1 support on linux and here i don't mean bleeding edge kernel but day 1 support on already existing distro where kernel patches might be arriving in months if ever. Ability to install something propertiary or open source but not mainlined is real advantage.

      7th. Because of (6th) if company wants to be nice, they need at minimum either not suffer from 6th issue or spend money on maintaining 2 driver teams one for mainline and one for propertiary. If open source support only, then support to distro of your choice might appear few months too late what limits usability of your stuff. If you make propertiary only you have nvidia situation. If you make both you waste money doing same work twice.

      8th. Screw nvidia, think about AMD. Because amd is nice and have both teams, they still get hit in face same as nvidia for very same reason. Funny huh?


      Personally? I wish Nvidia in kernel would treated same was as Nvidia in KDE. In KDE nvidia employee is often called to help with issues nvidia related, he contributes, contribution goes and no one cares if contribution is "in spirit of GPL" or in spirit of "must interact with open source userspace" etc. Patch is done, it helps users, here it goes.

      Lower a little fences and simply let nvidia contribute (even if contribution helps only interaction kernel<-> propertiary driver). That would help a lot to users.
      Last edited by piotrj3; 18 August 2022, 07:18 PM.

      Comment


      • #63
        piotrj3 Nvidia devs can contribute as much as they want, the open source community isn't trying to hold them back. Rather, it is likely that nvidia is holding them back.

        The API in question is an internal kernel API, it can change however they like.

        If that breaks the nvidia driver, then the best thing to do is to wait for the fix from nvidia while downgrading ur kernel.

        Comment


        • #64
          Originally posted by piotrj3 View Post
          It is normal that user or QA will care more about that things are working way they should instead of means how are working. For commercial developer priority still number 1 is, if it is working and usable, 2nd priority is readable code, open sourcing, expandability of it etc. And I feel like for open source project it is quite important that when those priorities are conflicting, person can just look at situation overall and think if that is really what you want. Issues like that definitly do not help users.
          Couldn't agree more. I always felt that in Linux or many other open source projects, ideology or code readability/correctness comes at the expense of usability. I'll just leave a quote of Linus Torvalds himself from the end of this video, that I think really fits this situation:

          You can write code that looks beautiful, but just doesn't actually solve the problem.
          Last edited by user1; 19 August 2022, 10:14 AM.

          Comment


          • #65
            Originally posted by piotrj3 View Post
            Anyway we offtopic too much.

            I think change should be at least optional, because:
            a) copying framebuffers isn't that totally free,
            b) there are legitimate use cases that do benefit from that (G.A.M.E.R.S WHY NO ONE TALKS ABOUT GAMES?)



            I hate such aproach. First there are such double Intel/double AMD systems and nothing stops from making Intel/AMD dual GPU system. Only Nvidia however cared enough to have display wired directly (in laptop) wired up to display because it was faster, lower latency and in general makes more sense (eg. if you use extrernal monitors, you are likely using power brick as well). Again linux philosophy dominates over making system actually good for users. Which is basically saying I don't care about making system used by milions of users i care that it is done way i want.

            Also karolherbst I am not certain, but doesn't it only apply towards MUX-ed scenarios?

            In laptops there are 2 configurations:
            -MUX-less when everything is wired to iGPU,
            -MUX-ed when display is routed either to dGPU or iGPU and can be switched.

            If we talk only about 2nd) case is most of time only happening if Laptop supports stuff like Gsync/freesync or such beast workstations, and I believe we don't talk here about average dell XPS (They are muxless), seriously i don't know a single laptop with nvidia mx150-450 or GTX/RTX XX50 (or below) that uses MUX-ed case. This is why i don't think issue raised is actually serious. 90% of internal/external displays in laptops are wired to iGPU and seriously i only know from news that few such cases exist but there were all insane configurations like RTX 2080 in laptop.

            The patch for me (if i understand stuff right):
            - in muxless case (90%+ of all laptops with nvidia gpu), things work as they did before,
            - in mux-ed case (few unusual configurations with ultra powerful components) in this case behaviour does change,
            - while case from patch is when external display is not supported at all because it is wired only to dGPU.

            Overall? I think good patch because most mux-ed cases will be using dGPU anyway.
            So the entire situation is a bit more complex overall.

            At the moment all the composition is done on the iGPU, so there is a huge amount of overhead displaying anything on the dGPU: 1. render on dGPU 2. copy to iGPU 3. composite on iGPU 4. copy to dGPU.

            In theory having a mux can allow you to make all the overhead go away: you composite on the GPU the display belongs to. Sadly this needs major reworks in all the compositors we have today (and is outright impossible to do with X anyway).

            We already have laptops where the _internal_ display is on a Mux and this is intended for dynamic use cases. E.g. you want to play a game but don't want the iGPU <-> dGPU copy overhead. So what you do is to flip the display over to the dGPU and let the compositor transition its composition context over as well (which compositors can't, but so the theory).

            So treating muxes as dynamic switch you can flip at runtime gives you interesting choices.

            You run into troubles if you don't and that's why the patch is bad. There aren't many people who really care about being able to connect 4 or just 3 displays or something. Also the overhead isn't all that big, especially given that if you want to play for low budget you'd do that on a desktop regardless.

            The main reason here is really though just being able to connect more displays and I honestly don't see it worth it to increase power consumption if that makes the difference between no fan noise vs some fan noise. I have a Laptop where all displays connected via USB-C route through the iGPU and my fans are off most of the time.
            Once the dGPU is in used, there is already too much heat so fans start to spin.

            And now given that what would people generally prefer?

            Comment


            • #66
              This is s patch which toggles display multiplexing via acpi? That is cool. I didn't know laptops with display multiplexing still existed, let alone have acpi xontrol. I had two multiplexing laptops,.W520 and P50 but the multiplexer could be toggled only in bios. ... Reading the comments I get the feeling that most people don't even know it's a thing.

              So the kernel devs are opposed to enabling this because nouveau is a bad experience? Laptops that have this capability don't have to turn it on.

              Comment


              • #67
                Originally posted by karolherbst View Post

                So the entire situation is a bit more complex overall.

                At the moment all the composition is done on the iGPU, so there is a huge amount of overhead displaying anything on the dGPU: 1. render on dGPU 2. copy to iGPU 3. composite on iGPU 4. copy to dGPU.

                In theory having a mux can allow you to make all the overhead go away: you composite on the GPU the display belongs to. Sadly this needs major reworks in all the compositors we have today (and is outright impossible to do with X anyway).

                We already have laptops where the _internal_ display is on a Mux and this is intended for dynamic use cases. E.g. you want to play a game but don't want the iGPU <-> dGPU copy overhead. So what you do is to flip the display over to the dGPU and let the compositor transition its composition context over as well (which compositors can't, but so the theory).

                So treating muxes as dynamic switch you can flip at runtime gives you interesting choices.

                You run into troubles if you don't and that's why the patch is bad. There aren't many people who really care about being able to connect 4 or just 3 displays or something. Also the overhead isn't all that big, especially given that if you want to play for low budget you'd do that on a desktop regardless.

                The main reason here is really though just being able to connect more displays and I honestly don't see it worth it to increase power consumption if that makes the difference between no fan noise vs some fan noise. I have a Laptop where all displays connected via USB-C route through the iGPU and my fans are off most of the time.
                Once the dGPU is in used, there is already too much heat so fans start to spin.

                And now given that what would people generally prefer?
                Wouldn't then simple solution would be for 1-3 displays to default to iGPU (as that is what AMD or Intel supports anyway), and 4+ default to dGPU?

                Comment


                • #68
                  Originally posted by piotrj3 View Post

                  Wouldn't then simple solution would be for 1-3 displays to default to iGPU (as that is what AMD or Intel supports anyway), and 4+ default to dGPU?
                  yeah well.. apparently as the developer told me, it's a switch you can only toggle once at boot time, so it's all or nothing. Oh and the toggle flips it over for all of DisplayPort or soemthing.

                  Comment


                  • #69
                    Originally posted by piotrj3 View Post

                    When user is a little toxic so was response. That is 1.

                    2nd. User that reports a bug is not just a user. It is contributor that helps document issues as well. So does people making technical documentation of hardware. So does people reporting security issues. Not only people contributing code are contributing. This is something i feel is EXTREMLY prelevant among GPL communities underrating work of other people that are not making less effort.
                    I agree on you with that. But that also requires the bugreport to be constructive, and the reporter to be helpful. Just locking yourself into a position and going toxic is neither.
                    That bugreport is pretty hard read. And you cannot get the full picture of the problem from that bugreport either. Read it again and tell me exactly how many versions of what drivers are impacted on what versions of the kernel and is any driver fixed in any version of the kernel? Most of bugreport here contributes noting to make the maintainers aware of the full problem, and thus does nothing to work towards a good fix.

                    Originally posted by piotrj3 View Post
                    3rd. It is normal that user or QA will care more about that things are working way they should instead of means how are working. For commercial developer priority still number 1 is, if it is working and usable, 2nd priority is readable code, open sourcing, expandability of it etc. And I feel like for open source project it is quite important that when those priorities are conflicting, person can just look at situation overall and think if that is really what you want. Issues like that definitly do not help users.
                    And here is where you are both right and wrong.
                    This is because a fix for a bigger problem can produce problems for a smaller set of users/QA. An you do not re-break the majority of users or paying customers to satisfy a subset of users. Or remove a fix for a head-ache if you committed fully knowing it might show up other problematic behaviors that should be fixed due to some functions using/abusing the possibilities given by the head-ache inducing possible broken behaviour.

                    I am working with development for a company. I know the problems with priority vs working code.
                    I am currently aware of a problem some of our users have with one of the systems I am responsible for, and that their experience broke due to a bugfix. The problem is that said bugfix fixed a several year old faulty behavior that silently created problems in the background no-one noticed util an event triggered outage for a couple of thousands of our customers.
                    Our company prioritize that we do not risk breaking a couple of thousand customers again, and the couple of broken users needs to wait for a proper fix. The next problem is that said bugfx is not prioritized since there has been some new legal stuff we need to adapt our systems for, which is of higher priority due to hard dates than fix those handful of users.

                    The race-condition fix feels very much like this kind of problem. It fixed a faulty behavior. Some other drivers for some edge-cases stopped working.

                    Originally posted by piotrj3 View Post
                    4th. Issue is impactful and doesn't only affect Nvidia. It impacted some ARM64 gpu (that was open source) during contributing window, it impacted AMD propertiary driver THAT HAS TO EXIST (for many reasons) even when AMD is so nice and open and contributes stuff.
                    What part of the AMD-driver broke? The only thing I have heard broke for them was their closed-source user-space-part of the driver. I have not heard that users using their open-source user-space-part of the driver has had any problems.
                    And did it break for all the customers using the AMD-binary driver?

                    This is the problem with the bugreports for this issue I have seen here so far: there is so much focus on shit-flinging that what is and is not impacted in what versions are hard to find. And also for who. As I said I have at least three different systems I have with the nvidia binary blob and none of the systems experience the reported problems.
                    I also have an HTPC running AMD open source without problems.
                    Why do I not experience any problems? What do I do differently? If I cannot figure out how to reproduce, how could a maintainer that needs to try to figure out if the fix really is to revert, or if the fix might be something different.

                    What I have seen reported yet is:
                    * Nvidia - however I ave yet t be able to reproduce so it cannot be everyone nvidia
                    * AMD-binary - I have seen one reporting the AMD-binary as part of a list somewhere (I am using the opensource on a fourth system without problems). But that list - like everything else naming AMD - has been users pointing at at other users pointing at other users to say "It is not only Nvidia", and yet to see any real user saying _I_ have problem, and this is the versions I am using
                    * vmwgfx - If you read the thread from patchwork they state that they found and fixed the problem during the rc-cycle of 5.19. Has anybody actually been hit by this? Did they not backport the fix for vmwgfx from 5.19 to 5.18? Would that not be a more correct way of handling that reverting the change?

                    Originally posted by piotrj3 View Post
                    5th. Many kernel developers forget about use cases when relativly old machine with older kernel gets new GPU and needs driver for it. Then you have to backport it or risk installing newer kernel that might be incompatible with distro.
                    This is a problem with all operating systems. Have you tried running an rtx 3060 on windows vista?
                    Windows 7 has still new drivers since some pretty large companies pays MS for ESU (extended until at least january 10 2023) and nvidia is interested in this large large companies large large budget.
                    The real difference with windows and linux is that there is a plenthora of different distros with different versions of kernels, and different patch levels of said kernels, and also users that refuses to update/upgrade.
                    How many times have you seen this problem on an Ubuntu LTS, or a Red Hat Enterprise installation? I am seriously asking, since me myself has not needed to use an Enterprise/LTS for a long time except in headless virtual servers. And the most related problems there are pretty different from the problems we are talking about here.

                    Also, if the distro does not care enough about their users to help the users fix problems in the distro, why should the hardware vendor?

                    Originally posted by piotrj3 View Post
                    6th. Companies want to give you day 1 support on linux and here i don't mean bleeding edge kernel but day 1 support on already existing distro where kernel patches might be arriving in months if ever. Ability to install something propertiary or open source but not mainlined is real advantage.
                    The distros wants this to, and should work with said company to enable this. So if a customer/user has a problem the correct way to escalate it is to the distribution so they can help the user getting the information needed for the distribution to take it up with the company/vendor/maintainer. And if you as a user do the last step yourself, realize that the ones you are talking with when it comes to kernel development often knows much more than you do about the things in details, but also as a part of something bigger, and adapt your attitude to match.

                    Originally posted by piotrj3 View Post
                    7th. Because of (6th) if company wants to be nice, they need at minimum either not suffer from 6th issue or spend money on maintaining 2 driver teams one for mainline and one for propertiary. If open source support only, then support to distro of your choice might appear few months too late what limits usability of your stuff. If you make propertiary only you have nvidia situation. If you make both you waste money doing same work twice.

                    8th. Screw nvidia, think about AMD. Because amd is nice and have both teams, they still get hit in face same as nvidia for very same reason. Funny huh?
                    So why does both amd and nvidia have binary blobs? The difference is that AMD has a fully open source driver, and maintains a optimized binary alternative for the user-space parts of the driver (probably to hide IPs and other stuff), while nvidia have yet to come as far and so far only really support their binary blob-versions.

                    And as I said: I have yet to see anyone reporting problems with the fully open source AMD driver. That said, every thread on the issue is so spammy with non-relevant stuff that I might have missed that.

                    Originally posted by piotrj3 View Post
                    Personally? I wish Nvidia in kernel would treated same was as Nvidia in KDE. In KDE nvidia employee is often called to help with issues nvidia related, he contributes, contribution goes and no one cares if contribution is "in spirit of GPL" or in spirit of "must interact with open source userspace" etc. Patch is done, it helps users, here it goes.

                    Lower a little fences and simply let nvidia contribute (even if contribution helps only interaction kernel<-> propertiary driver). That would help a lot to users.
                    And here you are wrong. Nvidia is part of the kernel community. they do release and commit fixes to the kernel, look for example at the tegra-driver.
                    The problem as far as I have come to understand it is that it is nvidia that has fences against the kernel developers.

                    Nvidia is like: Do you have a problem with our driver? Report it in our forums, and we will not look at in unless you attach this report from our report tool as part of your driver even if your system is so broken you cannot run the tool.
                    What can the kernel developers do in this kind of situation? They cannot troubleshoot what the nvidia blob does or does not do. They are _guessing_ what the blob does in the bugreport, and that is why they say "you need to report this to nvidia". So they cannot really tell what they might do to fix this stuation. They need help from Nvidia. And they know that nvidia will not touch the bugreport in the kernel bugzilla, and nvidia will probably not touch it since it is not escalated in the "proper" nvidia-way.

                    Comment

                    Working...
                    X