Originally posted by Weasel
View Post
I actually don't know of any way to actively abort jobs on the firmware,
so this is pretty much the only option I have. I've even seen
long-running compute jobs on macOS run to completion even if you kill
the submitting process, so there might be no way to do this at all.
Though in practice since we unmap everything from the VM anyway when the
userspace stuff gets torn down, almost any normal GPU work is going to
immediately fault at that point (macOS doesn't do this because macOS
effectively does implicit sync with BO tracking at the kernel level...).
so this is pretty much the only option I have. I've even seen
long-running compute jobs on macOS run to completion even if you kill
the submitting process, so there might be no way to do this at all.
Though in practice since we unmap everything from the VM anyway when the
userspace stuff gets torn down, almost any normal GPU work is going to
immediately fault at that point (macOS doesn't do this because macOS
effectively does implicit sync with BO tracking at the kernel level...).
Weasel it is rust unsafe is factor why person looked at doing this code but its also the nature of some GPUs as well.
Hopefully apple has not made a mistake in the AGX and include means to infinity loop in the AGX yes halt and catch file. There are reasons why you want to be able to force terminate GPU processes so most GPU designs include this feature. Note I said most AGX is not the only exception.
Weasel the prior stuff ups in these areas that caused kernel paincks don't after this modification.
Weasel think about drm_sched_fini why is every driver having to do it own checks before calling drm_sched_fini to make sure no job is still using the device this is code duplication. Rust developers nature that C code should validate does cause some better C code design choices in the long term in lots of cases.
Yes the one you said that it releates to the rust driver "can_run_job" in fact does not relate to safe and unsafe code.
It relates to how in heck to you set what is the max number of jobs you can run on a GPU at anyone time this is a different method.
But keep in mind none of this is documented, and there's no way for us
driver authors to understand what we're supposed to do without
documentation.
driver authors to understand what we're supposed to do without
documentation.
Both of these solutions are author with different background looking at code and coming up with a different solution.
Rust developers looking to have C code they can declare safe does cause some interesting design choices like drm_sched_fini one that could over time reduce lots of graphics driver code and make it simpler for people to write new people due to having less lines of code to get right so driver works correctly.
Weasel the developer has gone into detailed reasons for both alterations its not just hey we want rust safe support. Rust safe support caused them to look at drm_sched_fini in more detail and notice hang on why are we doing this were a person can miss cleaning up all the jobs and then call drm_sched_fini and kernel panic the system it will not add much overhead adding auto clean up here particularly when this function is only called when you are basically shutdown a GPU.
Weasel think about it how would you like shutdown your system to have it kernel panic due to some GPU job so not writing your files to disc correctly just because drm_sched_fini is done how it currently is and some driver developer did not pick up some job on the GPU was not closed before calling drm_sched_fini. Just because something is motivated by rust safe vs unsafe does not make the assessment that the current implementation of drm_sched_fini could be stupidly dangerous and a change should be suggested particularly considering drm_sched_fini is not path that need to be high performance and failure leading to panick when drm_sched_fini is normally will be called could be data loss. People make checklists for safety things for a reason. Making sure you don't terminate a scheduler with jobs still on list would be like safety checklist.
Comment