Originally posted by linuxgeex
View Post
Normally application or framework notification is handled via separate paths, either by the application checking vendor-specific error codes or (more recently) robustness-type extensions being added to various APIs.
That said, the general industry philosophy seems to be drawing a line between <HW + OS> and <application>, ie the HW + OS is responsible for keeping the hardware reliable and understanding when something has gone wrong, while the application is supposed to run in a protected and reliable world and not have to care about things like hardware failures.
The missing link AFAICS is that the model was developed when mainframes primarily did batch processing, and so things like checkpoint/restart processing could be done at an OS level. Now that we have dozens of different frameworks all doing sort-of-batch-processing in a vendor-independent way it feels like some more standardization is required. There seems to be a fair amount of that happening on the CPU side and a few of us are working on extending that to GPUs as well.
Comment