Originally posted by coder
View Post
Announcement
Collapse
No announcement yet.
ByteDance Working To Make It Faster Kexec Booting The Linux Kernel
Collapse
X
-
Originally posted by lowflyer View PostMost seem to be ok with just about anything "as long as its good"
Originally posted by lowflyer View PostMistakes happen and that's the reason why I think it is important to look at the intent of this "fix".
Apart from that, the Linux Foundation & kernel development community should continue pushing the envelope in security tools, testing, & practices.
- Likes 2
Comment
-
Originally posted by davidhendricks View PostWhy settle for a mediocre solution when you can do as the Bytedance folks have done and make it work to best suite your business needs.
If you want to use kexec for fast reboots on bare hardware, that's up to you. Personally, I'll take the hit on reboot time in the interest of better stability.
Comment
-
After learning about "sysctl kexec" a while back, I read the docs and wrote a script to present a menu of kernels, load up the correct initramfs, and do a kexec reboot. But it never gets used because I came to the conclusion that half the benefit of a reboot is verifying that the machine is still capable of it.
- Likes 3
Comment
-
Originally posted by coder View PostThe problem is non-local. We're talking about any given hardware devices in your system not being properly reset by their driver. That doesn't have a centralized solution, such as this patch.
If you want to use kexec for fast reboots on bare hardware, that's up to you. Personally, I'll take the hit on reboot time in the interest of better stability.
- Likes 1
Comment
-
Originally posted by davidhendricks View Post
The centralized solution is to fix broken drivers. Large companies such as Bytedance have very tight control over their hardware and use a limited number of components. Problems with the general PC industry often do not apply. They also have a small army of kernel developers and can get vendors to fix things as needed.
Fair enough, do what works best for you. Large companies with datacenters full of servers tend to want maximum ROI (e.g. minimum downtime) so spending some effort to make kexec fast and reliable is well worth it if it means kernel/OS updates, error handling, etc. takes seconds instead of minutes.
Then comes the security aspect, lets say you use a dgpu for compute tasks you also need to make sure that the gpu driver does a zerofill of the dgpu ram, so no old code is still there, or graphic artifacts occure because suddenly you see a ghost image of the game you played a few hours ago, or get some funny compute results from the previous kernel.
You need some kind of driver quality standard when kexec is used to ensure all device registers get rewritten with sane values, and all dedicated ram get´s zeroed.
You even need to ensure that the kexec kernel is not loaded in the previous kernel space and the previous kernel get´s zeroed.
Copy from wiki
While feasible, implementing a mechanism such as kexec raises two major challenges:- Memory of the currently running kernel is overwritten by the new kernel, while the old one is still executing.
- The new kernel will usually expect all hardware devices to be in a well defined state, in which they are after a system reboot because the system firmware resets them to a "sane" state. Bypassing a real reboot may leave devices in an unknown state, and the new kernel will have to recover from that.
Comment
-
Originally posted by davidhendricks View PostThe centralized solution is to fix broken drivers.
Originally posted by davidhendricks View PostLarge companies such as Bytedance have very tight control over their hardware and use a limited number of components.
- Likes 2
Comment
-
Originally posted by erniv2 View PostThen comes the security aspect, lets say you use a dgpu for compute tasks you also need to make sure that the gpu driver does a zerofill of the dgpu ram, so no old code is still there, or graphic artifacts occure because suddenly you see a ghost image of the game you played a few hours ago, or get some funny compute results from the previous kernel.
Realistically, I expect all GPU compute runtimes are good about ensuring memory is zero'd, when first allocated to a process. I think they all now support memory encryption, as well.
Originally posted by erniv2 View PostYou need some kind of driver quality standard when kexec is used to ensure all device registers get rewritten with sane values,
- Likes 1
Comment
-
Originally posted by erniv2 View Post
You need some kind of driver quality standard when kexec is used to ensure all device registers get rewritten with sane values
Comment
Comment