Announcement

**ryao** · 01 November 2015, 02:38 PM

Originally posted by interested View Post

Well, on systemd distros init (systemd) does actually start and control services already in initramfs, it then pivots into rootfs after it is mounted, and later may pivot back into initramfs after rootfs has been dismounted to fx shutdown and disassemble complex raid arrays.
dbus can't pivot back and forth, so it can't really be used in initramfs, which is a major reason for using kdbus, since it will be available from the earliest boot to after rootfs has been dismounted.

I have seen the dracut source code. dracut should not start sytemd's init until it pivot's into the actual rootfs. Either the kernel starts the dracut init script which runs its "drivers" prior to the pivot or it starts the systemd-init daemon. It cannot do both. I suppose that the systemd-init daemon could be execve'd after that script to do the pivot itself, but that does not provide any real benefit. If it really is starting systemd-init, would you provide provide actual links to dracut's source code showing where it does this?

As for "disassembling complex raid arrays", this is just overhead that slows down the shutdown process. You can remount readonly to force the dirty data write out and then shutdown the system. There is no need to disassemble because it does not change what the boot process sees when it starts. If you are using a sane next generation filesystem like ZFS, you do not even need to do that because after a sync call to flush dirty data, you can quite literally get away with cutting the power as there is no need to even do umount.

Originally posted by interested View Post

Since systemd initramfs like Dracut are moving to an entirely even-driven system instead of hard coded shell scripts, there will more ever more services needing to communicate with each other and PID1 when they react to hardware and networking events. Kdbus/dbus would be really useful to that, and since there clearly is a need for services to communicate, why not use a standard IPC for it?

The term "event driven" is perhaps the most misunderstood buzzword ever. An event implies that something else does something. During the boot process, the only things that do something are perhaps device initialization, which is why we have udev for events. So far, no one has proposed a sane use case where anything else is necessary.

Originally posted by interested View Post

Another, more long term reason for wanting kdbus is complex raid and storage setups. Neither the Kernel, udev, systemd etc. knows anything about the organization of such sub-systems, they only know disks. So you have this problem with rootfs residing on a storage system that has no standard way of communicating state to init. A lot of the present day raid assembly relies on crude timeouts, so that that a raid array may be incomplete and started in degraded mode just because a disk showed up late to the party. It would be really nice if Linux had a standard way of communicating with storage subsystems so that they could tell init their state of readiness etc even before rootfs was mounted. kbdus could be a nice solution to this problem, that also occurs when disassembling the storage systems when shutting down.

"A lot of the present day raid assembly relies on crude timeouts, so that that a raid array may be incomplete and started in degraded mode just because a disk showed up late to the party." is the first thing you have said that actually make sense. The problem is independent of the message passing mechanism used and kdbus makes it no easier or harder than it already is. One way to try to deal with this is to listen to the hardware events from either the hotplug helper (mdev) or netlink interface (udev), look at the devices in /sys and their status, and wait until things converge. You are not going to avoid a timeout by doing this and having something in the kernel do it and communicate it over kdbus would place unnecessary complexity into the kernel. Even if you shoved such logic into the kernel, you could very easily just define a file in /proc into which one could echo a timeout and block until either everything converges or times out and then another file to check to find the result without any need for a kdbus interface. The latter would not be compatible with the scripts used for initramfs environments and would require more complexity in an area where you need less.

Originally posted by interested View Post

initramfs is bound to become more feature full in the future, simply because storage layouts and boot options will become much more flexible, with people wanting to boot their laptops using WiFi with "/" or just /usr being a btrfs snapshot residing on a remote server. Shell scripting and hard coding initramfs for every such permutation of storage and boot options becomes extremely complex, which is why stuff like Dracut is moving into an event driven model, and why something like kdbus would be very useful to allow all those services talking to each other before rootfs is mounted.

The idea of the initramfs was to avoid hard coding anything by allowing it to be flexible. No decent initramfs generator hard codes the configuration into itself and instead learns it from the commandline options. The logic needed to process that must go somewhere and your suggestion that kdbus is some kind of answer to this suggests to me that the logic would go into the kernel. Having spoken to mainline developers about putting new logic into the kernel's boot code (module loading), I am fairly confident that will never happen. The entire point of the initramfs archive was to simplify the kernel's boot code and make debugging easier. Shoving logic into the kernel here is a step backward.

That said, it would be easier for Microsoft to make Windows ME reliable than for anyone to make / over WiFi reliable. I do not think any developers working on these things seriously thinks that this is something worth attempting. If you are intersted in enabling it, I can see why you might think you need some overengineered solution to enable it because the existing userspace daemons are such horrible hacks that they make it look necessary. However, the answer here is to redesign them to better fulfill their purpose rather than hack around it in the kernel until it seems like it works.. Hacking around design issues is the wrong way of doing anything.

I am not sure what you mean by "Dracut is moving into an event driven model" as events generally do not make sense unless you already have a full operating system started. Mind elaborating on that? You are talking to one of Gentoo's genkernel initramfs generator maintainers and a professional kernel storage stack developer, so I will likely explain why you are wrong after you do.

**interested** · 01 November 2015, 05:29 PM

Originally posted by ryao View Post

I have seen the dracut source code. dracut should not start sytemd's init until it pivot's into the actual rootfs.

Remember that Dracut is init-agnostic so not all Dracut implementations are using systemd, but AFAIK, all the systemd distros I have seen have been using systemd already in initramfs. So look again, systemd starts as a real init in initramfs. You can also insert breakpoints before rootfs is mounted so boot stops while in initramfs; you will have journald, systemd support so you can use "systemctl status <service>" etc.
Dracut with systemd is actually a kind of mini-Linux distro more than a traditional simple initrd.

Here is the original blog post announcing systemd in Dracut initramfs in 2012:

New community features for Google Chat and an update on Currents

https://plus.google.com/104232583922197692623/posts/ZCoGDXNKAoQ

Note: This blog post outlines upcoming changes to Google Currents for Workspace users. For information on the previous deprecation of Googl...

There are countless links showing how systemd is integrated in Dracut initramfs, including a source repo on github.

Here is the pre-rootfs initramfs boot sequence for Dracut with systemd:

dracut/dracut.bootup.7.asc at master · haraldh/dracut

https://github.com/haraldh/dracut/blob/master/dracut.bootup.7.asc

Contribute to haraldh/dracut development by creating an account on GitHub.

Originally posted by ryao View Post

As for "disassembling complex raid arrays", this is just overhead that slows down the shutdown process.

No it isn't. This is a real problem that have caused problems for many years, perhaps not for small home-servers, but certainly in enterprise.
Look here for the Dracut/systemd shutdown procedure:

dracut/dracut.asc at master · haraldh/dracut

https://github.com/haraldh/dracut/blob/master/dracut.asc#dracut-on-shutdown

Contribute to haraldh/dracut development by creating an account on GitHub.

Originally posted by ryao View Post

The term "event driven" is perhaps the most misunderstood buzzword ever. An event implies that something else does something. During the boot process, the only things that do something are perhaps device initialization, which is why we have udev for events. So far, no one has proposed a sane use case where anything else is necessary.

There are actually a lot of events that may require different actions from different services during boot: There a many different storage options these days like iscsci, fiber, LVM, hardware or software raid, or even NFS etc. Root may reside on many different storage systems and only be available by using NFS/tftp/WiFi. It is really nice if the initramfs is event driven so needed services are started, and that they are only started if needed instead of just enabling everything or manually enabling such services by re-configuring initramfs specifically for LVM etc. Initramfs may also need to react to different network events like dhcp, and for whatever hardware coming and going as detected by udev etc.

This LKML post from Harald Hoyer (working on Dracut) explains why kdbus looks like a very attractive idea for initramfs implementations:

LKML: Harald Hoyer: Re: [GIT PULL] kdbus for 4.1-rc1

https://lkml.org/lkml/2015/4/29/256

One can argue whether kdbus is good or not, but the goal of having a good standard IPC available from early boot to late shutdown is hardly controversial.
Same with the idea of reusing a standard IPC like dbus that already have wide support and backing. That way the entire userland doesn't need to change a line of code, and userland will stay compatible with BSD/Unix/OSX/<many other OS's with dbus support> instead of having a NIH Linux-and-systemd-only IPC.

Sorry for not answering everything directly, but this is getting long.

**jrch2k8** · 02 November 2015, 10:52 AM

Originally posted by ryao View Post

If you are polling a socket, you are doing something wrong. You should be using libev/libevent, which hooks into epoll and will hook into its replacement should someone come up with something better than epoll, which is possible.

This has already been done multiple times. You can make an anonymous file, mmap it, send its file descriptor over a pipe and mmap it there. If you fork, you can have the child have the same memory. You can do shared memory segments via System V IPC. In addition, you can map buffers into userspace, which V4L and netmap both do. All of these things avoid copying. Here is some documentation on System V IPC:

Fscked.org

http://fscked.org/writings/SHM/shm-5.html

If the dbus userland code is polling, it is doing something wrong and needs to be fixed. That is independent of being inside the kernel. As for being faster, no one has demonstrated that dbus is a bottleneck in any real world context or that using it is a good idea in the contrived scenarios where it is a problem.

You can likely implement it using CUSE. That would eliminate the fat on the kernel side entirely while making it compatible with *many* kernels and cross platform.

Any IPC mechanism that does #1 is brain dead. As far as I know, only the legacy select()/poll() calls do anything like this and when it is done, it is done for the purpose of improving latencies (like having mutexes be spinlocks under certain conditions). If a programmer were to do it via the POSIX API explicitly, he is doing something wrong.

Any software that is copying instead of doing zero copy for #2 when large amounts of data need to be passed will need to be rewritten anyway to use something sane for sending large amounts of data and there is nothing stopping it from being rewritten to use one of the existing options.

Anyone using dbus that needs things to go faster for #3 (which is probably more along the lines of #2) can just use existing IPC mechanisms. The only benefit of using dbus is to minimize reliance on POSIX to be portable with Windows and other esoteric system designs, but the entire idea of kdbus is predicated on throwing that away, so there is no need to worry about being compatible with non-POSIX platforms.

1.) yes you are right, still is epoll and family tho(get i wasn't giving classes of kernel API or about which syscall is better or not use, i just generalised the concepts for easier understanding. So polling means any means to be alerted when something change in X)

2.) Yes, we all have used SHM but is not even close to solve the issues i named in that item, first remember we are talking at kernel level here(maybe i wasn't explicit enough about this, english give me some troubles from time to time), second SHM is anything but fast(there is no guarantee you will always be on ram), is very complicated and fail prone(especially if you need 1 per every system in the kernel), is neither thread safe or context safe, is prone to memory leaks and is not exactly easy to debug and worst of all there is no easy way to make sure it will always be readonly but well memfds solved this already so is moot now

3.) dbus is userspace so is not using those directly as far as i remember, here i was talking about plumber kernel theoretical implementation (read on 1 what i meant for "polling") not dbus

4.) again you cannot do real time event IPC outside the kernel for the kernel(so no neither for linux does, mac and iOS and probably hurd+mach are a different story) and SYSV "IPC" is not event driven neither in the kernel or from user space, that why you have to keep checking if there is a new data(regardless the method) in the same sense TCP/UDP aren't event driven

5.) this is partially true, sure SYSV provide enough infrastructure for zero copy(is ugly tho) but the main problem is, you cannot guarantee the state of the file descriptor once it can be read outside its main process, so you need validation infrastructure and a way to deal with context and thread issues to avoid kernel side memory leaks or security leaks(the file descriptor stays open because kernel one thread/context is waiting but is detached from its parent process, messy messy stuff) and other issues but this got fixed by memfds so moot(as a side note dbus and others do this but all require at least 6 to 10 context switches due to validation layer where kdbus + memfds can do it in 2 - 4 since you don't need the validation layer or verification layers)

6.) i explained dbus choosing was simply done for its popularity in userspace nothing more eccentric than that, the advantage of dbus is the easiness of use for developers and the use of proper modern IPC features (lennart explain this quite well in youtube, is extensive so watch it) but as i agreed in my previous post, i do believe there is certain fat that should be removed

As a side note, to make myself more clear, we are talking here about kernel-kernel and userspace IPC not userspace only IPC(those we got a lot), for example:

user press hibernation button -> GTK button connect signal to broadcast event socket -> KDBUS emit broadcast signal "Hibernation" -> Application and kernel receive signal -> kernel drivers start powering off the hardware and saving state -> raid driver and X application -> contact KDBUS -> emit broadcast signal "not ready" + string data -> Gnome put on screen "string data" -> kernel stop hibernation until ready signal is emitted -> X app finalised write and emit "ready" -> kernel wait for another pending ready -> raid driver ended mirroring operation and emit "ready" -> not more pending-> continue hibernation.

**ryao** · 03 November 2015, 10:41 AM

Originally posted by jrch2k8 View Post

1.) yes you are right, still is epoll and family tho(get i wasn't giving classes of kernel API or about which syscall is better or not use, i just generalised the concepts for easier understanding. So polling means any means to be alerted when something change in X)

What you mean is any sort of blocking communication and whatever you use is going to block on something whether it does a poll or waits for a wakeup. kdbus does not change this.

Originally posted by jrch2k8 View Post

2.) Yes, we all have used SHM but is not even close to solve the issues i named in that item, first remember we are talking at kernel level here(maybe i wasn't explicit enough about this, english give me some troubles from time to time), second SHM is anything but fast(there is no guarantee you will always be on ram), is very complicated and fail prone(especially if you need 1 per every system in the kernel), is neither thread safe or context safe, is prone to memory leaks and is not exactly easy to debug and worst of all there is no easy way to make sure it will always be readonly but well memfds solved this already so is moot now

It sounds like what you really want are memfds, which you have now.

Originally posted by jrch2k8 View Post

3.) dbus is userspace so is not using those directly as far as i remember, here i was talking about plumber kernel theoretical implementation (read on 1 what i meant for "polling") not dbus

It could be extended to use memfds.

Originally posted by jrch2k8 View Post

4.) again you cannot do real time event IPC outside the kernel for the kernel(so no neither for linux does, mac and iOS and probably hurd+mach are a different story) and SYSV "IPC" is not event driven neither in the kernel or from user space, that why you have to keep checking if there is a new data(regardless the method) in the same sense TCP/UDP aren't event driven

Get an out of band notification. If you want whatever is running in userspace to be preempted, use signals. Signals are likely what you mean when you say "events".

Originally posted by jrch2k8 View Post

5.) this is partially true, sure SYSV provide enough infrastructure for zero copy(is ugly tho) but the main problem is, you cannot guarantee the state of the file descriptor once it can be read outside its main process, so you need validation infrastructure and a way to deal with context and thread issues to avoid kernel side memory leaks or security leaks(the file descriptor stays open because kernel one thread/context is waiting but is detached from its parent process, messy messy stuff) and other issues but this got fixed by memfds so moot(as a side note dbus and others do this but all require at least 6 to 10 context switches due to validation layer where kdbus + memfds can do it in 2 - 4 since you don't need the validation layer or verification layers)

You can pass memfds via sendmsg()/recvmsg() without kdbus. You still need a syscall for each recipient, but the overhead needs to be substantial enough that eliminating it is a win. Otherwise, we end up stuffing everything into the kernel address space and becoming like MSDOS, which would be a disaster.

Originally posted by jrch2k8 View Post

6.) i explained dbus choosing was simply done for its popularity in userspace nothing more eccentric than that, the advantage of dbus is the easiness of use for developers and the use of proper modern IPC features (lennart explain this quite well in youtube, is extensive so watch it) but as i agreed in my previous post, i do believe there is certain fat that should be removed

If the kernel does not provide the mechanisms (e.g. via CUSE) to implement something generic and useful, that is a generic problem to be solved.

Originally posted by jrch2k8 View Post

As a side note, to make myself more clear, we are talking here about kernel-kernel and userspace IPC not userspace only IPC(those we got a lot), for example:

user press hibernation button -> GTK button connect signal to broadcast event socket -> KDBUS emit broadcast signal "Hibernation" -> Application and kernel receive signal -> kernel drivers start powering off the hardware and saving state -> raid driver and X application -> contact KDBUS -> emit broadcast signal "not ready" + string data -> Gnome put on screen "string data" -> kernel stop hibernation until ready signal is emitted -> X app finalised write and emit "ready" -> kernel wait for another pending ready -> raid driver ended mirroring operation and emit "ready" -> not more pending-> continue hibernation.

IPC means "interprocess communication" and the kernel is a single process, so IPC between different kernel processes makes no sense unless you are talking about a microkernel architecture where almost everything traditionally inside the kernel is userspace anyway. Linux is not a microkernel, so this discussion of intrakernel IPC is moot.

As for hibernation, the kernel has platform_suspend_ops to deal with storage subsystems and given that userspace signals it to take over the shutdown process, I do not see what kdbus gives us here that we do not already have. If platform_suspend_ops is limited, it is an internal kernel interface that can be changed.

**ryao** · 03 November 2015, 10:57 AM

Originally posted by interested View Post

Remember that Dracut is init-agnostic so not all Dracut implementations are using systemd, but AFAIK, all the systemd distros I have seen have been using systemd already in initramfs. So look again, systemd starts as a real init in initramfs. You can also insert breakpoints before rootfs is mounted so boot stops while in initramfs; you will have journald, systemd support so you can use "systemctl status <service>" etc.
Dracut with systemd is actually a kind of mini-Linux distro more than a traditional simple initrd.

breakpoints are nothing new and initramfs generators have always generated images containing a kind of mini-Linux distribution.

Originally posted by interested View Post

Here is the original blog post announcing systemd in Dracut initramfs in 2012:
https://plus.google.com/104232583922...ts/ZCoGDXNKAoQ
There are countless links showing how systemd is integrated in Dracut initramfs, including a source repo on github.

Here is the pre-rootfs initramfs boot sequence for Dracut with systemd:
https://github.com/haraldh/dracut/bl...t.bootup.7.asc

I had been unaware that Harald and Lennart had done that. That is an interesting design. Under it, systemd would have no trouble starting dbus for any services that hook into it.

Originally posted by interested View Post

No it isn't. This is a real problem that have caused problems for many years, perhaps not for small home-servers, but certainly in enterprise.
Look here for the Dracut/systemd shutdown procedure:
https://github.com/haraldh/dracut/bl...ut-on-shutdown

This does not state the problems for shutdown.

Originally posted by interested View Post

There are actually a lot of events that may require different actions from different services during boot: There a many different storage options these days like iscsci, fiber, LVM, hardware or software raid, or even NFS etc. Root may reside on many different storage systems and only be available by using NFS/tftp/WiFi. It is really nice if the initramfs is event driven so needed services are started, and that they are only started if needed instead of just enabling everything or manually enabling such services by re-configuring initramfs specifically for LVM etc. Initramfs may also need to react to different network events like dhcp, and for whatever hardware coming and going as detected by udev etc.

Certain network storage configurations might benefit from a closer coupling between the init system and the initramfs. You just need a protocol for the initramfs to start things and tell the init system the state so that it does not get confused. Just leave something in /dev or /run for init to find and change its behavior. kdbus is not needed.

Originally posted by interested View Post

This LKML post from Harald Hoyer (working on Dracut) explains why kdbus looks like a very attractive idea for initramfs implementations:

https://lkml.org/lkml/2015/4/29/256

I am a developer of a competing initramfs implementation (genkernel) and I do not share Harold's sentiments.

Originally posted by interested View Post

One can argue whether kdbus is good or not, but the goal of having a good standard IPC available from early boot to late shutdown is hardly controversial.
Same with the idea of reusing a standard IPC like dbus that already have wide support and backing. That way the entire userland doesn't need to change a line of code, and userland will stay compatible with BSD/Unix/OSX/<many other OS's with dbus support> instead of having a NIH Linux-and-systemd-only IPC.

Sorry for not answering everything directly, but this is getting long.

**jrch2k8** · 03 November 2015, 12:35 PM

Originally posted by ryao View Post

IPC means "interprocess communication" and the kernel is a single process, so IPC between different kernel processes makes no sense unless you are talking about a microkernel architecture where almost everything traditionally inside the kernel is userspace anyway. Linux is not a microkernel, so this discussion of intrakernel IPC is moot.

As for hibernation, the kernel has platform_suspend_ops to deal with storage subsystems and given that userspace signals it to take over the shutdown process, I do not see what kdbus gives us here that we do not already have. If platform_suspend_ops is limited, it is an internal kernel interface that can be changed.

i agree with all your points and let me rephrase my term here, so lets not call it "IPC", let me call it "Intra Kernel Structured Signal Handler with user space IPC extensibility".

the suspend case was just an example but as many other interfaces in the kernel, this interfaces are not flexible and not all drivers implement or even use them to its full extent and in many cases they cannot react to user space due to race conditions. A very common example will be "&%&%&%& i pressed shutdown in gnome-3.18 instead of system settings while i was copying a heavy file to my mirror ZFS(happened with XFS and EXT4 too, sadly this failed hard on me) volume which uses an LSI Raid controller (with RAM), so gnome tried to warn me but Xorg decided to go away and killed my copy in the process while the shutdown continued and broke the mirroring process because LSI controllers don't give a shit(i heard adaptec are way way better but well ), luckily I'm using ZFS so no problem and is a workstation not a server" <-- ironically firefox/epiphany don't give a shit but chromium sometimes properly stop the process while Bomi player simply kill itself but halt the process while VLC don't give a shit but Libreoffice is random at it(seems to have a mood), etc ,etc ,etc. Also worth notice is not too uncommon to have halted shutdown because one module died before it should and another parent module panic the kernel(i shutdown my WS and PC at night due to high costs, so by default i always verify) , probably is my hardware combination but my point a bit smarter communication system could make this go away

sure the problem is probably partly gnome, maybe Xorg can be guilty, maybe LSI controller have a bug, etc. but i strongly believe an simple "structured" data passing system(i like the signals idea) in the kernel that can be extended to support user space could make obsolete this static interfaces on the kernel, providing a more familiar API to developers that will allow better interaction between contexts for a more dynamic decision making and coordination and i do believe KDBUS is capable of fit this role(after some fat chopping) but if another solution comes up that fixes or simplify KDBUS approach that is fine with me too

**ryao** · 03 November 2015, 01:16 PM

Originally posted by jrch2k8 View Post

i agree with all your points and let me rephrase my term here, so lets not call it "IPC", let me call it "Intra Kernel Structured Signal Handler with user space IPC extensibility".

the suspend case was just an example but as many other interfaces in the kernel, this interfaces are not flexible and not all drivers implement or even use them to its full extent and in many cases they cannot react to user space due to race conditions. A very common example will be "&%&%&%& i pressed shutdown in gnome-3.18 instead of system settings while i was copying a heavy file to my mirror ZFS(happened with XFS and EXT4 too, sadly this failed hard on me) volume which uses an LSI Raid controller (with RAM), so gnome tried to warn me but Xorg decided to go away and killed my copy in the process while the shutdown continued and broke the mirroring process because LSI controllers don't give a shit(i heard adaptec are way way better but well ), luckily I'm using ZFS so no problem and is a workstation not a server"

There are multiple things wrong with this description.

1. ZFS is a replacement for hardware RAID and is superior to it, so it is not clear to me why you are using hardware RAID with it:

http://open-zfs.org/wiki/Hardware#Ha...ID_controllers

2. How does the controller care about the system shutting down when mirroring things?

3. How did GNOME try to warn you and Xorg shutdown anyway?

4. An Xorg crash will kill your copies processes that are started inside of X unless you do them in a way that does not cause termination when Xorg dies. Changing how shutdown requests are handled, even in the case of fixing broken handling of unintentional shutdown clicks, is not going to fix this. You should be using something like screen or tmux for your copies. If you like GUIs, you will need to talk to the GUI developers about developing code that survives Xorg being killed.

5. If you want to send something to userspace, you can do it already via a pipe. If you want to send something in the kernel, you can do it by using kernel primitives. If you want to send something to another kernel thread via a userspace mechanism, Linus Torvalds is likely to go ballistic because you are essentially in "database as IPC"-esque territory:

Database-as-IPC - Wikipedia

https://en.wikipedia.org/wiki/Database-as-IPC

If you are interested in a transport independent means of sending data, you can use Sun RPC within a transport:

https://en.wikipedia.org/wiki/Open_N...Procedure_Call

ZFSOnLinux and NFS both use it. In the case of ZFSOnLinux, it sends XDR representations between the kernel and userspace via either pipes or ioctls, depending on the data being sent. It is a very convenient way of message passing between userspace and the kernel.

Originally posted by jrch2k8 View Post

<-- ironically firefox/epiphany don't give a shit but chromium sometimes properly stop the process while Bomi player simply kill itself but halt the process while VLC don't give a shit but Libreoffice is random at it(seems to have a mood), etc ,etc ,etc.

This sounds like broken userspace software. No amount of hacking around it will make it work fine without actually fixing it. Talk to the developers of the software where you have issues.

Originally posted by jrch2k8 View Post

Also worth notice is not too uncommon to have halted shutdown because one module died before it should and another parent module panic the kernel(i shutdown my WS and PC at night due to high costs, so by default i always verify) , probably is my hardware combination but my point a bit smarter communication system could make this go away

1. The shutdown process does not unload modules and it is not possible to unload a module that is in use by another (at least not without a bug). If you have a panic at shutdown, there is something seriously wrong and that needs to be fixed.

2. You are hitting a bug. Trying to rearchitect the system shutdown process without understanding what is wrong is unlikely to solve it and is more likely to introduce more bugs. The proper way to fix the bug is to understand what is going wrong and patch it to stop happening.

Originally posted by jrch2k8 View Post

sure the problem is probably partly gnome, maybe Xorg can be guilty, maybe LSI controller have a bug, etc. but i strongly believe an simple "structured" data passing system(i like the signals idea) in the kernel that can be extended to support user space could make obsolete this static interfaces on the kernel, providing a more familiar API to developers that will allow better interaction between contexts for a more dynamic decision making and coordination and i do believe KDBUS is capable of fit this role(after some fat chopping) but if another solution comes up that fixes or simplify KDBUS approach that is fine with me too

Your issue is more likely going to be fixed by accident than it would be by the latest idea in "better interaction between contexts for a more dynamic decision making and coordination".

Making architectural changes without a clearly defined way that it makes an existing thing better is more likely to cause new issues than it is to solve existing ones, especially when the causes of the existing ones are not understood.

**jrch2k8** · 03 November 2015, 04:27 PM

Originally posted by ryao View Post

There are multiple things wrong with this description.

1. ZFS is a replacement for hardware RAID and is superior to it, so it is not clear to me why you are using hardware RAID with it:

http://open-zfs.org/wiki/Hardware#Ha...ID_controllers

2. How does the controller care about the system shutting down when mirroring things?

3. How did GNOME try to warn you and Xorg shutdown anyway?

4. An Xorg crash will kill your copies processes that are started inside of X unless you do them in a way that does not cause termination when Xorg dies. Changing how shutdown requests are handled, even in the case of fixing broken handling of unintentional shutdown clicks, is not going to fix this. You should be using something like screen or tmux for your copies. If you like GUIs, you will need to talk to the GUI developers about developing code that survives Xorg being killed.

5. If you want to send something to userspace, you can do it already via a pipe. If you want to send something in the kernel, you can do it by using kernel primitives. If you want to send something to another kernel thread via a userspace mechanism, Linus Torvalds is likely to go ballistic because you are essentially in "database as IPC"-esque territory:

Database-as-IPC - Wikipedia

https://en.wikipedia.org/wiki/Database-as-IPC

If you are interested in a transport independent means of sending data, you can use Sun RPC within a transport:

https://en.wikipedia.org/wiki/Open_N...Procedure_Call

ZFSOnLinux and NFS both use it. In the case of ZFSOnLinux, it sends XDR representations between the kernel and userspace via either pipes or ioctls, depending on the data being sent. It is a very convenient way of message passing between userspace and the kernel.

This sounds like broken userspace software. No amount of hacking around it will make it work fine without actually fixing it. Talk to the developers of the software where you have issues.

1. The shutdown process does not unload modules and it is not possible to unload a module that is in use by another (at least not without a bug). If you have a panic at shutdown, there is something seriously wrong and that needs to be fixed.

2. You are hitting a bug. Trying to rearchitect the system shutdown process without understanding what is wrong is unlikely to solve it and is more likely to introduce more bugs. The proper way to fix the bug is to understand what is going wrong and patch it to stop happening.

Your issue is more likely going to be fixed by accident than it would be by the latest idea in "better interaction between contexts for a more dynamic decision making and coordination".

Making architectural changes without a clearly defined way that it makes an existing thing better is more likely to cause new issues than it is to solve existing ones, especially when the causes of the existing ones are not understood.

1.) true but long story short this were ufs volumes and for time reasons was easier to handle them this way, is not bad just not recommended and since it its not a server i can live with it
2.) Some controllers are smarter than others, some of the dumber but cheaper ones(DELL I'm looking at you cheap bastards) don't handle well shutdowns when mirroring massive data and by default start degraded which is very annoying but as you say zfs knows lot better plus in another cases i've seen raid rebuilds trash XFS and Ext4 filesystems too (i'm guessing driver bug in some kernel versions, been long time since then). Of course this is not too common because you need massive random writes in the several gigabytes order of magnitude.
3.) pretty fast i saw the glimpse of an popup saying "swoooosh .... shutdown ... ok button" and then tty1 (yeah is not very technical but i really didn't give it that much attention)
4.) i know but i shouldn't have too, Mac has its issues but on this is really good
5.) again i know SYSV "IPC" and been using it for 10+ years and the for the same reason i don't like it(well i like it a bit more since memfds and cgroups, those fixes lots of the annoying things) and i kinda hate Sun RPC too(i used to love it tho)

1.) i agree
2.) i agree

Rationale:

We agree on the roots of the problem and the problems themselves but the point i'm trying to make is exactly because this low level good tools exists (like SYSV IPC and Sun RPC and other low level constructs) this have become a mess in both user space and the kernel and i agree I may not be right proposing the current KDBUS as a possible solution.

adding a little bit to that before i'm not saying this tools aren't good or they don't work, they are just too low level(as construct should be, not problem here) but there is no standard nor simple way to use them for developers and i strongly believe an easier and more standard API will make many of this problems go away simply because more developers will be willing to use them properly instead of reinventing the wheel every time(i like K/DBUS nomenclature for this, is just that).

for example lets put a generic example of an application written in C/C++ what will have to do now to react/query/submit events to the hardware and other processes.(very rough approximation, nothing too API specific) from a average user space developer(not me i do this a lot)

1.) read tons of man/kernel documentation about every piece common system for the hardware you wanna handle(at least /proc/sys/dev, glibc, in tree kernel documentation, learn some syscall tricks too, if you need pass fd remember to verify sendmsg 100 times since it can be tricky)
2.) find 1. but for the specific device/s you have in mind for custom extra info outside the subsystem defaults
3.) abstract to hell some classes to handle the query and test in at least 2 kernel versions
4.) optionally prepare some external executables/pipe/toolkit i/o devices for passing data to the kernel again test in at least 2 kernel versions and be careful about paths with distros
5.) enjoy and hope nobody ever will try to exchange related hardware data with your app because it won't be easy i guarantee it

how could be with something similar to kdbus(doesn't have to be KDBUS, just to express my point):

1.) go to org.kernel.system.subsystem and verify if you hardware exposes the info you need
2.) check bus to verify its what you need
2.1.) uses org.kernel.system.common for regularly used functions and receive hardware events (for example switch to a simpler algorithm version when on battery event is triggered or send a shutdown block event when the algorithm is running, inform the user that whatever/FS is broadcasting a write failure and offer to open the logs for action, Etc.)
3.) query subsystem bus to certify if additional info is present from another related source(lets say a recently loaded related kernel module)
4.) make nice abstract class to handle parameters and start read/write/broadcast
5.) enjoy and don't fear compatibility, everyone do this anyway <-- all this with with strong defined types

i know there are other ways and some tools that can be used for this but in my experience they rarely are less horrible and have other set of issues but with option 2 is abstracted and standard enough for any developer/desktop/toolkit/etc. to use and react smartly where with the current way probably most developers quit and the few that try ended up with completely different ways to do it and probably won't work with each other.

i do believe however the current kdbus can be extended to do this since Mach on OS X works with a monolith kernel(horrible API tho), maybe as you correctly imply not as an IPC but altering part of the low level kernel constructs to act as a central dispatch for a defined set of options(maybe even re route and exposed things already in sysfs/proc) and leave the IPC part for userspace

**interested** · 03 November 2015, 04:40 PM

Originally posted by ryao View Post

breakpoints are nothing new and initramfs generators have always generated images containing a kind of mini-Linux distribution.

I didn't use breakpoints to show anything new with regard to that, but that you had actually full systemd support in initramfs from before rootfs was mounted, including logging. A cool thing with the Dracut-systemd initramfs is that it is using the exact same tools to boot initramfs and the entire systemd. "systemd", "udev" and "journald" are simply just copies of version used on the full distro. That means the tools are much better tested (a long standing problem with non-standard tools in rescue and initramfs is that few people test them).

Originally posted by ryao View Post

This does not state the problems for shutdown.

Well, it at least it shows how initramfs is used for shutdown with the stated reason that "This ensures, that all devices are disassembled and unmounted cleanly"

Can't find the Red Hat engineer post about this problem, but the problem is somewhat similar chicken-egg problem as with using raid on rootfs, just in reverse. In the old days you used /boot to assemble software raid, later initrd/initramfs. The problem is when when disassembling the raid array and a userspace daemon somehow is involved, then how do the system kill all user space processes to un-mount the root file-system without killing the raid daemon too. And user space daemons are hard to avoid since the kernel doesn't contain such raid logic.

Originally posted by ryao View Post

Certain network storage configurations might benefit from a closer coupling between the init system and the initramfs. You just need a protocol for the initramfs to start things and tell the init system the state so that it does not get confused. Just leave something in /dev or /run for init to find and change its behavior. kdbus is not needed.

Such a protocol sounds very much like d-bus/kdbus. And dbus have the benefit of being widely supported across so many languages and tool kits and daemons etc. Using non-standard NIH Linux and initramfs-only solutions for such a protocol doesn't sound good to me when existing, mature, cross-platform solutions already exist.

Originally posted by ryao View Post

I am a developer of a competing initramfs implementation (genkernel) and I do not share Harold's sentiments.

Well, many people do which is why they want to move systemd-initramfs in that direction.

Personally I can't see any reasonable objection to using the same standard IPC and same init and init tools in both initramfs and on full systems, and use the same IPC from early boot to late shutdown.

**ryao** · 03 November 2015, 07:13 PM

Originally posted by interested View Post

I didn't use breakpoints to show anything new with regard to that, but that you had actually full systemd support in initramfs from before rootfs was mounted, including logging. A cool thing with the Dracut-systemd initramfs is that it is using the exact same tools to boot initramfs and the entire systemd. "systemd", "udev" and "journald" are simply just copies of version used on the full distro. That means the tools are much better tested (a long standing problem with non-standard tools in rescue and initramfs is that few people test them).

Well, it at least it shows how initramfs is used for shutdown with the stated reason that "This ensures, that all devices are disassembled and unmounted cleanly"

Can't find the Red Hat engineer post about this problem, but the problem is somewhat similar chicken-egg problem as with using raid on rootfs, just in reverse. In the old days you used /boot to assemble software raid, later initrd/initramfs. The problem is when when disassembling the raid array and a userspace daemon somehow is involved, then how do the system kill all user space processes to un-mount the root file-system without killing the raid daemon too. And user space daemons are hard to avoid since the kernel doesn't contain such raid logic.

Such a protocol sounds very much like d-bus/kdbus. And dbus have the benefit of being widely supported across so many languages and tool kits and daemons etc. Using non-standard NIH Linux and initramfs-only solutions for such a protocol doesn't sound good to me when existing, mature, cross-platform solutions already exist.

Well, many people do which is why they want to move systemd-initramfs in that direction.

Personally I can't see any reasonable objection to using the same standard IPC and same init and init tools in both initramfs and on full systems, and use the same IPC from early boot to late shutdown.

They can do whatever they want, but we should not be putting bad code into the kernel to hack around design flaws in userspace.

As for the quote, I am not convinced that is sufficient to avoid problems when the design spec says that all process are killed by systemd before going into the initramfs environment. As pointed out in the other thread, this breaks the rootfs mount when it is on iSCSI or nbd (not that any distribution does). Just saying that it prevents problems does not make that the case.

Also, have you ever seen the system go into the initramfs at shutdown. This is generally not necessary, which is why it does not happen in most Redhat's installs. If mkcpioinit is doing this unconditionally, it is needlessly slowing down shutdown.

Lastly, that "protocol" is nothing like kdbus. kdbus is a dbus replacement for sending messages to things that you are running, not leaving breadcrumbs for things that run after the sender has stopped running. There a thing known as a filesystem meant for doing that.

Announcement

KDBUS Is Being Removed From Fedora, Could Be A While Before Being Mainlined

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment