Announcement

**F.Ultra** · 17 December 2021, 05:19 PM

Originally posted by aht0 View Post

Could have avoided that by using daemon tools suite augmenting/expanding functionality of your sysv init.. It's been around for decades. It's not a proof that sysv was shit, it's proof that people just couldnt be assed to research problem and use alternative tools available.

Perhaps daemon tools have timeout on the whole shutdown proves, far too long ago that I used it to remember. The point is that sysvinit does not (hence proof that its shit for this particular use case) and a single shell script can hang the entire sysvinit process indefinitely on both boot and shutdown.

**F.Ultra** · 17 December 2021, 05:29 PM

Originally posted by arQon View Post

Thanks.

I'm not worried about MY services: it's the crap Ubuntu puts on the box, which I assume is mostly just copypasted from RH to Debian to Ubuntu like the human centipede, but regardless of origin is absolutely braindead. A 90s timeout isn't magically going to work just because you THEN decided to try a 5min timeout instead. ffs...

I've seen this stupidity so often I'm surprised it's not just a bad default in systemd itself, but nope: apparently "opt in to the worst possible choice for this" is a policy for one of those distros. Go figure... :/

Well I don't now which daemon we are talking about here. But in a daemon its very likely that you do several things at start where some initial ones could rely on external factors (say network connect) so the daemon writer determines that if this part of the process takes more than 90s then there is something so bad happening that you should see the daemon as failed so he put 90s at the startup timeout in the unit file. Once the daemon is past this initial part though there might be some other initializing being done that the daemon writer have more control over so here he uses the sd_notify() interface to inform init that everything is working as intended even if it now takes more than 90s, say it now sees that it has to load a 500TB database file into RAM before setting the state to running so now it tells init that we can take 5min more before shit hits the fan, then before 5m is up and the databse is still loading he sends "5min more" to signal that work is being done and the daemon is not hung.

This is a good thing and not braindead. Braindead is just assuming that a daemon will always take x seconds to load or it have failed without taking into consideration WHY it takes a longer time in this particular instance. Don't know about you but I'm only interested in having my services being marked as failed if they really have failed and not just because they happened to hit some randomly determined global timeout value.

**arQon** · 17 December 2021, 08:45 PM

Originally posted by F.Ultra View Post

Don't know about you but I'm only interested in having my services being marked as failed if they really have failed and not just because they happened to hit some randomly determined global timeout value.

Agreed, but you and I are clearly in the minority on that, because that's exactly what ISN'T happening here: some clown has basically *every* built-in service set to use a 90s timeout even for stuff that should be coming up (or shutting down, which is where I see this happening most often) in well under 10s, and then changing that timer to an additional 5min for no reason - and for all I know, then going to 30min, and so on. But in NONE of these cases, ever, has even that first 90s timeout ever been a "genuine" window.
Again, these are ALL "system" services etc that, without exception, shouldn't even need a 10s timeout, let alone 5+ minutes. If they fail that first timeout, something is BROKEN, and no amount of additional waiting is going to cause them to magically recover. That's the piece I'm objecting to.

As you say, needing to set a custom timeout for some specialized case is valid - but that's not what I'm talking about. A default timeout that is already nearly an order of magnitude too long, and that then expands itself to being two orders too long, absolutely IS braindead. It tells you that whoever configured the service doesn't know what they're doing, and just copypasted an example from the systemd docs or Stack Overflow or whatever, because they either didn't understand it at all or couldn't be bothered to set it correctly. Every other service then got the same braindead timeouts copied to it, and so now they ALL have timeouts that are at best inappropriate, and at worst just hopelessly wrong.

Which is exactly the "randomly determined global timeout value" you called out. So, despite what you wrote, I think we're actually in agreement on this.

**uxmkt** · 18 December 2021, 07:27 AM

Originally posted by CochainComplex View Post

Since everything tends to be woke. The Borg could be less toxic with their wording. G=1/R
"resistance is fugitle" could be rephrased in a more positive framing "conductance is welcome"

After getting zapped by static electricity just often enough, one will respond to "conductance is welcome" with a "thanks, but no thanks". ;-)

**F.Ultra** · 18 December 2021, 04:21 PM

Originally posted by arQon View Post

Agreed, but you and I are clearly in the minority on that, because that's exactly what ISN'T happening here: some clown has basically *every* built-in service set to use a 90s timeout even for stuff that should be coming up (or shutting down, which is where I see this happening most often) in well under 10s, and then changing that timer to an additional 5min for no reason - and for all I know, then going to 30min, and so on. But in NONE of these cases, ever, has even that first 90s timeout ever been a "genuine" window.
Again, these are ALL "system" services etc that, without exception, shouldn't even need a 10s timeout, let alone 5+ minutes. If they fail that first timeout, something is BROKEN, and no amount of additional waiting is going to cause them to magically recover. That's the piece I'm objecting to.

As you say, needing to set a custom timeout for some specialized case is valid - but that's not what I'm talking about. A default timeout that is already nearly an order of magnitude too long, and that then expands itself to being two orders too long, absolutely IS braindead. It tells you that whoever configured the service doesn't know what they're doing, and just copypasted an example from the systemd docs or Stack Overflow or whatever, because they either didn't understand it at all or couldn't be bothered to set it correctly. Every other service then got the same braindead timeouts copied to it, and so now they ALL have timeouts that are at best inappropriate, and at worst just hopelessly wrong.

Which is exactly the "randomly determined global timeout value" you called out. So, despite what you wrote, I think we're actually in agreement on this.

Do have an example of a daemon that does this and on which distro? I would just like to see what it is that they have done, hopefully it's something that I can fix.

**arQon** · 18 December 2021, 07:15 PM

Originally posted by F.Ultra View Post

Do have an example of a daemon that does this and on which distro?

Ubuntu 20.04 (MATE, specifically, though I expect it's inherited from the base Ubuntu), and, erm, whatever the Pi is running, which I think is still 21.04. The "user session" at shutdown is the one that trips most often, but there are at least two others that crop up from time to time. (I have a vague feeling unattended-upgrades was one of them, but I've disabled that on both those machines since then).

As you say, I'm sure the configs might be fixable - but it's not something I have the patience to keep doing every time the package updates, and I don't trust Ubuntu enough not to handle that sensibly. I'll probably do one of them the next time it fails though and see if it survives updates or not. The problem is intermittent, but in none of those cases is there any external resource etc further away than my own LAN, and even the NAS will always respond within a few seconds, so by the time things reach the second stage timeout it's clearly because something is outright broken, and no amount of waiting will magically change that.

**F.Ultra** · 31 December 2021, 11:25 PM

Originally posted by arQon View Post

Ubuntu 20.04 (MATE, specifically, though I expect it's inherited from the base Ubuntu), and, erm, whatever the Pi is running, which I think is still 21.04. The "user session" at shutdown is the one that trips most often, but there are at least two others that crop up from time to time. (I have a vague feeling unattended-upgrades was one of them, but I've disabled that on both those machines since then).

As you say, I'm sure the configs might be fixable - but it's not something I have the patience to keep doing every time the package updates, and I don't trust Ubuntu enough not to handle that sensibly. I'll probably do one of them the next time it fails though and see if it survives updates or not. The problem is intermittent, but in none of those cases is there any external resource etc further away than my own LAN, and even the NAS will always respond within a few seconds, so by the time things reach the second stage timeout it's clearly because something is outright broken, and no amount of waiting will magically change that.

If it's the "user session" then it looks like its not a single service but the user session as a whole so then it looks more that the first timeout that you see is for a daemon that takes a long time to shutdown and the second timeout that you see is the total timeout for shutting down the entire user session as a whole.

E.g if I do "systemctl cat [email protected]" I can see that "TimeoutStopSec=120s" so shutting down my user session has a timeout of 2 minutes but individual services started by my user have their own timeouts as well. Probably this is what you see.

**arQon** · 01 January 2022, 08:49 PM

Originally posted by F.Ultra View Post

E.g if I do "systemctl cat [email protected]" I can see that "TimeoutStopSec=120s" so shutting down my user session has a timeout of 2 minutes but individual services started by my user have their own timeouts as well. Probably this is what you see.

Thanks. I'm sure you've got the correct understanding of things, but the point is, that timeout can say anything it wants, because that's not what happens. In reality, the timeout will simply extend itself indefinitely until I get tired of waiting for it and either CAD-hammer it into rebooting or power it off by force.

Again, there's nothing on that machine connected to any "distant" server, no huge amount of data to be flushed, or anything like that - so even at only 10s it would already be taking far longer than it "should". But reducing that timeout to 15s (which I'm sure I've done on at least one of those machines, if not both) just doesn't matter, because once the timer expires it just restarts with an even longer countdown, instead of actually honoring the timeout and shutting down the user session (and thus allowing the machine to finish its shutdown).

Unfortunately, since I can't get it to happen on demand I haven't been able to dig into it - generally, if I'm shutting the machine down it's because I need it to actually shut down. systemd doesn't show what it's waiting FOR, other than that generic "user session" message, and on the few times I have been in a position to try to chase it down, most of the time I can't log in on a new VT to look at anything because the machine is trying (but failing) to shutdown. So, yeah...

I see that /lib/systemd/system/[email protected]/timeout.conf on this machine actually has it as *5* seconds, and the comment there isn't written in my style, so it has to be part of the Ubuntu stock configuration.

Although it's possible I could fix it by editing the configuration for every service to 10-15s rather than the default 2min, this insane behavior of "okay, the timeout's been exceeded, but fkit, let's just start a whole new larger timeout instead of actually moving on" suggests that would be a lot of work for no real gain.

I don't see much point in having a timeout on the user session if it's just going to be ignored and instead use whatever the largest timeout of any running service is, and/or especially wait for each of those failing timeouts in sequence - that's pretty drastically broken. But you've given me some leads to follow next time it happens, so thanks a lot for all the help.

**F.Ultra** · 02 January 2022, 02:21 PM

Originally posted by arQon View Post

Thanks. I'm sure you've got the correct understanding of things, but the point is, that timeout can say anything it wants, because that's not what happens. In reality, the timeout will simply extend itself indefinitely until I get tired of waiting for it and either CAD-hammer it into rebooting or power it off by force.

Again, there's nothing on that machine connected to any "distant" server, no huge amount of data to be flushed, or anything like that - so even at only 10s it would already be taking far longer than it "should". But reducing that timeout to 15s (which I'm sure I've done on at least one of those machines, if not both) just doesn't matter, because once the timer expires it just restarts with an even longer countdown, instead of actually honoring the timeout and shutting down the user session (and thus allowing the machine to finish its shutdown).

Unfortunately, since I can't get it to happen on demand I haven't been able to dig into it - generally, if I'm shutting the machine down it's because I need it to actually shut down. systemd doesn't show what it's waiting FOR, other than that generic "user session" message, and on the few times I have been in a position to try to chase it down, most of the time I can't log in on a new VT to look at anything because the machine is trying (but failing) to shutdown. So, yeah...

I see that /lib/systemd/system/[email protected]/timeout.conf on this machine actually has it as *5* seconds, and the comment there isn't written in my style, so it has to be part of the Ubuntu stock configuration.

Although it's possible I could fix it by editing the configuration for every service to 10-15s rather than the default 2min, this insane behavior of "okay, the timeout's been exceeded, but fkit, let's just start a whole new larger timeout instead of actually moving on" suggests that would be a lot of work for no real gain.

I don't see much point in having a timeout on the user session if it's just going to be ignored and instead use whatever the largest timeout of any running service is, and/or especially wait for each of those failing timeouts in sequence - that's pretty drastically broken. But you've given me some leads to follow next time it happens, so thanks a lot for all the help.

Well the next time it happens, on the next boot try a "journalctl -b-1 -e" to get the last page of the journal from the previous boot to see if there is something in there that shows something taking a long time.

**RahulSundaram** · 03 January 2022, 08:17 AM

Originally posted by arQon View Post

Technically true, but that's a very large "if".

Yes, Indeed technically true on technical matters just means "It is true".

Originally posted by arQon View Post

&gt; If you need to ssh into a server just to look at why a service is failing, that is a failure of logging

orly? Because the logs are going to one or both of two places: local, in which case, yeah, duh, ssh is the only way you're going to see them if there's a problem; or remote, which, as you say, isn't going to be happening if whatever's failing is related to any network logging.

Yes really since there are a number of networking issues that can allow for network logging while preventing ssh.

Originally posted by arQon View Post

No, I'm replying to a comment that was *explicitly about* a server getting locked at boot. That being a Bad Thing is kind of the whole point.

Feel free to quote him on where he said this

Announcement

systemd 250 Is Coming For Christmas With A Boat Load Of New Features

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment