Announcement

**gens** · 18 November 2014, 04:35 PM

Originally posted by oleid View Post

Systemd's watchdog has absolutely nothing to do with hardware watchdogs. The daemon has to update a timestamp in the main loop, which systemd checks. Thus, if the daemon is stuck in an infinite loop, the timestamp won't get updated.

ah, so the program must be hacked for that feature to work
instead of... you know... the program doing it itself
note that a program can itself handle SIGSEGV and almost all other signals, except for SIGKILL
(note on the note: firefox and enlightenment do it)
it can also reexec itself if it can't recover itself

and it can still fail to deliver the website (for example the hypothetical "wont accept connection" bug)

also why is it called "watchdog" if it has nothing to do with the watchdog mechanism
(hint: i don't care, its stupid)

**oleid** · 18 November 2014, 05:11 PM

Originally posted by gens View Post

ah, so the program must be hacked for that feature to work
instead of... you know... the program doing it itself

How would a hung (as in stuck) service restart itself? That's the idea of a service manager.

Originally posted by gens View Post

note that a program can itself handle SIGSEGV and almost all other signals, except for SIGKILL
(note on the note: firefox and enlightenment do it)

Sure, I use signal handling in my own code.

You can handle SIGSEGV in the service, then it won't have to be restarted by the service manager.
If you don't want to do this on your own, the service manager can do it for you.

In your case, you'd probably want to send SIGTERM to the service, if wget fails. And then, provide an implementation for SIGTERM to restart the service. But I guess adding one line of code (two, if you count the include directive for the header) is simpler than providing an implementation for SIGTERM and SIGSEGV for *every* service you want to be self-restarting. Especially if you need to keep track of how often the service was restarted before, since your SIGSEGV handler might get stuck in an infinite loop, if the crash occurs directly after service start. That's exactly why such a complexity should be outsourced.

Originally posted by gens View Post

and it can still fail to deliver the website (for example the hypothetical "wont accept connection" bug)

As the bug is hypothetical, it's hard to speculate what causes it. It probably depends on where the sd_notify call is inserted, if it can detect the error.

Originally posted by gens View Post

also why is it called "watchdog" if it has nothing to do with the watchdog mechanism
(hint: i don't care, its stupid)

It's a software watchdog.

Edit:
Oh, before you start asking: No, there is no runtime dependency on systemd to the service, only a compile time dependency. And as it's only less than a hand full lines of code, it can easily be #ifdef'ed to even prevent the compile time dependency. Furthermore, sd_notify simply sends a signal via DBUS, so any service manager can listen to it.

**gens** · 18 November 2014, 05:41 PM

Originally posted by oleid View Post

How would a hung (as in stuck) service restart itself? That's the idea of a service manager.

and the whole point of all this is to show that there are things a "service manager" just can not know are happening
but a special 4 line script (or a very small C program, whatever) can

if a "service manager" would do such things, it would need a special function for EVERY "service" ever written

on the other hand if the people writing these programs would think that it was a good idea they would just put it in their program
init system / service manager / whatever independent and simple (it would even work on windows)

**oleid** · 18 November 2014, 06:04 PM

Originally posted by gens View Post

and the whole point of all this is to show that there are things a "service manager" just can not know are happening
but a special 4 line script (or a very small C program, whatever) can

No, you didn't get the point. sd_notify can, if correctly placed, inform the service manager if the service is still alive and kicking. Two lines of extra code. Correctly placed e.g. in the mainloop, which answers the socket connections and (e.g.) forks of the worker processes.

Originally posted by gens View Post

if a "service manager" would do such things, it would need a special function for EVERY "service" ever written

Not if using the generic sd_notify.

Originally posted by gens View Post

on the other hand if the people writing these programs would think that it was a good idea they would just put it in their program
init system / service manager / whatever independent and simple (it would even work on windows)

Signal handlers won't work on windows -- at least not the way you implement them on POSIX.

According to http://stackoverflow.com/questions/3...lt-under-linux the cleanest way to make your daemon self-restartable is to create a kind of hypervisor process on your own. But isn't that exactly, what a generic service manager is for?

If you don't trust the software watchdog, nobody stops you from additionally calling your wget-shellscript from e.g. cron and simply send SIGTERM to the service and let it be auto-restarted by systemd.

I guess you surely can construct a case which can't be detected by the software watchdog, but can be detected by your wget-script, but I doubt it's the kind of error you find out in the wild. That's what service testsuites are for, but it's way beyond a simple service manager.

**gens** · 18 November 2014, 06:28 PM

Originally posted by oleid View Post

No, you didn't get the point. sd_notify can, if correctly placed, inform the service manager if the service is still alive and kicking. Two lines of extra code. Correctly placed e.g. in the mainloop, which answers the socket connections and (e.g.) forks of the worker processes.

...

I guess you surely can construct a case which can't be detected by the software watchdog, but can be detected by your wget-script, but I doubt it's the kind of error you find out in the wild. That's what service testsuites are for, but it's way beyond a simple service manager.

...
......
idk what to say
a modern server program does most of the work in worker threads, meaning that the main loop would work fine
it can do work but send a 500 HTTP message
it can work absolutely normally but due to some router or caching node not working properly not provide the service that would be expected from a server
it can have thread management problems and thus send garbage to the client (modern servers do their own memory management instead of relying on sendfile())
and many more that i can't think of (as it is with bugs)

you don't make assumptions
you don't debate about "opinions"
it either works or it doesn't
if you want to know if there is power in a socket, you take this and stick it in
there is no philosophy behind it

a simple check like the one you speak of, that can be done by the program itself (from a clone() or fork()), is not 100% accurate
if you point firefox at the webpage, that is
if you point curl/wget at a webpage you get the same certainty but without doing it yourself
(ofc from a computer outside of the local network)

as for desktop processes, where these things arn't as important
if something like firefox hangs, you will notice
if your window manager fails, you will notice
if you tell your window manager to close a window that is hanging it will tell you that it is not responding and give you an option to kill it
and so on

now go tell someone else that they don't get the point

**erendorn** · 18 November 2014, 06:38 PM

Originally posted by gens View Post

...
it is the difference between a banana written as "baeana" and as "apple"
if a binary log is corrupt you have to rely on the tool to decode the rest of it properly
if a ASCII text log is corrupt all you have to do is pass it through strings (or use a text editor that won't go nuts on unprintable characters, and that it most of them on many platforms)

But text part of the binary log is written as ASCII text, so is there any actual difference?

**oleid** · 18 November 2014, 06:50 PM

Originally posted by gens View Post

a modern server program does most of the work in worker threads, meaning that the main loop would work fine

This was merely a suggestion, I don't have a testcase of what gets stuck and I didn't write the hypothetical websever, thus I don't exactly know where to put it.

Originally posted by gens View Post

...
a simple check like the one you speak of, that can be done by the program itself (from a clone() or fork()), is not 100% accurate
if you point firefox at the webpage, that is
if you point curl/wget at a webpage you get the same certainty but without doing it yourself
(ofc from a computer outside of the local network)

Sure, if a valid HTTP header is sended, wget helps. You're wget won't notice if you get a 200 and simply a blank page or maybe some garbage, which was loaded from any memory area.

But that is not, what you where talking about. We talked about infinite loops. Of curse a watchdog can't help here if there is no infinite loop. The sort of 500-ish errors could be also extracted from the journal (it marks the priority of a message, such as ERROR, WARNING etc), as such errors typically get logged. If you change the subject while discussion without notification, a discussion is pointless. No wonder people think you don't get the point.

**gens** · 19 November 2014, 08:26 AM

Originally posted by erendorn View Post

But text part of the binary log is written as ASCII text, so is there any actual difference?

yes, many differences

take for example the name of the program that sent the log msg
it is written once and given an "id"
an id is just a number and in the rest of the log it is used instead of the name
so if one of those entries has the wrong id, the whole line is more or less meaningless
the rest is similar

it's kind of like simple looseless compression in that every string that repeats itself is replaced by an index
so worst case that index gets corrupted and the whole log is worthless

reasoning for a binary log was faster indexing, that was shown to not be entirely true
also who gives a f about how fast a log is parsed (grep can parse thousands of lines a sec)

**gens** · 19 November 2014, 08:37 AM

Originally posted by oleid View Post

This was merely a suggestion, I don't have a testcase of what gets stuck and I didn't write the hypothetical websever, thus I don't exactly know where to put it.

i just realized this morning
you are suggesting a systemd specific mechanism (sd_notify) that uses the OO dbus with the reasoning that it is just a couple lines of code
thus making it dependent on having two things running that you don't need on a specialized server (dbus and a compliant process tracker aka systemd)
and a library (or two)

while the "program does it itself" solution is also short (cca 7-10 simple lines of C) and is not just init/process tracker independent but also OS independent and needs only the kernel

not to even go in the discussion that process tracking does not necessarily need to be part of the init
(i'm making a process tracker, for fun)

let's have another quote on complicating things:
"Any intelligent fool can make things bigger, more complex, and more violent. It takes a touch of genius -- and a lot of courage -- to move in the opposite direction."
-Albert Einstein

edit: don't get me wrong
it would work the same, it's just overall way more complicated

**oleid** · 19 November 2014, 09:32 AM

Originally posted by gens View Post

i just realized this morning
you are suggesting a systemd specific mechanism (sd_notify) that uses the OO dbus with the reasoning that it is just a couple lines of code
thus making it dependent on having two things running that you don't need on a specialized server (dbus and a compliant process tracker aka systemd)
and a library (or two)

Sure, you'd need DBUS and the process tracker. But since dbus is now even part of the kernel, I wouldn't count that as a huge dependency.

Originally posted by gens View Post

while the "program does it itself" solution is also short (cca 7-10 simple lines of C) and is not just init/process tracker independent but also OS independent and needs only the kernel

I'd doubt, that it's only 10 lines of C code. You'd need to handle a few cases here to get it right. Maybe it would make sence to put process restarting into a shared library, I guess. As it really wouldn't make sense to repeat mostly the same code over and over again in every daemon out there. But then, you can put this code into a process hypervisor, if you have it anyway.

Originally posted by gens View Post

not to even go in the discussion that process tracking does not necessarily need to be part of the init
(i'm making a process tracker, for fun)

I'd love to see the code as I find this topic interesting.

Originally posted by gens View Post

let's have another quote on complicating things:
"Any intelligent fool can make things bigger, more complex, and more violent. It takes a touch of genius -- and a lot of courage -- to move in the opposite direction."
-Albert Einstein

I'm all for simple solutions. If somebody comes up with a simpler solution than systemd, that solves the same problems, I'm all for it.

Announcement

Debian Developer Resigns From The Systemd Maintainership Team

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment