Announcement

**NobodyXu** · 26 January 2023, 07:48 AM

Originally posted by oiaohm View Post

https://man7.org/tlpi/code/online/di...id_file.c.html
The historic Unix style PIDFILE is past dumb. You take a process current PID value write that into a file and use that latter to kill a process or something and hope nothing bad has happened in the middle. Yes the only information about the process in the pidfile be the PID number.

SysVinit - ArchWiki

https://wiki.archlinux.org/title/SysVinit

Yes this form of dumb PIDFILE is used in Sysvinit.

Old historic PIDFILE system is very much blind faith if as in if a process still exists at PID value it must be the right process no need to perform any checks that it is the right process no need to record any extra information to make sure its the right process.

The historic PIDFILE is not in fact linked to /proc/$pid because the /proc directory did not exist when on any Unix was invented. All the historic PIDFILE is a file that you wrote a PID number into then in most implementations blindly trust yes this include sysvinit and many of the old alteratives.

Reason why systemd uses cgroups is so if a service first process starts another process cgroups track it yes of course the historic PIDFILE not going to track this.

Now that you mention it, I remember seeing pidfile in many daemons (which I think including nginx) for supporting this via a flag.

It's certainly racy, not only can a process disappear and pid recycled at any point, you also need to create a tempfile there to avoid multiple daemons writing into the same file.
pidfd is clearly the elegant solution here.

Originally posted by oiaohm View Post

pidfd_open allows you to open a file handle then proceed to check /proc/$pid directory and if this matches what you are expecting then send terminate signal or equal. Now if termination happens in that window result was right process and signal sent the signal will not to go though to the replacement process. Yes one of the issues with PIDFILE usage is the fun case of killing the wrong process just because it was unlucky enough to start up on X PID value.

Yes, the manpage of pidfd_open https://man7.org/linux/man-pages/man2/pidfd_open.2.html has clearly stated that it can only be used when the following condition is satisfied:

• the disposition of SIGCHLD has not been explicitly set to SIG_IGN (see sigaction(2));

• the SA_NOCLDWAIT flag was not specified while establishing a handler for SIGCHLD or while setting the disposition of that signal to SIG_DFL (see sigaction(2)); and

• the zombie process was not reaped elsewhere in the program (e.g., either by an asynchronously executed signal handler or by wait(2) or similar in another thread).

It's clear that using CLONE_PIDFD in clone is the best solution without race in any possible situation.

Originally posted by oiaohm View Post

pidfd_open and signal commands is about allow create of a window to check if you have the right process.

Yes, while having pidfd opened does not prevent pid from being recycled, though you can poll on it to know if the process is still alive after checking /proc/$pid/.

Originally posted by oiaohm View Post

Of course the historic PIDFILE since they only contain PID value they don't contain any information to confirm that.

Lets say PIDFILE also did really include the time that the targeted PID started So instead of being "PID>PIDFILE" "PID mtime_of_/proc/$pid>PIDFILE" That enough information to check if something has been recycled or not under Linux. Because the /proc/$pid mtime does not change over a process life under Linux. Note that I wrote under Linux.

The big problem here there is no cross UNIX world standard to check if a PID has been recycled or not.

Systemd and launchd and SMF all needing platform particular code should start making sense as well.

IMO they should just adopt pidfd_open.
Unix should grow up and fix its API that works in the old days but no longer work now.

Fork + exec is another example of in-efficient API with vfork + exec being a broken and racy replacement.
I remember there is a proposal of new API that closely models after posix_spawn in last year's Linux conference, hope it is in the kernel soon as that would also implementing a seccomp sandbox easier if adopted everywhere.

**Old Grouch** · 26 January 2023, 10:49 AM

Originally posted by NotMine999 View Post

My own view of systemd is, and has been for a while, the more distros converge on common concepts like systemd, the harder those must work to differentiate themselves in order to retain their raison d'être. It's like the days of the old Model T Ford...you can have any color you wanted so long as it was black. Srsly all original Model T Fords were painted black.

Original Ford Model T automobiles came in a choice of many colours, exactly none of which were black. At launch on October 1st 1908, it was available in red, blue, grey, and green, but not black. It was between 1914 and 1925 that it was available only in black.

**_ReD_** · 26 January 2023, 11:11 AM

Originally posted by Xake View Post

How is this any different from how the linux kernel handle oom if no systemd-oomd? That is kill random process, and if your are lucky it is not Xorg it deemd to be the process to kill.
Or do you prefer the windows way? That is: lock up you computer until you hard reset.

( remember I was talking specifically about the desktop. )
While I'm a fan for userspace OOM managers, we have to be real. And I don't know you,
but to me, being "not any different" than the lowest common denominator doesn't make it
any less of a clusterfuck. ( - again - talking about the desktop ).

You make it sound like it kills long before anyone notice any problem with regards to system load. That is not my experience. At all.

No. Not even from afar. Never said anything of the sort, nor to that end.

When was the last time you actually oom-ed?

Testing falls among my hobbies and my duties.
I try to not do -that- on my personal workstation, but not on mine???
What about five minutes ago and again?

Me myself has done it a couple of times lately due to a somewhat unstable dev environment. And for me it starts to show I am running out of memory long before systemd-oomd does anything.

Good fake example out of your arse... Please, don't make up things just to make your points. Anybody can smell it.

Also on my home-server running a daemon with a known memory leak, if I have forgotten to restart that daemon for about two month ssh will for a short while stop working until systemd-oomd has done its things..

I don't believe you. Why would you even mention a so generic-named "a daemon", "with a known memory leak"—no less—, if not for a cheap counterpoint?
Moreover, why the hell would you run such a thing anyway? And even if it where true (which I'm not conceding),
why would you then rely on such a clumsy mitigation strategy? Restarting manually, really? Not even a cronjob???
After all, you could just shove it in a cgroup with memory limits, and forget about it... Instead you choose to Machiavelli it??? No, my friend...

And while we are at it, why are you all bent over backwards and fabricating claims,
when you actually just quoted me saying it works like a charm on servers.
Didn't you notice?

The only thing I can think of that leads to you getting OOMs without any system impact beforehand is [...] swap on their systems.

That's bullshit. A proper swap will move the boundaries for sure, but when memory
starvation hits, it hits. And you'll feel it even more, 'cause that's the weak point of linux swap.

It's not good under excessive memory pressure. It tends to start swap/reclaim
tempests that slow everything down to a crawl well before the kernel OOM has anything to say.
Hence the birth of userspace OOMMs... Whaddayaknow. That's why they made it???

is if you are one of those who due to loads of misconceptions has turned of swap on their systems.

Wooow! That's... you're so misguided It's cute.
Misconceptions are what leads to fads like disabling swap. Not the other way around.

**Xake** · 26 January 2023, 12:17 PM

Originally posted by _ReD_ View Post

( remember I was talking specifically about the desktop. )
While I'm a fan for userspace OOM managers, we have to be real. And I don't know you,
but to me, being "not any different" than the lowest common denominator doesn't make it
any less of a clusterfuck. ( - again - talking about the desktop ).

And this is what I am talking about. You are calling oomd a clusterfuck. You did before and you said it again. The only difference now is that now you did (because of me) add the context that it was not worse then the previous existing handler, the kernel internal oom routine.
But that said, I personally will not place it on the same level as the kernel oomd. Just because you cannot guide it with a GUI.

For me, oomd is different then the kernel oom-routine, and using a desktop workstation, I can actually make oomd kill what I want. Just not with the help of a pretty GUI.
So I will personally place oomd a bit above the kernel integrated oom routine. And for me you are not "real" when you say they are on the same level of clusterfuck.
And to be honest, I do not think a regular user should be able to tweak oomd in a pretty GUI. They tend to shoot theirselves in the foot enough on their own and then bugreport their own fault-configuration and try to blame it on others. Just because someone said something fun on a forum once. Like the persons telling people "turn of your swap".

Originally posted by _ReD_ View Post

Testing falls among my hobbies and my duties.
I try to not do -that- on my personal workstation, but not on mine???
What about five minutes ago and again?

Good fake example out of your arse... Please, don't make up things just to make your points. Anybody can smell it.

Just as good fake example as you testing on your not personal workstatin or what again that hit a oom five minutes ago? I try to figure out what you are trying to say about personal workstation, but cannot really comprehend what you are trying to say.

And lets just say that the first time you are spinning up a local kafka instance, local mysqld, and some other containers with code fetching test data from a remote topic with an unknown amount of data, and manipulating that data locally, things has a tendency to oom before you remember to set memory limits for the containers in question.

Originally posted by _ReD_ View Post

I don't believe you. Why would you even mention a so generic-named "a daemon", "with a known memory leak"—no less—, if not for a cheap counterpoint?
Moreover, why the hell would you run such a thing anyway? And even if it where true (which I'm not conceding),
why would you then rely on such a clumsy mitigation strategy? Restarting manually, really? Not even a cronjob???
After all, you could just shove it in a cgroup with memory limits, and forget about it... Instead you choose to Machiavelli it??? No, my friend...

The generic daemon is transmission-daemon, that has a tendency to eat memory. I do not usually manually restart transmission-daemon as such, however I do tend to install updates including kernel updates needing reboots which, you know, kinda restarts the transmission-daemon at the same time.
And that is the reason I have yet to do any other kind of work around. Because I have for some reason not hit this memory leak that hard before that hard that it has oom-ed for me before. Now I have had three since November, however no spare time to do any ugly workaround like automatic restarts in a cronjob.

And that is the reason I did not indulge in details about the "generic daemon with known memory leaks" (looking into the git history of transmissionbt and you will see there has been a couple of them plugged since version 3.00, so yeah, known memory leaks)
As far as I can see these details did not really add anything to the discussion, which in this case was an example of how a memory starved system behaves, and that you _will_ notice it. no matter if it is a server or a desktop.

Originally posted by _ReD_ View Post

And while we are at it, why are you all bent over backwards and fabricating claims,
when you actually just quoted me saying it works like a charm on servers.
Didn't you notice?

I noticed. But I would say it works as a charm on desktops as well. For me oom has been more predictable on my desktop since systemd-oomd was introduced, and even better after I tweaked it.

Originally posted by _ReD_ View Post

That's bullshit. A proper swap will move the boundaries for sure, but when memory
starvation hits, it hits. And you'll feel it even more, 'cause that's the weak point of linux swap.

You are misreading what I am saying.
The thing is, your stance sounded like systemd-oomd was shit on desktops and the way you worded it sounded like you meant it was worse than previous implementations (that is the kernel internal oom routine). So you made it sound like people got hit by systemd-oomd out of the blue on the desktop without any indicators (like a sluggish system).
So yes, swap moves the boundary. And yes, just like any system with swap with memory starvation it will start to be sluggish.
However if you do not use swap at all your system will not be sluggish. It will oom directly before it really has the time being sluggish.
That was and is my point.

Originally posted by _ReD_ View Post

It's not good under excessive memory pressure. It tends to start swap/reclaim
tempests that slow everything down to a crawl well before the kernel OOM has anything to say.
Hence the birth of userspace OOMMs... Whaddayaknow. That's why they made it???

Yes?

Originally posted by _ReD_ View Post

Wooow! That's... you're so misguided It's cute.
Misconceptions are what leads to fads like disabling swap. Not the other way around.

And again you are misreading me. Again, what I said was based on how you worded it so it sounded like you got hit by systemd-oomd out of the blue. Without any bit of sluggishness before. And I said the only way the system can behave that way is if you are one of those that has fallen for those exact same misconceptions you linked to. Which is something I have seen a lot of people do. And recommend others doing.
According to me, turning off swap on your linux system is just as brain dead as turning of your antivirus on your windows PC since "they are not interested in my computer anyway".

**oiaohm** · 26 January 2023, 08:32 PM

Originally posted by NobodyXu View Post

Yes, the manpage of pidfd_open https://man7.org/linux/man-pages/man2/pidfd_open.2.html has clearly stated that it can only be used when the following condition is satisfied:

It's clear that using CLONE_PIDFD in clone is the best solution without race in any possible situation.

Even if the child has already terminated by the time of the
pidfd_open() call, its PID will not have been recycled and the
returned file descriptor will refer to the resulting zombie
process. Note, however, that this is guaranteed only if the
following conditions hold true:

The bit above is important it says if the following points are not true then it not giving file descriptor. All of those 3 points are conditions that the zombie process state has ceased to exist so nothing to open file descriptor against if something else has not taken it place. Yes if those conditions are not satisfied pidfd_open might open handle connected to the wrong process or return no file descriptor.

CLONE_PIDFD really does not work much better. All it means is you have a file handle open that is pointing to nothing after its zombied so you don't have to perform any extra checks that you have the right file.

Originally posted by NobodyXu View Post

Yes, while having pidfd opened does not prevent pid from being recycled, though you can poll on it to know if the process is still alive after checking /proc/$pid/.

To be correct for a lot of cases you can avoid checking /proc/$pid if you have mtime value because you can perform fstat on the fd given back by pidfd_open to get the mtime value.

Yes if PIDFILE contained PID and mtime its perform pidfd_open then follow with fstat then you know you have the right process. If pidfd_open returns no file handle there was no active or zombie process at that PID number at the time. Yes more race condition proof using fstat than checking /proc/$pid/

Having file descriptor if the PID gives you something to work with to check that you have the right process before sending kill signal to it.

Originally posted by NobodyXu View Post

IMO they should just adopt pidfd_open.
Unix should grow up and fix its API that works in the old days but no longer work now.

https://www.freebsd.org/cgi/man.cgi?...fork&sektion=2

Code:

[I]int[/I]
[B]pdgetpid[/B]([I]int[/I] [I]fd[/I], [I]pid[/I]_[I]t[/I] [I]*pidp[/I]);

[I]int[/I]
[B]pdkill[/B]([I]int[/I] [I]fd[/I], [I]int[/I] [I]signum[/I]);

FreeBSD equals to the Linux pidfd_open stuff. Slightly different platform handling to check if its still the right file handle.

So what is need here is agreement and something added to the posix standard or we just accept we need platform unique code for process management.

I am not sure with freebsd if you can fstat against the fd to get mtime with start time of the process. Yes PID + start time equals unique ID.

There as a big problem in the past where people were pushing for init/service management systems to be written in pure posix and the result is there are a lot of alternative init and service management solutions that don't in fact work correctly due to be problem of PID recycling.

pidfd_open is not 100 percent problem free if used alone. But checking fstat on the fd or other checks that can be done with pidfd_open the result being able to detect process recycling and when detected that a PID is recycled can avoid sending kill signals to incorrect process because check.

Yes if the pidfd file descriptor is confirmed as the right one and between the time you confirm it the right one and the time you perform the pidfd command send signal to kill process the process has been recycled and replaced the signal is not going to the wrong process because the pidfd file descriptor will have been disconnected. pidfd_open used right is as race condition free as using clone(2).

Something to remember processes under Linux have a max count on the number of file descriptors they can have open. clone(2) method could complete run you out of file handles. pidfd_open equals having to take extra due care but you avoid the risk of using up all the file descriptors due to only having file handles open when you need to interface with a particular process. Of course to be sure using pidfd_open that you have the right process you need to do something else like fstat to check mtime to make sure you have the right process and it has not been replaced.

The reality is neither clone(2) or pidfd_open usage should result in not sending signals to the wrong process if used correctly. pidfd_open is possible to use incorrectly by not performing any check that you have the right process. pidfd_open is way better than raw PID numbers where you cannot safely perform check.

**NobodyXu** · 26 January 2023, 09:25 PM

Originally posted by oiaohm View Post

The bit above is important it says if the following points are not true then it not giving file descriptor. All of those 3 points are conditions that the zombie process state has ceased to exist so nothing to open file descriptor against if something else has not taken it place. Yes if those conditions are not satisfied pidfd_open might open handle connected to the wrong process or return no file descriptor.

CLONE_PIDFD really does not work much better. All it means is you have a file handle open that is pointing to nothing after its zombied so you don't have to perform any extra checks that you have the right file.

CLONE_PIDFD works when these conditions cannot be satisfied and it will always guarantee to work, since it creates that pidfd when creating the process and before that process can even run a single instruction.

Originally posted by oiaohm View Post

To be correct for a lot of cases you can avoid checking /proc/$pid if you have mtime value because you can perform fstat on the fd given back by pidfd_open to get the mtime value.

Yes if PIDFILE contained PID and mtime its perform pidfd_open then follow with fstat then you know you have the right process. If pidfd_open returns no file handle there was no active or zombie process at that PID number at the time. Yes more race condition proof using fstat than checking /proc/$pid/

I'm not sure about the mtime as it is not documented in the pidfd_open man page.

I would probably do the following to verify it's the right process:
- pidfd_open on $pid
- check /proc/$pid/
- poll on pidfd to ensure that it's still alive

It uses more syscalls and inquery /proc/$pid, but the behavior of using select/poll/epoll on pidfd is well documented in its man page.

Originally posted by oiaohm View Post

Having file descriptor if the PID gives you something to work with to check that you have the right process before sending kill signal to it.

That's true.

Originally posted by oiaohm View Post

FreeBSD equals to the Linux pidfd_open stuff. Slightly different platform handling to check if its still the right file handle.

So what is need here is agreement and something added to the posix standard or we just accept we need platform unique code for process management.

Would definitely be good to have it in posix for portable code, but it seems to be moving slowly nowadays.

Originally posted by oiaohm View Post

There as a big problem in the past where people were pushing for init/service management systems to be written in pure posix and the result is there are a lot of alternative init and service management solutions that don't in fact work correctly due to be problem of PID recycling.

Broken Unix API again... Along with fork/vfork + exec, where other OSes have fixes for it but Linux hasn't yet.

Originally posted by oiaohm View Post

Something to remember processes under Linux have a max count on the number of file descriptors they can have open. clone(2) method could complete run you out of file handles. pidfd_open equals having to take extra due care but you avoid the risk of using up all the file descriptors due to only having file handles open when you need to interface with a particular process. Of course to be sure using pidfd_open that you have the right process you need to do something else like fstat to check mtime to make sure you have the right process and it has not been replaced.

The reality is neither clone(2) or pidfd_open usage should result in not sending signals to the wrong process if used correctly. pidfd_open is possible to use incorrectly by not performing any check that you have the right process. pidfd_open is way better than raw PID numbers where you cannot safely perform check.

I would argue that CLONE_PIDFD would not cause such issues as you would only use it if you are something like systemd where you know you would be reaping processes from created by others.

And using pidfd is always better than raw PID as it is simply impossible to use PID in a race-free manner unless you are 100% sure the zombie won't be waitpid before you perform any OP on it.

**_ReD_** · 26 January 2023, 11:17 PM

Ok I shall refrain from delving in too much detail because apparently we have a very big language barrier here: You find meaning into my writings that have nothing to do with what's in the ink, while —often— writing yourself exactly the opposite of what you really mean. That's tough...

Originally posted by Xake View Post

And this is what I am talking about. You are calling oomd a clusterfuck.

No. No. No. Not OOMD not even any of the existing userspace OOM-managers. (because I din't even write OOMD)
I always talked about the whole situation of OOM-killing stuff on the desktop. (I will further explain the concept below.)

Originally posted by Xake View Post

You did before and you said it again. The only difference now is that now you did (because of me) add the context that it was not worse then the previous existing handler, the kernel internal oom routine.

No. I did not just "add the contex" as you say. I specifically quoted your exact words in an effort to clarify that the point is NOT if systemd-oomd is equal, or better, or worse, or even much better than the alternatives. That comparison is completely irrelevant. That was—never—the point.

I crafted my point carefully, defining a very real and recurrent situation: (eminently among Fedora and Ubuntu users)

"a situation where something may—silently—murder your programs or even your whole desktop"

and then I specifically called—that situation—"a clusterfuck".

I agree that for interactivity, and from the system's point of wiew, THAT may certainly be better than just hanging forever or crashing. OF COURSE!!!! But that is —NOT— the point.

And it is NOT the point, whether someone—supposedly watching the PC very attentively—could, or could not,
judge that the system is circling the drain... and therefore ascertain that, whatever "magically disappeared"
from their desktop, was indeed just oom-killed.

The point is (and always was) that all this happens (ON THE DESKTOP) in a way—that shouldn't even—

Killing boogies silently is good for the system, but it's also extremely dumbfounding
to the desktop user.
And on the desktop, the mythical "average desktop user" *must* be king.

Now, is this whole situation fixable?
YES! I hypothesized a host of possible enhancements right in the original post.

Is it possible to mitigate or minimize the clusterfuck—now—?
YES, But the average desktop user can't do it.
He'd have to experiment with ratios, cgroups and force or simulate many different OOM scenarios.

I do exactly THAT. And that's why I told you that I experienced an OOM in the last five minutes.
I do often reproduce and test many scenarios, for fun and profit.
But I really have real customers, hit in the face by "the clusterfuck situation", whom, now and again,
waste very long—very expensive—hours trying to understand how their (usually) 2-3 hour
analysis job keeps crashing at different random stages... And why sometimes it brings along
the whole desktop... with not even the cold comfort of a useless blue screen.
To them the old "the task is hang and the system is thrashing like mad" is a thousand time better.
Even if it sucks more at the system level. They can grok it. They can look at the aftermath, instead of an empty screen.
They can freaking recover partial jobs... Because they still have a visual. They can decide to kill <this>
instead of <that> and just bide their time out of the thrashing, if they decide that starting from
scratch would take longer...

To each his own. But don't call it OK just because it worked for you on a couple of instances.

To be clear, ALL of the above can-and-will happen OUT OF THE BLUE, under the right
conditions, because the worst offenders, some major distros, did —zilch— testing and YOLO-ed some
very funny OOM defaults into their golden images, while also doing ungodly things to their other
defaults (like pushing no-swap or zram-alone, redardless of physical RAM configuration, etc.)

---------

Was any of the reasoning above specific to systemd-oomd? No!

Does ANY of the above demean systemd-oomd? No!

Do systemd-oomd (and earlyoom) play a role in both the problem and the solution?
YES OF COURSE!!!

**oiaohm** · 27 January 2023, 12:02 AM

Originally posted by NobodyXu View Post

CLONE_PIDFD works when these conditions cannot be satisfied and it will always guarantee to work, since it creates that pidfd when creating the process and before that process can even run a single instruction.

If you were wanting a replacement to the kill command or equal that was safe CLONE_PIDFD is not going to work.

Originally posted by NobodyXu View Post

I'm not sure about the mtime as it is not documented in the pidfd_open man page.

I would probably do the following to verify it's the right process:
- pidfd_open on $pid
- check /proc/$pid/
- poll on $pid to ensure that it's still alive

It uses more syscalls and inquery /proc/$pid, but the behavior of using select/poll/epoll on pidfd is well documented in its man page.

Its noted in the Linux kernel patch that added pidfd_open that fstat could be used. Doing the process you described if the poll on the file handle not pole on the $pid you meant in the third one.

Now the thing here is if pidfd_open fails to create a file description the process is in a state you cannot send signals to it anyhow. That why I don't see pidfd_open conditions of failure are problem.

- attempt pidfd_open(pid) as filehandle.
- Check filehandle against null if NULL bail out because there is no zombie or process at pid to interact with.
- Check /proc/$pid contains what you are expecting.
- Check that pidfd filehandle still exists is valid/not deleted. fstat check if file is deleted or not and then your epoll,select.. fstat checking says file is deleted then the check of /proc/$pid might have been reading some other processes data either try again from the start or bail. Yes checking fstat at this point you get to pick up mtime.
- pidfd_send_signal

stress-ng/stress-pidfd.c at master · ColinIanKing/stress-ng

https://github.com/ColinIanKing/stress-ng/blob/master/stress-pidfd.c#L172

This is the stress-ng upstream project git repository. stress-ng will stress test a computer system in various selectable ways. It was designed to exercise various physical subsystems of a compute...

fstat is a command that is meant to work against pidfd. epoll/poll/selectfd are all about you want to wait on the pidfd. fstat is more I want to know right at this point is everything still good.

There is more overhead doing this but it safe.

Originally posted by NobodyXu View Post

I would argue that CLONE_PIDFD would not cause such issues as you would only use it if you are something like systemd where you know you would be reaping processes from created by others.

Service management you are always reaping processes created by others. pidfd_open with fstat after checking /proc/$pid information gives you a pidfd file description that you can be sure of what process you are dealing with.

Little bit of due care with pidfd_open by using poll/fstat in the right places and performing the right checks and the result is safe solution.

Originally posted by NobodyXu View Post

IAnd using pidfd is always better than raw PID as it is simply impossible to use PID in a race-free manner unless you are 100% sure the zombie won't be waitpid before you perform any OP on it.

Absolutely agree. clone pidfd is simple to be 100 percent race free. pidfd_open to be race free requires the little extra step to make sure that operations that could be race condition effected are not leading you up garden path that epoll, poll, selectfd or fstat options after operations like checking contents of /proc/$pid achieve this ends.

raw PID is how the posix standard tells you to-do it. Posix standard is basically useless for process management and has been that way for the complete time it existed.

Yes I would say the pidfd documentation does need to be updated with the fstat usages that the implementation in the kernel supports. Yes there are cases where the fstat path makes absolutely more sense than epoll/poll/selectfd.

**NobodyXu** · 27 January 2023, 12:24 AM

Originally posted by oiaohm View Post

If you were wanting a replacement to the kill command or equal that was safe CLONE_PIDFD is not going to work.

Yes I know, but for init system like systemd this is still very useful.

Originally posted by oiaohm View Post

Its noted in the Linux kernel patch that added pidfd_open that fstat could be used.

So long as it is not documented in the man page, I won't use it as it could be an implementation detail.

Originally posted by oiaohm View Post

Doing the process you described if the poll on the file handle not pole on the $pid you meant in the third one.

Thanks for spotting, I've fixed it.

Originally posted by oiaohm View Post

raw PID is how the posix standard tells you to-do it. Posix standard is basically useless for process management and has been that way for the complete time it existed.

Agreed, posix standard is useless for any complex system like that as you would have to use OS-specific features like pidfd to fix race, or sendmmsg/recvmmsg to improve performance for udp socket (http3/quic), or better process creation API without the vfork + exec mess or the performance penalty comes with fork + exec.

Originally posted by oiaohm View Post

Yes I would say the pidfd documentation does need to be updated with the fstat usages that the implementation in the kernel supports. Yes there are cases where the fstat path makes absolutely more sense than epoll/poll/selectfd.

It definitely needs to, I really don't want to depend on impl details that could break my app anytime.

**oiaohm** · 27 January 2023, 12:26 AM

Originally posted by _ReD_ View Post

Do systemd-oomd (and earlyoom) play a role in both the problem and the solution?
YES OF COURSE!!!

https://www.kernel.org/doc/Documentation/vm/overcommit-accounting

For some users the problem is out oomd or the out of memory killer. Some use cases you should have disabled overcommit. So some cases systemd-oomd and earlyoom are not part of the solution.

Over commit off something goes to allocate memory that does not exist to use Linux kernel will just return to application there is not memory to get.

Originally posted by _ReD_ View Post

I crafted my point carefully, defining a very real and recurrent situation: (eminently among Fedora and Ubuntu users)
"a situation where something may—silently—murder your programs or even your whole desktop"
and then I specifically called—that situation—"a clusterfuck".

This is not all users. The problem here by the time any form of OOM killer is going after your processes things are already on path to horrible wrong.

Originally posted by _ReD_ View Post

like pushing no-swap or zram-alone, redardless of physical RAM configuration, etc.

This one has a horrible catch. no-swap is not a good idea with the Linux kernel. Pushing pages annoymous pages to swap allows them to be relocated in memory so that the Linux kernel can de-fragment ram so large allocations can happen. zram-alone is fine.

Yes even with overcommit disabled have no swap is not a good idea. Yes no swap causing increased memory fragmentation with overcommit disabled causes programs to stop being able allocate memory sooner than they should because the sizes of memory they are asking for cannot created due to fragmentation. Yes instead of telling process you cannot have memory with overcommit disabled this instead comes trigger OOM Killer kernel level. NO Swap is bad. zram swap alone you don't have this absolutely nightmare because the zram will be used to de-fragment memory..

Out of memory killer exists with Linux because Linux kernel allows applications to allocate more memory that the system really has. Yes what has triggered oom killers is processes is using more memory that the system has by too large of margin.

Yes you can set Linux kernel to the Linux equal to blue/red screen of death in case of out of memory. /proc/sys/vm/panic_on_oom set to 1 and the kernel will panic instead of running oom killer.

Announcement

systemd 253 RC1 Released With New "ukify" Tool

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment