Announcement

**curfew** · 26 November 2020, 07:05 AM

Originally posted by indepe View Post

I'd expect there will be io_uring support for readfile, and that asynchronous reading of multiple sysfs values will show an increased benefit for using readfile, and certainly be easier to use.

Intel quoted their internal benchmark from 4-5 years ago, which means it bears no relevancy at all for a discussion happening today. Basically it might have been a regression or just "bad performance" in general that has already been optimized away within all this time.

Originally posted by indepe View Post

Maybe that's not where the header should be, but other performance related API's (io_uring) are also Linux specific. The proposed use for sysfs certainly is Linux-specific in the first place.

Reading sysfs in parallel on a 64-core platform is also such a tiny meaningless use case that I would not even consider it. It also falls into the same category with synthetic benchmarks. If your app is spending most of its time reading sysfs, then it probably does not do anything worthwhile.

**indepe** · 26 November 2020, 09:52 AM

Originally posted by curfew View Post

Intel quoted their internal benchmark from 4-5 years ago, which means it bears no relevancy at all for a discussion happening today. Basically it might have been a regression or just "bad performance" in general that has already been optimized away within all this time.

I suppose "5.8.14" refers to the kernel version, which was last month

**baryluk** · 26 November 2020, 02:45 PM

Originally posted by indepe View Post

Excuse my ignorance: why is it necessary to use two reads or even a loop, if using a pre-allocated buffer large enough to hold the max expected size ?

EDIT: Or to put it differently, why would be there no internal API, and why is there no user-space API, that allows doing this with a single read call? Or why doesn't a result code allow to distinguish EOF vs incomplete read?

EDIT 2: Comparing above source for readfile by Greg KH and read/ksys_read in read_write.c suggests that read on Linux (in the current implementation) will always read the whole content of a file, even though the official semantics are that it doesn't necessarily do so. Is it perhaps valid to take advantage of this in Linux-only code? (Or to have two versions depending on platform?) Specifically sysfs code would tend to be Linux-only.

Lets say you want to implement readfile in user space for older Linux kernel, or non-Linux systems.

Lets say your file is 4096 bytes, but you don't know that.

1) You open the file.

2) You have a buffer, lets say 65536 bytes.

3) You do a read with max of 65536 bytes to this buffer.

4) read returns and tells you that 4096 bytes was copied to the buffer.

now what?

You don't know if your read the whole file, or the kernel just decided to fill part of a buffer, or what. In general you can't do that. If there are some markers in the file to tell you that, or you try to parse the result, and it looks ok. But often you don't. I.e. If you read "1234" from sysfs, you don't know if this is a full content. Usually there should be a newline, but not all sysfs files do that.

You need to another read:

5) you ask a read again with same or other buffer, and now read returns 0.

6) You know you read everything, and you can close the file.

**tajjada** · 26 November 2020, 08:20 PM

Originally posted by markg85 View Post

Making system monitoring less taxing

We all know the irony in system monitoring. Where, when you open the monitor, it itself is often the top cpu user. That's mainly because it needs to continuously load and parse tiny process files.

This is just wrong. System monitoring is not taxing. At all.

The reason why the system monitor process always shows up as the top cpu user is because it is active (using cpu / scheduled by the kernel) at that exact moment. Not because it uses a lot of CPU. It uses very little CPU. It's just that the process statistics are for one specific moment in time, and during that exact moment, the system monitor is the active CPU user (as it is literally the one obtaining the statistics), so clearly it will always be "active". It does it very fast and goes back to sleep, but you (obviously) never see what's going on while it is sleeping, it can only show you the data from when it was running.

If you open multiple system monitors, they will refresh at different times, and each one will show itself as the top cpu user and the others as 0% / sleeping.

**indepe** · 26 November 2020, 08:27 PM

Originally posted by baryluk View Post

You don't know if your read the whole file, or the kernel just decided to fill part of a buffer, or what. In general you can't do that. If there are some markers in the file to tell you that, or you try to parse the result, and it looks ok. But often you don't. I.e. If you read "1234" from sysfs, you don't know if this is a full content. Usually there should be a newline, but not all sysfs files do that.

Yes. However, why is that so? Why does the API which is most commonly used for this purpose not let you know that you have reached EOF after the first read? Why is there no alternative API? Why is there no additional variable like errno that gives you more information about why the read returned before the requested size was reached? Just because that is how it was done, and not done, maybe 50 years ago?

**baryluk** · 26 November 2020, 08:30 PM

Originally posted by tajjada View Post

This is just wrong. System monitoring is not taxing. At all.

The reason why the system monitor process always shows up as the top cpu user is because it is active (using cpu / scheduled by the kernel) at that exact moment. Not because it uses a lot of CPU. It uses very little CPU. It's just that the process statistics are for one specific moment in time, and during that exact moment, the system monitor is the active CPU user (as it is literally the one obtaining the statistics), so clearly it will always be "active". It does it very fast and goes back to sleep, but you (obviously) never see what's going on while it is sleeping, it can only show you the data from when it was running.

If you open multiple system monitors, they will refresh at different times, and each one will show itself as the top cpu user and the others as 0% / sleeping.

While what you wrote is 100% true.

The readfile in fact will provide additional performance improvement to make things like top, ps, htop, collectd a bit faster and less taxing. Always good for CPU power usage to be able to go to sleep faster, or to process big lists of processes. Hard to say how much it will be important, but it will be nonzero amount.

**baryluk** · 26 November 2020, 08:31 PM

Originally posted by indepe View Post

Yes. However, why is that so? Why does the API which is most commonly used for this purpose not let you know that you have reached EOF after the first read? Why is there no alternative API? Why is there no additional variable like errno that gives you more information about why the read returned before the requested size was reached? Just because that is how it was done, and not done, maybe 50 years ago?

Because it is extremely niche topic, not worth care in majority of cases. Feel free to provide new API to the kernel or glibc. Nobody is stopping you proposing such API.

And yes, it is how things were done in Unix 50 years ago. Even DOS use similar method I think, and it probably also borrowed it from other systems, like CP/M, or Unix. Would be nice to research. My guess, is that on very early systems, machines had very little memory and registers, so providing extra "variable" or return value to indicate these things was really costly.

**indepe** · 26 November 2020, 08:57 PM

Originally posted by baryluk View Post

Because it is extremely niche topic, not worth care in majority of cases. Feel free to provide new API to the kernel or glibc. Nobody is stopping you proposing such API.

And yes, it is how things were done in Unix 50 years ago. Even DOS use similar method I think, and it probably also borrowed it from other systems, like CP/M, or Unix. Would be nice to research. My guess, is that on very early systems, machines had very little memory and registers, so providing extra "variable" or return value to indicate these things was really costly.

In any case, I'd like to clarify that I do like the readfile API and functionality.

Sure, limited resources 50 years ago could be the reason. How about this: Maybe a difficulty is coming up with an available variable name, so to speak. Otherwise it could be a variable like errno, for example operation_reached_eof which has the values TRUE, FALSE, and UNKNOWN. For files it would be obvious, and for TCP/IP connections, for example, it would be TRUE if the connection was terminated. Missing implementations would always return UNKNOWN.

It seems (as I suggested above) that on current Linux, read on a file would return TRUE simply if there is no error, yet read-size is smaller than requested-size.

**tajjada** · 27 November 2020, 07:01 AM

Originally posted by baryluk View Post

While what you wrote is 100% true.

The readfile in fact will provide additional performance improvement to make things like top, ps, htop, collectd a bit faster and less taxing. Always good for CPU power usage to be able to go to sleep faster, or to process big lists of processes. Hard to say how much it will be important, but it will be nonzero amount.

Oh yes, the new syscall would be very applicable to system monitoring tools and could be used to micro-optimize (although i doubt this performance matters much in practice, it's quite negligible to begin with).

I just wanted to correct the other poster's misconception that system monitoring is taxing on the cpu. They seemed convinced that a system monitor is somehow using a lot of their CPU.

**markg85** · 27 November 2020, 07:07 PM

Originally posted by tajjada View Post

This is just wrong. System monitoring is not taxing. At all.

The reason why the system monitor process always shows up as the top cpu user is because it is active (using cpu / scheduled by the kernel) at that exact moment. Not because it uses a lot of CPU. It uses very little CPU. It's just that the process statistics are for one specific moment in time, and during that exact moment, the system monitor is the active CPU user (as it is literally the one obtaining the statistics), so clearly it will always be "active". It does it very fast and goes back to sleep, but you (obviously) never see what's going on while it is sleeping, it can only show you the data from when it was running.

If you open multiple system monitors, they will refresh at different times, and each one will show itself as the top cpu user and the others as 0% / sleeping.

Reading your response i'm like.. WTF.?.... Youi probably just totally misunderstood me.
I know that the monitoring software isn't visible when it's not open..
I'm not sure what you try to describe there.

When you open your monitor application, it itself is - at that point while it's open - a substantial resource user. It still is very low but you'll definitely notice it popping up.
Now with readfile that statistics gathering just plain and simply takes less cycles.

Announcement

Linux READFILE System Call Revived Now That It Might Have A User

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment