Announcement

**tildearrow** · 28 September 2022, 03:52 PM

Originally posted by phoronix View Post

Google engineer Daeho Jeong has been working on the new F2FS_IOC_START_ATOMIC_REPLACE ioctl for atomically replacing the entire coments of a file.

So files are forum threads.

**sinepgib** · 28 September 2022, 04:42 PM

It's unclear to me what's exactly the advantage of this. Typically we'd just create a new file and move it as with any other filesystem. Is the intent to avoid creating extra metadata in the logs?

**zcansi** · 28 September 2022, 04:55 PM

Originally posted by sinepgib View Post

It's unclear to me what's exactly the advantage of this. Typically we'd just create a new file and move it as with any other filesystem. Is the intent to avoid creating extra metadata in the logs?

I have the same question. One thing which might be different is the behavior of existing open file handles; using the rename method existing file handles would keep pointing to the old file. But... if an existing process is halfway through reading the file, naively "atomically updating" the file contents would still result in the process seeing half of one version and then the other. So I would hope it has some other behavior than that.

Edit: I've long thought it would be nice to have a filesystem API which is explicitly snapshot-based, so you can ask for either "keep this exact version open even if the file on disk is replaced or overwritten" or "invalidate this handle if the file is changed", with an API to explicitly upgrade the handle to the latest version. Sadly, no matter how many times I read "atomic" or "snapshot" in fs patches, it never seems to be that.

**andreano** · 28 September 2022, 05:40 PM

Originally posted by sinepgib View Post

It's unclear to me what's exactly the advantage of this. Typically we'd just create a new file and move it as with any other filesystem. Is the intent to avoid creating extra metadata in the logs?

On the other hand, the lack of transactions in filesystems is a bit glaring since databases have had them for ages. Current filesystem semantics even make it superhard to make transactions on top of. Avoiding journal/log writes sounds like a nice benefit – interesting that this happens in F2FS.

The procedure is not precisely "write a new file and move it over", but "write a new file, remember to fdatasync it, otherwise it's still an empty file, and only then move it over". And this "correct" procedure is extremely unideal: What we need is a write barrier to specifically guard against reordering, but what we have is a blocking flush-everything-all-the-way-down-to-nonvolatile-memory-right-now call that punishes doing it correctly to the point where users put their workloads on tmpfs and implement their own syncing to work around it.

**Compholio** · 28 September 2022, 10:48 PM

Originally posted by sinepgib View Post

It's unclear to me what's exactly the advantage of this. Typically we'd just create a new file and move it as with any other filesystem. Is the intent to avoid creating extra metadata in the logs?

The "typical" append logfile behavior involves leaving the log file open and appending when necessary. When you log this way and then logrotate comes by and rotates your logs it needs to copy the contents and then truncate the original file so that your append operation doesn't append to the rotated log. If between the copy and the truncate there is a log message added to the log then that message will be lost.

**sinepgib** · 28 September 2022, 11:44 PM

Originally posted by Compholio View Post

The "typical" append logfile behavior involves leaving the log file open and appending when necessary. When you log this way and then logrotate comes by and rotates your logs it needs to copy the contents and then truncate the original file so that your append operation doesn't append to the rotated log. If between the copy and the truncate there is a log message added to the log then that message will be lost.

I'm talking about logs as in F2FS being a log-structured filesystem...
Regarding logfiles+logrotate, there's little you can do about the brokenness of having those functionalities separate.

**yump** · 29 September 2022, 04:04 AM

Originally posted by sinepgib View Post

I'm talking about logs as in F2FS being a log-structured filesystem...
Regarding logfiles+logrotate, there's little you can do about the brokenness of having those functionalities separate.

Terrible no-good yucky idea, inspired by catp: use ptrace or an ebpf helper to intercept and buffer writes to the log file while the copy&truncate is conducted.

**cl333r** · 29 September 2022, 06:48 AM

Originally posted by zcansi View Post

I have the same question. One thing which might be different is the behavior of existing open file handles; using the rename method existing file handles would keep pointing to the old file. But... if an existing process is halfway through reading the file, naively "atomically updating" the file contents would still result in the process seeing half of one version and then the other. So I would hope it has some other behavior than that.

Not an expert but I think it's an unsolvable issue, there's barely any app detecting the file has been renamed/updated while being read in real time because in most cases it's not practical bothering about it, when you really need it you use a database instead which has an API and all the guarantees. But I like the idea behind the atomic replace nonetheless.

Originally posted by zcansi View Post

Edit: I've long thought it would be nice to have a filesystem API which is explicitly snapshot-based, so you can ask for either "keep this exact version open even if the file on disk is replaced or overwritten" or "invalidate this handle if the file is changed", with an API to explicitly upgrade the handle to the latest version. Sadly, no matter how many times I read "atomic" or "snapshot" in fs patches, it never seems to be that.

Imho it's too much to ask, a db hybrid. Let's first fix the broken inotify (Linux's filesystem notification system) where it's impossible to properly deal with renamed files in user space because while the rename operation is atomic - inotify ships it as two separate messages (events), ships them in any order, they only share a common cookie, yet sometimes you don't get the starting or ending event when the file has been moved to or from another directory that you're not watching.
Ironically internally the Linux kernel deals with renames in one shot, yet splits the rename into 2 when broadcasting it thru the inotify API - an incredible api/design blunder if you ask me. This makes it incredibly hard to properly deal with file renames, probably not possible at all.

So before undertaking a huge task like fixing the posix file system let's first try fixing something that should be much easier (inotify) and see how it goes.

**linuxgeex** · 29 September 2022, 10:18 PM

Originally posted by sinepgib View Post

It's unclear to me what's exactly the advantage of this. Typically we'd just create a new file and move it as with any other filesystem. Is the intent to avoid creating extra metadata in the logs?

I came here meaning to ask the same question, but a few seconds in, it occurred to me that this ioctl should be part of the VFS, and each filesystem should implement it in the most efficient manner for that FS.

TLDR; Yes I think this avoids unwanted metadata updates, but also: it can avoid unwanted data updates esp for COW scenarios where blocks can be shared between the original and updated file data; it's a simpler API for userspace vs traditional file atomics; performance can be much higher.

Background for other readers: Normally we do file atomics by creating a new inode, modifying its data to the new state we want, then making the directory entry point to the new inode, usually by renaming the new inode to the old one (or with locks and barriers, but the question is re:renames). This requires the FS to do a lot of work in preparing the (temporary) inode if the changes exceed the VFS/mount's write-back caching interval, or some API is forcing unwanted FSYNCs on the temporary file, then it can hurt performance / cause write amplification. And then also the COW scenario, where you might be modifying just a few bytes out of a terabyte-sized file, and using rename would be pathological.

Announcement

F2FS Preparing Support For Atomic Replace

F2FS Preparing Support For Atomic Replace

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment