Announcement

**oiaohm** · 28 October 2017, 03:05 AM

Originally posted by coder View Post

The promise of a pure-userspace approach is no context switches or other syscall overhead. That becomes very significant, as the IOPS rate increases.

That said, I'm completely unfamiliar with the specifics of these implementations, so I can only opine in broad platitudes and generalities.

pmemfile is trapping syscalls. To trap syscalls there is a price of overhead on every syscall request of that type including the ones you don't want to trap.

GitHub - pmem/syscall_intercept: The system call intercepting library

https://github.com/pmem/syscall_intercept

The system call intercepting library. Contribute to pmem/syscall_intercept development by creating an account on GitHub.

The intercept can end up messing with cpu instruction caching and compiler optimisations to work well in cpu instruction cache . So the intercept can be more costly than doing a syscall with a context switch.

https://schd.ws/hosted_files/osseu17/01/OSS17_pmemfile-user_space_filesystem_on_persistent_memory.pdf

You also have multi libraries in the mix as documented on page 19. So multi jumps more code needing to be cached by the cpu.

This is one of these things IOPS can go up by avoiding kernel. But at times you IOPS will go down by avoiding kernel because you are forcing more low efficiency paths and causing the cpu cache to have issues. This is why we need a proper bench between pmemfile and nova it might turn out that pmemfile method is harmful and of no benefit.

Userspace solutions are a double sided sword particularly when you are talking about hooking into stuff. Also userspace starts losing is shine when you start talking multi thread support what pmemfile does not support.

**coder** · 28 October 2017, 10:49 AM

Originally posted by oiaohm View Post

pmemfile is trapping syscalls. To trap syscalls there is a price of overhead on every syscall request of that type including the ones you don't want to trap.

The main cost should be during process initialization, which is where they scan for & modify all the libc syscalls. After that, there's a small cost, for unintercepted syscalls. Should be just a jump to their routine and a lookup to see whether to intercept it.

A bigger downside is they still save the architecture state, in the event of an intercept. This is not ideal, because it's getting quite big.

Originally posted by oiaohm View Post

The intercept can end up messing with cpu instruction caching and compiler optimisations to work well in cpu instruction cache . So the intercept can be more costly than doing a syscall with a context switch.

You're really blowing that out of proportion, IMO. Is this based on any experimental data, or why do you think that?

Originally posted by oiaohm View Post

This is why we need a proper bench between pmemfile and nova

If someone has 3D XPoint DIMMs, then absolutely show us the data! My main point is that, until then, we can't see the full impact.

Originally posted by oiaohm View Post

Userspace solutions are a double sided sword particularly when you are talking about hooking into stuff.

Also userspace starts losing is shine when you start talking multi thread support what pmemfile does not support.

Careful. Don't focus too much on this one hack. There are other ways to use their filesystem, such as by directly linking to libpmemfile-posix. In the future, perhaps they could provide a modified libc, or libc could be modified specifically to support hooks for userspace filesystems.

It's early days, yet.

BTW, I do have security concerns about a 100% userspace solution of any form. In the long run, perhaps you cannot have a fully userspace solution that's truly secure. But perhaps it's still possible to avoid the majority of syscalls made today.

**oiaohm** · 28 October 2017, 12:46 PM

Originally posted by coder View Post

The main cost should be during process initialization, which is where they scan for & modify all the libc syscalls. After that, there's a small cost, for unintercepted syscalls. Should be just a jump to their routine and a lookup to see whether to intercept it.

A bigger downside is they still save the architecture state, in the event of an intercept. This is not ideal, because it's getting quite big.

You're really blowing that out of proportion, IMO. Is this based on any experimental data, or why do you think that?

These intercept style file systems predate fuse. So the idea is not new. Issues are not new. Yes FUSE where you switch to kernel back to userspace was less overhead than some of the intercept style file systems prior to it. Messing up the cpu caches and causing a performance hit from that is a know problem of this style.

Originally posted by coder View Post

Careful. Don't focus too much on this one hack. There are other ways to use their filesystem, such as by directly linking to libpmemfile-posix. In the future, perhaps they could provide a modified libc, or libc could be modified specifically to support hooks for userspace filesystems.

It's early days, yet.

Its not really this is all stuff that was done prior to 2005 before fuse. The problem you are missing is how often file structures are used. Directly linking and design to use libpmemfile-posix is more using a direct memory management solution and would allow avoiding putting extra load on a very hot path.

memfd_create(2) - Linux manual page

http://man7.org/linux/man-pages/man2/memfd_create.2.html

Linux does not just use file for disc structures it also used for memory structures you pass between applications when multi threading/multi process.

Prior to 2005 the file usage under Linux was no where near as heavy as today. IOPS rate of calling the Linux kernel for file operations can be just as down right critical as speeding up the file system speed due to the other uses.

So yes the idea of modified libc/including hooks is going cause some trouble because it will still make a very hot path longer. Gaining IOPS in for one action while taking IOPS away from other actions using the same interfaces as this case will be means you may not have any total performance gain at all and at worse gone backwards.

So the path that does not risk going backwards would be coding to use libpmemfile-posix directly.

Originally posted by coder View Post

BTW, I do have security concerns about a 100% userspace solution of any form. In the long run, perhaps you cannot have a fully userspace solution that's truly secure. But perhaps it's still possible to avoid the majority of syscalls made today.

https://www.snia.org/sites/default/files/PM-Summit/2017/presentations/Coughlan_Tom_PM_in_Linux.pdf

Pmem hardware there is a emulation in ram that you can use. That is in fact faster than 3D XPoint DIMM.

The problem here is file descriptors is used for securely passing data between applications. Hooking syscalls for file actions by any means be it modified libc or by syscall intercept can have a lot of unintended effects and is playing on a very hot code path and with features used for lots of things.

Like its a file descriptor I can send this block to be displayed straight on screen opps it a pmemfile item that the Linux kernel does not know about. So this is really playing with fire. You can see it happening add libpmemfile to add disc access speed then add opencl to add processing speed and then watch bothsides cat fight.

I guess that is another thing most would not consider that accessing accelerated processing will in a lot of cases lead back to requiring file descriptors the Linux kernel knows about and libpmemfile will be harming that code path by hooking syscalls or even a modified libc. Even using libpmemfile-posix is not without is risk of issues.

The idea of hey you can use this with unmodified code is playing with trouble. If you have to modify the code any how go the libpmemfile-posix and avoid messing with the hot path and have chance of having compiler errors when you use the wrong file handle with the wrong thing.

I see no place for intercept style userspace file systems as they are trouble and developers have not learnt from history on this point and repeat the same basic set of mistakes. Fuse was not made for no reason with most of the prior ways of doing userspace file systems disappearing for over 10 years now before someone in this case got the idea of bring one of those design back.

I can see where compiler based userspace file systems could have a place due to being able to avoid messing with hot paths for file actions to kernel space and detect where you have crossed userspace generated file handles with kernel generate file handles before bad things happen.

**mslusarz** · 28 October 2017, 04:01 PM

pmemfile developer here.

There's not much data about applications and performance comparisons because we just barely got to the state when it makes sense to try running applications. And pmemfile is not even that optimized yet.

Originally posted by starshipeleven View Post

But it is a "dumb" way of using such new technology, as it still basically uses the thing as storage only, not as RAM/storage at the same time.

It's not "dumb". It's just another way to approach the problem of transitioning between world built on block storage to world where storage is byte-addressable.

Originally posted by oiaohm View Post

Userspace solutions are a double sided sword particularly when you are talking about hooking into stuff. Also userspace starts losing is shine when you start talking multi thread support what pmemfile does not support.

Pmemfile supports multi-threading. It's multi-process support that is missing.
And because of that pmemfile is more of an application accelerator than general purpose file system. We do our best to make it general purpose, but multi-process is pretty big obstacle.

Originally posted by coder View Post

BTW, I do have security concerns about a 100% userspace solution of any form. In the long run, perhaps you cannot have a fully userspace solution that's truly secure. But perhaps it's still possible to avoid the majority of syscalls made today.

Pmemfile is definitely not secure. We implement chmod, chown & stuff, but it's only there for applications that absolutely need those syscalls. If you know pmemfile is underneath, you can do whatever you want.

Originally posted by oiaohm View Post

These intercept style file systems predate fuse. So the idea is not new. Issues are not new. Yes FUSE where you switch to kernel back to userspace was less overhead than some of the intercept style file systems prior to it. Messing up the cpu caches and causing a performance hit from that is a know problem of this style.

We have pmemfile-fuse in tree (which calls libpmemfile-posix), so you can see for yourself how much slower FUSE is. I don't have the data in front of me, but IIRC in some benchmarks fuse version is MUCH slower (~10 times slower, on DRAM-emulated pmem, but this factor shouldn't change much on real hw). FUSE is only for toys.

**coder** · 28 October 2017, 08:46 PM

Originally posted by mslusarz View Post

pmemfile developer here.

Thanks for your comments.

pmemfile is clever, but seems mostly transitional. I look forward to seeing which direction you take, in the longer term. I like the idea of a filesystem interface to persistent memory, however - even if that's not the primary usage model.

**mslusarz** · 29 October 2017, 10:01 AM

Originally posted by coder View Post

Thanks for your comments.

pmemfile is clever, but seems mostly transitional. I look forward to seeing which direction you take, in the longer term. I like the idea of a filesystem interface to persistent memory, however - even if that's not the primary usage model.

Yes, it is transitional.

At Intel we want to make this transition as easy as possible. That's why we provide helper libraries (like libpmemobj - IMO the best library that helps writing native persistent memory applications), examples (there's a lot of them in NVML repo) and sometimes reimplement backends using those helper libraries (pmemfile, pmemkv, pmse, experimental Redis port). Anything more advanced will have to be done by the community and ISVs.

All of this is needed because writing against persistent memory is not easy. Not because of Intel's particular hw implementation, but because there are new challenges when dealing with memory that is persistent. Some examples to think about: What do you do when your data structures are in persistent memory and there's power loss in the middle of an algorithm which requires updating multiple pointers? What happens to allocated, but not yet linked anywhere, memory? How do you even allocate memory in fail-safe way? All of this (and more) is solved by libpmemobj.

**name99** · 29 October 2017, 01:49 PM

Originally posted by schmidtbag View Post

So if I understand this correctly, it basically creates a RAM disk with persistent storage?

Saying this omits everything that makes this a non-trivial project.

(a) RAM disks (by definition!) do not have to worry about persistence. They don't have to care about ACID and what happens if power is lost partway through an update.
Persistent storage does have to care about this, and that makes all the difference. In particular it means that every write to persistent storage has to be structured on the assumption that it might fail partway, meaning that you need to implement some sort of "backup" mechanism. On solution to this might be logs, a different solution might be to construct a secondary tree of changes then, as the very last step in a change, swap the roots. There are various options, but they're all complicated and nevertheless have to be implemented.

OK, so you have you carefully structured writes. Even so, what makes them work is the ordering of the writes, so that the very last write (the one that "publishes" the changes to the rest of the universe) only occurs after every structural backup write has been committed. Which means you need a way to enforce write ordering all the way to persistence.
Clearly this is a poorly understood issue (witness the hash most SATA disks have made of this, and the constant arguments about how macOS has tried to deal with that hash).

Anyway, in the case of disks, this write ordering is performed through the use of OS calls (which hopefully can get translated into some sort of disk controller call...) which force waiting till persistence is guaranteed.
In the case of persistent RAM this model no longer works because there is no OS mediating between the CPU and writes to the nvRAM --- by definition the nvRAM looks like RAM!
Which means that writes to the data structures of the file system on that nvRAM would, by default, be pushed out to the nvRAM at random times, in random order, as the various levels of cache decide to flush various lines. And so your careful write ordering is lost.
SO you have to design a write ordering that is now appropriate for this new transport model (cache lines going out to nvRAM, rather than disk sectors going through a driver to disk) and you need to use the appropriate barriers provided by the CPU (demands that certain lines get flushed to persistence NOW).

(b) The above is required for correctness, but could be retrofitted to any existing file system. BUT nvRAM differs from disks in that the minimum addressable unit is the byte (or, if you prefer, the minimum practical unit of manipulation is the cache line). Either way, the fact that you can write data on a much smaller granularity means that it's worth reconsidering the fundamental data structures of the file system, skewing very much more towards "in-RAM" type data structures rather than "on-disk" type data structures (so small B-tree nodes rather than large B-tree nodes, or perhaps even something more like binary trees than B-trees, or even hash tables rather than trees.)
This is the second side of the problem. RAM disks, of course, could likewise have used optimal data structures, but never did --- no-one redesigned an FS from scratch for RAM, rather they just wrote a simple block driver that treated a block of RAM as a disk, and used that RAM as an array of sector-sized blocks.

The point is not to create a RAM disk. It's to create an FS that
- is faster because it makes optimal use of the target HW (which, for Intel, is of course Optane)
- is correct (ACID) in the face of any possible failure mode.

**name99** · 29 October 2017, 02:03 PM

Originally posted by oiaohm View Post

These intercept style file systems predate fuse. So the idea is not new. Issues are not new. Yes FUSE where you switch to kernel back to userspace was less overhead than some of the intercept style file systems prior to it. Messing up the cpu caches and causing a performance hit from that is a know problem of this style.

Its not really this is all stuff that was done prior to 2005 before fuse. The problem you are missing is how often file structures are used. Directly linking and design to use libpmemfile-posix is more using a direct memory management solution and would allow avoiding putting extra load on a very hot path.

memfd_create(2) - Linux manual page

http://man7.org/linux/man-pages/man2/memfd_create.2.html

Linux does not just use file for disc structures it also used for memory structures you pass between applications when multi threading/multi process.

Prior to 2005 the file usage under Linux was no where near as heavy as today. IOPS rate of calling the Linux kernel for file operations can be just as down right critical as speeding up the file system speed due to the other uses.

So yes the idea of modified libc/including hooks is going cause some trouble because it will still make a very hot path longer. Gaining IOPS in for one action while taking IOPS away from other actions using the same interfaces as this case will be means you may not have any total performance gain at all and at worse gone backwards.

So the path that does not risk going backwards would be coding to use libpmemfile-posix directly.

https://www.snia.org/sites/default/files/PM-Summit/2017/presentations/Coughlan_Tom_PM_in_Linux.pdf

Pmem hardware there is a emulation in ram that you can use. That is in fact faster than 3D XPoint DIMM.

The problem here is file descriptors is used for securely passing data between applications. Hooking syscalls for file actions by any means be it modified libc or by syscall intercept can have a lot of unintended effects and is playing on a very hot code path and with features used for lots of things.

Like its a file descriptor I can send this block to be displayed straight on screen opps it a pmemfile item that the Linux kernel does not know about. So this is really playing with fire. You can see it happening add libpmemfile to add disc access speed then add opencl to add processing speed and then watch bothsides cat fight.

I guess that is another thing most would not consider that accessing accelerated processing will in a lot of cases lead back to requiring file descriptors the Linux kernel knows about and libpmemfile will be harming that code path by hooking syscalls or even a modified libc. Even using libpmemfile-posix is not without is risk of issues.

The idea of hey you can use this with unmodified code is playing with trouble. If you have to modify the code any how go the libpmemfile-posix and avoid messing with the hot path and have chance of having compiler errors when you use the wrong file handle with the wrong thing.

I see no place for intercept style userspace file systems as they are trouble and developers have not learnt from history on this point and repeat the same basic set of mistakes. Fuse was not made for no reason with most of the prior ways of doing userspace file systems disappearing for over 10 years now before someone in this case got the idea of bring one of those design back.

I can see where compiler based userspace file systems could have a place due to being able to avoid messing with hot paths for file actions to kernel space and detect where you have crossed userspace generated file handles with kernel generate file handles before bad things happen.

You seem to be missing the point.
As I understand it, the point of this work is to investigate the issues I described above --- ACID and performance for nvRAM. The wrapper mechanism they have chosen for this (intercepting FS calls and operating in user space) is clearly far easier than writing experimental code in kernel space, while still achieving these particular goals.

Obviously there are a whole set of ADDITIONAL goals that a final solution will need, including things like a security model and a sharing model. One can imagine a variety of ways these might develop, based on a starting point that looks a lot more like IPC, shared pages, and page permissions, than the current FS security models. But that's an orthogonal problem, and you don't fault someone trying to solve problem A by complaining that they're not solving the very different problem B.

**oiaohm** · 29 October 2017, 02:56 PM

Originally posted by name99 View Post

You seem to be missing the point.
As I understand it, the point of this work is to investigate the issues I described above --- ACID and performance for nvRAM. The wrapper mechanism they have chosen for this (intercepting FS calls and operating in user space) is clearly far easier than writing experimental code in kernel space, while still achieving these particular goals.

Obviously there are a whole set of ADDITIONAL goals that a final solution will need, including things like a security model and a sharing model. One can imagine a variety of ways these might develop, based on a starting point that looks a lot more like IPC, shared pages, and page permissions, than the current FS security models. But that's an orthogonal problem, and you don't fault someone trying to solve problem A by complaining that they're not solving the very different problem B.

FUSE proving a kernel mode wrapper for proto type file systems was to prevent a stack of nightmare issues. Intercepting FS calls in user-space is in fact nightmare because glibc and gcc/llvm is free to break internal backwards compatibility at any time.

Just because something is easier does not mean its safe to-do it. The issues with doing a user-space file system of performance overhead or performance overhead combined with unstable connectivity makes that path not that viable in most cases. Also when you move to a IPC model the inter-process switching you are crossing the kernel anyhow. You have to go to a IPC model to enabled multi process/thread support. So all those saved context switching by blocking syscalls starts quickly disappearing as well.

All the problems I listed don't apply to using Nova a kernel space driver. Also Nova is experimental code written in kernel space its smaller and more compact than the pmemfile thing and is doing more. Nova in their development exploited the fact usermode Linux exists. What is a Linux kernel built to run in userspace so you can debug stuff like a pmem device using item in userspace even that its a kernel driver.

The more lines of code you have to write to perform a task is not going the easy path. So your assume that writing a kernel mode drivers is hard is wrong. When it comes to files systems being in the Linux kernel gives you a lot of generic parts you can straight up use so reducing your lines of code and your risk of errors.

name99 due to the know list of faults that userspace drivers suffer from. I would have liked to seen proper benchmarks against Nova that is a file system design for pmem devices not ext4-dax that still contains stuff to deal with older slower hard-drives so its not going to be performance.

The fact what they are doing is the harder path you have to ask to see properly demonstrated benefit. If pmemfile file intercept is faster than nova spending more on that path into the future might be worth it. Now if pmemfile file intercept slower than Nova it would be wise to drop that path now.

It is very important with file systems to bench like vs like. ext4-dax is a generic file system for multi device types pmemfile is not. Nova is a specialist file system for pmem devices and pmemfile is a special file system for pmem devices. So Nova vs pmemfile would be like vs like.

name99 you are using if I built it they will come logic. Yes long list of Additional goals that are absolutely worthless if the basic foundation cannot demonstrate being properly faster than the kernel mode driver of the same type. What was presented did not demo that.

That brings up two questions Did they not do the benchmark? or Worse they did the benchmark knows it under performs so would have to admit path was failure and that would have been in trouble with boss so deceived in presentation?

name99 it not safe to give people a easy pass when they have not performed the right benchmarks.

**coder** · 30 October 2017, 03:02 AM

Originally posted by oiaohm View Post

Also when you move to a IPC model the inter-process switching you are crossing the kernel anyhow. You have to go to a IPC model to enabled multi process/thread support. So all those saved context switching by blocking syscalls starts quickly disappearing as well.

Not really. If we're talking about persistent storage that's fully mapped into the physical address range, then you can use page tables to grant concurrent access to different subtrees of the filesystem. Occasionally, as with dynamically-allocated memory, a process might need more memory or to access structures outside its current arena. That's when the kernel might get involved.

If you're talking about legitimate IPC (i.e. communicating via the same filesystem structures from multiple processes), rather than simply overcoming the current single-process limitation, then you might be right that the kernel could still get involved the same number of times. But that's more of a corner case and not a good reason to route all filesystem operations through the kernel.

IMO, the long-term goal shouldn't be to completely remove the kernel from the picture. If you want a solution that's fast and secure, you just need to make persistent memory access work a bit more like dynamic memory access.

Announcement

Intel Has Been Working On A New User-Space File-System For Persistent Memory

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment