Intel Has Been Working On A New User-Space File-System For Persistent Memory

oiaohm replied

04 November 2017, 11:20 PM
Originally posted by mslusarz View Post

You are not paying attention - I already told you that multi-threading works fine. And if it's not clear - multi threading means that data structures are shared and you need locks. If I count correctly "write" in pmemfile takes 3 locks.
Adding multi-process support doesn't mean more locking, it means sharing runtime state between processes. And that's something the underlying library (libpmemobj) currently does not support.

I said multi threaded and multi process. You have to remember both ext4-dax and nova have multi process enabled so this is going to kick the teeth in when writing. So nova and ext4-dax have sharing run-time state enabled already.

https://www.kernel.org/doc/Documentation/vm/numa

To share state between cpu/processes under linux like it or not means running into a stack of extra locking in kernel you do not have to see. You may not have to implement this locking and get to pretend its not there. But the performance effect of all these locks will show themselves. Locking effects are worst on writes than reads. Reads are almost transparent.

This leads us back to a horrible fact how well a file system is going to perform be it userspace or kernel space supporting multi processes is going to depend on how well the kernel itself handles memory management and state of memory management.

Nova and Ext4-dax lower write performance is showing an area that need optimisation if you want multi processes file system user space or kernel space to work well. So finding that implementing multi process support is a lot slower performance is no surprise.

The high read and the slow write of nova is not because nova is in kernel space because it has multi process support enabled. improving multi process speeds will require working on the kernel be the file system implemented in user-space or kernel space..
Leave a comment:
oiaohm replied

04 November 2017, 10:52 PM
Originally posted by coder View Post

You're stuck in the old way of thinking about devices and device drivers. The way to enable fast, concurrent, safe access to persistent memory devices is to use the page table - not locks. This leverages the CPU hardware to detect & intercept collisions.

Nova kernel mode file system driver with current syscall system does that.

Originally posted by coder View Post

BTW, your own link says syscalls no longer (necessarily) trigger context switches. So, I guess you mean syscall overhead.

The reality is the method of avoiding context switches in syscalls have got the syscall overhead down that close to calling a library it not funny. So staying in userspace does not really give you performance advantages any more.
Leave a comment:
coder replied

04 November 2017, 03:10 AM
Originally posted by oiaohm View Post

So pmemfile modified to allow safe multi process writing would slow down a lot in all the write tests.

You're stuck in the old way of thinking about devices and device drivers. The way to enable fast, concurrent, safe access to persistent memory devices is to use the page table - not locks. This leverages the CPU hardware to detect & intercept collisions.

I already mentioned this, which is seeding doubts in my mind about the value of continuing this exchange. That, and your apparent fixation on the syscall interception, which I think it's clear is intended as a short-term measure that only a small number of users would ever utilize.

Originally posted by oiaohm View Post

http://blog.tsunanet.net/2010/11/how...e-context.html

...

The reality is when you start looking at benchmark numbers particularly read numbers it comes clear context switch is not a major performance issue.

BTW, your own link says syscalls no longer (necessarily) trigger context switches. So, I guess you mean syscall overhead.

Last edited by coder; 04 November 2017, 03:18 AM.
Leave a comment:
coder replied

04 November 2017, 02:57 AM
Originally posted by mslusarz View Post

I'm not sure I understand the question about randread64.

In read64, nova wins. In randread64, pmemfile wins. I wondered if this could be due to some caching or read-ahead that nova is doing, or how else would you explain that?
Leave a comment:
mslusarz replied

02 November 2017, 07:26 PM
Originally posted by oiaohm View Post

Please do be aware is all the reads to nova and ext4+dax have context switch overhead what as you can see is basically non existant. Both nova and ext4+dax have locks on write to allow multi processes and threads to write without stuffing stuff up. So the write speed of in kernel file systems is held up by lock acquirement not context switching.

So pmemfile modified to allow safe multi process writing would slow down a lot in all the write tests.

You are not paying attention - I already told you that multi-threading works fine. And if it's not clear - multi threading means that data structures are shared and you need locks. If I count correctly "write" in pmemfile takes 3 locks.
Adding multi-process support doesn't mean more locking, it means sharing runtime state between processes. And that's something the underlying library (libpmemobj) currently does not support.
Likes 1
Leave a comment:
mslusarz replied

02 November 2017, 06:49 PM
Originally posted by coder View Post

Thanks for posting! Is this even with the syscall interception, or was the benchmark natively compiled against pmemfile? Some impressive numbers, for sure.

These results are with syscall interception.

Originally posted by coder View Post

I'm surprised by the difference between write64 and randwrite64, however. Is the difference possibly due to the actual random number generation?

Nope. Random IO requires finding correct block and that's what is taking time.

Originally posted by coder View Post

I'm also a bit surprised by the difference between write64 and read64, in two respects. First, why is pmemfile so affected by reads vs. writes? Second, is the difference vs. randread64 showing us that nova is doing some caching?

Reads are mostly bounded by memcpy speed, because there are no metadata updates. Writes require metadata updates (at least mtime, sometimes ctime, size and actual block metadata) and this is where pmemfile shines. I'm not sure I understand the question about randread64.

In general optimizing file system is about removing code from hot path.
Likes 1
Leave a comment:
oiaohm replied

02 November 2017, 12:50 AM
Originally posted by coder View Post

Thanks for posting! Is this even with the syscall interception, or was the benchmark natively compiled against pmemfile? Some impressive numbers, for sure.

I'm surprised by the difference between write64 and randwrite64, however. Is the difference possibly due to the actual random number generation?

I'm also a bit surprised by the difference between write64 and read64, in two respects. First, why is pmemfile so affected by reads vs. writes? Second, is the difference vs. randread64 showing us that nova is doing some caching?

Small reads under a sector size(512) can in the Linux kernel take a extra hit in memory protection actions this is legacy hang over in the stack.

Please do be aware is all the reads to nova and ext4+dax have context switch overhead what as you can see is basically non existant. Both nova and ext4+dax have locks on write to allow multi processes and threads to write without stuffing stuff up. So the write speed of in kernel file systems is held up by lock acquirement not context switching.

So pmemfile modified to allow safe multi process writing would slow down a lot in all the write tests.

coder as soon as you add multi process/thread support being able to write to the pmem device freely goes out the window. You do have to ask how faster nova and ext4-dax could go if there was a option to lock writing to one thread only so no locking acquirement on write.

How long does it take to make a context switch?

http://blog.tsunanet.net/2010/11/how-long-does-it-take-to-make-context.html

That's a interesting question I'm willing to spend some of my time on. Someone at StumbleUpon emitted the hypothesis that with all the impr...

The reality here is nasty. Doing everything to attempt to avoid context switches if this results in the CPU caches not working effectively we can be talking 1000 times slower than doing the context switch path.

Reality here how many workloads are going to be happy with the slower read speed in all except for the one case with small reads where the is a issue.

dax allows applications under Linux for mapped files to write straight to the device. So no device overhead.

Now if dax was extended to allow a single thread to lock the writing process of a file system/area in file system then be able to use faster functions without locks on writing while the thread holds the lock the speed difference would close a heck of a lot. So basically a means to say this filesystem is read only except this thread that is allowed to write using the fastest functions. I am fairly sure if we could do this pmemfile and others like it would lose when benchmark performed because the cost of modern day context switch is so cheep.

The reality is when you start looking at benchmark numbers particularly read numbers it comes clear context switch is not a major performance issue. CPU cache and locking for multi thread/process support are both more lethal to performance.

Yes a lot presume that in pmem storage would make the context switch cost a factor again. The reality is they are wrong. Locking yourself to userspace is not going to help with the worst performance hindering problems in fact it might be what causes you to run straight into them.
Leave a comment:
coder replied

01 November 2017, 11:47 PM
Originally posted by mslusarz View Post

Ok, I spent some time today and benchmarked ext4+dax, nova (with inplace_data_updates=1, because it would be unfair for nova with default settings) and pmemfile on DRAM-emulated pmem with regular Skylake CPU.
...

Thanks for posting! Is this even with the syscall interception, or was the benchmark natively compiled against pmemfile? Some impressive numbers, for sure.

I'm surprised by the difference between write64 and randwrite64, however. Is the difference possibly due to the actual random number generation?

I'm also a bit surprised by the difference between write64 and read64, in two respects. First, why is pmemfile so affected by reads vs. writes? Second, is the difference vs. randread64 showing us that nova is doing some caching?
Leave a comment:
coder replied

01 November 2017, 11:25 PM
Originally posted by oiaohm View Post

Items need to go from pmem to devices as well and that will involve the kernel. Implementing file system in userspace without kernel is path to trouble.

I'm not sure if you're saying what I think you are, but the idea is that Persistent memory devices are memory-mapped, so there's no extra step involved in writing the data to the device.

Originally posted by oiaohm View Post

Intercepting syscalls should fairly much be restricted to diagnostics as I do not see that path is sane because you start messing with hot paths of other things.

I can't speak for the developers, but I think it was mainly done for prototyping, experimentation, and a way that a tiny number of users could start to benefit from faster hardware using legacy applications (with caveats, as noted). I assume they don't imagine a future where this method factors prominently in the software stack.
Leave a comment:
oiaohm replied

30 October 2017, 07:12 PM
Originally posted by coder View Post

Not really. If we're talking about persistent storage that's fully mapped into the physical address range, then you can use page tables to grant concurrent access to different subtrees of the filesystem. Occasionally, as with dynamically-allocated memory, a process might need more memory or to access structures outside its current arena. That's when the kernel might get involved.

If you're talking about legitimate IPC (i.e. communicating via the same filesystem structures from multiple processes), rather than simply overcoming the current single-process limitation, then you might be right that the kernel could still get involved the same number of times. But that's more of a corner case and not a good reason to route all filesystem operations through the kernel.

IMO, the long-term goal shouldn't be to completely remove the kernel from the picture. If you want a solution that's fast and secure, you just need to make persistent memory access work a bit more like dynamic memory access.

https://www.kernel.org/doc/Documenta...ystems/dax.txt Dax attempts to make file systems more like dynamic memory access. Nova is a DAX using filesystem designed for pmem.

So instead of trapping syscalls. Improving the in kernel Dax and allowing more stuff to be done in user-space from what the Dax system exposes and reducing syscalls it set everything up.

Remember the kernel is in charge of where processes can and cannot do dynamic memory access. Its simple to think IPC. Items need to go from pmem to devices as well and that will involve the kernel. Implementing file system in userspace without kernel is path to trouble.

I don't see need for pure user-space file systems unless they can prove they are performing well enough. When you talk about taking a users-space file system over multi threads and processes then I have issue why not expand dax and have the file system in kernel-space with user-space helpers reducing syscalls.

I do think there is a problem with all processing done in kernel space or all processing done in userspace for performance. There is issues with all done in userspace for security. Intercepting syscalls should fairly much be restricted to diagnostics as I do not see that path is sane because you start messing with hot paths of other things.

I still do not change my point of view that valid bench-marking is required to be done we should not give someone a pat on back when they have not done valid bench-marking for a presentation.
Leave a comment:

Announcement

Intel Has Been Working On A New User-Space File-System For Persistent Memory

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment: