Announcement

**mslusarz** · 30 October 2017, 04:58 PM

Ok, I spent some time today and benchmarked ext4+dax, nova (with inplace_data_updates=1, because it would be unfair for nova with default settings) and pmemfile on DRAM-emulated pmem with regular Skylake CPU.

First result is always for ext4, 2nd for nova, 3rd for pmemfile. This is bandwidth (in kB/s), so higher is better. FIO configurations can be found here: https://github.com/pmem/pmemfile/tree/master/utils/fio/. The best result is in bold.

append64
57588
49168
113653

append512
259494
381994
870187

append64k
4562332
16341444
13559172

write64
57474
47966
114046

write512
309203
367440
892196

write64k
16701553
16347012
22529019

randwrite64
43316
40589
80603

randwrite512
352866
336198
679860

randwrite64k
16415930
15891286
21115274

read64
129922
182207
181075

read512
996425
1395789
1183149

read64k
18469602
19011876
18456036

randread64
86956
88777
100491

randread512
718048
739612
733759

randread64k
17795568
17568962
16862292

**oiaohm** · 30 October 2017, 07:12 PM

Originally posted by coder View Post

Not really. If we're talking about persistent storage that's fully mapped into the physical address range, then you can use page tables to grant concurrent access to different subtrees of the filesystem. Occasionally, as with dynamically-allocated memory, a process might need more memory or to access structures outside its current arena. That's when the kernel might get involved.

If you're talking about legitimate IPC (i.e. communicating via the same filesystem structures from multiple processes), rather than simply overcoming the current single-process limitation, then you might be right that the kernel could still get involved the same number of times. But that's more of a corner case and not a good reason to route all filesystem operations through the kernel.

IMO, the long-term goal shouldn't be to completely remove the kernel from the picture. If you want a solution that's fast and secure, you just need to make persistent memory access work a bit more like dynamic memory access.

https://www.kernel.org/doc/Documenta...ystems/dax.txt Dax attempts to make file systems more like dynamic memory access. Nova is a DAX using filesystem designed for pmem.

So instead of trapping syscalls. Improving the in kernel Dax and allowing more stuff to be done in user-space from what the Dax system exposes and reducing syscalls it set everything up.

Remember the kernel is in charge of where processes can and cannot do dynamic memory access. Its simple to think IPC. Items need to go from pmem to devices as well and that will involve the kernel. Implementing file system in userspace without kernel is path to trouble.

I don't see need for pure user-space file systems unless they can prove they are performing well enough. When you talk about taking a users-space file system over multi threads and processes then I have issue why not expand dax and have the file system in kernel-space with user-space helpers reducing syscalls.

I do think there is a problem with all processing done in kernel space or all processing done in userspace for performance. There is issues with all done in userspace for security. Intercepting syscalls should fairly much be restricted to diagnostics as I do not see that path is sane because you start messing with hot paths of other things.

I still do not change my point of view that valid bench-marking is required to be done we should not give someone a pat on back when they have not done valid bench-marking for a presentation.

**coder** · 01 November 2017, 11:25 PM

Originally posted by oiaohm View Post

Items need to go from pmem to devices as well and that will involve the kernel. Implementing file system in userspace without kernel is path to trouble.

I'm not sure if you're saying what I think you are, but the idea is that Persistent memory devices are memory-mapped, so there's no extra step involved in writing the data to the device.

Originally posted by oiaohm View Post

Intercepting syscalls should fairly much be restricted to diagnostics as I do not see that path is sane because you start messing with hot paths of other things.

I can't speak for the developers, but I think it was mainly done for prototyping, experimentation, and a way that a tiny number of users could start to benefit from faster hardware using legacy applications (with caveats, as noted). I assume they don't imagine a future where this method factors prominently in the software stack.

**coder** · 01 November 2017, 11:47 PM

Originally posted by mslusarz View Post

Ok, I spent some time today and benchmarked ext4+dax, nova (with inplace_data_updates=1, because it would be unfair for nova with default settings) and pmemfile on DRAM-emulated pmem with regular Skylake CPU.
...

Thanks for posting! Is this even with the syscall interception, or was the benchmark natively compiled against pmemfile? Some impressive numbers, for sure.

I'm surprised by the difference between write64 and randwrite64, however. Is the difference possibly due to the actual random number generation?

I'm also a bit surprised by the difference between write64 and read64, in two respects. First, why is pmemfile so affected by reads vs. writes? Second, is the difference vs. randread64 showing us that nova is doing some caching?

**oiaohm** · 02 November 2017, 12:50 AM

Originally posted by coder View Post

Thanks for posting! Is this even with the syscall interception, or was the benchmark natively compiled against pmemfile? Some impressive numbers, for sure.

I'm surprised by the difference between write64 and randwrite64, however. Is the difference possibly due to the actual random number generation?

I'm also a bit surprised by the difference between write64 and read64, in two respects. First, why is pmemfile so affected by reads vs. writes? Second, is the difference vs. randread64 showing us that nova is doing some caching?

Small reads under a sector size(512) can in the Linux kernel take a extra hit in memory protection actions this is legacy hang over in the stack.

Please do be aware is all the reads to nova and ext4+dax have context switch overhead what as you can see is basically non existant. Both nova and ext4+dax have locks on write to allow multi processes and threads to write without stuffing stuff up. So the write speed of in kernel file systems is held up by lock acquirement not context switching.

So pmemfile modified to allow safe multi process writing would slow down a lot in all the write tests.

coder as soon as you add multi process/thread support being able to write to the pmem device freely goes out the window. You do have to ask how faster nova and ext4-dax could go if there was a option to lock writing to one thread only so no locking acquirement on write.

How long does it take to make a context switch?

http://blog.tsunanet.net/2010/11/how-long-does-it-take-to-make-context.html

That's a interesting question I'm willing to spend some of my time on. Someone at StumbleUpon emitted the hypothesis that with all the impr...

The reality here is nasty. Doing everything to attempt to avoid context switches if this results in the CPU caches not working effectively we can be talking 1000 times slower than doing the context switch path.

Reality here how many workloads are going to be happy with the slower read speed in all except for the one case with small reads where the is a issue.

dax allows applications under Linux for mapped files to write straight to the device. So no device overhead.

Now if dax was extended to allow a single thread to lock the writing process of a file system/area in file system then be able to use faster functions without locks on writing while the thread holds the lock the speed difference would close a heck of a lot. So basically a means to say this filesystem is read only except this thread that is allowed to write using the fastest functions. I am fairly sure if we could do this pmemfile and others like it would lose when benchmark performed because the cost of modern day context switch is so cheep.

The reality is when you start looking at benchmark numbers particularly read numbers it comes clear context switch is not a major performance issue. CPU cache and locking for multi thread/process support are both more lethal to performance.

Yes a lot presume that in pmem storage would make the context switch cost a factor again. The reality is they are wrong. Locking yourself to userspace is not going to help with the worst performance hindering problems in fact it might be what causes you to run straight into them.

**mslusarz** · 02 November 2017, 06:49 PM

Originally posted by coder View Post

Thanks for posting! Is this even with the syscall interception, or was the benchmark natively compiled against pmemfile? Some impressive numbers, for sure.

These results are with syscall interception.

Originally posted by coder View Post

I'm surprised by the difference between write64 and randwrite64, however. Is the difference possibly due to the actual random number generation?

Nope. Random IO requires finding correct block and that's what is taking time.

Originally posted by coder View Post

I'm also a bit surprised by the difference between write64 and read64, in two respects. First, why is pmemfile so affected by reads vs. writes? Second, is the difference vs. randread64 showing us that nova is doing some caching?

Reads are mostly bounded by memcpy speed, because there are no metadata updates. Writes require metadata updates (at least mtime, sometimes ctime, size and actual block metadata) and this is where pmemfile shines. I'm not sure I understand the question about randread64.

In general optimizing file system is about removing code from hot path.

**mslusarz** · 02 November 2017, 07:26 PM

Originally posted by oiaohm View Post

Please do be aware is all the reads to nova and ext4+dax have context switch overhead what as you can see is basically non existant. Both nova and ext4+dax have locks on write to allow multi processes and threads to write without stuffing stuff up. So the write speed of in kernel file systems is held up by lock acquirement not context switching.

So pmemfile modified to allow safe multi process writing would slow down a lot in all the write tests.

You are not paying attention - I already told you that multi-threading works fine. And if it's not clear - multi threading means that data structures are shared and you need locks. If I count correctly "write" in pmemfile takes 3 locks.
Adding multi-process support doesn't mean more locking, it means sharing runtime state between processes. And that's something the underlying library (libpmemobj) currently does not support.

**coder** · 04 November 2017, 02:57 AM

Originally posted by mslusarz View Post

I'm not sure I understand the question about randread64.

In read64, nova wins. In randread64, pmemfile wins. I wondered if this could be due to some caching or read-ahead that nova is doing, or how else would you explain that?

**coder** · 04 November 2017, 03:10 AM

Originally posted by oiaohm View Post

So pmemfile modified to allow safe multi process writing would slow down a lot in all the write tests.

You're stuck in the old way of thinking about devices and device drivers. The way to enable fast, concurrent, safe access to persistent memory devices is to use the page table - not locks. This leverages the CPU hardware to detect & intercept collisions.

I already mentioned this, which is seeding doubts in my mind about the value of continuing this exchange. That, and your apparent fixation on the syscall interception, which I think it's clear is intended as a short-term measure that only a small number of users would ever utilize.

Originally posted by oiaohm View Post

http://blog.tsunanet.net/2010/11/how...e-context.html

...

The reality is when you start looking at benchmark numbers particularly read numbers it comes clear context switch is not a major performance issue.

BTW, your own link says syscalls no longer (necessarily) trigger context switches. So, I guess you mean syscall overhead.

**oiaohm** · 04 November 2017, 10:52 PM

Originally posted by coder View Post

You're stuck in the old way of thinking about devices and device drivers. The way to enable fast, concurrent, safe access to persistent memory devices is to use the page table - not locks. This leverages the CPU hardware to detect & intercept collisions.

Nova kernel mode file system driver with current syscall system does that.

Originally posted by coder View Post

BTW, your own link says syscalls no longer (necessarily) trigger context switches. So, I guess you mean syscall overhead.

The reality is the method of avoiding context switches in syscalls have got the syscall overhead down that close to calling a library it not funny. So staying in userspace does not really give you performance advantages any more.

Announcement

Intel Has Been Working On A New User-Space File-System For Persistent Memory

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment