Announcement

Collapse
No announcement yet.

Intel Has Been Working On A New User-Space File-System For Persistent Memory

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #21
    Ok, I spent some time today and benchmarked ext4+dax, nova (with inplace_data_updates=1, because it would be unfair for nova with default settings) and pmemfile on DRAM-emulated pmem with regular Skylake CPU.

    First result is always for ext4, 2nd for nova, 3rd for pmemfile. This is bandwidth (in kB/s), so higher is better. FIO configurations can be found here: https://github.com/pmem/pmemfile/tree/master/utils/fio/. The best result is in bold.

    append64
    57588
    49168
    113653

    append512
    259494
    381994
    870187

    append64k
    4562332
    16341444
    13559172

    write64
    57474
    47966
    114046

    write512
    309203
    367440
    892196

    write64k
    16701553
    16347012
    22529019

    randwrite64
    43316
    40589
    80603

    randwrite512
    352866
    336198
    679860

    randwrite64k
    16415930
    15891286
    21115274

    read64
    129922
    182207
    181075

    read512
    996425
    1395789
    1183149

    read64k
    18469602
    19011876
    18456036

    randread64
    86956
    88777
    100491

    randread512
    718048
    739612
    733759

    randread64k
    17795568
    17568962
    16862292

    Comment


    • #22
      Originally posted by coder View Post
      Not really. If we're talking about persistent storage that's fully mapped into the physical address range, then you can use page tables to grant concurrent access to different subtrees of the filesystem. Occasionally, as with dynamically-allocated memory, a process might need more memory or to access structures outside its current arena. That's when the kernel might get involved.

      If you're talking about legitimate IPC (i.e. communicating via the same filesystem structures from multiple processes), rather than simply overcoming the current single-process limitation, then you might be right that the kernel could still get involved the same number of times. But that's more of a corner case and not a good reason to route all filesystem operations through the kernel.

      IMO, the long-term goal shouldn't be to completely remove the kernel from the picture. If you want a solution that's fast and secure, you just need to make persistent memory access work a bit more like dynamic memory access.
      https://www.kernel.org/doc/Documenta...ystems/dax.txt Dax attempts to make file systems more like dynamic memory access. Nova is a DAX using filesystem designed for pmem.

      So instead of trapping syscalls. Improving the in kernel Dax and allowing more stuff to be done in user-space from what the Dax system exposes and reducing syscalls it set everything up.

      Remember the kernel is in charge of where processes can and cannot do dynamic memory access. Its simple to think IPC. Items need to go from pmem to devices as well and that will involve the kernel. Implementing file system in userspace without kernel is path to trouble.

      I don't see need for pure user-space file systems unless they can prove they are performing well enough. When you talk about taking a users-space file system over multi threads and processes then I have issue why not expand dax and have the file system in kernel-space with user-space helpers reducing syscalls.

      I do think there is a problem with all processing done in kernel space or all processing done in userspace for performance. There is issues with all done in userspace for security. Intercepting syscalls should fairly much be restricted to diagnostics as I do not see that path is sane because you start messing with hot paths of other things.

      I still do not change my point of view that valid bench-marking is required to be done we should not give someone a pat on back when they have not done valid bench-marking for a presentation.

      Comment


      • #23
        Originally posted by oiaohm View Post
        Items need to go from pmem to devices as well and that will involve the kernel. Implementing file system in userspace without kernel is path to trouble.
        I'm not sure if you're saying what I think you are, but the idea is that Persistent memory devices are memory-mapped, so there's no extra step involved in writing the data to the device.

        Originally posted by oiaohm View Post
        Intercepting syscalls should fairly much be restricted to diagnostics as I do not see that path is sane because you start messing with hot paths of other things.
        I can't speak for the developers, but I think it was mainly done for prototyping, experimentation, and a way that a tiny number of users could start to benefit from faster hardware using legacy applications (with caveats, as noted). I assume they don't imagine a future where this method factors prominently in the software stack.

        Comment


        • #24
          Originally posted by mslusarz View Post
          Ok, I spent some time today and benchmarked ext4+dax, nova (with inplace_data_updates=1, because it would be unfair for nova with default settings) and pmemfile on DRAM-emulated pmem with regular Skylake CPU.
          ...
          Thanks for posting! Is this even with the syscall interception, or was the benchmark natively compiled against pmemfile? Some impressive numbers, for sure.

          I'm surprised by the difference between write64 and randwrite64, however. Is the difference possibly due to the actual random number generation?

          I'm also a bit surprised by the difference between write64 and read64, in two respects. First, why is pmemfile so affected by reads vs. writes? Second, is the difference vs. randread64 showing us that nova is doing some caching?

          Comment


          • #25
            Originally posted by coder View Post
            Thanks for posting! Is this even with the syscall interception, or was the benchmark natively compiled against pmemfile? Some impressive numbers, for sure.

            I'm surprised by the difference between write64 and randwrite64, however. Is the difference possibly due to the actual random number generation?

            I'm also a bit surprised by the difference between write64 and read64, in two respects. First, why is pmemfile so affected by reads vs. writes? Second, is the difference vs. randread64 showing us that nova is doing some caching?
            Small reads under a sector size(512) can in the Linux kernel take a extra hit in memory protection actions this is legacy hang over in the stack.

            Please do be aware is all the reads to nova and ext4+dax have context switch overhead what as you can see is basically non existant. Both nova and ext4+dax have locks on write to allow multi processes and threads to write without stuffing stuff up. So the write speed of in kernel file systems is held up by lock acquirement not context switching.

            So pmemfile modified to allow safe multi process writing would slow down a lot in all the write tests.

            coder as soon as you add multi process/thread support being able to write to the pmem device freely goes out the window. You do have to ask how faster nova and ext4-dax could go if there was a option to lock writing to one thread only so no locking acquirement on write.

            http://blog.tsunanet.net/2010/11/how...e-context.html

            The reality here is nasty. Doing everything to attempt to avoid context switches if this results in the CPU caches not working effectively we can be talking 1000 times slower than doing the context switch path.

            Reality here how many workloads are going to be happy with the slower read speed in all except for the one case with small reads where the is a issue.

            dax allows applications under Linux for mapped files to write straight to the device. So no device overhead.

            Now if dax was extended to allow a single thread to lock the writing process of a file system/area in file system then be able to use faster functions without locks on writing while the thread holds the lock the speed difference would close a heck of a lot. So basically a means to say this filesystem is read only except this thread that is allowed to write using the fastest functions. I am fairly sure if we could do this pmemfile and others like it would lose when benchmark performed because the cost of modern day context switch is so cheep.

            The reality is when you start looking at benchmark numbers particularly read numbers it comes clear context switch is not a major performance issue. CPU cache and locking for multi thread/process support are both more lethal to performance.

            Yes a lot presume that in pmem storage would make the context switch cost a factor again. The reality is they are wrong. Locking yourself to userspace is not going to help with the worst performance hindering problems in fact it might be what causes you to run straight into them.

            Comment


            • #26
              Originally posted by coder View Post
              Thanks for posting! Is this even with the syscall interception, or was the benchmark natively compiled against pmemfile? Some impressive numbers, for sure.
              These results are with syscall interception.

              Originally posted by coder View Post
              I'm surprised by the difference between write64 and randwrite64, however. Is the difference possibly due to the actual random number generation?
              Nope. Random IO requires finding correct block and that's what is taking time.

              Originally posted by coder View Post
              I'm also a bit surprised by the difference between write64 and read64, in two respects. First, why is pmemfile so affected by reads vs. writes? Second, is the difference vs. randread64 showing us that nova is doing some caching?
              Reads are mostly bounded by memcpy speed, because there are no metadata updates. Writes require metadata updates (at least mtime, sometimes ctime, size and actual block metadata) and this is where pmemfile shines. I'm not sure I understand the question about randread64.

              In general optimizing file system is about removing code from hot path.

              Comment


              • #27
                Originally posted by oiaohm View Post
                Please do be aware is all the reads to nova and ext4+dax have context switch overhead what as you can see is basically non existant. Both nova and ext4+dax have locks on write to allow multi processes and threads to write without stuffing stuff up. So the write speed of in kernel file systems is held up by lock acquirement not context switching.

                So pmemfile modified to allow safe multi process writing would slow down a lot in all the write tests.
                You are not paying attention - I already told you that multi-threading works fine. And if it's not clear - multi threading means that data structures are shared and you need locks. If I count correctly "write" in pmemfile takes 3 locks.
                Adding multi-process support doesn't mean more locking, it means sharing runtime state between processes. And that's something the underlying library (libpmemobj) currently does not support.

                Comment


                • #28
                  Originally posted by mslusarz View Post
                  I'm not sure I understand the question about randread64.
                  In read64, nova wins. In randread64, pmemfile wins. I wondered if this could be due to some caching or read-ahead that nova is doing, or how else would you explain that?

                  Comment


                  • #29
                    Originally posted by oiaohm View Post
                    So pmemfile modified to allow safe multi process writing would slow down a lot in all the write tests.
                    You're stuck in the old way of thinking about devices and device drivers. The way to enable fast, concurrent, safe access to persistent memory devices is to use the page table - not locks. This leverages the CPU hardware to detect & intercept collisions.

                    I already mentioned this, which is seeding doubts in my mind about the value of continuing this exchange. That, and your apparent fixation on the syscall interception, which I think it's clear is intended as a short-term measure that only a small number of users would ever utilize.


                    Originally posted by oiaohm View Post
                    http://blog.tsunanet.net/2010/11/how...e-context.html

                    ...

                    The reality is when you start looking at benchmark numbers particularly read numbers it comes clear context switch is not a major performance issue.
                    BTW, your own link says syscalls no longer (necessarily) trigger context switches. So, I guess you mean syscall overhead.
                    Last edited by coder; 11-04-2017, 03:18 AM.

                    Comment


                    • #30
                      Originally posted by coder View Post
                      You're stuck in the old way of thinking about devices and device drivers. The way to enable fast, concurrent, safe access to persistent memory devices is to use the page table - not locks. This leverages the CPU hardware to detect & intercept collisions.
                      Nova kernel mode file system driver with current syscall system does that.

                      Originally posted by coder View Post
                      BTW, your own link says syscalls no longer (necessarily) trigger context switches. So, I guess you mean syscall overhead.
                      The reality is the method of avoiding context switches in syscalls have got the syscall overhead down that close to calling a library it not funny. So staying in userspace does not really give you performance advantages any more.

                      Comment

                      Working...
                      X