Announcement

**log0** · 22 March 2016, 05:58 AM

Originally posted by Zoxc View Post

Actually there is a way around that, but not for arbitrary x86 code. My microkernel project (WIP: https://github.com/AveryOS/avery) uses software-fault isolation on processes ensuring that they are isolated. The processes all run in a single address space in kernel mode. This software-fault isolation also happens to prevent control-flow hijacking attacks, but it does have a overhead compared to regular x86 code.

Now that sounds much more interesting. Although your project seems to be "light" on documentation too. Which approach are you using? I guess it would be too early to ask about the performance impact it has?

**duby229** · 22 March 2016, 11:27 AM

Originally posted by Zoxc View Post

Actually there is a way around that, but not for arbitrary x86 code. My microkernel project (WIP: https://github.com/AveryOS/avery) uses software-fault isolation on processes ensuring that they are isolated. The processes all run in a single address space in kernel mode. This software-fault isolation also happens to prevent control-flow hijacking attacks, but it does have a overhead compared to regular x86 code.

What do you mean by address generation? A key benefit of having caches indexed by virtual memory (like the Mill, but unlike x86) is that you don't have to look up the physical address in the TLB before going to the caches.

You must know that's not how x86 works though? I'm pretty certain all x86 processors use the TLB to store cache line tags specifically for that reason.If you aren't looking up the address in the TLB wouldn't that necessarily mean a pipeline stall while it's waiting for latency?

EDIT: That's what I mean when I said a backside cache may be useful, but a front side cache would hurt.

EDIT:https://en.wikipedia.org/wiki/CPU_cache

Virtual memory requires the processor to translate virtual addresses generated by the program into physical addresses in main memory. The portion of the processor that does this translation is known as the memory management unit (MMU). The fast path through the MMU can perform those translations stored in the translation lookaside buffer (TLB), which is a cache of mappings from the operating system's page table, segment table or both.

So I guess my question is, where is the virtual memory index? It can't possibly be faster than a TLB because that's pretty much what it already is, in the lowest latency location that it can possibly be.

**Zoxc** · 22 March 2016, 02:12 PM

Originally posted by duby229 View Post

You must know that's not how x86 works though? I'm pretty certain all x86 processors use the TLB to store cache line tags specifically for that reason.If you aren't looking up the address in the TLB wouldn't that necessarily mean a pipeline stall while it's waiting for latency?

EDIT: That's what I mean when I said a backside cache may be useful, but a front side cache would hurt.

EDIT:https://en.wikipedia.org/wiki/CPU_cache

So I guess my question is, where is the virtual memory index? It can't possibly be faster than a TLB because that's pretty much what it already is, in the lowest latency location that it can possibly be.

This just made me more confused. You didn't define address generation either. I assume that is whatever generates the virtual address we want to access in memory.

I assume that x86 processors use a virtually indexed, physically tagged cache so the cache needs to wait for the tag from the TLB. Mill is uses a virtually indexed, virtually tagged cache, so it does not have to wait for the TLB (which is a good thing, since the TLB is far away and only consulted before requests go on the memory bus).

**duby229** · 22 March 2016, 02:49 PM

Originally posted by Zoxc View Post

This just made me more confused. You didn't define address generation either. I assume that is whatever generates the virtual address we want to access in memory.

I assume that x86 processors use a virtually indexed, physically tagged cache so the cache needs to wait for the tag from the TLB. Mill is uses a virtually indexed, virtually tagged cache, so it does not have to wait for the TLB (which is a good thing, since the TLB is far away and only consulted before requests go on the memory bus).

I'm not real familiar with that processor honestly. I just read the wikipedia page on it. I can assure you x86 is much different. Even though internally they are generally OoO RISC pipelines externally they are still complex instructions which will never map well to VLIW architectures. I don't think you can get very good performance on x86 processors like that.

**Zoxc** · 22 March 2016, 02:58 PM

Originally posted by log0 View Post

Now that sounds much more interesting. Although your project seems to be "light" on documentation too. Which approach are you using? I guess it would be too early to ask about the performance impact it has?

My project is pretty light on progress and documentation

I have a LLVM pass which turns memory accesses into the form gs:[ptr & mask] which means they are cheaply restricted to a linear region in memory. I did some simple benchmarks on tinyrb (a simple Ruby interpreter written in C) which showed a code space overhead of 9% and runtime of 7%. Note that this is just the overhead of protecting memory accesses (other protections are not included). Now these overheads will go up for program which more heavily accesses memory.

I have not yet decided how I will restrict function calls. I could use an array of code addresses where instances of function pointers are indexes into this array. Calling a function will look up the address in the array before actually calling it. I could also use a large bitmap which marks function entry points as valid. Before calling a function the bitmap will be consulted to see if it's a valid destination. This is the approach Windows uses (and a 1% runtime overhead is typically quoted).

To deal with returning from functions I will use the same approach as SafeStack, that is, there will be separate data and call stacks. The call stack will be outside of the region of memory processes can access so overwriting return addresses will be impossible. SafeStack also has a ~%1 overhead.

Code will be proven to be isolating when loaded by the linker by running an data-flow analysis on all functions. (You can read section 5 of this paper for details on how this can be done http://www.cse.lehigh.edu/~gtan/pape...Sandboxing.pdf)

I don't have any plots to deal with jump tables, but I hope that some simple pattern matching will suffice.

**CrystalGamma** · 23 March 2016, 05:29 AM

Originally posted by atomsymbol

Fully safe software fault isolation (such as: software checking of every memory access) doesn't require garbage collection.

Why do you believe it would be next to impossible without GC or unsafety?

Note: Of course, by definition, "fully safe" does not account for the safety of the underlying hardware and software on top of which the fault isolation is implemented.

What I mean is that you couldn't use just any programming language (e. g. C or platform assembly) when you run SIPs. I probably used bad wording.

I am aware that you can do it with a language without GC. In fact, I think if you forbade unsafe blocks in Rust code for applications, you could have such a system.
But then you have massive restrictions on pointers (even inside processes) because of borrowchecking.

**SystemCrasher** · 23 March 2016, 10:49 PM

Originally posted by andreano View Post

SystemCrasher: The so called "borrow checker" in Rust runs during compilation — you are forbidden from compiling most kinds of memory bugs. I think you will have to add Rust to your list of “unique in this regard” languages regarding performance.

When it comes to kernel development and somesuch, I guess it could turn into annoyance instead. Drivers and other system-level things are actually supposed to do "strange" memory accesses to deal with their devices and so on, especially if they want to be anyhow fast. Furthermore, drivers could and would do "indirect" memory accesses, thanks to DMA. Shuffling large chunks of data yourself using CPU is lame. To make it even more fun, devices could also access memory on their own. Do you understand what "bus master" is? One wrong register programming and device does DMA into WRONG address. Some devices can actually have their own will due to firmware and decide what to do on their own. Needless to say wrong DMA write by device tends to kill OS really quick. Good luck to track it down in compiler or even catch it in kernel. The only thing which is known to be efficient is IOMMU, purely hardware feature.

This said, I wouldn't be against to see something like this as ... optional flag for C compiler, just like it happens with asan/ubsan and somesuch. Sure, it could be good to have. If it happens to be unobtrusive, optional and does not req's code rewrite. Otherwise it just happens to be not worth of it.

**unixfan2001** · 24 March 2016, 05:39 AM

Originally posted by pal666 View Post

don't hold your breath, chances to switch kernel to c++ are order of magnitude higher

An order of magnitude higher? Last I checked, zero minus zero still equaled zero.

**andreano** · 24 March 2016, 02:07 PM

Originally posted by SystemCrasher View Post

I guess it could turn into annoyance instead.

Yes. You gave me more examples than I was aware of, but you're supposed to be able to do all of that in Rust — in any case where the borrow checker can't prove your code to be correct, you can mark the block of code with the unsafe keyword, which really means "trust me".

**andreano** · 24 March 2016, 03:35 PM

Originally posted by Hi-Angel View Post

I'm curious, whether at some point Linus will allow Rust to find its way into Linux. He wasn't friendly toward C++, I hope he would change his opinion toward Rust.

Me too. He is known to care about performance, and I can see how C++03 is an obvious no-go for that purpose:

A thought experiment: Try writing a class in C++03, that manages any kind of resource, e.g. heap memory. For it to be possible to store objects of your class in any kind of STL container (except by pointer indirection, which you don't want for performance), you must implement the copy constructor. This is btw a dilemma, because you can choose between deep copying and reference counting, each with its own tradeoffs. Now, consider how your copy constructor is used: Your object is constructed first as a temporary on the stack, then *copied* to its destination in the container's memory, before the immediate destruction of your temporary. All that copying (and programmer headache) for nothing!

C++11 solves this specific nightmare, and many more, with stuff like rvalue references and unique_ptr, which I'd say, finally makes it meaningful to try to write performant C++.

Now, consider that Rust is like C++11 *minus* C++ — a language entirely based on rvalue references and unique_ptr aka. the borrow checker, encouraging all the good-for-performance practices from the start…

Announcement

Redox: A Rust-Written, Microkernel Open-Source OS

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment