Linux 5.15 Working Towards Comprehensive Compile-Time & Run-Time Detection Of Buffer Overflows

oiaohm replied

08 September 2021, 03:38 AM
Originally posted by mdedetrich View Post

Data61 didn't even exist back then.

CSIRO’s Digital Productivity flagship first used the Data61 name first in 1997.

Originally posted by mdedetrich View Post

Data61 is a new company that was created after the Australian government reduced funding to Nica/CSIRO and hence Data61 is the result of a merger between the 2 companies,

This is true but its also reused name. Its a really good way of hiding something that should not have been done.

Originally posted by mdedetrich View Post

Right, but then its not standard C across different compilers such as LLVM. Of course the C standard doesn't forbid you from redefining said behavior but then you are only validating the spec against specific compiler.

Sparse enforce behaviour that restricts C functionality does work as a standard across different compilers. Like having a proper type types as Sparse adds.

There is a lot of undefined functionality you can define on top of C without requiring to alter compiler behaviour. Like adding type data to define what is user space and what is kernel space.

Sparse is a interesting solution as a static analyzation tool because this allows you to generally extend the C language. If the compiler has plugins to support sparse then you can do it in one pass. If the compiler does not you have to run sparse to validate the code before you build it.

Sparse demos that you can extend C and apply that extend across multi C compilers.
Leave a comment:
mdedetrich replied

08 September 2021, 02:04 AM
Originally posted by oiaohm View Post

This is second stage. I know the people who did first stage at Data61. The start Haskell is not human generated but instead computer generated from the starting l4 code bases. Sel4 code base does not start from nothing anyone with the original data61 git can prove this.

No it really is not, the Isabell/HOL proofs are autogenerated (from the Haskell, C and generated C binary). The Haskell code is not auto generated, it doesn't make any sense.

Can you please read the paper at https://wwwnipkow.in.tum.de/teaching...s/14-final.pdf, what you are saying is blatantly wrong. It even states there that the first prototype was written in Haskell

As a first step, an Haskell prototype was implemented. Using Haskell has several advantages over low-level languages like C. Firstly, functions in Haskell are pure, meaning that they are stateless. Consequently, they behave much like mathematical functions, in that calling a function with the same arguments always yields the same result. Secondly, partly owing to the purity of Haskell functions, the semantics of Haskell is well-defined and simple compared to C.

It is then further stated that they generate the proofs automatically from Haskell into Iasebelle/HOL

Thirdly, due to the similarity between Haskell and Isabelle/HOL, it is possible to automatically translate the Haskell implementation of the kernel into the language of the theorem prover, provided only a restricted subset of Haskell is used [Der+06; KDE09]. As a consequence, we obtain an executable specification of the kernel in the sense that the translated functions are still executable in Isabelle/HOL, but it also becomes possible to prove facts about them in Isabelle/HOL. Lastly, Haskell already has a mature ecosystem, which, among other things, made it possible to use the Haskell prototype as a hardware simulator [Kle+09], essentially turning it into a virtual machine. This step enables the kernel developers to test user applications against early versions of the kernel, which is crucial for rapid prototyping. Using the hardware simulator, the developers were able to test out different ideas and eventually change the requirements accordingly [Der+06]. As another advantage, no time must be spent to rewrite low-level, hardware dependent code when using a hardware simulation. But what about running the kernel on real hardware?

Originally posted by oiaohm View Post

Yes the is a question if what Data61 originally did is in fact legal. Yes UNSW was a good party to white wash over the questionable action. Yes UNSW was used to be the clean room over Data61 dirty room work.

What are you going on about? I was merely stating that I knew people that were writing (i.e. not "auto generating") the Haskell spec for sel4, that is all

Originally posted by oiaohm View Post

Data61 does the first stage on how you start making a formally validated OS from existing code base. The guy you know from the university training program are doing the second stage of how to complete that process. Yes the first stage is completely missing from the university repository.

Data61 didn't even exist back then. Data61 is a new company that was created after the Australian government reduced funding to Nica/CSIRO and hence Data61 is the result of a merger between the 2 companies, see https://en.wikipedia.org/wiki/NICTA

What you are saying does't make any sense.

Originally posted by oiaohm View Post

What you described is mandating duck typing. You don't have to have duck typing to have a safely typed language.

No thats not what I am saying, I was merely illustrating that C is ill specified for type safety where as ironically dynamic languages (such as JS/Python) are strongly typed where as C is not

Originally posted by oiaohm View Post

Lot of sparse typing is hard typing. As in once you set a sparse type on something there is no changing it. And when something is sparse typed it will only work with some of the same sparse type.

Thats not what type safety means

Originally posted by oiaohm View Post

This is where every screws up. Undefined behaviour in the C standard does not forbid you defining that behaviour.

Right, but then its not standard C across different compilers such as LLVM. Of course the C standard doesn't forbid you from redefining said behavior but then you are only validating the spec against specific compiler.

Furthermore, as far as I can tell proof is not validated formally unlike sel4 or Rust's borrow checker. In other words there isn't a formal mathematical proof stating that sparse follows a spec, hence why (as they even advertise) it can only detect potential problems.
Leave a comment:
oiaohm replied

07 September 2021, 09:04 PM
Originally posted by mdedetrich View Post

Umm, the Haskell code is not auto generated. Have a look at the commit history that I linked earlier, its clearly human written and not auto generated https://github.com/seL4/l4v/commits/master/spec/haskell

What you just linked is the parser that parses human written C code into HOL/Iasbell to validate that it works against Haskell specification.

x64: add x64 haskell code from seL4 repository · seL4/l4v@02824d7

https://github.com/seL4/l4v/commit/02824d7599ab2a7b8c81a48840784bc1910b29d5#diff-fceb8f8c261c5a7b72bbe6cf4ff4f3f4ea7f3706763daca71cb022b1b5cb17ab

seL4 specification and proofs . Contribute to seL4/l4v development by creating an account on GitHub.

Except there is a problem here that does not contain the original data61 repository history. There are two stages to Sel4 development.

First stage is purely done by Data61.

Originally posted by mdedetrich View Post

Again I knew people that literally worked on sel4 as part of the university training program where I studied (UNSW) when learning formal verification/OS development.

This is second stage. I know the people who did first stage at Data61. The start Haskell is not human generated but instead computer generated from the starting l4 code bases. Sel4 code base does not start from nothing anyone with the original data61 git can prove this.

Yes the is a question if what Data61 originally did is in fact legal. Yes UNSW was a good party to white wash over the questionable action. Yes UNSW was used to be the clean room over Data61 dirty room work.

Data61 does the first stage on how you start making a formally validated OS from existing code base. The guy you know from the university training program are doing the second stage of how to complete that process. Yes the first stage is completely missing from the university repository.

Originally posted by mdedetrich View Post

As for sparse that you linked earlier, I can't comment too much on it because I am not that familiar but since C is not a safely typed language when using the formal definition, i.e.
All states within the type system can be fully evaluated into another type

All states as a result of that evaluation are also well typed (this is the property of preservation)

This also means dynamic languages are also safely typed by this definition, they are just mono typed (since they have one global type). This guarantees preservation automatically due to having one type and if there is a type mismatch they throw an exception which is well specified.

GCC plugin infrastructure — The Linux Kernel documentation

https://www.kernel.org/doc/html/latest/kbuild/gcc-plugins.html

Is the Linux kernel using a version of C that matches the C formal definition. The answer is no it not.

All states within the type system can be fully evaluated into another type

This is in fact false for the Linux kernel for a different reason. You cannot have something that it type user space by sparse with something that type kernel space by sparse directly in the Linux as will not build in a fully validated build. This is where Linux kernel version of C start going away from the C standard and away from what people call safely typed.

What you described is mandating duck typing. You don't have to have duck typing to have a safely typed language.

Lot of sparse typing is hard typing. As in once you set a sparse type on something there is no changing it. And when something is sparse typed it will only work with some of the same sparse type.

Originally posted by mdedetrich View Post

C does not have this property because of undefined behavior which is in the language spec. This means that sparse cannot provide the same formally verified guarantees that SEL4 does by corresponding proofs with an alternate representation unless they are using some sought of magic.

This is where every screws up. Undefined behaviour in the C standard does not forbid you defining that behaviour.

Originally posted by mdedetrich View Post

Says it can only detect potential problems.

Yes how sparse is able to find problems is that it defines what would normally be undefined C behaviours with hard typing. Yes hard typing is a typing that you cannot convert from one type to another. Where it will refuse to build if you attempt to type from one hard type to another hard type.

The C of the Linux kernel does not have the same area of undefined as the C standard because a lot of the C behaviour has been defined.

Hard reality here Linux kernel is not using standard C. The Linux kernel is using a extend version of C where many things that in C standard are undefined behaviours have sold defined behaviours in Linux kernel source. Enforcement of this Linux extended C is done by compiler plugins and sparse resulting in the C in the Linux not having the same areas of undefined behaviours because a lot of behaviours are defined.
Leave a comment:
mdedetrich replied

07 September 2021, 02:43 AM
Originally posted by oiaohm View Post

That is in fact wrong. This is that you have not worked with the tools sel4 used. Those haskell files active code is 90+% generated.

AutoCorres: Automatic Specification Abstraction | TS

https://trustworthy.systems/projects/TS/autocorres/

This is where you are so far wrong. The original specifications in haskell of sel4 are in fact from the original l4 c files processed by early form of autocorres to generate them. Majority of the sel4 Haskell specification was never written by a human and based off the original C files. The copyright on those haskell files miss a human author is no mistake.

Umm, the Haskell code is not auto generated. Have a look at the commit history that I linked earlier, its clearly human written and not auto generated https://github.com/seL4/l4v/commits/master/spec/haskell

What you just linked is the parser that parses human written C code into HOL/Iasbell to validate that it works against Haskell specification.

It even says on the main README of the page that both codebases are kept in sync at https://github.com/seL4/l4v#overview, i.e.

haskell: Haskell model of the seL4 kernel, kept in sync with the C code

And if you still don't believe me, you can compare the commit messages of both repositories. Its very obvious that the Haskell code is written by person. There is also the paper written at https://wwwnipkow.in.tum.de/teaching...s/14-final.pdf, it clearly states there that the Haskell code was written first, and if possible they would have not wanted to rewrite it in C however since they are writing a Kernel they needed to use a language that doesn't have a runtime (haskell has a GC runtime). So instead they have the current solution which I described earlier.

Again I knew people that literally worked on sel4 as part of the university training program where I studied (UNSW) when learning formal verification/OS development. The haskell code along with the C are both manually written. They then generate proofs from both codebases and then they basically see if the proofs are equivalent (greatly simplifying here). They have a tool (as described in the paper linked) that lets them generate HOC/Isabell from Haskell and they also have a tool that lets them generate HOC/Isabell from C code (which is what you linked earlier). They compare that along with proof generated from disassembling the binary to verify that there wasn't a bug in the compiler.

So using sel4 is a terrible example of wanting to typecheck C in the Linux compiler. As for sparse that you linked earlier, I can't comment too much on it because I am not that familiar but since C is not a safely typed language when using the formal definition, i.e.
All states within the type system can be fully evaluated into another type

All states as a result of that evaluation are also well typed (this is the property of preservation)

This also means dynamic languages are also safely typed by this definition, they are just mono typed (since they have one global type). This guarantees preservation automatically due to having one type and if there is a type mismatch they throw an exception which is well specified.

C does not have this property because of undefined behavior which is in the language spec. This means that sparse cannot provide the same formally verified guarantees that SEL4 does by corresponding proofs with an alternate representation unless they are using some sought of magic. Even from briefly reading the language of sparse that you linked, there is no formal proof that this tool is correct, even the intro at https://www.kernel.org/doc/html/v4.1...se.html#sparse

Sparse is a semantic checker for C programs; it can be used to find a number of potential problems with kernel code. See https://lwn.net/Articles/689907/ for an overview of sparse; this document contains some kernel-specific sparse information.

Says it can only detect potential problems.
Last edited by mdedetrich; 07 September 2021, 03:03 AM.
Leave a comment:
krzyzowiec replied

06 September 2021, 11:10 PM
Originally posted by xhustler

I don't claim to be a master Jedi programmer but all this railing against C is bogus in my book - learn to write better code. STOP being lazy, once you allocate memory, clean it up. All this Java and Rust BS (pardon the language) simply encourages lazy programmers.

Where are these programmers? Google, Microsoft, and basically every major company out there has had security vulnerabilities from mismanaging memory. Microsoft said something like 70% of their security issues are memory related, and it’s not like they can’t afford good programmers.

The human mind is not good at reasoning when you have mutable state along with aliasing. With those two features, it just becomes too complex to track what is going on. Now if you have immutable data, then aliasing is no problem. Or if you disallow shared ownership of mutable state, then again you have no problem. That’s when you have Rust.

Personally, I prefer immutable data and a managed runtime like Clojure because it’s so much cleaner to program in, but I recognize it’s not suitable for something like systems programming.
Leave a comment:
oiaohm replied

06 September 2021, 10:37 PM
Originally posted by mdedetrich View Post

You kind of misquoted me, sel4 is written in a subset of C that has been formally verified against the original specification that is written in Haskell proven via Isabell/Coq.

That is in fact wrong. This is that you have not worked with the tools sel4 used. Those haskell files active code is 90+% generated.

AutoCorres: Automatic Specification Abstraction | TS

https://trustworthy.systems/projects/TS/autocorres/

This is where you are so far wrong. The original specifications in haskell of sel4 are in fact from the original l4 c files processed by early form of autocorres to generate them. Majority of the sel4 Haskell specification was never written by a human and based off the original C files. The copyright on those haskell files miss a human author is no mistake.

Originally posted by mdedetrich View Post

BTW as a tidbit, I studied at the university which did a joint venture with CSIRO that developed SEL4 and I know some people that worked on SEL4, its definitely not "writing in a subset of C that runs through a theorem prover" like you are implying.

There was more than 1 team.

Also most of the stuff that was added by humans to the sel4 specification already exist for the Linux kernel.

https://www.kernel.org/doc/readme/tools-memory-model-README

Sel4 is kind of horrible because the memory model and the validation code is all nice mixed with each other.

sel4 depend on what point of the project you saw alters what you see. Like not seeing the start with the auto generation of the Haskell but been there when they are adding the memory model into the haskell result in need to fix the C give the appearance that you are working haskell to C when in reality it was done the other way.

Originally posted by mdedetrich View Post

In order to get formal verification to work automatically in C (without having to rewrite it in another language like Haskell) you would have to put a type system ontop of C to properly track properties such as memory and are basically describing Rust right now.

Type System — sparse 0.6.4 documentation

https://sparse.docs.kernel.org/en/latest/types.html

Welcome to fun facts. Linux kernel has sparse that does put a type system on top of C. Sparse is older than rust.

mdedetrich this is the thing sparse extend types is already part of the Linux kernel C source code. So rust for Linux kernel is not vs stock C. Linux kernel C is already extended with types by the sparse system. The fact Linux kernel C has types on top of the standard C types by sparse means a lot of formal validation can automatically work with the sparse add type information. Then you have the extra information from the formal memory model of the Linux kernel that you can also provide into those tools.

The reality that the Linux kernel is not standard C but sparse extended C is why comprehensive complite-time buffer overflow detection is truly possible in a formal verification way with Linux kernel.

The reality here is sparse and rust do very much the same kinds of things. Linux kernel C is not pure C. How far sparse extend C can go is a very interest question. Remember the validate sparse extended C can be the less code that the Linux kernel has to rework.

mdedetrich I guess you never thought to ask if the Linux kernel C in fact had a type system extended on its C solution being C and sparse.

Remember over 95% of the Linux kernel code is already typed with extra types from sparse. Linux kernel C is not your stock standard C. Since Linux kernel is not stock standard C it should be able to have a lot of the properties of rust without being recoded to rust.

Linux kernel is also a rare program with a full formal mathematical memory model that includes threading and other things. The Linux kernel between is sparse and its formal memory model technically has most of the parts to come a very secure OS the hard part is getting all the code to make that work coded.

The reality here Kees Cook doing this work like rust like you but he is also being realistic. Recoding large sections of the Linux kernel from C to rust is most likely going to take too many hours to be done. This is resulting in Kees Cook looking at how much does need to be recoded. How much of rust functionality can be brought to the Linux kernel not be recoding but by taking advantage of the features the Linux kernel version of C already has.

It is still a open question is rust security wise better. The reality is Linux kernel C and rust may end up security equal at worse case Linux kernel C might end up more secure than rust. Of course the rust code could be more tidy and developer friendly because you don't have to extra code for types that sparse mandates you add with the Linux kernel C.

Remember when you use a sparse type you still have to use a standard c type as well. This does make your code a little more horrible with the Linux kernel version of C than using rust.

Sparse giving the means to add extra typing to C does open up the question right how much of rust could be done using C files with sparse style type extend so requiring less code rewriting and allow the system to remain using standard C compilers.
Leave a comment:
darkonix replied

06 September 2021, 06:58 PM
Originally posted by xhustler View Post

I don't claim to be a master Jedi programmer but all this railing against C is bogus in my book - learn to write better code. STOP being lazy, once you allocate memory, clean it up. All this Java and Rust BS (pardon the language) simply encourages lazy programmers.

That's an argument that comes many times. Reality is that it didn't work in 49 years since the first version of C. Why will start working now? C makes it difficult to avoid making mistakes. Every analysis shows the same. Even if we were to assume it was a matter of lazy programmers, how are you proposing to make them better programmers of tens or maybe hundreds of thousands of C programmers? I don't think that affirmation is valid.
Likes 1
Leave a comment:
mdedetrich replied

06 September 2021, 06:47 PM
Originally posted by oiaohm View Post

Go back and read 3.1 again.

C is not a formal language; in order to allow reasoning about a C program in the theorem prover (we use Isabelle/HOL), it has to be transformed into mathematicallogic (HOL). This is done by a C parser written in Isabelle. The parser defines thesemantics of the C program, and gives it meaning in HOL according to this seman-tics. It is this formalisation which we prove to be a refinement of the mathematical(abstract) model.

SeL4 is in fact written in C. There is a parser that turns C into HOL for mathematical validation. The HOL does not generate the C program just yells at you if you have it coded wrong. Yes look at page 10 you start at the C code for both the compiler and the validator.

This is something that Sel4 proved could be done with C. Yes you would need processing engine on top of C. This is the comprehensive Comple-Time buffer overflow detection that is now being implement.

Does gcc and llvm have a theorem prover the answer is in fact yes. Working out how to get from Isabel into gcc and llvm version of a theorem prover. How does gcc and llvm optimise code by covering what very language you code in to a internal language that it can run provers on for optimisation and validation.

So gcc case C comes into gets turned into gimple and that is what the provers are using.

SeL4 like it or not demoed that you could still code C and have validation tool on top and produce solid code. This of course can be change to integrate validator into C compiler so doing the same thing with 1 tool.

mdedetrich yes your claim that Sel4 is written in HOL/Isabel is invalid. SeL4 is written it self in C being a limited subset with its validator written in HOL/Isabel that cannot generate C code itself only error if you have something coded wrong.

So there is two ways to solve this problem. SeL4 demos that the C compiler can be improved and with gcc and llvm based tools adding extra meta data to C by __attribute__( stuff they are limit themselves to only C standard. Yes sparse demos that this can be pushed even more than what it currently is.

mdedetrich parties like you don't want to admit to the possibility that C could be fixed. Yes this will most likely result in feature of C being disabled from use. There are already a list of C things you are not allowed to submit to the Linux that are allowed under the C standard. Yes the issue they talked about with C not having enough information this is also being addressed as compilers catch up in validation to what the sel4 validation can do by extending C with features like __attribute( that are not part of the C standard.

There could come the day where the sales pitch that rust will fix buffer overflow issues and other memory issues will mean nothing to the Linux kernel because the version of C the Linux kernel will accept will not have these issues. "Rust's borrow checker" like it or not SeL4 implemented equal as a static analyzation tool on top of project written C with C functionality restricted. Nothing says that the Linux kernel code base has to use every C feature..

You kind of misquoted me, sel4 is written in a subset of C that has been formally verified against the original specification that is written in Haskell proven via Isabell/Coq. As noted in the diagram/text you quoted, they had to write a parser for a subset of C that they can formally verify it, i.e. see https://github.com/seL4/l4v/tree/master/spec/haskell, more specifically as an example
Thread implementation in Haskell https://github.com/seL4/l4v/blob/mas...Thread/ARM.lhs vs the Thread implementation in C at https://github.com/seL4/seL4/blob/ma...ernel/thread.c

https://github.com/seL4/l4v/blob/mas...egisterSet.lhs vs https://github.com/seL4/seL4/blob/ma.../registerset.c

etc etc

So its not as simple as just writing in a subset of C, you also have to write in Haskell, then the formal proof in Isabell first and THEN manually write the C equivalent in a subset of C. Finally you use theorem solvers to prove that the manually written C is equivalent to the Haskell specification.

In order for this to work with the Linux kernel you would have to re-implement the Linux kernel as a specification in Haskell along with formal proofs. BTW as a tidbit, I studied at the university which did a joint venture with CSIRO that developed SEL4 and I know some people that worked on SEL4, its definitely not "writing in a subset of C that runs through a theorem prover" like you are implying.

What you are saying about shoving Isabell/HOL into c/gcc doesn't work and its also not how sel4 works. The C language by itself doesn't have the actual language constructs to allow this because its loosely typed, unlike haskell which is a strongly typed language (hence why the original implementation of sel4 is written in Haskell). The closest thing you can verify in C is that the specific C compiler in question is verified to follow the C spec i.e. https://compcert.org/compcert-C.html, but the C spec itself does allow for many of these unsecure (i.e. undefined) behaviors.

In order to get formal verification to work automatically in C (without having to rewrite it in another language like Haskell) you would have to put a type system ontop of C to properly track properties such as memory and are basically describing Rust right now.
Last edited by mdedetrich; 06 September 2021, 06:55 PM.
Leave a comment:
tildearrow replied

06 September 2021, 01:13 PM
Originally posted by xhustler View Post

I don't claim to be a master Jedi programmer but all this railing against C is bogus in my book - learn to write better code. STOP being lazy, once you allocate memory, clean it up. All this Java and Rust BS (pardon the language) simply encourages lazy programmers.

I am NOT lazy. I know this extremely well.
Leave a comment:
xhustler replied

06 September 2021, 05:50 AM
Originally posted by tildearrow View Post

This is where C doesn't shine. It is a very simple language and allows compilation to nearly every computer architecture out there, but memory management is a total mess and there are absolutely no checks, which makes the language hard to master.

Prepare for the Rust comments...

I don't claim to be a master Jedi programmer but all this railing against C is bogus in my book - learn to write better code. STOP being lazy, once you allocate memory, clean it up. All this Java and Rust BS (pardon the language) simply encourages lazy programmers.
Leave a comment: