Linux Kernel Patches Posted For Bringing Up Tesla's Full Self-Driving SoC

mppix replied

25 March 2022, 02:40 PM
Originally posted by filssavi View Post

While there are a number of possible safety certifications depending on exactly which sub-field of the whole transportation we are talking about, the most popular and widely used for automobiles is the Automotive Safety Integrity Level (ASIL) delineated in the ISO 26262 (Road vehicles – Functional safety") which is in turn a derivative from IEC61508.

It has 4 levels from A to D, with A being the weakest one, used for items such as external lighting, where the risk in case of failure is relatively limited while D is the strongest one used for stuff like ABS brakes and ECU, where a failure is very likely to result in injuries or even death, think ECU randomly accellerating with full power during a turn or ABS completely preventing you to brake while doing an emergency brake (on the dry).
Most automotive Electronics falls somewhere in the middle at ASIL B or C.

The first hurdle to clear for ASIL (at least for the highest levels) is that the certification does not only cover the item in itself, but also the whole development process, this requires formally documenting how and why requirements are formulated, how the development process is carried out in order to fill those requirements, and so on.
All of this is very, very, very far from how the linux kernel in particular, and most general purpose software in general is developed, so a full certification of the entire kernel "as is" is basically unthinkable.

The second problem is that for the higher levels of ASIL ( namely C and D) extensive failure analysis is required, to identify all the possible failure modes and how these could affect the safety of the entire vehicle and what mitigations are put in place so that such failure does not lead to catastrophic consequences. Doing such a detailed analysis on a project with the scale of the entire kernel (with upwards of 20 milion lines of code) is basically unthinkable.

The final issue (and probably the biggest) is that all these analyses and documentation requirements do no apply to the kernel itself (or rather to any single component) but to the complete product being developed (the entire ECU for example, and not to the processor running the firmware alone). And thus need to be tailored to each specific case, and can not be just done once.

Now all of this work is only valid for a single point in time ( a specific kernel release for example), a significant chunk of the process must be re-done for all subsequent releases as it must be shown that the changes do not interact negatively with the rest of the design

Now the linux kernel could at some point meet all requirement for use in an ASIL certified product at the lower levels (A or B) i dont belive there is any incentive in adopting it in priducts with higher ratings, as it will be much, much cheaper, safer and faster to develop something from scratch that just include the necessary stuff, minimizing the potential failure points

I think this is does not capture the whole picture on self driving or ISO 26262.
Yes, ISO 26262 is painful for rather simple projects and doing it for the Linux kernel would be "more painful" (understatement of the year). However, ISO 26262 is for components, e.g. that the motor applies the correct torque, or that your barkes stop the car.
In this context, ISO 26262 does not allow adaptive controls or machine learning in any shape of form for high ASIL. Hence, these two concepts don't mix.

I don't know anyone that tries to certify an autonomous "kernels" through functional safety certification because -by definition- autonomous driving admits unexplored failures (accidents).
Separately, there is no such thing as an autonomous today, we are just discussing levels of autonomy (despite the attention-grabbing headlines..)

Self-driving car - Wikipedia

https://en.wikipedia.org/wiki/Self-driving_car

PS. to my knowledge ISO 26262 is also voluntary in the sense that you can sell a car without it. Car OEMs just require it from their Tier 1 and Tier 2 suppliers.

Last edited by mppix; 25 March 2022, 02:53 PM.
Leave a comment:
discordian replied

15 January 2022, 12:38 PM
Originally posted by coder View Post

Sure you can. You can run a collision-detection test to see if the control inputs from the algorithm are predicted to hit anything, and engage emergency measures if so. The test must necessarily be lower-level and more stable than the algorithm, obviously. That should also make it cheaper to do.

if you have some alternative algorithm, this helps alot shielding you from systematic errors. However the safety concept is rather vague about some rather complicated algorithms. Most of the time certification requires you to prove you can detect "simple errors".
highest cert levels still require stuff like periodic ram Selftest, changing bits and testing the whole memory if other bits got flipped. Stuff that you kann do with 32k ram.

Or you have multiple - less reliable/safe systems and test if they agree. i know of several industrial safety relevant systems using Linux in it's components, even if some are only SIL1 or even just adhering to parts of it, the whole plant can be SIL2 because of redundancy.

Originally posted by coder View Post

If the inputs are unknown or the objective is unclear, then you can't evaluate the quality of its solution. However, certain inputs are knowable, as are certain constraints and when they're about to be violated.

Yeah, and to check the constraints you need similar hardware. aI is still new, doubt there's a conclusion yet how to tackle it in safety critical applications. Tesla tries to argue with having gobs of recorded input to thoroughly test the trained AI.

Originally posted by coder View Post

I think whatever OS is hosting VM with a safety-critical guest must also be certified for safety-critical applications, as a bug or failure in the host/hypervisor can invalidate the testing & assumptions made by the guest.

This block alone will never be safe, it's just a matter of generating a response in a given timeframe. Can still be part of a safety-critical system, you would just have to argue how unlikely an operational fault is - that's the case with everything down to the chance of getting flipped bits in a cat cable.

Last edited by discordian; 15 January 2022, 12:45 PM.
Leave a comment:
coder replied

15 January 2022, 10:51 AM
Originally posted by discordian View Post

what I mean with "not supervisable" (I am not english native, so I picked a bad word probably), is that there is no easy way to decide if the data coming out makes sense. Other than running another instance and see whether they are agree.

Sure you can. You can run a collision-detection test to see if the control inputs from the algorithm are predicted to hit anything, and engage emergency measures if so. The test must necessarily be lower-level and more stable than the algorithm, obviously. That should also make it cheaper to do.

Originally posted by discordian View Post

with machine learning you can make little assumptions what you get as answer,

If the inputs are unknown or the objective is unclear, then you can't evaluate the quality of its solution. However, certain inputs are knowable, as are certain constraints and when they're about to be violated.

Originally posted by discordian View Post

in both cases Linux support would help.

I think whatever OS is hosting VM with a safety-critical guest must also be certified for safety-critical applications, as a bug or failure in the host/hypervisor can invalidate the testing & assumptions made by the guest.
Leave a comment:
discordian replied

15 January 2022, 09:52 AM
Originally posted by coder View Post

Wow. That's a lot.

First, a deep learning network is basically a systolic processing graph, which typically can be evaluated in deterministic time. I presume there are some deep learning networks with feedback, that can take a non-deterministic amount of time to converge, but that doesn't seem like a practical architecture for realtime control applications.

Second, it doesn't follow that algorithms "not supervisable", in terms of their answer therefore don't need a RTOS, if that's what you're saying. The RTOS exists to ensure that threads get their requisite amount of processing time & resources, so they maintain the necessary degree of responsiveness.

The RTOS is part outside of the analysis and decision making, what I mean with "not supervisable" (I am not english native, so I picked a bad word probably), is that there is no easy way to decide if the data coming out makes sense. Other than running another instance and see whether they are agree.
Compare that to most math problems where you can easily guess if a solution is valid or highly likely to be valid, with machine learning you can make little assumptions what you get as answer, even if its deterministic.

Yeah, a RTOS would need to check whether the systems responds and doesn't report some error it can self-diagnose. But the high-level stuff is pretty much a blackbox.

Originally posted by coder View Post

As for what to do when your self-driving pilot produces some bad output or otherwise outright fails, there ought to be a lower-level obstacle avoidance system that tries to safely bring the vehicle to a stop. Even with redundancy, you still need that for an algorithm operating safety-critical machinery with an unconstrained set of inputs.

You wont be able to detect "bad output", in the sense that "bad decisions" are practically cooked in the AI. You can only test whether the system doesn't have some operational hickups like bad memory, overheating, instability.
The safety critical part is then quite similar to classical cars, steering, brakes and so on are tested and redundant. No one is safe from a human driver making bad decisions, you could at most check some operational status like heart-rate, sleep, pulse.

Originally posted by coder View Post

That's basically what I'm talking about. Maybe these chips aren't used directly for training, but at least for testing new algorithms and deep learning models.

I would expect that those are the same chips. Btw. you can today use RT below Linux (Xenomai) or isolate cores and HW for RT later (Jailhouse), in both cases Linux support would help.
Leave a comment:
coder replied

14 January 2022, 06:02 PM
Originally posted by discordian View Post

Some parts of the self-driving, ie the highlevel stuff is:

- not what you called hard realtime
- not really feasible to predict what it should do and thus supervise

ie. you get away without a RTOS, and you can't use safety measures except the most generic ones - like to run 2-3 separate hardware instances and cross-check.

Wow. That's a lot.

First, a deep learning network is basically a systolic processing graph, which typically can be evaluated in deterministic time. I presume there are some deep learning networks with feedback, that can take a non-deterministic amount of time to converge, but that doesn't seem like a practical architecture for realtime control applications.

Second, it doesn't follow that algorithms "not supervisable", in terms of their answer therefore don't need a RTOS, if that's what you're saying. The RTOS exists to ensure that threads get their requisite amount of processing time & resources, so they maintain the necessary degree of responsiveness.

As for what to do when your self-driving pilot produces some bad output or otherwise outright fails, there ought to be a lower-level obstacle avoidance system that tries to safely bring the vehicle to a stop. Even with redundancy, you still need that for an algorithm operating safety-critical machinery with an unconstrained set of inputs.

Originally posted by discordian View Post

(And of course, they want to run their hardware in some sever-racks for AI-training and testing)

That's basically what I'm talking about. Maybe these chips aren't used directly for training, but at least for testing new algorithms and deep learning models.
Leave a comment:
filssavi replied

14 January 2022, 07:10 AM
fl1pm Lockstep execution (the arrangement you are describing) is mandatory for high levels of ASIL, however it is nowere near enough, it is in fact just a starting point.

Redundant execution does protect you from hardware failure (power supply sags, radiation and cosmic rays, etc...) however it does nothing against logic errors (which are actually the most dangerous) either due to bugs, or even more important wrong requirements (this is why ASIL is so heavy on documentation).

In these type of systems you typically have multiple separate clusters, one might run linux and manage high level features, connectivity, etc, while the other, running an RTOS ( hard real time, not soft, as preemt RT can do) and manage the safety critical aspects of the design.

The two then talk via very well defined protocols if they must
Likes 1
Leave a comment:
discordian replied

14 January 2022, 05:17 AM
Originally posted by coder View Post

So, are those "neural processing units" of Jim Keller's design, or some derivative thereof?

I'm honestly surprised by this, since I'd read someone claim that Linux isn't capable of being certified for self-driving. So, I wonder why they'd bother with Linux support, unless it's like just for the convenience of internal software development (e.g. testing the deep learning models and algorithms).

Some parts of the self-driving, ie the highlevel stuff is:

- not what you called hard realtime
- not really feasible to predict what it should do and thus supervise

ie. you get away without a RTOS, and you can't use safety measures except the most generic ones - like to run 2-3 separate hardware instances and cross-check.

(And of course, they want to run their hardware in some sever-racks for AI-training and testing)

Last edited by discordian; 14 January 2022, 05:55 AM.
Leave a comment:
coder replied

13 January 2022, 09:51 PM
Originally posted by Linuxxx View Post

Most people seem to think that completely autonomous "killer" drones need to be incredibly sophisticated machines to be useful in real-world combat scenarios.

As a matter of fact I know how a certain class of autonomous "suicide bomber" drones already put to very effective, tide-turning use in a recent real war are built by a certain country:

Terrorists and combatants in various conflicts are most certainly doing this, already.

BTW, the concept of "suicide drones" arguably goes back as far as Germany's WW2-era V2 rockets. In more recent years, cruise missiles should fall in this category, if not also other types of guided missiles.
Leave a comment:
coder replied

13 January 2022, 09:47 PM
Originally posted by fl1pm View Post

that makes it even more bizzare that Tesla is able to run its so called "Full Self-Driving" software on Linux

I don't think it's weird, at all. They can do a lot of development, prototyping, and evaluation using Linux, and only deploy the finished deep learning models, software, and algorithms on the proper self-driving OS. There's a lot you need for testing of models and algorithms that you wouldn't want to build into the RTOS used in the final product. Debugging & visualization tools, and even media pipelines for streaming simulator or pre-recorded video into the video analysis code.

What they gain by upstreaming their changes is to eliminate the burden of maintaining out-of-tree support for their hardware. Maybe, they could even provide dev kits with their SoC to robotics labs at leading universities.
Likes 1
Leave a comment:
fl1pm replied

13 January 2022, 07:25 PM
Thanks filssavi for clarifying the exact ISO norms that one need to meet, but that makes it even more bizzare that Tesla is able to run its so called "Full Self-Driving" software on Linux as it would seem from the article. Or are they doing something like: each of the three clusters is running its own independent Linux with software on it, and the mentioned in the article additional IP blocks merge them into one output (e.g. all agree -> make the decision, exactly two agree -> go into emergency mode, but still able to it fairly safely, all disagree -> some sort of emergency stop, hopefully never happens), and argue that the random variables coresponding to each of the cluster failing have very little correlation to each other (that seems like a big ask).
Likes 1
Leave a comment:

Announcement

Linux Kernel Patches Posted For Bringing Up Tesla's Full Self-Driving SoC

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment: