Abstract: The so-called Electronic Throttle Control unit was a big step towards reducing important parameters like fuel consumption or exhaust emmission. Due to its safety-criticality, a dependability study was initiated by the manufacturer Siemens Automotive. As the most important result, values could be stated for the quantitative estimation of the safety-critical and the availability-relevant cases. The study was based on the existing safety concept, but after this study, a slightly changed system architecture of the ECU was proposed to VDA (Verband der Automobilindustrie), which enhances availability and safety of the ECU significantly, at about the same cost. For this study, a new kind of Markov evaluation method was used, called TEFT (Time-Extended Fault Trees). This was necessary to deal with concepts like multiple faults, faulty states, and time. In this paper, the questions raised by the Electronic Throttle Control system are described, together with our way to solve these problems.

Key Words: Dependability, Availability, Safety, Markovian methods, TEFT (Time-Extended Fault Trees), Cars, Powertrain, ECU (Electronic Control Unit), Electronic Throttle Control (ETC)

1 Problem Description

Higher requests caused by public and by law, to improve fuel consumption, exhaust emmission, or comfort, forced the development of the Electronic Throttle Control (ETC) system. This solution removes the traditional Bowden cable, and uses the torque request instead as relevant parameter [see Fig. 1]. The torque request can be generated by the driver or by car comfort systems.

Page 730

Fig. 1: Traditional Bowden cable solution versus Electronic Throttle Control

This approach makes it possible to give the car engine in every moment exactly the necessary mixture of air, fuel, and ignition angle, in contrast to the Bowden cable solution, in which fuel and ignition angle try to follow the air mass provided by the throttle. Due to this exact and consistent engine control, ETC reduces fuel consumption and reaches low-emission requirements. In addition, this concept eases the integration of comfort functions like cruise control or anti-ice measures, and environmental measures, e.g. heating the catalyst.

The ETC unit, however, is considered to be safety-critical. Due to this, a dependability study was initiated by the ETC manufacturer Siemens AG, section Automotive. Goals were to scrutinise safety and availability of the considered Electronic Throttle Control system solutions, in order to detect weaknesses and to select the best architecture. The considered architectures were compared to the existing ETC system proposal. With our approach, it was possible

to make quantitative statements about the safety and availability of the system, also considering degrading states,
to compare the considered architectures quantitatively,
to detect weaknesses of the system already in the design phase,
to give a base for system tests.

2 Considered Faults

As a first step, possible faults and their system interference were analysed by a FMEA (Failure Mode and Effects Analysis, [3]). The fault sources were roughly structured as faults appearing in sensors, actuators, and ECU (Electronic Control Unit). In the next section, we will highlight the differences in the structure of the faults. By using our Markovian method TEFT, it became easier to describe the influence of any faults on the system behavior, also integrating system states and time.

Page 731

2.1 Sensor Faults

From the beginning, the pedal sensor was considered to be safety-critical. Due to this, the sensors generally are built with redundancy, to be able to detect and to mask appearing faults. Note that common-mode failures like a mechanical sensor break has the same effect as they have at Bowden cables, namely stuck at a random sensor value e.g. at highest value! The measures to avoid faults in the A/D converters were not clear at the project beginning: the standard architecture only checks the A/D converters regularly, in the other architectures we also considered alternatives with redundant components.

Other sensors like the oil temperature sensor were not considered safety-critical. Here, the goal of this study was to check the plausibility of these assumptions.

Other "sensor" signals come via the CAN bus. These signals were also considered, and their influence on the system behavior appeared as being under-estimated.

2.2 Actuator Faults

Actuators do directly influence the engine behavior. Examples are ignition angle and fuel injection mass. The throttle seemed to be most significant: Here, both actuators faults and sensor faults may appear, and this device is critical due to high temperature or ice.

Actuator faults have in common, that they are rather an availability problem than safety-critical: They rather decrease the engine performance than increasing it.

2.3 Faults of the Electronic Control Unit (ECU)

The remaining faults appear in the ECU: Examples are processor faults, computing faults caused by hardware defects, or driver faults. These faults cause random values. Some of these faults are transient ones, so measures (like plausibilty checks) can be taken to detect and possibly mask them.

In the ECU, the most failure-avoiding concepts of the ETC system are found. These will be considered next.

3 Monitoring Concepts

To make sure that the system is functioning correctly also in case of faults, several checking strategies were designed. The overall goal was to discover as many faults as possible and to trigger an adequate reaction, in order to avoid critical states. Particularly the runaway case, i.e. undesired high engine performance, has to be avoided.

To get the desired torque (performance) the ECU controls the parameters air mass, fuel mass, ignition angle. The ECU also controls the checking concepts. Due to this

Page 732

important role, the main processor is considered to be very important. The software of the ECU was considered as correct, but hardware faults that lead to software errors were taken into account:

Faults that lead to a partial processor failure.

Memory faults that influence critical variables.

I/O errors, e.g. given by the A/D converters.

3.1 Processor Monitoring

Processor monitoring are measures that guarantee the correct function of the ECU processors. Examples are watchdog-timers. Due to a two-processor concept, each ECU processor can be checked by the other. Implemented function tests are:

Instruction set tests execute representative instruction sequences and check the result.
Memory tests detect stuck-at faults in the processor memory.
Test computings check the complete function of a software part, by executing tests and comparing the results.

These monitoring-means efficiently check processor functions. Note that most processor faults lead to a total system failure causing an engine halt.

3.2 Process Monitoring

Process monitoring deals with the complete function of the ETC system. This is done by executing all safety-critical computation twice. To guarantee the efficiency of this checking strategy, the system has to be redundant ("two channels"). The processors get input values from different sensors, they have own A/D converters, and they use diverse software. By these measures, the appearing of identical faults is avoided.

3.3 Plausibility Monitoring and Signal Comparison

All the input values are checked for their plausibility. Short circuits and wire breaks can be discovered as leaving the allowed value range. All safety-critical input values are measured by two sensors, and compared between each other. Sensors often use different transfer functions.

These checks deal with the complete input stream, including sensors, signal wires, and A/D converters.

Page 733

4 Safety Reactions

In order to guarantee the safety of the ETC system, a variety of safety reactions with different severity grades were implemented. The overall goal was to keep the system running as long as possible, e.g. to be able to maintain the heating function of the motor in very cold regions.

Thus, the following system modes were implemented (in rising severity):

intact: Although an error occured, the system reacts according to its specification. In this mode, the built-in redundancy avoids a system error. This state is possible as long as no safety-critical fault occors.
extended limp home: In this mode, the car still reacts on the pedal request, but acceleration and torque are reduced, in order to avoid endangering. Thus, the driver is able to react within reasonable time.
limp home: The engine runs a slightly higher idle number-of-revolutions. The goal is to maintain heating and servo functions. Possibly, the car still can be moved.
safety stop: In this mode, the car is stopped. This mode is selected, if no other means are able to maintain safety.

The safety of the system modes increases in the sense, that the probability of a runaway decreases. These modes are implemented as internal states. The selection of the appropriate state as a fault reaction is an important part of the safety-concept.

For the modeling approach is important, that this system has an internal memory. The ECU reaction depends on the current mode, i.e. on the errors that occured in the past.

5 Modeling Technique

We took an "extended fault tree approach" to model this system. To motivate this, let us shortly consider some features of standard fault trees:

Fault trees are a system function that maps sets of component faults onto a specified undesired event.
Stochastic dependencies have to be taken into account explicitly. In many fault tree tools, extensions exist to decribe this. A similar problem is raised by common-mode failures.
Using highly redundant or re-configurable systems, the fault sequence sometimes plays a role: It is a difference, if a component fails first, or its checking function (cf. silent errors!). This kind of faults need the introduction of states. Unfortunately, fault trees only consist of (combinatorial) system functions. Furthermore, fault trees are usually defined over sets of faults, not over ordered sets.

Page 734

For the system analysis, only relevant fault sequences have to be considered. Thus, we have to "cut" the evaluation, when the system is coming into the safety stop mode.
Fault rates may be state-dependent. A typical example is warm redundancy.

Note that this obstacles need not cause problems if they are properly taken into account. But on the other hand, for complex systems it can be rather difficult to model this using fault trees. Furthermore, significant errors can be caused by a naive evaluation of redundant systems.

Generalised Stochastic Petri Nets (GSPNs, [1]) are able to cope with the aspects mentioned above. This formalism uses a Petri Net extension for the description of the system failure behavior and its interdependencies, and maps this net to Markov chains for an evaluation, as pointed out in [1]. Thus, concurrent subsystems, synchronisation, system states, and exponential failure rates can be modeled consistently.

5.1 The TEFT Approach

For modeling the ETC system, we used our Markovian technique TEFT (Time-Extended Fault Trees). In this approach, the system function is modeled using fault trees, and the temporal behavior is modeled by GSPNs. The state transitions caused by faults are described by fault trees in a functional way. This comprises component faults, distinguished by the state in which they may occur, and the sequential state. TEFT can only be used for acyclic Markov chains. This usage is motivated by the repairing strategy of car-based systems: As soon as a fault occurs, cars are expected to be repaired in a way they behave like new ones, i.e. the Markov chain is reset to its initial state.

The evaluation is done by state space exploration: All fault sequences are generated, until a given depth, i.e. number of errors, is reached. This reachability graph with the exponential fault rates of the components, spans a Markov chain. The evaluation of this Markov chain gives the probability of the considered failures. For the performance of the Markov chain solver it is very important, that the reachability graph does not need to be held within memory, thus avoiding the main obstacle of state space explosion. The TEFT evaluation algorithm evaluates 200,000 states within one minute, so that one billion states can be reached within reasonable computing time. Furthermore, time-consuming iteration is avoided. The solution can be given as closed-form expression.

This TEFT approach enables the modeler to put very detailed questions to the system by computing several fault sequences, e.g. to measure how long the system can be used after the first fault.

Page 735

5.2 System Components

The first modeling step is to break the system down into its "components" or "logical blocks". This dissection is determined by the appropriate abstraction level.

The ETC system was broken down into 40 logical blocks, comprising sensors, actuators, processors, and software components. If several logical blocks appear to be a serial system, they can be aggregated to one single logical block, to ease modeling and reduce the evaluation complexity. For each component the failure modes are listed; this work is done on the base of a FMEA. Note that these failure modes only describe the faults of a single component. Here, several failure modes are possible, i.e. the pedal faults can be described in "Runaway/Middle/Idle" rather than in a Boolean "Ok/Faulty". For ETC, we considered 72 failure modes. All the failure modes get (exponential) fault rates, expressed in FIT (Failure In Time, i.e. within 10⁹ hours).

5.3 Functional Part

The fault effects, i.e. the system behavior that can be watched from outside, were distinguished like this:

intact: The car reacts in the specified way.
accelaration drift: Slight undesired accelerations are noted by the driver. The driver usually is able to compensate this system reactions; they compare to effects like sudden wind changes.
limited performance: The car still reacts on the pedal request, but acceleration and torque are reduced. This reaction appears in system mode extended limp home.
idle performance: The engine runs at slightly increased idling speed. This reaction appears in system mode limp home.
engine stop: The engine is turned off.
runaway: Undesired high engine performance appears, e.g. high accelerations.

These effects are considered like "top events" in faults trees, i.e. undesired resulting events. Each combination of faults falls into exactly one of these categories; this circumstance can be used for model consistency checks. The modeling has to be done manually. In our case, it required an extensive modeling specification; in this work package, we and our customers had many useful discussions and got a lot of very relevant knowledge that was represented in the resulting model. In the modeling phase, the multi-value property of faults and the availability of system states, which are implemented in TEFT, turned out to be extremely useful.

Page 736

5.4 Temporal Behavior

Time appeared in two aspects:

Some faults were only possible after primary faults, i.e. not in every case and particularly not in the initial state of the system. TEFT describes this by using state-dependend activation conditions.
The internal states, i.e. the system memory, have to be modeled. TEFT implements this by a reference on more than one state.

6 Model Validation

A big advantage of TEFT is, that every fault sequence and its effect can be listed. This list may be compared to minimal cut sets of fault trees, but representing ordered sets. It can also be sorted by the resulting FIT rate. This representation may remind to a FMECA analysis, but the criticality is not estimated, but computed. (There are also some formal differences, so we will avoid naive comparison.)

This list can also be used as a base for generating test cases: Then, it can be used to compare the real system directly with the modeled one. And it can be compared systematically with every system that leaves a production line.

The length of the list produced by TEFT is not always a pure advantage: Several 100,000s of list entries can no longer handled with standard table calculation tools. Furthermore, to order the results may become difficult. On the other hand, it really reflects the system complexity, and by using postprocessors the handling can become very easy, without losing information. We often do this by writing a "question sensitive postprocessor". This makes the captured results very valuable.

The evaluation speed makes it also possible to vary parameters, and to enable the modeler to study the effect on the complete system. This can be used both for the variation of component values, and for components, whose FIT rates are unknown, e.g. very new components.

7 Modeling results

We modeled and evaluated several system alternatives, to study several architectural changes that were in discussion.

All systems have in common, that redundant pedal and throttle sensors are used. System differences are:

System #1: This is a one-processor solution. The processor does process checking. We modeled this system to compare the availability of the redundant solution to a non-existing naive one.

Page 737

System #2a: This system uses two processors, which do processor checks. No process checking is done.
System #2b: The only difference to #2 is the process checking.
System #3: This is a kind of true 2-channel system. The processors do processor checks; in addition the 2nd processor does process checking. Both processors have A/D converters, and the redundant pedal and throttle signals are delivered to both processors.

The following table gives an overview of the main system differences [see Table 1]. Note: "p1" means processor #1.

	A/D converters	Comparing redundant signals	Processor check p1	Processor check p2	Process checking
System #1	p1	p1	none	irrelevant	p1
System #2a	p1	p1	p2	p1	none
System #2b	p1	2x p1	p2	p1	p1
System #3	p1 and p2	p1 and p2	p2	p1	p2

Table 1: Significant differences of the considered system alternatives

7.1 Single Faults

Since cars are expected to be maintained regularly, single faults are expected to play the most important role in the system. All the systems turned out to be rather similar; exception was the safety-critical runaway case. For the runaway probability we got the following values [see Fig.2]:

Page 738

Fig. 2: Results for single faults (scale is linear, using arbitrary units)

Much less than 1% of the single faults lead to a runaway. The fault cases are:

The greatest part of the resulting probability is caused by faulty torque requests via the CAN bus. Note that in such cases the ETC system has no possibility to recognize these torque requests as faulty. These requests can be induced by several car components; examples are gear switch signal at automatic transmissions, the so-called deceleration slip control, or faulty reference voltage at the pedal sensor. Moreover, at system #3 these torque requests are the only single faults that lead to the runaway case.
Faults in the A/D converters may cause a runaway, if both pedal sensor values are wrong. This is possible, if both analogue signal from the pedal are multiplexed to a single A/D converter; only at system #3 this fault constellation is avoided. The A/D converter errors can be recognized with a rather high probability, so that the resulting effect is rather small. But this is the only single fault that leads to the runaway case system #2b.
Systems without process check e.g. system #2b may trigger the runaway case by faulty torque computings.
Systems without processor check e.g. system #1 may trigger the runaway case by a processor error.

Page 739

The modeling result was that only the systems #3 and #2b are safe enough for the real-world implementation. Another result was to have a closer view to the external torque requests: Their plausibility cannot be checked by ETC-immanent means. Due to this, these comfort (!) signals have to be considered as being safety-critical.

7.2 Double Faults

The main goal for the double fault examination was to make sure the single fault results. Another goal was to have statements, if it is critical to run the car after some single faults. Fortunately double faults turned out to lead mainly to idle performance or engine stop. This result emphasises the relevance of single faults as the main safety criterion [see Fig. 3].

Fig.3: Results for double faults (scale is linear, using arbitrary units)

7.3 Common-Mode Failures

Following common-mode failures lead to a runaway.

Common pedal value error, e.g. if the pedal breaks.
Common throttle value error. But this fault can be recognized by using the air mass sensor, maybe by other engine parameters.
Common A/D converter error. The most likely cause is a faulty reference voltage.

Page 740

Common processor errors. They are mainly caused by software faults.

These failures can be avoided by a diversity approach.

8 Conclusions

The TEFT modeling method turned out to be able to scrutinize even complex systems, in order to make the systems safer by targeted architecture optimisation. This guided optimisation is a sound base to design systems efficiently and to save costs for unnecessary components, which can be an additional failure cause.

References

[1] Ajmone Marsan, M.; Balbo, G.; Conte, G.: "Performance Models of Multiprocessor Systems"; MIT-Press 1986

[2] DIN 25424: "Fehlerbaumanalyse. Methode und Bildzeichen". 1981 (part 1) and 1990 (part 2)

[3] DIN 25448: "Ausfalleffektanalyse (Fehler-Möglichkeits- und -Einfluß-Analyse)". 1990

[4] Jurgen, R.: "Automotive Electronics Handbook". McGraw-Hill 1995

Page 741

Electronic Throttle Control ­ A Dependability Case Study