Research Project: Finesse
The ability to accurately diagnose and recover from faults in complex systems such as the copiers of Oce constitutes a crucial element in achieving higher system dependability. As effective recovery (or repair) fully depends on the accuracy of the fault diagnostic process to determine the root cause of failure, fault diagnosis (FD) is the key determining factor. Apart from the operational phase, FD is also beneficial in the development phase where many system faults occur as a result of improper design and/or integration.
The increasing complexity of multidisciplinary systems that integrate mechanical, electronics, and embedded software components, such as paper handling systems, poses an increasing tension between the effort of developing (correct) embedded FD software, and the FD accuracy that is required to improve system dependability. For these types of complex systems, the effort to realize FD mechanisms that have sufficient diagnostic accuracy to be practically useful in dependability enhancement, is increasingly becoming prohibitive.
In the FINESSE project, we develop and investigate an improved FD strategy, based on a novel FD method within a model-based approach. The method provides the required diagnostic accuracy to meet the challenges posed by the complex application carrier. The model-based approach reduces the embedded FD software development effort since it is used to generate the code from. As the model-based approach is relatively well-established, the FD method is the central theme in FINESSE.
Due to many reasons, explained later on, diagnostic models of complex systems usually allow for many diagnostic solutions, ordered in terms of probability, while only one of the solutions reflects actual system health (e.g., the combination of HW component X and SW component Y is unambiguously at fault). In order to radically improve FD accuracy compared to the current state-of-the-art, we propose to (1) improve the quality of the probabilistic diagnosis ranking process, and (2) to significantly decrease the number of diagnosis solutions. To address the former, we develop an improved fault probability modeling method to estimate the a priori probability of faults occurring in software components, which is much more complex than hardware component fault probability modeling. To address the latter, we develop an improved FD algorithm which includes the ability to reason over time at low-cost as well as to automatically generate test vectors as part of the diagnostic reasoning process. The combination of enhanced (SW) fault probability analysis, as well as concepts known from the automatic test pattern generation and sequential diagnosis disciplines within a model-based, FD algorithm has not been proposed before.
The above FD approach will be implemented in terms of an existing, model-based tool set based on TUD's system modeling language Lydia, and validated on a paper handling system (PHS) of Oce in terms of a demonstrator. The issues that will be investigated include the adequacy of the new FD approach to improve system dependability during operations, the effort spent in modeling, the computational costs of the FD approach, all compared to traditional techniques, as well as architectural development topics such as the added (dependability) value of improved sensor placement, and improved testability features.
Apart from Oce and LogicaCMG
, the impact of the research is very high. Many manufacturers that produce complex hardware-software artifacts performing functions with a high economic added value and/or which are life-critical, are facing tremendous problems with respect to systems dependability, and have traditionally spent a huge effort on devising FD mechanisms. On a national scale examples can be found at industries such as ASML, Philips (Medical Systems, Consumer Electronics, Semiconductors), apart from Oce. The FD solution provided by the FINESSE project is directly applicable to the above domains. In general, the range of potential applications is virtually unlimited, potentially leading to autonomic dependability in many so-called intelligent products, ranging from PDAs to TVs, satellites, medical devices, wafer scanners, and process plants.
A PHS comprises a complex system of electro-mechanical moving parts, actuators, and sensors, controlled by real-time embedded software with very high accuracy. The dependability problems associated with this type of systems are primarily related to the timely prediction and unambiguous determination of non-operator recoverable faults, which require company service effort, comprising FD and repair. In order to minimize service effort and machine downtime, from an economical point of view, it is also extremely important to automate FD and to improve diagnostic resolution up to a point where, e.g., it is clear which future spare part needs to be replaced (preventive maintenance). Apart from service-level feedback, FD can also be applied within a compensatory feedback loop (e.g., providing more motor power) to recover proper timing until the root-cause problem is remedied. In PHS the causal relationships between the degradation of (transfer) functionality (e.g., a worn-out motor) and the symptoms (the noisy signals that are actually sensed, e.g., some later event timeout somewhere else) are extremely complex. As especially in condition-based servicing and autonomic control loops there is little room for a misdiagnosis, the diagnostic challenge is formidable, and dramatically improved FD algorithms are essential to advance technology to a higher level of dependability and autonomy.
The above carrier application is typical for the problems related to dependability that next-generation embedded systems will face. With the increasing complexity of embedded systems, malfunction of equipment is becoming a major factor in cost of ownership. As more and more embedded systems functionality is shifted to software, faults do not only originate from hardware but increasingly from systems that have closely integrated hardware and software components. Despite the impressive advances in development-phase techniques to improve dependability such as model checking, testing, etc., such techniques cannot sufficiently, nor cost-effectively achieve sufficient dependability at the operational phase, due to the huge, and dynamic, behavioral state space of today's complex systems. The growing appreciation that non-zero fault probability systems (especially software) are simply an economical fact of life, to be adequately coped with, has recently sparked new interest into FD and recovery (commonly denoted fault detection, isolation, and recovery, or FDIR), aimed to minimize the impact of the inevitable residual software defects and hardware failures during field operation. In this new paradigm, FD is seen as the key factor. FD is increasingly making its way into complex systems with high dependability requirements (NASA Deep Space One probe, JSF Prognostic Health Management system, Airbus, automotive X-by-wire). In a run-time context, accurate diagnostic (or even prognostic) information is provided so that subsequent recovery (or prevention) is also accurate. This may avoid, e.g., generating a service call or even shutting down an entire system as a result of diagnosis that indicated a serious fault, while, in fact, such was not the case. Furthermore, FD can be regarded as an advanced, automatic debugging tool, which can significantly accelerate the fault analysis and repair process that nowadays forms an increasingly important (human) resource bottleneck.
Despite the advantages of FD, implementing man-made FD mechanisms in complex systems is extremely expensive because of the application-specific nature, the complex interplay between hardware and/or software faults, as well as the high dependability requirements on the (man-made) FD software itself. Moreover, the coverage of man-made FD systems is usually limited to those syndromes that have been anticipated and understood at design-time. The enabling technology that is nowadays used to automatically generate embedded code that dependably performs these complex tasks is often referred to as model-based, where embedded FD is generated from a model of the system. Consequently, the FD software effort is transferred from explicit FD coding to systems modeling in a paradigm that is more conducive to application domain experts, thus avoiding the need for highly specialistic software engineering effort. In summary, FD is a promising approach to enhancing system dependability. The automatic generation of FD logic removes some important hurdles that have traditionally limited the mainstream introduction of FD into industrial practice.
As mentioned earlier, the practical benefits of FD in the PHS are critically related to the ability of FD to accurately infer the health state of the components. The challenges in FD are to infer maximum diagnostic information on the operational status of software and hardware components from a typically limited amount of (noisy) observations. While current, state-of-the-art diagnosis algorithms can handle state spaces of practical size, in an industrial context such as in the application carrier their diagnostic accuracy is still quite limited. This is due to the fact that for many systems the associated diagnostic models are severely underconstrained (weak models). These models provide robustness when the system's fault behavior can not (yet) be totally captured, or has dynamically changing functional properties which are difficult to model. The weak models, in combination with the limited amount of observations as well as measurement uncertainty may lead to hundreds of possible diagnosis solutions (component 14 fault mode 1 (probability 0.25), or component 71 fault mode 2 (probability 0.21), or component 211 fault mode 1 and component 34 fault mode 2 (probability 0.05), or ...). Moreover, depending on the system's interconnection topology and the weakness of the model, the solution with the highest probability need not always be the actual root cause of failure. As inaccurate root cause information can easily trigger inappropriate service calls, recovery and/or provide erroneous feedback to the design and integration process, there is a great, and ever increasing need for diagnosis algorithms that deliver sufficient accuracy to adequately improve system dependability.
The accuracy of FD is critically related to (1) the accuracy of the a priori fault probability information of each system component, and (2) the amount of observations that can be exploited in the diagnostic inference process. The former relates to the probabilistic ranking of the diagnostic solutions, while the latter relates to the number of diagnostic solutions. Especially when the number of solutions is large, a small change (error) in the a priori probabilities can easily cause a change in the ultimate ranking with potentially dramatic consequences for diagnostic accuracy. In the FINESSE project we improve on the state-of-the-art in both respects. With respect to fault probability analysis, we develop a model for predicting the fault probability of individual software components. The reason is that the hardware component failure models and probabilities are much better understood than for software components where the margin for error is much greater. This model integrates data obtained from the design, implementation, and testing stages of the development. The prediction technique will be based on Bayesian belief networks (BBNs), which is the most promising technique for combining probabilities. This research is carried out by the TUE-SAN group.
In order to minimize the solution space, more observations need to be exploited into diagnostic reasoning than just those present at the momentary state. Our approach is twofold. First, we exploit observation history through an incremental technique which probabilistically updates the previous solution set based on the solutions inferred from the current observations. Second, apart from the observations passively collected under a routinely operational context, we actively generate specific test patterns to further reduce the solution space. This approach exploits the fact that systems often provide the possibility of accepting control input (test vectors) in certain operational modes (e.g., an idle PHS may autonomously execute particular test sequences to further narrow down the solution space). Based on the current diagnosis of the system the inputs are computed by the FD algorithm as to provide the greatest diagnostic information increase (entropy loss). Standard diagnosis techniques (using information entropy as a measure for information gain) also determine additional observations to be made. However, as such an approach requires a human to perform the additional measurements, this approach is very impractical for many applications. Our approach doesn't rely on additional observations within the context of the current input, but delivers test patterns that bring the system in such a state that alleviate the causes for diagnostic ambiguity. This part of the research is carried out by the TUD-PDS group.
The combination of (1) enhanced fault probability analysis (which improves the probability order of solutions), and (2) the integration of concepts known from the automatic test pattern generation and sequential diagnosis disciplines within a model-based, FD algorithm (which decreases the number of solutions), is a unique feature of the proposed FD research. The application to the PHS provides the essential feedback on whether the algorithmic approach meets the stringent FD requirements as found in practice.
The main research question that will be addressed is: what is the added value of the FINESSE approach (advanced probability modeling and FD algorithm) in terms of PHS dependability. In this context, dependability is directly related to diagnostic accuracy (fault coverage, e.g., in terms of entropy loss, false true rate, and related metrics) given the key role of FD in system dependability engineering (down-time minimization, better spare part prognosis, better sensor placement, design for testability/diagnosability). Hence, the derived question is how diagnostic accuracy is improved, compared to the state-of-the-art (FD based on, e.g., CDA*). In addition, we address related research and engineering issues directly pertinent to the Oce case, such as the effect of variance in a priori probabilities on diagnostic accuracy, determining the proper prediction models for fault probability based on case studies of LogicaCMG
software, diagnostic accuracy versus cost, the impact of diagnostic accuracy on system dependability, the proper metrics for diagnostic accuracy, the effects of sensor placement on diagnostic accuracy, the effects of various algorithm parameters on diagnostic performance.