The goal of constructing reliable programs has led to the introduction of transaction (action) software into programming environments. The further goal of constructing reliable programs in a distributed environment has led to the extension of transaction systems to operate in a more decentralized environment. We present the deisgn of a transaction manager that is integrated within the kernel of a decentralized operating system, the Clouds kernel. This action management system supports nested actions, action-based locking, and efficient facilities for supporting recovery. The recovery facilities have been designed to support a systems programming language which recognizes the concept of an action. Orphans, disjoint parts of actions that will later abort, are managed using a time-driven orphan detection scheme which requires a clock synchronization protocol. We present the facilities to generate a system-wide global clock. In addition to the design of facilities necessary to support a decentralized action manager, we present a search protocol to locate objects in the distributed environment. The design goal of this implementation has been to achieve the erformance necessary to support an experimental testbed, which serves as the basis for further work in the area of decentralized systems. We feel this prototype implementation meets this goal.
Developing effective debugging strategies to guarantee the reliability of software is important. By analyzing the debugging process used by experienced programmers, we have found that four distinct tasks are consistently performed: (1) determining statements involved in program failures, (2) selecting suspicious statements that might contain faults, (3) making hypotheses about suspicious faults (variables and locations), and (4) restoring program state to a specific statement for verification. This dissertation focuses on the second task, reducing the search domain for faults, referred to as fault localization. A new approach to enhancing the process of fault localization is explored based on dynamic rogram slicing and mutation-based testing. In this new scenario, a set of heuristics was developed to enable debuggers to highlight suspicious statements and thus to confine the search domain to a small region. A prototype debugging tool, SPYDER, was previously constructed to support the first tasks by using dynamic program slicing and the fourth task by backward execution; some of the heuristics were integrated into SPYDER to demonstrate our approach. An experiment confirms the effectiveness and feasibility of the proposed heuristics. A new debugging paradigm equiped with our proposed fault localization strategies is expected to reduce huaman interaction time significantly and aid in the debugging of complex software.
Reliable software testing is a time consuming operation. In addition to the time spent by the tester in identifying, locating, and correcting bugs, a significant time is spent in the execution of the program under test and its instrumented or fault induced variants. When using mutation based testing to achieve high reliability, the number of such variants can be large. Providing a software testing tool that can efficiently exploit the architecture of a parallel machine implies providing more computing power to the software tester and hence an opportunity to improve the reliability of the product being developed. In this thesis, we consider the problem of utilizing high performance computers to improve the quality of software. We describe three approaches to the parallelization of mutant execution on three architectures: MIMD, Vector, and MIMD with vector processors. We describe the architecture of the PMothra system designed to provide the tester a transparent interface to parallel machines. A prototype, constructed by interfacing the Mothra system to an Ncube through a scheduler, was used to conduct the eperiments reported in this dissertation. Analysis of algorithms developed and experimental results obtained on these three architecture are presented. Our results enable us to conclude that the MIMD machine, as typified by the Ncube, is superior to some other architectures for mutation based software testing.
The objective of this research was to investigate the planning of information assurance in agent-based workflow systems. Two objectives were set: (1) defining information assurance in distributed information systems, such as ERP, and (2) planning agents for effective execution of assurance tasks in workflow systems. To achieve the first objective, a TQM-based definition of information assurance was developed after literature review, MICSS lab experiments, and an industry survey. A list of requirements for information assurance was created. The result of the lab experiments show the difference of impact of information assurance failtures in an ERP system on the profit and due-date-performance of a company. The fact that certain information failures are more critical than others motivated the implementation of variable assurance in workflow systems. The responses to the industry survey proved that comanies are facing Information Significance problems in their ERP systems, and that decision-makers are willing and able to wait for better assured information. To achieve the second objective of planning effective execution of assurance tasks in agent-based workflow systems, two research tasks were performed: development of a model for variable assurance implementation, and investigation if different variale assurance protocols and agent modes. For implementing variable assurance in agent-based workflow system, the concept of request analysis and context analysis for risk assessment were introduced. These features were added to a recently developed multi-agent framework for process monitoring called Agent-based Integration Model of Information Systems (AIMIS). For investigating assurance protocols and models, experiments were performed with AutoMod to simulate assurance and production tasks execution by agents. The main findings are: (1) Compared to systematic total information assurance, significant requestprocessing time can be saved by executing assurance tasks on an assurance-needs basis. The needs-based variable assurance protocol reduces the mean processing time of requests without decreasing the proportion of trusted requests; (2) Polyvalent agents are recommended solution to assure production requests when assurance tasks are serialized.
Decentralization of computing systems has several attractions: performance enhancements due to increased parallelism; resource sharing; and the increased reliability and availability of data due to redundant copies of the data. Providing these characeristics in a decentralized system requires proper organization of the system. With respect to increasing th e reliability of a system, one model which has proven successful is the object/action model, where tasks performed by the system are organizaed as sequences of atomic operations. the system can determine which operations have been performed comopletely and so maintain the system in a consistent state. This dissertation describes the design and a prototype implementation of a storage management system for an object-oriented, action-based decentralized kernel. The storage manager is responsible for providing reliable secondary storage structures. First, the dissertation shows how the object model is supported at the lowest levels in the kernel by the storage manager. it also describes how storage management facilities are integrated into the virtual memory management provided by the kernel to support the mapping of objects into virtual memory. All input and output to secondary storage is done via virtual memory management. This dissertation discusses the role of the storage management system in locating objects, and a technique intended to short circuit searches whenever possible by avoiding unnecessary secondary storage queries at each site. It also presents a series of algorithms which support two-phase commit of atomic actions and then argues that these algorithms do indeed provide consistent recovery of object datd. These algorithms make use of virtual memory management information to provide recovery, and relieve the action management system of the maintenance of the stable storage.
This dissertation presents a new debugging assistant called a Debugging Critic. As an alternative to an automated debuggin oracle, the debugging critic evaluates hypotheses about fault locations. If it cannot confirm that the code at the hypothesized location contains a fault, it formulates an alternative hypothesis about the location of a faulty statement or the location of omitted statements. The debugging critic derives knowledge of possible locations of a fault that manifested itself under a given test case from failure symptoms. Therefore, it can operate without a line-by-line specification and a knowledge base of faults. A prototype of our debugging critic has been implemented as an extension of an existing debugger, Spyder. An experiment with Spyder shows that programmers debug two to four times faster on the average with the debugging critic than without it. Ninety-two percent of the critic’s users recommend the critic as an extension of conventional debuggers. The research iin this dissertation contributes to debugging and critic systems. In debugging, our experiment shows that an active debugging assistant can effectively improve debugging performance. Another contribution is our approach to evaluate and formulate hypotheses about fault locations, especially the locations of omitted statements. In critic systems, our contribution is the use of questions to prevent information and non-insulting criticism.
Traditionally, compilers available to the software developer/tester have only supported two software testing techniques, statement and branch coverage. However, during compilation, sufficient syntactic and semantic information is available to provide support for additional testing techniques. This dissertation presents an approach to integrate support for program mutation, a well-known and effective software testing technique, directly into a compiler. The paradigm permits the intermediate states of computation within a machine-executable program to be monitored or modified subsequent to compilation, without recompiling the program. Program mutations are performed in an efficient manner on native machine-code, and direct support is provided for effective mutant execution on MIMD architectures. As such, the paradigm provides facilities for the development of practical tools that allow support for program mutation, while improving the cost-effectiveness of both experimental and testing applications. The paradigm is based upon program patch generation and application. A prototype patch-generating C compiler, and mutation-based software testing environment utilizing this paradigm, have been constructed in order to demonstrate the approach. The prototype implementation supports the manipulation of separately compiled programs and, therefore, permits potentially large software systems to be tested. A set of experimental results compares the effectiveness of the compiler-integrated approach, employed by the prototype, to that employed by existing mutation-based software testing environments in providing support for program mutation.
Concurrent software systems are more difficult to design and analyze than sequential systems. Considerable research has been done to help the testing, analysis, and verification of concurrent systems. Reachability analysis is a family of analysis techniques involving exhaustive examination of the whole state space. Reachability analysis is attractive because it is simple and relatively straightforward to automate, and can be used in conjunction with model-checking procedures to check for application-specific as well as general properties like freedom from deadlocks and race conditions. However, practical application of reachability analysis to real systems has been stymied by the state explosion problem, and exponential growth in the number of states to be explored as the number of processes increases. State explosion can be substantially controlled if the analysis approach is “compositional.” An analysis technique is compositional if the results of analyzing subsystems can be “composed” to obtain analysis results for the complete system. We propose using process algebra to achieve this goal in automated analysis tools. The algebraic structure of process algebra and its rich abstraction capabilities provide the means to achieve compositional (divide-and-conquer) analysis. The primary contribution of this dissertation will be a demonstration and evaluation of the usefulness of process algebra in controlling state explosion in reachability analysis of models close to the model of concurrency found in real programs. We propose a two-leveled approach to control state explosion - using process algebra at the coarse-grained level and simplification strategies based on a process-sleep mechanism at the fine-grained level. Although the process algebra-based compositional approach is not guaranteed to solve the state explosion problem, with a suitably modular design that provides good service abstraction, state explosion can be controlled sufficiently for application of reachability analysis to be practical.
Debugging, which entails locating program faults responsible for a program failure, is more difficult in concurrent programs than in sequential programs because of the inherent difficulty of understanding task interactions. In addition to sequential debugging facilities for individual threads of control, specialized facilities are needed to debug interactions among tasks. Integration of the two kinds of debugging facilities is motivated by the observation that a program fault can be manifested as a sequence of erroneous program states and erroneous coordinates among tasks alternately before causing an observable failure. Localization of a fault may require facilities to trace the causes of a failure through both sequential computation and interactions. In this dissertation, we propose a two dimensional model to debugging of concurrent programs. Different and specialized facilities are provided for sequential debugging and concurrent debugging, respectively. These two “dimensions” are integrated in a way that supports tracing the effect of a program fault both within and across tasks. We present a design of concurrent debugging facilities based on the model. As the key component of the design, we introduce and define a customized dynamic dependence analysis called augmented concurrent dynamic slicing. As many concurrent programs are nondeterministic, program instrumentation may perturb a program to the extent that it fails to reproduce the error-revealing behavior. Our analysis requires less execution information than required by prior analyses, and provide flexible control of the tradeoff between precise analysis and behavior perturbation. Thus, our analysis is applicable to time-sensitive programs.
In this thesis, we prove that well designed communication support can make the use of address spaces and high-level interfaces a practical solution to the structuraing problems of complex distributed transaction processing systems. Such support includes efficient local interrocess communication, transaction-oriented multicasting and naming schemes, and specialized scheduling and synchronization policies. We take the implementation and experimental approach to prove our thesis. First, we identify the communication needs in a distributed transaction processing system. Second, we design a new communication facility that addresses those needs. Finally, we implement a prototype of the new communication facility and measure its impact on transaction processing. The new communication facility features shared-memory ports, a simple naming scheme, and a transaction-oriented multicasting mechanism, Local and remote communication is through ports. Ports can be accessed directly by the kernel and by user-level processes. The naming schemes used for the application and network levels avoids the use of name-resolution protocols. The multicasting mechanism is CPU and network efficient. The new communication facility reduces kernel overhead during transation processing by up to 70%. To conduct some of the experiments in this thesis, we developed a system called Push. It allows the modification of operating system services at run time. Push is based on an extension language interpreted within the kernel. It enables experimentation that would be difficult and time-consuming in current-environments. The overhead of the Push implementation can be factored out to give a good estimate of the performance of a native kernel implementation. We have used Push to implement kernel-resident communication services. A multicasting implementation in Push has an inherent overhead of .32 milliseconds per additional site. The corresponding overhead for a direct kernel-level implementation is 0.5 milliseconds and for a user-level implementation 0.57 milliseconds.
The main objective of replication in distributed database systems is to increase data availability. However, the overhead associated with relication may impair the performance of transaction processing. Moreover, in the presence of changing failure and transaction characteristcs, static replication schemes are so restrictive that they may actually decrease availability. The purpose of this research is to show how adaptability and data reconfiguration can be used in conjunction with static replcation schemes to achieve and maintain higher levels of availability. The basis of this research is an integrated study of availability and performance of replication methods. The integrated study was suggested as a future work in [Pu85}. The availability analysis part of the study is performed through an analystical model. The performance evaluation part is conducted on the second version of the RAID distributed database system developed at Purdue. We classify availability into two categories: algorithmic and operational. While algorithmic availability measures the fault-tolerance provided by replication methods against component failures, operational availability examines the effect of performance failure on the validity of the algorithmic measure. Performance failure can result from an inefficient implementation of an expensive fault-tolerant replication method. Algorithmic availability is studied through an analytical model that ecompasses transaction and database parameters, site and communication link failures, and replication methods’ paramters. Operational availability is examined, experimentally, through near-saturation performance measurements of an actual implementation of replication methods. We implement a variety of replication mechanisms and an infrastructure for adaptability to detected and predicted failures. The implementation includes off-line relicatin management, a stand-alone replication control server, a quorum-based interface to a library of replication methods, quorum selection heuristics, a surveillance facility, and a dynamic data reconfiguration protocol. The effectiveness and performance of our implementation, methods, and ideas are tested through experimental measurements of transaction performance. The experiments use an extended version of the standard DebitCredit benchmark. Using the availability model and the RAID system combined, a series of experiments are conducted. We study static replication schemes and devlop local policies for their efficient use. We then examine how to adapt the use of these schemes to perturbations in parameters which includes transaction read/write mix and site and communication link reliabilities. The experiments give a number of insights about how to adapt replication in order to increase the availability of fault-intolerant replication methods, and reduce performance penalties of methods which are highly fault-tolerant.
The virtual memory abstraction aids in the design and implementation of portable aplications. Virtual memory allows applications to execute in an arbitrarily large memory space, independent from the physical memory size of the underlying machine. Conventional virtual memory operating systems use magnetic disks for backing storage. Magnetic disks provide high data transfer rates, large storage capacity, and the ability to randomly access data, making them an appealing backing storage medium. However, the average seek time of a magnetic disk is several orders of magnitude slower than memory access times. Recent advances in CPU speeds, network banwidth, and memory sizes have made new types of backing storage with imroved performance and greater functionality feasible. This thesis invesigates a new model for virtual memory in which dedicated, large-memory machines provide backing storage to virtual memory systems executing on a set of client machines in a distributed envionment. Dedicated memory servers provide clients with a large, shared memory resource. Each memory server machine is capable of supporting heterogeneous client machines executing a wide variety of operating systems. Clients that exceed the capacity of their local memory access remote memory servers across a high-speed netowrk to obtain additional storage space. Clients use a highly efficient, special purpose, reliable, data streaming, network architecture independent communication protocol to transfer data to and from the memory server. To reduce the delay associated with accessing remote memory, memory servers use efficient algorithms and data structures to retrieve data, on average, in constant time. In addition to providing a highly-efficient backing store, the model allows data sharing between clients and improves file system performance by offloading the file server. This thesis also describes the design and implementation of a prototype system. Measurements obtained from the prototype implementation clearly demonstrate the viability of systems based on the model. The prototype shows that remote memory systems offer performance competitive with, and in some cases better than, existing virtual memory systems. Moreover, rapid advances in network bandwidth, CPU speeds, and memory sizes make the model an attractive basis for the design of future distributed systems.
The availability of low-cost, high-bandwidth optical fiber will change the technology and use of wide-area networks by allowing lower-cost, higher-speed communication. Higher transmission rates combined with wide geographical distribution will require new protocols for dynamic resource allocation in wide-area computer networks. This thesis analyzes a scheme to approach the problem of network congestion for high-speed packet switches that provide datagram services. The analyzed algorithm, which has been propose by Yavatkar [Yava89], avoids congestion by restricting excessive traffic at the entry of the network and by operating the network adaptively on a rate basis at the “knee point” of the delay-throughput curve. At this point a possible increase in throughput is small but would cause a relatively large increase of transmission delay. The algorithm is analyzed by using basic, simple topologies to get an understanding of its stationary and dynamic behavior. The thesis analyzes the characteristics, problems and compromises of the algorithm. The congestion avoidance and control mechanism basically trades resource efficiency for a small transmission delay inside the network and higher reliability. A low signal-to-noise ration of the traffic measurements limites the effectiveness of the rate control. The analysis is based on computer simulation experiments using the simulation language CSIM.
This dissertation desribes selected software issues of mapping tasks onto parallel processing systems that are shown to have a strong effect on performance. It can be divided into two related categories: (1) parallel mapping studies, and (2) an analysis of dynaic task migration for fault-tolerance, load balancing, and various other administrative reasons. The first category consists of two application case studies, specifically, the computation of multiple quadratic forms and image correlation. The goal of this part of the work is to understand the relationship between parallel system features (e.g., mode of parallelism supported, number of processors available, communication speed, computation speed) and parallel algorithm characteristics (e.g., amount of data-parallelism available, amount of functional-parallelism available, amount of scalar computation present). The knowledge obtained from the parallel mapping studies provided the foundation necessay to investigate the second category, the task migration work. This research involved developing a method to migrate dynamically a task between a SIMD (single instruction stream - multiple data stream) machine and a SPMD (single program-multiple data stream) machine. It is assumed that the SIMD and SPMD machines only differ to support the different modes of parallelism, and that the program was coded in a mode-independent programming language. This area of research targets system that are either a network of different tpes of computers or a single system that can support multiple mods of parallelism.