Reports and Papers Archive - Reports & Papers

Static specification inference using predicate mining

Murali Krishna Ramanathan, Ananth Grama, Suresh Jagannathan

The reliability and correctness of complex software systems can be significantly enhanced through well-defined specifications that dictate the use of various units of abstraction (e.g., modules, or procedures). Often times, however, specifications are either missing, imprecise, or simply too complex to encode within a signature, necessitating specification inference. The process of inferring specifications from complex software systems forms the focus of this paper. We describe a static inference mechanism for identifying the preconditions that must hold whenever a procedure is called. These preconditions may reflect both data flow properties (e.g., whenever p is called, variable x must be non-null) as well as control-flow properties (e.g., every call to p must bepreceded by a call to q). We derive these preconditions using a ninter-procedural path-sensitive dataflow analysis that gathers predicates at each program point. We apply mining techniques to these predicates to make specification inference robust to errors. This technique also allows us to derive higher-level specifications that abstract structural similarities among predicates (e.g., procedure p is called immediately after a conditional test that checks whether some variable v is non-null.) We describe an implementation of these techniques, and validate the effectiveness of the approach on a number of large open-source benchmarks. Experimental results confirm that our mining algorithms are efficient, and that the specifications derived are both precise and useful-the implementation discovers several critical, yet previously, undocumented preconditions for well-tested libraries.

Added 2008-05-19

MOBY-a mobile peer-to-peer service and data network

T. Horozov, A. Grama, V. Vasudevan, S. Landis

This paper describes the design and implementation of MOBY, a network for mobile peer-to-peer exchange of services and data. Constraints on computing power of mobile devices, limited hardware, networking, and software resources, and ad-hoc nature of mobile clients pose considerable challenges from the points of view of supporting performance goals, ease of service integration, and adaptation. These challenges are addressed in MOBY by dynamic service location and client mapping, surrogates for mobile clients, and standardized interfaces built upon off-the-shelf software components.

Added 2008-05-15

Inferring functional information from domain co-evolution

Yohan Kim, Mehmet Koyuturk, Umut Topkara, Ananth Grama, Shankar Subramaniam

Co-evolution is a powerful mechanism for understanding protein function. Prior work in this area has shown that co-evolving proteins are more likely to share the same function than those that do not because of functional constraints. Many of the efforts founded on this observation, however, are at the level of entire sequences, implicitly assuming that the complete protein sequence follows a single evolutionary trajectory. Since it is well known that a domain can exist in various contexts, this assumption is not valid for numerous multi-domain proteins. Motivated by these observations, we introduce a novel technique called Coevolutionary-Matrix that captures co-evolution between regions of two proteins. Instead of using existing domain information, the method exploits residue-level conservation to identify co-evolving regions that might correspond to domains.

Added 2008-05-15

Randomized Protocols for Duplicate Elimination in Peer-to-Peer Storage Systems

Ronaldo A. Ferreira, Murali K. Ramanathan, Ananth Grama, Suresh Jagannathan

Distributed peer-to-peer systems rely on voluntary participation of peers to effectively manage a storage pool. In such systems, data is generally replicated for performance and availability. If the storage associated with replication is not monitored and provisioned, the underlying benefits may not be realized. Resource constraints, performance scalability, and availability present diverse considerations. Availability and performance scalability, in terms of response time, are improved by aggressive replication, whereas resource constraints limit total storage in the network. Identification and elimination of redundant data pose fundamental problems for such systems. In this paper, we present a novel and efficient solution that addresses availability and scalability with respect to management of redundant data. Specifically, we address the problem of duplicate elimination in the context of systems connected over an unstructured peer-to-peer network in which there is no a priori binding between an object and its location. We propose two randomized protocols to solve this problem in a scalable and decentralized fashion that does not compromise the availability requirements of the application. Performance results using both large-scale simulations and a prototype built on PlanetLab demonstrate that our protocols provide high probabilistic guarantees while incurring minimal administrative overheads.

Added 2008-05-15

When being Weak is Brave: Privacy Issues in Recommender Systems

N. Ramakrishnan, B.J. Keller, B.J. Mirza, A.Y. Grama

Added 2008-05-15

Path-Sensitive Inference of Function Precedence Protocols

Murali Krishna Ramanathan, Ananth Grama, Suresh Jagannathan

Function precedence protocols define ordering relations among function calls in a program. In some instances, precedence protocols are well-understood (e.g., a call to pthread mutex init must always be present on all program paths before a call to pthread mutex lock ). Oftentimes, however, these protocols are neither well-documented, nor easily derived. As a result, protocol violations can lead to subtle errors that are difficult to identify and correct. In this paper, we present CHRONICLER, a tool that applies scalable inter-procedural path-sensitive static analysis to automatically infer accurate function precedence protocols. CHRONICLER computes precedence relations based on a program’s control-flow structure, integrates these relations into a repository, and analyzes them using sequence mining techniques to generate a collection of feasible precedence protocols. Deviations from these protocols found in the program are tagged as violations, and represent potential sources of bugs. We demonstrate CHRONICLER’s effectiveness by deriving protocols for a collection of benchmarks ranging in size from 66K to 2M lines of code. Our results not only confirm the existence of bugs in these programs due to precedence protocol violations, but also highlight the importance of path sensitivity on accuracy and scalability.

Added 2008-05-15

Conquest: A Distributed Tool for Constructing Summaries of High-Dimensional Discrete Attributed Datasets

Jie Chi, Mehmet Koyuturk, Ananth Grama

Added 2008-05-15

Bounded-Error Compression of Particle Data from Hierarchical Approximate Methods

Dow-Yung Yang, Ananth Y. Grama, Vivek Sarin

This paper presents an analytical and computational framework for the compression of particle data resulting from hierarchical approximate treecodes such as the Barnes-Hut and Fast Multipole Methods. Due to the approximations introduced by hierarchical methods, the position (as well as velocity and acceleration) of a particle can be bounded by a distortion radius. We develop storage schemes that maintain this distortion radii while maximizing compression. Our schemes make extensive use of spatial and temporal coherence of particle behavior and yield compression ratios higher than 12:1 over raw data, and 6:1 over gzipped (LZ78) raw data. We demonstrate that for uniform distributions with 100K particles, storage requirements can be reduced from 1200KB (100K Ã— 12B) to about 99KB (under 1 byte per particle per timestep). This is significant because it enables faster storage/retrieval, better temporal resolution, and improved analysis. Our results are shown to scale from small systems (2K particles) to much larger systems (over 100K particles). The associated algorithm is optimal (O(n)) in both storage and computation with small constants.

Added 2008-05-15

Multipole-based preconditioners for large sparse linear systems

Sreekanth R. Sambavaram, Vivek Sarin, Ahmed Sameh, Ananth Grama

Dense operators for preconditioning sparse linear systems have traditionally been considered infeasible due to their excessive computational and memory requirements. With the emergence of techniques such as block low-rank approximations and hierarchical multipole approximations, the cost of computing and storing these preconditioners has reduced dramatically. This paper describes the use of multipole operators as parallel preconditioners for sparse linear systems. Hierarchical multipole approximations of explicit Greenâ€™s functions are effective preconditioners due to their bounded-error properties. By enumerating nodes in proximity preserving order, one can achieve high parallel efficiency in computing matrixâ€“vector products with these dense preconditioners. The benefits of the approach are illustrated on the Poisson problem and the generalized Stokes problem arising in incompressible fluid flow simulations. Numerical experiments show that the multipole-based techniques are effective preconditioners that can be parallelized efficiently on multiprocessing platforms.

Added 2008-05-15

Architecture Independent Analysis of Parallel Programs

Ananth Grama, Vipin Kumar, Sanjay Ranka, Vineet Singh

The presence of a universal machine model for serial algorithm design, namely the von Neumann model, has been one of the key ingredients of the success of uniprocessors. The presence of such a model makes it possible for an algorithm to be ported across a wide range of uniprocessors efficiently. Universal algorithm design models proposed for parallel computers however tend to be limited in the range of parallel platforms they can efficiently cover. Consequently, portability of parallel programs is attained at the expense of loss of efficiency. In this paper, we explore desirable and attainable properties of universal models of architecture independent parallel program design. We study various models that have been proposed, classify them based on important machine parameters and study their limitations.

Added 2008-05-15

Randomized leader election

Murali Krishna Ramanathan, Ronaldo A. Ferreira, Suresh Jagannathan, Ananth Grama, Wojciech Szpankowski

We present an efficient randomized algorithm for leader election in large-scale distributed systems. The proposed algorithm is optimal in message complexity (O(n) for a set of n total processes), has round complexity logarithmic in the number of processes in the system, and provides high probabilistic guarantees on the election of a unique leader. The algorithm relies on a balls and bins abstraction and works in two phases. The main novelty of the work is in the first phase where the number of contending processes is reduced in a controlled manner. Probabilistic quorums are used to determine a winner in the second phase. We discuss, in detail, the synchronous version of the algorithm, provide extensions to an asynchronous version and examine the impact of failures.

Added 2008-05-15

Plethora: An Efficient Wide-Area Storage System

Ronaldo A. Ferreira, Ananth Grama, Suresh Jagannathan

Trends in conventional storage infrastructure motivate the development of foundational technologies for building a wide-area read-write storage repository capable of providing a single image of a distributed storage resource. The overarching design goals of such an infrastructure include client performance, global resource utilization, system scalability (providing a single logical view of larger resource and user pools) and application scalability (enabling single applications with large resource requirements). Such a storage infrastructure forms the basis for second generation data-grid efforts underlying massive data handling in high-energy physics, nanosciences, and bioinformatics, among others. This paper describes some of the foundational technologies underlying such a repository, Plethora, for semi-static peer-to-peer (P2P) networks implemented on a wide-area Internet testbed. In contrast to many current efforts that focus entirely on unstructured dynamic P2P environments, Plethora focuses on semi-static peers with strong network connectivity and a partially persistent network state. In a semi-static P2P network, peers are likely to remain participants in the network over long periods of time (e.g., compute servers), and are capable of providing reasonably high availability and response-time guarantees. The repository integrates novel concepts in locality enhancing overlay networks, transactional semantics for read-write data coupled with hierarchical versioning, and novel erasure codes for robustness. While mentioning approaches taken by Plethora to other problems, this paper focuses on the problem of routing data request to blocks, while integrating caching and locality enhancing overlays into a single framework. We show significant performance improvements resulting from our routing techniques.

Added 2008-05-15

Extended Consistent Hashing: A Framework for Distrubted Servers

S. Lei, A. Grama

Added 2008-05-15

Compression of particle data from hierarchical approximate methods

Dow-Yung Yang, Ananth Grama, Vivek Sarin, Naren Ramakrishnan

This article presents an analytical and computational framework for the compression of particle data resulting from hierarchical approximate treecodes such as the Barnes—Hut and Fast Multipole Methods. Due to approximations introduced by hierarchical methods, various parameters (such as position, velocity, acceleration, potential) associated with a particle can be bounded by distortion radii. Using this distortion radii, we develop storage schemes that guarantee error bounds while maximizing compression. Our schemes make extensive use of spatial and temporal coherence of particle behavior and yield compression ratios higher than 12:1 over raw data, and 6:1 over gzipped (LZ) raw data for selected simulation instances. We demonstrate that for uniform distributions with 2M particles, storage requirements can be reduced from 24 MB to about 1.8 MB (about 7 bits per particle per timestep) for storing particle positions. This is significant because it enables faster storage/retrieval, better temporal resolution, and improved analysis. Our results are shown to scale from small systems (2K particles) to much larger systems (over 2M particles). The associated algorithm is asymptotically optimal in computation time (O(n)) with a small constant. Our implementations are demonstrated to run extremely fast—-much faster than the time it takes to compute a single time-step advance. In addition, our compression framework relies on a natural hierarchical representation upon which other analysis tasks such as segmented and window retrieval can be built.

Added 2008-05-15

An IP address based caching scheme for peer-to-peer networks

R.A. Ferreira, A. Grama, S. Jagannathan

Distributed hash tables (DHTs), used in a number of current peer-to-peer systems, provide efficient mechanisms for resource location. Systems such as Chord, Pastry, CAN, and Tapestry provide strong guarantees that queries in the overlay network can be resolved in a bounded number of overlay hops, while preserving load balance among the peers. A key distinction in these systems is the way they handle locality in the underlying network. Topology-based node identifier assignment, proximity routing, and proximity neighbor selection are examples of heuristics used to minimize message delays in the underlying network. We investigate the use of source IP addresses to enhance locality in overlay networks based on DHTs. We first show that a naive use of source IP address potentially leads to severe resource imbalance due to nonuniformity of peers over the IP space. We then present an effective caching scheme that combines a segment of the source IP with the queried hash-code to localize access and affect replication effectively. Using detailed experiments, we show that this scheme achieves performance gains of up to 41%, when compared to Pastry in combination with the proximity neighbor selection heuristic.

Added 2008-05-15