Assured Cloud-based Big Data Analysis

Page Content

Principal Investigator: Patrick Eugster

Institutions across industry and government are increasingly relying on dedicated data centers — which can be viewed as inhouse clouds — for their computational needs, in particular for large- scale data analysis. Clouds are in fact considered to be one of the key ingredients to solving problems with big data. A major roadblock to faster adoption of cloud computing techniques, in particular adoption of cost-effective public clouds, is the current lack of assurance. Malicious programs, faulty hardware, or software defects can corrupt data, cause services to fail, and hamper availability. Inhouse clouds are not immune to these problems either, and the boundaries between inhouse and public clouds become less clear in large corporations and the government. Consolidation of data centers across institutions, data- and hardware-wise, is namely a top priority in the DoD and a nice illustration of the emerging "cloud-of-clouds" paradigm. While 3-letter institutions trust each other, no data center is immune to attacks or failures, and corresponding issues need to contained. In this setup cloud resource providers can be trusted but clients can exhibit malicious or benign failures. Our proposal aims to address confidentiality, integrity and availability issues of cloud- based large-scale big data processing applications.
Our solution is centered around Byzantine fault tolerant (BFT) replication. BFT systems are capable of producing correct output in the presence of failures by replicating computation onto multiple nodes and ensuring that at least f + 1 nodes agree on the final result (where f is the expected number of failures). Faulty components exhibit by producing erroneous results. Such replication protects computation from both benign failures and malicious activities (which are often hard if not impossible to distinguish at first) and is able to point to faulty components which helps for attribution as well as auditing, while still completing jobs. BFT techniques can be applied at a higher level in the protocol stack – here typically at the level of data-flow program execution – which allows it to be deployed easily across different cloud platforms and infrastructures, thus supporting interoperability and portability. Using BFT techniques for assured processing gains further significance today because of the widespread adoption of address space layout randomization by operating systems (e.g., many Gnu/Linux flavors, Windows 7, Mac OS X Lion), reducing the dependence on N-version programming.
But using BFT techniques in data centers has challenges of its own. a. Applications in the cloud process terabytes of data, and naively replicating entire applications would considerably increase resource usage and overheads. b. Existing BFT replication algorithms focus on securing monolithic servers. c. BFT does not address confidentiality. Furthermore, current BFT solutions do not address practical requirements such as trading security guarantees based on threat level against operation cost. 
We propose to address these challenges by exploiting the unique traits of cloud data analysis applications:  
1. Variable Grain Clustering: Components in a typical cloud-based data analysis application and their interactions can be modeled as an acyclic graph. One component C1 may invoke components C2 and C3. If we replicate C1 and allow all replicas of C1 to invoke C2 and C3, with each instance of C2 and C3 also being replicated, the overall overhead will become prohibitively high, in particular if intermediate data transferred between components is encrypted. We thus identify sub-graphs in the acyclic execution graph that can be replicated as a whole. Digests of the output of this execution cycle can be compared in the end and agreed upon. Different granularities are possible in a graph. Concrete requirements for a given job execution can be used to trade between overhead and guarantees. 
2. Sampling Replication:  If the output of a cluster deviates from that of its replicas, we only know that one or more nodes in the cluster are faulty. We thus augment variable grain clustering by identifying and tagging suspicious nodes based on deviant output or other activities. This can be done by using a small, strongly secured, deployment tier responsible for allocation of worker nodes in a cluster, and carefully overlapping sets of replicas from different applications; overlapping happens inherently in cloud-based execution. The runtime system can support this by intentionally making such sets overlap, or even running dummy jobs on suspicious nodes. 
3. Program Analysis: A confined programming model such as that provided by data-flow based programming and query languages allows static program analysis to be leveraged to various ends, including: break-down of programs into individual operations which allows for meaningful identification of clusters and reshuffling of (commutative) operations; the use of specific operators allow for easy application of information dispersal consisting in breaking down data-sets into finer granularity and storing these separately; knowledge of operations for given phases of a program allow for identification and application of partially homomorphic encryption schemes between phases, i.e., homomorphic with respect to operation(s) performed at that stage.

Personnel

Students: Julian Stephen Chamikara Jayalath

Keywords: assurance, big data, cloud, distributed applications, secure group communication