Panel #2: Big Data Analytics (Panel Summary)
Tuesday April 3, 2012
- William S. Cleveland, Purdue University
- Marc Brooks, MITRE Corporation
- Jamie Van Randwyk, Sandia National Laboratories
- Alok R. Chaturvedi, Professor, Purdue University
Panel Summary by Nabeel Mohamed
The panel was moderated by Joel Rasmus, CERIAS, Purdue University.
A quick review on Big Data:
Big Data represents a new era in data analysis where the volume of the data to analyze is so big that it does not work with current traditional database technologies and algorithms. The size of the data set needs to be collected, stored, shared, analyzed and/or visualized continue to grow as the information has been produced at an unprecedented rate from ubiquitous mobile devices, RFID technologies, sensor networks, web logs, surveillance records, search queries, social networks and so on. Increasing volume of the data is only one challenge of big data, and there are other challenges. In fact, Gartner analyst, Doug Laney, defined big data challenges/ opportunities as 3V’s:
Volume - it refers to the increasing volume of data as mentioned above.
Velocity - it refers to the time constraints in collecting, processing and using the data. A traditional algorithm which can process a small set of data quickly may take days to process a large set and give the results. However, if there is a real-time need such as national security, surveillance, and health care, taking days is not good enough any more.
Variety - it refers to the increasing array of data types that need to be handled. It includes all kinds of structured and unstructured data including audio, video, image data, transaction logs, web logs, web pages, emails, text messages and so on.
First, each of the panelists gave their perspective and their experience with big data analytics.
William S. Cleveland, Shanti S. Gupta Professor of Statistics, Purdue University, mentioning the challenges and experience in handling large volume of data in their research group, described their divide and recombine (D&R) approach to parallelize the processing by dividing the data into small subsets and applying traditional numeric and visualization algorithms on such subsets. They exploit the parallelization exhibits by the data itself. Cleveland described their tool called RHIPE built based on this concept. It is available to the public at www.rhipe.org. RHIPE is a merger of R, a free statistical analysis software and Apache Hadoop, an open source MapReduce framework.
Marc Brooks, Lead Information Security Researcher, MITRE Corporation, mainly focused on anomaly detection in large data sets. He raised the question of how one can detect an anomaly without sufficient test data sets. Further, in his opinion, it is expensive to create such data sets. Brooks sees the trend of moving from supervised learning to unsupervised learning such as clustering due to the above reason. Most of the big data sources provide large amount of unstructured data. We know well to handle structured data as we already have a schema of it. He raised the question of what are the effective ways of handling unstructured data and thinks that there should be a fundamental change in the way we model such data. He also touched on the subject of what it takes to be a data scientist which is becoming an attractive career path these days. He thinks that the skill set is a mixture of software engineering, statistics and distributed systems.
Jamie Van Randwyk, Technical R&D Manager, Sandia National Laboratories, started off with the idea of relativity behind the term “big data”. In his opinion, for different organizations big data means different sizes and complexities. Specially the volume of the data which can be called as big data. Randwyk mentioned that while most commercial entities such as Amazon, Microsoft, Rackspace and so on, handle the big data needs of the industry, Sandia mainly focus on US government agencies. He raised the question that we use Hadoop and other technologies to perform analytics and visualizations on large volume of data, however, we still don’t know how to secure such data in these big data environments. Randwyk and his team deal mainly with cyber data which is mostly unstructured. He pointed out the challenge of analyzing large volumes of unstructured data due to the lack of schema.
Alok R. Chaturvedi, Professor, Krannert Graduate School of Management, Purdue University, started his perspective with the idea that one has to collect as much information possible from multiple sources and make actual information stand out. Chaturvedi briefly explained their big data analytics work involving real time monitoring of multiple markets and multiple assets. A challenge in doing so in the real world is that data is often inconsistent and fragmented. They build behavioral models based on the data feeds from sensors, news feeds, surveys, political, economical and social channels. Based on such models they perform macro market assessment by regions in order to spot opportunities to invest. Chaturvedi thinks that big data analytics is continue to going to play a key role in doing such analysis.
After the initial perspective short talks by each panelist, the floor was open to the questions from the audience.
Q: Is behavioral modelling effective? What are the challenges involved?
A: Panelists identified two ways in which the behavior would change: adversarial activities and perturbation of data or the business itself. It is important to understand these two aspects and build behavioral model accordingly. Also, if the behavioral model does not keep up with the changes, it is going to be less effective in identifying behaviors that one wants to look for. Some of the challenges involved are deciding what matrices to use, defining such matrices, understanding the context (data perturbation vs. malicious activities) and keeping updating the model. It is also important to put the correct causality to the event. For example, 9/11 is due to a security failure not anything else.
Q: Do you need to have some expertise in the field in order to better utilize big data technologies to identify anomalies?
A: Yes, big data analytics will point to some red flags, you need be knowledgeable in the subject matter in order to dig deep and get more information.
Q: Is it practical to do host based modeling using big data technologies?
A: Yes, you have to restrict your domain of monitoring. For example, it may not be practical to do host based monitoring for the whole Purdue network.
Q: How do you do packet level monitoring if the data is encrypted?
A: Cleveland is of the view that one cannot do effective packet level monitoring if the data is encrypted. In their work, they assume that the packets are transmitted in cleartext.
Q: To what extent intelligence response being worked out? Can you do it without the intervention of humans?
A: Even with big data analytics, there will be false positives. Therefore, we still need human in the loop in order to pinpoint the incident accurately. These people should have background in computer security, algorithms, analysis, etc.
A challenge in current big data technologies like Hadoop is that it is difficult to do near real time analysis yet.
Q: (panel to audience) What are your big data problems?
A: (An audience) Our problem is scalability. There is nothing off the shelf that we can buy to meet our need. We have to put a lot of effort to build these system by putting various component together. Instead of spending time on defending attacks, we have to spend a lot of time on operational tasks.
Q: Is it better to have a new framework for big data for scientific data?
A: It is not the science per se that you have to look at; you have to look at the complexity and size of the data in order to decide. From an operational perspective, a definition/framework may not be important, but from a marketing perspective, it may be important. For example, defining the size of the data set could be potentially useful.
Q: We want to manage EHRs (electronic health records) for 60m people. Can these people be re-identified using big data technologies?
A: Even EHR data confirming to safe harbor rules where 18-19 elements are not there may be re-identified. Safe harbor rules are not sufficient, neither they are necessary. They will protect most people, but not all. You can protect even without safe harbor. This is a very challenging problem and CERIAS has an ongoing research project.
Q: Have you seen adversaries intentionally trying to manipulate big data so that they go undetected? Specifically have you seen adversaries that damage the system slowly to stay below the threshold level of detection and that damage very fast to overwhelm the system?
A: We have seen that adversaries understand your protocols, whether your packets are encrypted or not, etc., so that they can behave like legitimate users. I have heard anecdotal stories of manipulating data in bank and other financial institutions, but can’t point to any specific incident.
Q: Often times, we have to reduce the scope, when many parameters are to be analyzed due to the sheer volume of data. How do you ensure that you still detect an anomaly (no false negatives)?
A: You have to analyze all the data otherwise it may result in false negatives.