28 February 2017

Big Data: the devil’s in the detai


Michael Chi

As the government’s review of the Australian Intelligence Community (AIC) picks up steam, one of the key challenges is to identify and resolve growing gaps in the AIC’s technological capabilities. One such capability is the collection and use of Big Data.

The generally accepted definition of big data casts it as a “problem”, because it’s characterised by its extreme volume, velocity, and variety, which makes collection, management and analysis rather challenging. The problem stems from a ‘data deluge’ of social media posts, photos, videos, purchases, clicks and a burgeoning wave of sensor data from smarter and interconnected appliances and accessories, known as the ‘Internet of Things’. Those sources generated a staggering 4.4 trillion gigabytes of data in 2013, but that figure is forecasted to reach 44 trillion gigabytes of data by 2020, which threatens to overwhelm conventional methods for storing and analysing data.

In response to the problem of big data is the “promise” of big data analytics. Analytics promises to not only manage the data deluge, but also to analyse the data using algorithms to uncover hidden correlations, patterns and links of potential analytical value. Techniques to extract those insights fall under various names: ‘data mining’, ‘data analytics’, ‘data science’, and ‘machine learning’, among others. That work is expected to yield new insights into a range of puzzles from tracking financial fraud to detecting cybersecurity incidents through the power of parallel processing hardware, distributed software, new analytics tools and a talented workforce of multidisciplinary data scientists.

However, in order to keep the big data “promise”, the AIC review needs to address the following challenges: 

There need to be mechanisms to ensure that data is manageable in terms of the definitional iron triangle of volume (size), velocity (of data flow), and variety (of data types and formats). While these are challenging enough in the big data context, the data analyst must consider the veracity of the datasets they select: the data’s representativeness, reliability, and accuracy. The analyst must also consider how the data will generate insights of value to the end user. 

The current framework for privacy protection, based on the idea of ‘notice and consent’, needs updating. Currently, the burden is placed on an individual to make an informed decision about the risks of sharing their information. This is complicated in the context of big data, as this informed decision is, in the words of former President Obama’s council of advisors, ‘defeated by exactly the positive benefits that big data enables: new, non-obvious, unexpectedly powerful uses of data’. 

Data-based decisions need to be comprehensible and explicable by maintaining thetransparency and ‘interpretability’. That will become more challenging over time, as explaining the processes and reasoning behind a machine learning algorithm isn’t a simple task, especially as those algorithms become more complex, merge into composite ensembles, and utilise correlative and statistical reasoning. 

Regardless of whether data analytics decisions are explicable, a fourth challenge is whether the public will accept algorithmic decisions. The ongoing debate about machine learning in autonomous vehicles demonstrates these concerns. As decisions of increasing importance are informed by algorithms, this challenge of ‘Algorithm Aversion’ will intensify. 

Big data and analytics security needs to be ensured. That isn’t just limited to the traditional problem of the ‘honeypot’ allure of big datasets. It also relates to the distributed, parallel, and diverse architectural challenges of big data computing. It also relates to the security of the analytics themselves, with research into ‘adversarial examples’ showing that small changes to inputs can result in serious misclassification errors in machine learning algorithms. 

Big data has both benefited and suffered from fever-pitch hype. According to Gartner’sHype Cycle, mainstream attention to big data began in 2011, peaked in 2013, and was removed from the hype cycle in 2015, with the explanation that big data is now ‘normal’ and several of its aspects are ‘no longer exotic. They’re common’. While the idea that big data is the ‘new normal’ is reaching a growing consensus, it’s important to note that some of the early big data promises no longer hold in reality, and the distinction between hype and reality needs to be clear to policymakers before effective big data policy can be developed. 

Over the coming months, ASPI, with support from CSC Australia, will undertake an analysis of Big Data in National Security to further explore the policy issues and challenges outlined in this piece, and to stimulate policy discussions around the issue.

No comments: