22 May 2014

Can American Intelligence Leverage the Data-Mining Revolution?


Will the United States be left behind?
May 21, 2014

On Friday, the Chief Analytic Methodologist for the Defense Intelligence Agency Josh Kerbel said, in no uncertain terms, that the U.S. intelligence community (IC) needs to make changes to keep up with its changing environment. Its historical fascination with collecting secrets is dangerously outmoded, he claims, given the overwhelming availability of unguarded information on the internet. He is absolutely correct. With that said, the abundance of unclassified, useful data hardly simplifies the task set before the intelligence community; the fact of the matter is that such data is too abundant for the IC’s current capabilities. To address this problem, data mining has become a much-discussed keystone of the future intelligence system, and the current administration is certainly working to capitalize on cross-sector research in data mining.

Part of the rush of interest in data mining stems from changes in the state of the art that vastly increase the technology’s potential. In previous years, data-mining tools analyzed massive data sets looking for connections, especially connections that tie together social networks like terrorist organizations. This style of data analytics underpins widely used national-security data-management programs like Palantir and the army’s Distributed Common Ground System, and will no doubt remain a critical tool for national security.

Nonetheless, change is coming; government officials have clearly signaled a movement towards “non-predicated, or pattern-based, searches – using data to find patterns that reveal new insights.” While current tools have the capacity to connect a user’s query with points of information pulled from otherwise uselessly large and complex data sets, future tools will be able to generate genuinely novel intelligence based on patterns in massive data sets uncovered by computerized statistical modeling.

While this unfolding style of data mining is in its formative stages, it behooves technologists and policy planners alike to keep in mind that the development and deployment of powerful data-mining technologies in national security particularly causes a number of rather unique problems:

1. The Black Box Problem

Intelligence analysts need to know and report precisely how they arrived at a conclusion; unfortunately, data-mining solutions often lend themselves to a black box-style calculation. That is to say, the user adds data to a program, and the computer spits out a response. All the machine learning, pattern identification, statistical modeling, and extrapolation needed to generate a conclusion passes unobserved. This degree of opacity in a system may work in some sectors where the conclusions are experimental, not intended to be actionable, or where many different programs are run in parallel to verify conclusions by consensus, but opacity is tricky in intelligence. That is not to say that the intelligence community cannot run multiple competing programs or use data mining for nonactionable conclusions. But, by and large, explicitly knowing the source of and rationale for a conclusion is critical for the creation of an intelligence product.

There are both technical and systematic means for addressing this problem. Technologically, a data-mining tool could be designed to generate an audit trail as it executes its program. If the program itself can provide an accounting of its data, statistical model, and conclusions that is comprehensible to the analysts and subject-matter experts using the program (as opposed to the programmers that created it), that would give the user a window into the black box that would allow them to monitor its conclusions.

Systematically, a program could be designed to keep the human in the loop, involving the intelligence analyst in the execution of the program itself. By remaining engaged with the tool, for example, by confirming the model or selecting the data for use, the user has some insight into how the program functions (although it does greatly increase the odds of the user unconsciously biasing the outcome). The black box problem is not insurmountable, but the IC would do well to keep it at the forefront as the technology develops.

2. Anonymizing and Encrypting Data

In order to build a strong statistical model, a data-mining program obviously needs a large dataset. These datasets often describe individuals and their personal information, and that information must be safeguarded. Anonymizing and encrypting the data are two favored methods of doing so. For example, if a healthcare professional was doing research on a dataset that listed tens of thousands of patients and information on their health and lifestyle, the researcher might sanitize the dataset by removing identifying information like social security numbers, names, and zip codes. They then might encrypt their database and calculations to prevent outsiders from gaining access to their data.

Unfortunately, these methods are imperfect in the national-security world. For example, if a law-enforcement program identifies individuals who are potentially dangerous outliers to a pattern, anonymous data points are not useful. In this case, they really do need to know which individual people are represented by the data points so that they may act on their conclusions (obviously, insofar as the law allows. There are certainly conceivable cases where acting on such information would be illegal. In these cases, anonymized data may be an excellent option).

Encrypting datasets to keep the subjects’ information safe remains useful in an intelligence context, but less so than in other situations. Among air-gapped, highly protected computer systems, the outsider threat is existent, but relatively minimized. Meanwhile, encryption does not protect from the insider threat. If the Snowdens of the world have access to the encryption keys, the utility of the encryption is limited.

The intelligence community has peculiarities that could make solving these problems somewhat more manageable, from a technological standpoint. Access to their datasets is theoretically relatively easier to monitor since users can be held accountable. Presumably, analysts have unique log-ins that they must register to access the data, so the use of personally identifiable information can be limited to a small, highly regulated group. Naturally, the success of this system is contingent on the effectiveness of the regulators. Ultimately, protecting privacy is a much greater problem of oversight than it is one of technology. But technology designed with accountability and oversight in mind from the outset has a much greater chance for long-term success at protecting privacy.

3. Transparency and Public Trust

At a recent conference on big data and privacy, healthcare researcher Professor John Guttag stated boldly that “medical data is special . . . because progress in healthcare is too important and too urgent to wait for privacy [problems] to be solved. I’m in favor of privacy, but not at the cost of avoidable pain and suffering and death.” While arguably a sensible perspective in medical research, such a statement would never be acceptable in national security. The immediacy of the threat of pain, suffering, and death in national security is real, but the public fear of abuse of power is too great to allow such an outlook. In other words, the national security industry and the IC in particular are held to a different standard. Regardless of whether or not this is a justifiable perspective, it is a popular one that the IC must accept.

The easy solution to suggest is transparency, and it certainly has been discussed frequently. But for all that it is easy to suggest, it is hard to implement. The American intelligence community has to protect its sources and methods; if our global competitors know the exact capabilities of our collection and analytical systems, we lose our competitive edge and our intelligence system is vulnerable to exploitation.

Nonetheless, without transparency and public trust, the intelligence community quickly loses the resources and support it needs to operate. As data-mining tools are developed and deployed, the IC must think very critically about what aspects of their programs truly need to be classified, and what can be discussed with outside researchers, academics, and the public.

Through openness and candor whenever possible, the IC can not only create an opportunity for public trust, but can advance the state of the art by joining the wider scientific discussion on data mining. For all the problems unique to data mining in national security, there are many problems common to the wider scientific community. Policy makers and technologists would do well to keep an eye out for both if they are to take full advantage of this emerging opportunity.

Laura K. Bate is a program associate at The Center for the National Interest.

Image: Wikimedia Commons

No comments: