Loren Thompson
Earlier this year, federal contractor LeidosLDOS -0.9% and machine-learning company Domino Data Lab conducted a digital wargame that simulated a maritime combat scenario. The scenario envisioned attacks on a naval fleet that needed to be defeated without harming nearby commercial vessels.
Over 60 artificial intelligence researchers from Leidos participated. The winning team achieved a perfect score, using a novel approach to machine learning that mathematically mimics the way human beings learn.
The approach is called Reinforcement Learning, or RL, and the basic idea is to generate software that can enable a machine to learn from its environment through trial and error—in much the same way that young children do.
The concept is operationalized by creating algorithms that guide a machine in achieving desired outcomes, learning over time from its environment so that the machine’s decisions become increasingly subtle and precise.
Leidos, a contributor to my think tank, invited me to a virtual roundtable with two experts who participated in the exercise—Kevin Albarado, an AI and autonomy branch chief at Leidos subsidiary Dynetics, and Thomas Robinson, chief operating officer of Domino Data Lab.
They laid out an intriguing story of how, using a computational approach similar to human learning, it is possible to achieve desired outcomes in unstructured situations more efficiently than other approaches to machine learning.
Much of machine learning, whether supervised or unsupervised, involves programming an agent how to respond in specific circumstances.
However, as Richard Sutton and Andrew Barto point out in Reinforcement Learning: An Introduction (MIT Press: 2020), “In interactive problems it is often impractical to obtain examples of desired behavior that are both correct and representative of all the situations in which the agent has to act.” In such circumstances, RL enables the machine to learn from interacting with its environment.
Albarado and Robinson favor the widely used analogy of rewards and punishments to describe how software can be created that enables the machine to maximize favorable outcomes.
But the software doesn’t just maximize the likelihood that the next decision the machine makes will be the desired one, it structures the learning process so that the machine seeks to maximize rewards over the long term. The machine accumulates knowledge from experience that over time sharpens its performance.
If this all sounds eerily familiar, it is because RL derives much of its insight into how to teach machines from human psychology, neuroscience, economics and other fields that provide models of human behavior.
Conceptually, the process resembles the stimulus-response mechanism that was at the center of psychologist B.F. Skinner’s theory of operant conditioning. Skinner’s theory was controversial in the last century because some felt it produced too mechanistic a view of human behavior, but it is a good match for how machines can be taught to “think” in novel situations.
Wartime provides many such situations, because it is inherently dynamic. It is probably not possible to pre-program a computer so it has ready responses to the full range of wartime contingencies. A machine that learns from interacting with its environment won’t always deliver the optimum response, but over time it may become faster and more efficient than any human would be.
That’s a good thing, because warfare among advanced military powers is accelerating to a point where continuous human intervention in tactical decisions might easily lead to defeat. Some facets of military operations will have to be automated in order to provide a reasonable chance of victory.
For example, managing large swarms of drones as depicted in the movie Ender’s Game will eventually be quite beyond the mental capacity of human beings within the tight timelines required for combat success.
More prosaically, every advanced combat system consists of myriad subsystems that can incorporate RL to enhance their support of the overall system. Nikolay Manchev offers the example of a smart, self-learning thermostat to illustrate that RL can be used in many ways.
And its appearance in combat may be only two or three years away. For that reason, Albarado and Robinson are attuned to the issue of governance when deploying self-learning machines on the battlefield. Once it becomes clear the RL-driven systems are faster and more efficient than mechanisms driven by people, somebody will have to decide when that is not the best tactical solution.
It's a concern that already has the attention of ethicists, and Reinforcement Learning adds a new dimension to the challenge. But RL’s time has arrived, and the potential benefits of applying it to every facet of commerce and culture is too great to ignore.
So an approach to human learning that made B.F. Skinner one of the most influential psychologists of the 20th century is now poised to have a renewed impact in the present century, thanks to the way the digital revolution enables machines to learn like people.
No comments:
Post a Comment