26 April 2025

The Highs and Lows of LLMs in Strategy Games

Baptiste Alloui-Cros

In my last piece, I introduced the idea that different Large Language Models may exhibit distinct, coherent, and complex ‘playing styles’ with consistency across iterations of a given game. Granted, this applied to a very simple game (the Prisoner’s Dilemma) with a very limited game space. However, it raises questions about what the latest generation of LLMs can and cannot do in strategy games. As mentioned last time, benchmarking LLMs by pitting them against each other in various games has become somewhat trendy lately, providing nuances that traditional benchmarks often fail to capture. Consequently, virtual arenas for games such as Connect 4, code names, chess, and even Street Fighter, quickly emerged.

Diplomacy, in particular, has been the subject of several tests, from my own work to Sam Paech’s EQBench (Emotional Intelligence Benchmarks for LLMs) and SPINBench, developped by a talented team of researchers from Princeton and the University of Texas. The reason why Diplomacy is such an interesting theatre to test AI (so much so, in fact, that 2 pieces on this substack have already been devoted to it) is because it elicits a perfect blending of strategic reasoning, negociation skills, and tactical prowess, to be successful at it.

No comments: