14 July 2024

Why the Data Ocean Is Being Sectioned Off

Gary McGraw, Dan Geer, Harold Figueroa

Welcome to the era of data feudalism. Large language model (LLM) foundation models require huge oceans of data for training—the more data trained upon, the better the result. But while the massive data collections began as a straightforward harvesting of public observables, those collections are now being sectioned off. To describe this situation, consider a land analogy: The first settlers coming into what was a common wilderness are stringing that wilderness with barbed wire. If and when entire enormous parts of the observable internet (say, Google search data, Twitter/X postings, or GitHub code piles) are cordoned off, it is not clear what hegemony will accrue to those first movers; they are little different from squatters trusting their “open and notorious occupation” will lead to adverse possession. Meanwhile, originators of large data sets (for example, the New York Times) have come to realize that their data are valuable in a new way and are demanding compensation even after those data have become part of somebody else’s LLM foundation model. Who can gain access control for the internet’s publicly reachable data pool, and why? Lock-in for early LLM foundation model movers is a very real risk.

Below, we define and discuss data feudalism, providing context by determining where data needed to create the latest generation of machine learning (ML) models come from, how much we need, who owns it, and who should own it. We describe the data ocean and its constituent parts. We discuss recursive pollution. We wonder if less can be more.

First, some definitions.

Machine learning: “We” (meaning computer scientists and practitioners) have been building computer programs for a long time, and we’re pretty good at it. When we know HOW to describe something programmatically, we write a program to do that. Machine learning is what you end up doing when you don’t know HOW to do something in clear enough terms to write a program to do it. After all, if we knew how to solve a certain problem, we would just write a program to do so!

No comments: