16 May 2023

The open-source AI boom is built on Big Tech’s handouts. How long will it last?

Will Douglas Heavenarchive page

Last week a leaked memo reported to have been written by Luke Sernau, a senior engineer at Google, said out loud what many in Silicon Valley must have been whispering for weeks: an open-source free-for-all is threatening Big Tech’s grip on AI.

New open-source large language models—alternatives to Google’s Bard or OpenAI’s ChatGPT that researchers and app developers can study, build on, and modify—are dropping like candy from a piñata. These are smaller, cheaper versions of the best-in-class AI models created by the big firms that (almost) match them in performance—and they’re shared for free.

Companies like Google—which revealed at its annual product showcase this week that it is throwing generative AI at everything it has, from Gmail to Photos to Maps—were too busy looking over their shoulders to see the real competition coming, writes Sernau: “While we’ve been squabbling, a third faction has been quietly eating our lunch.”

In many ways, that’s a good thing. Greater access to these models has helped drive innovation—it can also help catch their flaws. AI won't thrive if just a few mega-rich companies get to gatekeep this technology or decide how it is used.

But this open-source boom is precarious. Most open-source releases still stand on the shoulders of giant models put out by big firms with deep pockets. If OpenAI and Meta decide they’re closing up shop, a boomtown could become a backwater.

For example, many of these models are built on top of LLaMA, an open-source large language model released by Meta AI. Others use a massive public data set called the Pile, which was put together by the open-source nonprofit EleutherAI. But EleutherAI exists only because OpenAI’s openness meant that a bunch of coders were able to reverse-engineer how GPT-3 was made, and then create their own in their free time.

“Meta AI has done a really great job training and releasing models to the research community,” says Stella Biderman, who divides her time between EleutherAI, where she is executive director and head of research, and the consulting firm Booz Allen Hamilton. Sernau, too, highlights Meta AI’s crucial role in his Google memo. (Google confirmed to MIT Technology Review that the memo was written by one of its employees but notes that it is not an official strategy document.)

All that could change. OpenAI is already reversing its previous open policy because of competition fears. And Meta may start wanting to curb the risk that upstarts will do unpleasant things with its open-source code. “I honestly feel it’s the right thing to do right now,” says Joelle Pineau, Meta AI’s managing director, of opening the code to outsiders. “Is this the same strategy that we’ll adopt for the next five years? I don’t know, because AI is moving so quickly.”

But experts say that releasing these models into the wild before fixing their flaws could prove extremely risky for the company.

If the trend toward closing down access continues, then not only will the open-source crowd be cut adrift—but the next generation of AI breakthroughs will be entirely back in the hands of the biggest, richest AI labs in the world.

The future of how AI is made and used is at a crossroads.

Open-source bonanza

Open-source software has been around for decades. It’s what the internet runs on. But the cost of building powerful models meant that open-source AI didn’t take off until a year or so ago. It has fast become a bonanza.

Just look at the last few weeks. On March 25, Hugging Face, a startup that champions free and open access to AI, unveiled the first open-source alternative to ChatGPT, the viral chatbot released by OpenAI in November.

Hugging Face’s chatbot, HuggingChat, is built on top of an open-source large language model fine-tuned for conversation, called Open Assistant, that was trained with the help of around 13,000 volunteers and released a month ago. But Open Assistant itself is built on Meta’s LLaMA.

And then there’s StableLM, an open-source large language model released on March 19 by Stability AI, the company behind the hit text-to-image model Stable Diffusion. A week later, on March 28, Stability AI released StableVicuna, a version of StableLM that—like Open Assistant or HuggingChat—is optimized for conversation. (Think of StableLM as Stability’s answer to GPT-4 and StableVicuna its answer to ChatGPT.)

These new open-source models join a string of others released in the last few months, including Alpaca (from a team at the University of Stanford), Dolly (from the software firm Databricks), and Cerebras-GPT (from AI firm Cerebras). Most of these models are built on LLaMA or datasets and models from EleutherAI; Cerebras-GPT follows a template set by DeepMind. You can bet more will come.

For some, open-source is a matter of principle. “This is a global community effort to bring the power of conversational AI to everyone … to get it out of the hands of a few big corporations,” says AI researcher and YouTuber Yannic Kilcher in a video introducing Open Assistant.

“We will never give up the fight for open source AI,” tweeted Julien Chaumond, cofounder of Hugging Face, last month.

A group of over 1,000 AI researchers has created a multilingual large language model bigger than GPT-3—and they’re giving it out for free.

For others, it is a matter of profit. Stability AI hopes to repeat the same trick with chatbots that it pulled with images: fuel and then benefit from a burst of innovation among developers that use its products. The company plans to take the best of that innovation and roll it back into custom-built products for a wide range of clients. “We stoke the innovation, and then we pick and choose,” says Emad Mostaque, CEO of Stability AI. “It’s the best business model in the world.”

Either way, the bumper crop of free and open large language models puts this technology into the hands of millions of people around the world, inspiring many to create new tools and explore how they work. “There’s a lot more access to this technology than there really ever has been before,” says Biderman.

“The incredible number of ways people have been using this technology is frankly mind-blowing,” says Amir Ghavi, a lawyer at the firm Fried Frank who represents a number of generative AI companies, including Stability AI. “I think that's a testament to human creativity, which is the whole point of open-source.”
Melting GPUs

But training large language models from scratch—rather than building on or modifying them—is hard. “It's still beyond the reach of the vast majority of people,” says Mostaque. “We melted a bunch of GPUs building StableLM.”


Stability AI’s first release, the text-to-image model Stable Diffusion, worked as well as—if not better than—closed equivalents such as Google’s Imagen and OpenAI’s DALL-E. Not only was it free to use, but it also ran on a good home computer. Stable Diffusion did more than any other model to spark the explosion of open-source development around image-making AI last year.

This time, though, Mostaque wants to manage expectations: StableLM does not come close to matching GPT-4. “There’s still a lot of work that needs to be done,” he says. “It’s not like Stable Diffusion, where immediately you have something that’s super usable. Language models are harder to train.”

Another issue is that models are harder to train the bigger they get. That’s not just down to the cost of computing power. The training process breaks down more often with bigger models and needs to be restarted, making those models even more expensive to build.

In practice there is an upper limit to the number of parameters that most groups can afford to train, says Biderman. This is because large models must be trained across multiple different GPUs, and wiring all that hardware together is complicated. “Successfully training models at that scale is a very new field of high-performance computing research,” she says.

The exact number changes as the tech advances, but right now Biderman puts that ceiling roughly in the range of 6 to 10 billion parameters. (In comparison, GPT-3 has 175 billion parameters; LLaMA has 65 billion.) It’s not an exact correlation, but in general, larger models tend to perform much better.

Biderman expects the flurry of activity around open-source large language models to continue. But it will be centered on extending or adapting a few existing pretrained models rather than pushing the fundamental technology forward. “There’s only a handful of organizations that have pretrained these models, and I anticipate it staying that way for the near future,” she says.

That’s why many open-source models are built on top of LLaMA, which was trained from scratch by Meta AI, or releases from EleutherAI, a nonprofit that is unique in its contribution to open-source technology. Biderman says she knows of only one other group like it—and that’s in China.

EleutherAI got its start thanks to OpenAI. Rewind to 2020 and the San Francisco–based firm had just put out a hot new model. “GPT-3 was a big change for a lot of people in how they thought about large-scale AI,” says Biderman. “It’s often credited as an intellectual paradigm shift in terms of what people expect of these models.”

No one knew how popular OpenAI’s DALL-E would be in 2022, and no one knows where its rise will leave us.

Excited by the potential of this new technology, Biderman and a handful of other researchers wanted to play with the model to get a better understanding of how it worked. They decided to replicate it.

OpenAI had not released GPT-3, but it did share enough information about how it was built for Biderman and her colleagues to figure it out. Nobody outside of OpenAI had ever trained a model like it before, but it was the middle of the pandemic, and the team had little else to do. “I was doing my job and playing board games with my wife when I got involved,” says Biderman. “So it was relatively easy to dedicate 10 or 20 hours a week to it.”

Their first step was to put together a massive new data set, containing billions of passages of text, to rival what OpenAI had used to train GPT-3. EleutherAI called its dataset the Pile and released it for free at the end of 2020.

EleutherAI then used this data set to train its first open-source model. The largest model EleutherAI trained took three and a half months and was sponsored by a cloud computing company. “If we’d paid for it out of pocket, it would have cost us about $400,000,” she says. “That’s a lot to ask for a university research group.”

Helping hand

Because of these costs, it's far easier to build on top of existing models. Meta AI’s LLaMA has fast become the go-to starting point for many new open-source projects. Meta AI has leaned into open-source development since it was set up by Yann LeCun a decade ago. That mindset is part of the culture, says Pineau: “It’s very much a free-market, ‘move fast, build things’ kind of approach.”

Pineau is clear on the benefits. “It really diversifies the number of people who can contribute to developing the technology,” she says. “That means that not just researchers or entrepreneurs but civil governments and so on can have visibility into these models.”

Like the wider open-source community, Pineau and her colleagues believe that transparency should be the norm. “One thing I push my researchers to do is start a project thinking that you want to open-source,” she says. “Because when you do that, it sets a much higher bar in terms of what data you use and how you build the model.”

But there are serious risks, too. Large language models spew misinformation, prejudice, and hate speech. They can be used to mass-produce propaganda or power malware factories. “You have to make a trade-off between transparency and safety,” says Pineau.

For Meta AI, that trade-off might mean some models do not get released at all. For example, if Pineau’s team has trained a model on Facebook user data, then it will stay in house, because the risk of private information leaking out is too great. Otherwise, the team might release the model with a click-through license that specifies it must be used only for research purposes.

This is the approach it took for LLaMA. But within days of its release, someone posted the full model and instructions for running it on the internet forum 4chan. “I still think it was the right trade-off for this particular model,” says Pineau. “But I’m disappointed that people will do this, because it makes it harder to do these releases.”

Facebook’s parent company is inviting researchers to pore over and pick apart the flaws in its version of GPT-3

“We’ve always had strong support from company leadership all the way to Mark [Zuckerberg] for this approach, but it doesn’t come easily,” she says.

The stakes for Meta AI are high. “The potential liability of doing something crazy is a lot lower when you’re a very small startup than when you’re a very large company,” she says. “Right now we release these models to thousands of individuals, but if it becomes more problematic or we feel the safety risks are greater, we’ll close down the circle and we’ll release only to known academic partners who have very strong credentials—under confidentiality agreements or NDAs that prevent them from building anything with the model, even for research purposes.”

If that happens, then many darlings of the open-source ecosystem could find that their license to build on whatever Meta AI puts out next has been revoked. Without LLaMA, open-source models such as Alpaca, Open Assistant, or Hugging Chat would not be nearly as good. And the next generation of open-source innovators won’t get the leg up the current batch have had.

In the balance

Others are weighing up the risks and rewards of this open-source free-for-all as well.

Around the same time that Meta AI released LLaMA, Hugging Face rolled out a gating mechanism so that people must request access—and be approved—before downloading many of the models on the company’s platform. The idea is to restrict access to people who have a legitimate reason—as determined by Hugging Face—to get their hands on the model.

“I’m not an open-source evangelist,” says Margaret Mitchell, chief ethics scientist at Hugging Face. “I do see reasons why being closed makes a lot of sense.”

Mitchell points to nonconsensual pornography as one example of the downside to making powerful models widely accessible. It’s one of the main uses of image-making AI, she says.

Mitchell, who previously worked at Google and cofounded its Ethical AI team, understands the tensions at play. She favors what she calls “responsible democratization”—an approach similar to Meta AI’s, where models are released in a controlled way according to their potential risk of causing harm or being misused. “I really appreciate open-source ideals, but I think it’s useful to have in place some sort of mechanisms for accountability,” she says.

OpenAI is also shutting off the spigot. Last month when it announced GPT-4, the company’s new version of the large language model that powers ChatGPT, there was a striking sentence in the technical report: “Given both the competitive landscape and the safety implications of large-scale models like GPT-4, this report contains no further details about the architecture (including model size), hardware, training compute, dataset construction, training method, or similar.”

These new restrictions are partly driven by the fact that OpenAI is now a profit-driven company competing with the likes of Google. But they also reflect a change of heart. Cofounder and chief scientist Ilya Sutskever has said in an interview with The Verge that his company’s openness in the past was a mistake.

OpenAI has definitely shifted strategies when it comes to what is and isn’t safe to make public, says Sandhini Agarwal, a policy researcher at OpenAI: “Previously, if something was open-source maybe a small group of tinkerers might care. Now, the whole environment has changed. Open-source can really accelerate development and lead to a race to the bottom.”

But it wasn’t always like this. If OpenAI had felt this way three years ago when it published details about GPT-3, there would be no EleutherAI.

A frenzy of activity from tech giants and startups alike is reshaping what people want from search—for better or worse.

Today, EleutherAI plays a pivotal role in the open-source ecosystem. It has since built several large language models, and the Pile has been used to train numerous open-source projects, including Stability AI’s StableLM (Mostaque is on EleutherAI’s board).

None of this would have been possible if OpenAI had shared less information. Like Meta AI, EleutherAI enables a great deal of open-source innovation.

But with GPT-4—and 5 and 6—locked down, the open-source crowd could be left to tinker in the wake of a few large companies again. They might produce wild new versions—maybe even threaten some of Google's products. But they will be stuck with last-generation's models. The real progress, the next leaps forward, will happen behind closed doors.

Does this matter? How one thinks about the impact of big tech firms’ shutting down access, and the impact that will have on open-source, depends a lot on what you think about how AI should be made and who should make it.

“AI is likely to be a driver of how society organizes itself in the coming decades,” says Ghavi. “I think having a broader system of checks and transparency is better than concentrating power in the hands of a few.”

Biderman agrees: “I definitely don’t think that there is some kind of moral necessity that everyone do open-source,” she says. “But at the end of the day, it's pretty important to have people developing and doing research on this technology who are not financially invested in its commercial success.”

OpenAI, on the other hand, claims it is just playing it safe. “It’s not that we think transparency is not good,” says Dave Willner, head of OpenAI’s trust and safety teams. “It’s more that we’re trying to figure out how to reconcile transparency with safety. And as these technologies get more powerful, there is some amount of tension between those things in practice.”

“A lot of norms and thinking in AI have been formed by academic research communities, which value collaboration and transparency so that people can build on each other’s work,“ says Willner. “Maybe that needs to change a little bit as this technology develops.”

No comments: