Alex Engler
Open-source software quietly affects nearly every issue in AI policy, but it is largely absent from discussions around AI policy—policymakers need to more actively consider OSS’s role in AI.
Open-source software (OSS), software that is free to access, use, and change without restrictions, plays a central role in the development and use of artificial intelligence (AI). Across open-source programming languages such as Python, R, C++, Java, Scala, Javascript, Julia, and others, there are thousands of implementations of machine learning algorithms. OSS frameworks for machine learning, including tidymodels in R and Scikit-learn in Python, have helped consolidate many diverse algorithms into a consistent machine learning process and enabled far easier use for the everyday data scientist. There are also OSS tools specific to the especially important subfield of deep learning, which is dominated by Google’s Tensorflow and Facebook’s PyTorch. Manipulation and analysis of big data (data sets too large for a single computer) were also revolutionized by OSS, first by the Hadoop ecosystem and later by projects like Spark. These are not simply some of the AI tools—they are the best AI tools. While proprietary data analysis software may sometimes enable machine learning without the need to write code, it does not enable analytics that are as well developed as those in modern OSS.
That the most advanced tools for machine learning are largely free and publicly available matters for policymakers, and thus the OSS world deserves more attention. The United States government has gotten better at supporting OSS broadly, notably through the Federal Source Code Policy, which encourages agencies to release more of the code they write and procure. Yet the relationship between OSS and AI policy goes less acknowledged. Trump administration documents on AI regulation and using AI at federal agencies mention OSS only in passing. The Obama administration’s AI strategy notes the important role of OSS in AI innovation, but does not mention its relevance to other issues. A new European Parliament report states that European OSS policies lack “a clear link to the AI policies and strategies… for most countries.” In fact, the recent proposed European AI regulation does not address the role of OSS at all.
Generally speaking, analyses and international comparisons of AI capacity often include talent, funding, data, semiconductors, and compute access, but often lack a discussion of the role of OSS. This is an unfortunate oversight since OSS quietly affects nearly every issue in AI policy. AI tools built in OSS enable the faster adoption of AI in science and industry, while also speeding proliferation of ethical AI practices. At the same time, OSS is playing a complex role in markets—powering innovation in many areas, while also further empowering Google and Facebook and challenging the traditional role of standards bodies.
1. OSS SPEEDS AI ADOPTION
OSS enables and increases AI adoption by reducing the level of mathematical and technical knowledge necessary to use AI. Implementing the complex math of algorithms into code is difficult and time-consuming, which means that if an open-source alternative already exists, it can be a huge benefit for any individual data scientist. Open-source developers often work on projects to build skills and get community feedback, but there is also prestige inherent in building popular OSS. Often, several different versions of the same algorithm are developed in OSS, with the best code winning out (perhaps due to its speed, versatility, or documentation). In addition to this competitive element, OSS can also be highly collaborative. Since OSS code is all public, it can be cross-examined and interrogated for bugs or possible improvements. With collaborative development and an engaged community, as often arises around popular OSS, this collaborative-competitive environment can frequently result in accessible, robust, and high-quality code.
“The work of the average data scientist requires them to be more of a data explorer and programmatic problem solver than a pure mathematician.”
This is especially important because many data scientists may not have the mathematical training necessary to implement especially complex algorithms. This is not meant as a criticism of data scientists, but the work of the average data scientist requires them to be more of a data explorer and programmatic problem solver than a pure mathematician. Generally, data scientists are focused on interpreting the results of their data analyses and trying to appropriately fit their algorithms into a digital service or product. This means that well-written open-source AI code significantly expands the capacity of the average data scientist, letting them use more current machine learning algorithms and functionality. Much attention has been paid to training and retaining AI talent, but making AI easier to use, which open-source code does, may have a similarly significant impact in enabling economic growth. Of course, this is undeniably a double-edged sword, as easier to use OSS AI also enables innovation in pernicious applications of AI, including cyberattacks and deepfakes.
2. OSS HELPS REDUCE AI BIAS
Similarly, open-source AI tools can enable the broader and better use of ethical AI. Open-source tools like OSS like IBM’s AI Fairness 360, Microsoft’s Fairlearn, and the University of Chicago’s Aequitas ease technical barriers to detecting and mitigating AI bias. There are also open-source tools for interpretable and explainable AI, such as IBM’s AI Explainability 360 or Chris Molnar’s interpretable machine learning tool and book, which make it easier for data scientists to interrogate the inner workings of their models. This is critical since data scientists and machine learning engineers at private companies are often time-constrained and operating in competitive markets. In order to keep their jobs, they must work hard on developing models and building products, without necessarily the same pressure on thoroughly examining models for biases. Academic researchers and journalists have done a remarkable job generating broad public awareness of the potential harms of AI bias, and so many data scientists understand these concerns and are personally invested in building ethical AI systems. For those engaged, but busy, data scientists, open-source code can be incredibly helpful in discovering and mitigating discriminatory aspects of machine learning.
“Open-source AI tools can enable the broader and better use of ethical AI.”
While more government oversight of AI is certainly necessary, policymakers should also more frequently consider investing in OSS for ethical AI as a different lever to improve AI’s role in society. At present, government funding tends to support code development only in the pursuit of academic research. The Chan Zuckerberg Initiative, which funds critical OSS projects, writes that OSS “is crucial to modern scientific research… yet even the most widely-used research software lacks dedicated funding.” This problem is similarly true in the ethical AI space, where government funding exists only for OSS used in early-stage research. For instance, in collaboration with Amazon, the National Science Foundation (NSF) is funding tens of million in grants for further academic research into AI fairness. This research is very likely to produce highly valuable OSS, but even the most successful projects will be challenged to find continued funding for development, support, documentation, and dissemination. Funders who are interested in ethical AI, including both government agencies and private foundations, should consider OSS as a necessary component of ethical AI, and look to support its sustainable development and widespread adoption.
3. OSS AI TOOLS ADVANCE SCIENCE
Perhaps even more than technology companies, scientific researchers from many domains gain tremendously from open-source AI. For instance, a series of responses to a tweet by François Chollet, developer of the open-source AI software Keras, demonstrate how his OSS is being used to identify subcomponents of mRNA molecules and build neural interfaces to better help visually impaired people see. The separation of these roles—the developer and the scientist—is common and generally enables both better tools and better science. Most scientific researchers cannot be expected to produce new knowledge within their fields, while also constantly implementing cutting edge statistical tools. Of course, the value of OSS to science has been constant long before the modern re-emergence of machine learning. It is not uncommon for entire community ecosystems of OSS to grow around specific scientific endeavors. Take for instance the OSS project Bioconductor, which, founded In 2001, now contains over two thousand OSS tools for genomic analysis.
Yet that scientific OSS is not new should not distract from its incredible value, nor should it mislead one into thinking that the proliferation of OSS AI tools was a certain outcome. In 2007, a group of researchers argued that “the lack of openly available algorithmic implementations is a major obstacle to scientific progress” in a paper entitled “The Need for Open Source Software in Machine Learning.” Certainly, the lack of OSS is not as prevalent of a problem today, although there are still efforts to raise the percent of academic papers which publicly release their code (currently around 50% at the Neural Information Processing Systems conference conference and 70% at International Conference of Machine Learning). Recognizing this value, policymakers should continue to encourage OSS code in the sciences (as through the NSF Fairness in AI program), and certainly avoid inhibiting it, as in the analogous case of the unfortunate consequences of the EU’s data protection law on the sharing of scientific data.
OSS software also makes research more reproducible, enabling scientists to check and confirm one another’s results at a time where much of science still faces an ongoing replication crisis. OSS is most directly helpful to reproducible research because the same OSS is available to many different researchers. Without knowing precisely how an experiment or analysis was done, critically evaluating the results of scientific papers can be difficult or impossible. Even small changes in how a mathematical algorithm was implemented can lead to different results—but using the same OSS code can greatly mitigate this source of uncertainty. This general accessibility also means they the commonly used OSS in a field will be better understood within a field, leading to easier interrogation of its use.
4. OSS AI HELPS AND HINDERS TECHNOLOGY SECTOR COMPETITION
OSS has significant ramifications for competition policy, too. At first glance, one might be inclined to think that open-source code enables more market competition, yet this is not clearly the case. On the one hand, the public release of machine learning code broadens and better enables its use. In many industries, this is likely a net boon, and enables more AI adoption with less AI talent, as discussed above. However, OSS AI tools are unlikely to check the growing influence and anti-competitive behavior of the largest technology companies. In terms of their online platforms, it is predominantly the proprietary data and network effects that keep companies like Google, Facebook, and Amazon a step above the competition. The ability to use the same algorithms does not really factor into why competing with these large companies is so difficult.
In fact, for Google and Facebook, the open sourcing of their deep learning tools (Tensorflow and PyTorch, respectively), may have the exact opposite effect, further entrenching them in their already fortified positions. While OSS is often associated with community involvement and more distributed influence, Google and Facebook appear to be holding on tightly to their software. Despite being open-sourced in 2015, the overwhelming majority of the most prolific Tensorflow contributors are Google employees, and Google pays for administrative staff to run the project. Similarly, almost all of the core developers for PyTorch are Facebook employees. This isn’t surprising, but it is noteworthy. Even in open sourcing them, Google and Facebook are not actually relinquishing any control over the development of these deep learning tools. So, while these tools are certainly more accessible to the public, and their release creates more transparency to their function, the oft stated goal of ‘democratizing’ technology through OSS is, in this case, euphemistic.
Conversely, these companies are gaining influence over the AI market through OSS, while the OSS AI tools not backed by companies, such as Caffe and Theano, seem to be losing significance in both AI research and industry. By making their tools the most common in industry and academia, Google and Facebook benefit from the public research conducted with those tools, and, further, they manifest a pipeline of data scientists and machine learning engineers trained in their systems. In a sector with fierce competition for AI talent, Tensorflow and PyTorch also help Google and Facebook bolster their reputation as the leading companies to work on cutting-edge AI problems. Other open-source developers have even added functionality and created more approachable ways to use the AI tools, as is the case through Fast.ai for PyTorch and Keras for Tensorflow. Collectively, these benefits are significant enough that creating leading open-source tools is clearly part of the competitive strategy for these companies—Google and Facebook have also done so in web development, releasing Angular.js and React.js respectively. All told, the benefits to Google and Facebook of dominating OSS deep learning are significant, and this should be accounted for in any discussions of technology sector competition.
5. OSS CREATES DEFAULT AI STANDARDS
OSS AI also has important implications for some mainstays of international policy discussions—especially standards bodies. A range of standards bodies, such as IEEE, ISO/JTC, the European Union’s CEN-CENELEC, the U.S.’s NIST, and many others, all seek to influence the rapidly emerging world of AI. Yet, in addition to competing with one another for prominence, these bodies have to navigate a field primarily driven by OSS whose default settings have become the defacto standards.
In other industries, standards bodies have sought to disseminate best practices and enable interoperable technology. For much of the machine learning world, this entails trying to encourage consistency and interoperability in a diverse ecosystem of OSS. However, the diversified use of operating systems, programming languages, and specific tools means that AI interoperability challenges have already received substantial attention. This has led to extensive work on technical solutions to interoperability that do not require making consistent coding choices—such as through containerization software and cloud-based microservices. These advances, now well used throughout the industry, make the interoperability appeal of standards less obvious. Further, the data science community is somewhat informal, with many practices and standards disseminated through twitter, blog posts, and OSS documentation. Standards bodies may have to make a significant investment to entice this community into participating in its processes, and so far it is not clear that OSS developers are extensively involved in the ongoing AI standards discussions.
“Are we comfortable with an AI world dependent on open source, but entirely corporate controlled, software?”
For deep learning specifically, the absence of diversity may also pose a challenge for standards bodies. The apparent dominance of Tensorflow and PyTorch means that Google and Facebook have outsized influence in the development and common use of deep learning methods—one they may be reluctant to cede to consensus driven organizations. Still, the large technology companies, including Google, IBM, and Microsoft, are engaged and exerting influence through the standards bodies, suggesting they believe these standards may come into meaningful effect. It’s unclear how precisely the interaction between OSS and international standards for AI will unfold, but OSS developments and developers will certainly play an important role in the way that AI is used, and they should be more involved in these debates.
AI POLICY IS INTRINSICALLY TIED TO OSS
From research to ethics, and from competition to innovation, open-source code is playing a central role in the developing use of AI. This makes the consistent absence of open-source developers from policy discussions quite notable, since they wield meaningful influence over, and highly specific knowledge of, the direction of AI. Involving more OSS AI developers can help AI policymakers more routinely consider the influence of OSS on the outcomes we aspire to—the equitable, just, and prosperous use of AI. This may lead to asking different, important questions. Are we comfortable with an AI world dependent on open source, but entirely corporate controlled, software? How can government funding best enable and encourage the beneficial use of AI? What is the right role for standards in a world powered by OSS algorithms? Certainly, the goals and challenges of AI governance are tied to AI’s open-source code. By involving more OSS AI developers, AI policymakers can better consider the influence of OSS in the pursuit of the just and equitable development of AI.
No comments:
Post a Comment