Solomon Klappholz
The StarCoder code generation tool has received a massive update that could position it as a leading open source alternative to services such as GitHub Copilot.
Initially launched in May 2023 as part of a collaboration between Hugging Face and ServiceNow, the latest iteration, StarCoder2, now also has major industry backing in the form of Nvidia.
The code generation tool supports developers by automating code completion, similar to GitHub Copilot or Amazon CodeWhisperer. It’s also capable of summarizing existing code and generating original snippets
StarCoder2 is available in three different model sizes, each trained by a different member of the partnership.
The smallest version is a three billion-parameter model trained by ServiceNow, with a seven billion-parameter model trained by Hugging Face.
Nvidia was responsible for the largest iteration of StarCoder2 with a 15 billion-parameter model built using its NeMo generative AI platform and trained on Nvidia’s accelerated AI infrastructure.
Each fork of the StarCoder2 models offers a significantly expanded array of programming languages they can work in.
The original StarCoder tool was trained on over 80 different programming languages, whereas StarCoder2 boasts the ability to generate code in 619 languages.
StarCoder2 is underpinned by the Stack v2 dataset, the largest open code dataset suitable for LLM pretraining, according to Hugging Face. The AI company said this latest dataset is seven times larger than the original Stack v1.
Paired with new training techniques, the trio believe this will help the models understand low-resource programming languages, mathematics, and program source code discussions.
The performance of each of the new LLMs is vastly enhanced too, with the three billion-parameter StarCoder2 matching the performance of Hugging Face’s original 15 billion-parameter StarCoder model.
StarCoder2 could be a game changer for devs
StarCoder2 is a huge step forward for open source AI code generation. In opening the door to competition within the open source community for the title of ‘best AI pair programmer’ and putting the heat on Meta’s Code Llama, it has ensured that developers have a future of solid, open options to look forward to.
Within the paper accompanying the launch, the team behind StarCoder2 presented evidence that the model can go toe-to-toe with Code Llama even in its largest, 34-billion parameter size.
In MBPP, a benchmark that pits a coding model against approximately 1,000 entry-level Python programming problems, StarCoder2’s 15-billion parameter model scored 66.2 against Code Llama 34B’s 65.4.
The fact that the training data for StarCoder is openly available through the Stack will also be a relief to many organizations.
Future legal battles will be fought over who owns the data used to train AI and any company that discovers its source code was generated using scraped proprietary data could be in for a very difficult and costly replacement process down the line.
In contrast, the openness of StarCoder2 is a crowning achievement. In the interest of crediting the developers whose code formed the basis of StarCoder2, users can enter outputs into a dataset search on Hugging Face to identify if the code the tool has produced is ‘original’ or a verbatim copy from its immense training data.
Alternatively, teams can freely search the dataset themselves.
It’s in the interest of all developers to have strong options like this on the market, as innovation and competition in the sector will only drive models to become more accurate. But the precedent StarCoder2 sets in terms of responsible AI model creation through open source may be its lasting legacy.
No comments:
Post a Comment