Itamar Golan 🤓
Itamar Golan 🤓

@ItakGol

11 Tweets 147 reads May 05, 2023
GitHub Copilot RIP? 🕊🪦
Introducing StarCoder🌟
All you need to Know (+Demo+Extension+Model+Data)⤵️⤵️⤵️
2/ 🙈 Introduction
StarCoder and StarCoderBase are Large Language Models for Code trained on GitHub data. They outperform existing open Code LLMs on programming benchmarks and match or surpass closed models (like CoPilot).
2/🤖 With a context length of over 8,000 tokens, they can process more input than any other open LLM. They can act as a technical assistant, autocomplete code, modify code and explain code in natural language.
3/ 📚 The models are released under an improved OpenRAIL license, making it easier for companies to integrate them into their products. The models are expected to serve as a solid foundation for the community to use and adapt for their use-cases and products.
4/🤓 Evaluation
StarCoder has been evaluated using the HumanEval benchmark for Python, and found to be outperforming larger models, such as PaLM, LaMDA and LLaMA - despite being smaller!
5/ 📈 It appears that adding a specific prompt significantly increased the HumanEval score of StarCoder from 34% to over 40%, setting a new state-of-the-art result for open models.
StarCoder is multilingual and performs well on MultiPL-E and DS-1000 benchmarks.
6/ 🧬 Training data
The model was trained on a subset of The Stack 1.2.
The dataset only consists permissively licensed code and includes an opt-out process such that code contributors can remove their data from the dataset (see Am I in The Stack).
huggingface.co
7/ 📧 In addition, they removed any Personal Identifiable Information from the training data such as Names, Passwords and Email addresses.
8/ 👨‍💻 Additional Resources
- Model weights and intermediate checkpoints with OpenRAIL license.
- All code for data preprocessing and training is also included with Apache 2.0 license.
- A comprehensive evaluation harness for code models is available.
9/ 📚 Also-
- A new PII dataset for training and evaluating PII removal is provided.
- The fully preprocessed dataset used for training is also included.
- A code attribution tool for finding generated code in the dataset is available.

Loading suggestions...