2/ 🙈 Introduction
StarCoder and StarCoderBase are Large Language Models for Code trained on GitHub data. They outperform existing open Code LLMs on programming benchmarks and match or surpass closed models (like CoPilot).
StarCoder and StarCoderBase are Large Language Models for Code trained on GitHub data. They outperform existing open Code LLMs on programming benchmarks and match or surpass closed models (like CoPilot).
2/🤖 With a context length of over 8,000 tokens, they can process more input than any other open LLM. They can act as a technical assistant, autocomplete code, modify code and explain code in natural language.
3/ 📚 The models are released under an improved OpenRAIL license, making it easier for companies to integrate them into their products. The models are expected to serve as a solid foundation for the community to use and adapt for their use-cases and products.
5/ 📈 It appears that adding a specific prompt significantly increased the HumanEval score of StarCoder from 34% to over 40%, setting a new state-of-the-art result for open models.
StarCoder is multilingual and performs well on MultiPL-E and DS-1000 benchmarks.
StarCoder is multilingual and performs well on MultiPL-E and DS-1000 benchmarks.
6/ 🧬 Training data
The model was trained on a subset of The Stack 1.2.
The dataset only consists permissively licensed code and includes an opt-out process such that code contributors can remove their data from the dataset (see Am I in The Stack).
huggingface.co
The model was trained on a subset of The Stack 1.2.
The dataset only consists permissively licensed code and includes an opt-out process such that code contributors can remove their data from the dataset (see Am I in The Stack).
huggingface.co
7/ 📧 In addition, they removed any Personal Identifiable Information from the training data such as Names, Passwords and Email addresses.
8/ 👨💻 Additional Resources
- Model weights and intermediate checkpoints with OpenRAIL license.
- All code for data preprocessing and training is also included with Apache 2.0 license.
- A comprehensive evaluation harness for code models is available.
- Model weights and intermediate checkpoints with OpenRAIL license.
- All code for data preprocessing and training is also included with Apache 2.0 license.
- A comprehensive evaluation harness for code models is available.
9/ 📚 Also-
- A new PII dataset for training and evaluating PII removal is provided.
- The fully preprocessed dataset used for training is also included.
- A code attribution tool for finding generated code in the dataset is available.
- A new PII dataset for training and evaluating PII removal is provided.
- The fully preprocessed dataset used for training is also included.
- A code attribution tool for finding generated code in the dataset is available.
10/ 📚 Links
Paper - huggingface.co
GitHub - github.com
StarCoder Model - huggingface.co
StarCoderBase Model-
huggingface.co
VsCode Extension-
marketplace.visualstudio.com
Data-
huggingface.co
Paper - huggingface.co
GitHub - github.com
StarCoder Model - huggingface.co
StarCoderBase Model-
huggingface.co
VsCode Extension-
marketplace.visualstudio.com
Data-
huggingface.co
marketplace.visualstudio.com/items?itemName…
llm-vscode - Visual Studio Marketplace
Extension for Visual Studio Code - LLM powered development for VS Code
huggingface.co/datasets/bigco…
bigcode/starcoderdata · Datasets at Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open s...
github.com/bigcode-projec…
huggingface.co/bigcode/starco…
huggingface.co/bigcode/starco…
bigcode/starcoder · Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open s...
huggingface.co/blog/starcoder…
It's hard to stay up-to-date with AI. I'll do that job for you ⤵️
tinyurl.com
tinyurl.com
Loading suggestions...