Today, weâre excited to release Command R7B, the smallest, fastest, and final model in our R series of enterprise-focused large language models (LLMs). Command R7B provides state-of-the-art performance in its class of open-weights models across real-world tasks that matter for users. The model is designed for developers and businesses that need to optimize for the speed, cost-performance, and compute resources of their use cases.
Like our other models in the R series, Command R7B offers a context length of 128k and excels in capabilities important for a wide range of business applications. It delivers a powerful combination of multilingual support, citation verified retrieval-augmented generation (RAG), reasoning, tool use, and agentic behavior. Thanks to its compact size and efficiency it can be served on low-end GPUs, a MacBook, or even CPUs â drastically lowering the cost of deploying AI applications into production.
High performance in a small package
A well-rounded modelÂ
Command R7B excels on standardized and externally verifiable benchmarks such as the HuggingFace Open LLM Leaderboard. Compared to other similarly sized open-weights models, Command R7B ranks first on average with strong performance across all tasks.
HuggingFace Leaderboard evaluation results. Competitor numbers are taken from the official leaderboard. Command R7B results are calculated by us using the official HuggingFace prompts and evaluation code.
Enhanced efficiency in math, code, and reasoning tasks
A major area of focus for Command R7B has been improving performance on math and reasoning, code, and multilingual tasks. In particular, the model matches or exceeds leading open-weights models in its class across common math and code benchmarks while using fewer parameters.
Model performance on math and code benchmarks. All numbers are from internal evaluations except those marked with an asterisk which are from externally reported results where these are higher. We use the base version of MBPPPlus, LBPP is the average across 6 languages, SQL the average of 3 datasets (SpiderDev and Test - hard and extra hard only, BirdBench, and an internal one) and COBOL is an internally developed dataset.
Document translation quality evaluated with corpus spBLEU on the NTREX dataset.
Command R7B outperforms the other similarly sized open-weights models when it comes to core business use cases such as RAG, tool use, and AI agents. It is an ideal choice for enterprises looking for a cost-efficient model grounded in their internal documents and data. Like our other R series models, our RAG offering delivers native in-line citations that significantly reduce hallucinations and make fact-checking easier.
Performance evaluated across the ChatRAGBench (10-dataset average), BFCL-v3, StrategyQA, Bamboogle, and Tooltalk-hard. Methodology and further details are provided at the bottom in a footnote [1].
For tool use, we see stronger overall performance than models of similar size on the industry-standard Berkeley Function-Calling Leaderboard. This shows Command R7B is particularly effective at tool use in real-world, diverse, and dynamic environments and avoids calling tools unnecessarily which is an important aspect of tool use in practical applications . Command R7Bâs multi-step tool use capabilities allow it to power fast and capable AI agents.
Optimized for enterprise use cases
Our models are optimized for the capabilities enterprises need for real-world deployment of AI systems. The R series delivers an unmatched balance of efficiency and strong performance. This means ensuring they excel on human evaluation, the gold standard for quality assessment. Command R7B outperforms similarly sized open-weights models in blind head-to-head evaluations by human raters on RAG use cases our customers care about when building AI assistants for functions like customer service, HR, compliance, and IT support.
Head-to-head Human evaluation of vs Gemma 2 9B on a collection of 949 examples of enterprise RAG use-cases. All examples are at least 3-way blind-annotated by specially-trained human annotators, assessing fluency, faithfulness and response utility.
Efficient and fast
Command R7Bâs compact size offers a reduced serving footprint that is ideal for rapid prototyping and iteration. It excels at high throughput, real-time use cases like chatbots and code assistants. It also unlocks dramatically cheaper deployment infrastructure such as consumer GPUs and CPUs to unlock on-device inference.
We achieve this without compromising on our enterprise-grade security and privacy standards to protect customersâ data.Â
Get started
Command R7B is available today on the Cohere Platform as well as accessible on HuggingFace. Weâre excited to be releasing the weights of this model to provide greater access to cutting-edge technology for the AI research community.
Cohere API Pricing | Input Tokens | Output Tokens |
---|---|---|
Command R7B | $0.0375 / 1M | $0.15 / 1M |
[1] Conversational RAG: Average performance over the 10-dataset ChatRAGBench benchmark which tests the ability to generate responses in a wide range of settings including conversational tasks, attending over long inputs, analyzing tables and extracting and manipulating numerical information in financial settings. We improve evaluation methodology using a PoLL judge ensemble (Verga et al. 2024) using Haiku, GPT3.5 and Command R, providing higher agreement to human annotators (Fleissâ kappa=0.74 vs 0.57 for original, calculated over 20k human judgements).
Tool use: Performance on the BFCL-v3 benchmark on 12 Dec 2024. Where available, scores are taken from the public leaderboard, and otherwise use a best-effort internal evaluation using the official codebase. For competitors, we report the higher of their BFCL âpromptedâ or âfunction-callingâ score. We report the Overall score, the Live subset score which tests tool-use in real-world, diverse, and dynamic environments, and the Irrelevance subset score, which tests how well models avoid calling tools unnecessarily.
REACT Agent/Multi-step: We assess the abilities of LangChain REACT agents connected to the internet to break down complex questions and formulate and successfully carry out a research plan to answer them using Bamboogle and StrategyQA. Bamboogle is evaluated using a PoLL ensemble, and StrategyQA is judged by assessing whether the model follows a formatting instruction to end its answer with either âYesâ or âNoâ. We use the test sets from Chen et al.( 2023) and Press et al. (2023).
ToolTalk challenges a model to perform complex reasoning and actively seek information from users in order to execute complex user tasks, in settings such as account management, sending emails, and updating calendars.Â
Tool-talk-hard is evaluated using soft-success-rate using the official ToolTalk repository. ToolTalk requires models to expose a function-calling API, which is not available for Gemma 2 9B.