CodeSearchNet by GitHub

Excerpt

Code retrieval using natural language. A Benchmark created with Weights and Biases.


Introduction

Searching for code is one of the most common tasks for software developers, but search engine results are often frustrating. Unlike natural language processing with GLUE benchmarks, there is no standard dataset suitable for code search evaluation. GitHub is partnering with Weights & Biases to release a large labeled dataset and baseline models. Our leaderboard uses an annotated dataset of queries to evaluate the quality of code search models.

Project Overview

CodeSearchNet is a collection of datasets and benchmarks that explore the problem of code retrieval using natural language. This research is a continuation of some ideas presented in this blog post and is a joint collaboration between GitHub and the Deep Program Understanding group at Microsoft Research - Cambridge. We aim to provide a platform for community research on semantic code search via the following: 1. Instructions for obtaining large corpora of relevant data 2. Open source code for a range of baseline models, along with pre-trained weights 3. Baseline evaluation metrics and utilities 4. Mechanisms to track progress in this community benchmark, hosted by Weights & Biases We hope that CodeSearchNet is a step towards engaging with the broader machine learning and NLP community regarding the relationship between source code and natural language. We describe a specific task here, but we expect and welcome other uses of our dataset. More context regarding the motivation for this problem is in this technical report.

Dataset

The primary dataset consists of 2 million (comment, code) pairs from open source libraries. Concretely, a comment is a top-level function or method comment (e.g. docstrings in Python), and code is an entire function or method. Currently, the dataset contains Python, Javascript, Ruby, Go, Java, and PHP code. Throughout this repo, we refer to the terms docstring and query interchangeably. We partition the data into train, validation, and test splits such that code from the same repository can only exist in one partition. Currently this is the only dataset on which we train our model. Summary statistics about this dataset can be found in this notebook.

Evaluation

The metric we use for evaluation is Normalized Discounted Cumulative Gain. Please reference this paper for further details regarding model evaluation.

Annotations

We manually annotated retrieval results for the six languages from 99 general queries. This dataset is used as groundtruth data for evaluation only. Please refer to this paper for further details on the annotation process.

Setup

You should only have to perform the setup steps once to download the data and prepare the environment. 1. Due to the complexity of installing all dependencies, we prepared Docker containers to run this code. You can find instructions on how to install Docker in the official docs. Additionally, you must install Nvidia-Docker to satisfy GPU-compute related dependencies. For those who are new to Docker, this blog post provides a gentle introduction focused on data science. 2. After installing Docker, you need to download the pre-processed datasets, which are hosted on S3. You can do this by running script/setup.

How to participate

Submitting results from the baseline repository

Follow the “Quickstart” instructions in the CodeSearchNet GitHub repository. This will guide you through running and submitting our baseline model end-to-end.

Submitting results from custom models

If you’ve created your own model without using the Github CodeSearchNet repo, please refer to the full submission instructions in the CodeSearchNet GitHub repository.