CuBERT: BERT for Source Code Understanding

A significant advancement in natural-language understanding has come with the development of pre-trained contextual embeddings, such as BERT. BERT is essentially a Transformer encoder trained from unlabeled data. The bidirectional nature of the underlying attention mechanism ensures that the context of each token is taken into account for computing its representation. BERT-based models can be fine-tuned for downstream tasks with less labeled data and training budget, while achieving better accuracies. We present the first attempt at obtaining a high-quality contextual embedding of source code, called Code Understanding BERT or CuBERT for short. CuBERT is trained on a massive corpus of 7.4M Python files from GitHub. We also create an open-sourced benchmark that comprises five classification tasks and one program-repair task, and demonstrate that CuBERT achieves the best results compared to multiple baselines. CuBERT is finding applications in several types of software engineering tasks such as automated bug fixing, code completion, code search and transfer learning.

References:

Learning and Evaluating Contextual Embedding of Source Code. Aditya Kanade, Petros Maniatis, Gogul Balakrishnan, Kensen Shi. In the 38th International Conference on Machine Learning (ICML), 2020.

https://github.com/google-research/google-research/tree/master/cubert

Faculty: Aditya Kanade, CSA

Click image to view enlarged version