Original article
2006.03511 Unsupervised Translation of Programming Languages
News
Research
Lample et al.
Researcher
Guillaume Lample
glample@fb.com

## Fully unsupervised neural transcompiler

• Converts languages obtained from GitHub
• BigQuery
• unsupervised
• it looks for previously undetected patterns in data sets without labels and with a minimal amount of human supervision
• outperforms rule-based baselines by a “significant” margin.

### Model of TransCoder

• seq2seq with attention

• composed of
• encoder
• decoder
• transformer architecture
• the same model is used for all programming languages.
• Trained using the three principles of unsupervised machine translation

• initialization,
• LMing, and
• back-translation.

In this section, we summarize these principles and detail how we instantiate them to translate programming languages.

### Cross Programming Language Model pretraining

 1 2 3 4 5 6 7 8  Pretraining A key ingredient of unsupervised machine translation. It ensures that sequences with a similar meaning are mapped to the same latent representation, regardless of their languages.

Originally, pretraining was done by initializing the model with cross-lingual word representations.

In the context of unsupervised English-French translation, the embedding of the word “cat” will be close to the embedding of its French translation “chat”.

Cross-lingual word embeddings can be obtained by training monolingual word embeddings and aligning them in an unsupervised manner.

Subsequent work showed that pretraining the entire model (and not only word representations) in a cross-lingual way could lead to significant improvements in unsupervised machine translation.

### The pretraining strategy of Lample and Conneau

In particular, we follow the pretraining strategy of Lample and Conneau, where a XLM is pretrained with a masked LMing objective on monolingual source code datasets.

First principle: initialization
The first principle initializes the model with

As a result, pieces of code that express the
same instructions are mapped to the same
representation, regardless of the programming
language.

Second principle: denoising
Train the decoder to always generate valid
sequences, even when fed with noisy data, and
increases the encoder robustness to input
noise.
Third (and last) principle: back-translation
Allows the model to generate parallel data
which can be used for training.

Whenever the Python C++ model becomes better,
it generates more accurate data for the C++
Python model, and vice versa.

We obtain cross-lingual embeddings after training.

The cross-lingual nature of the resulting model comes from the significant number of common tokens (anchor points) that exist across languages.

In the context of English-French translation, the anchor points consists essentially of digits and city and people names.

In programming languages, these anchor points come from common keywords (e.g. for, while, if, try), and also digits, mathematical operators, and English strings that appear in the source code.

For the masked LMing (MLM) objective, at each iteration we consider an input stream of source code sequences, randomly mask out some of the tokens, and train TransCoder to predict the tokens that have been masked out based on their contexts.

We alternate between streams of batches of different languages.

This allows the model to create high quality, cross-lingual sequence representations.

Denoising auto-encoding

We initialize the encoder and decoder of the seq2seq model with the XLM model that was pretrained.

The initialization is straightforward for the encoder, as it has the same architecture as the XLM model.

The transformer decoder, however, has extra parameters related to the source attention mechanism .

Following Lample and Conneau, we initialize these parameters randomly.

XLM pretraining
allows the seq2seq model to
generate high quality representations of input
sequences.

However, the decoder lacks the capacity to translate, as it has never been trained to decode a sequence based on a source representation.

To address this issue, we train the model to encode and decode sequences with a Denoising Auto-Encoding (DAE) objective .

The DAE objective operates like a supervised machine translation algorithm, where the model is trained to predict a sequence of tokens given a corrupted version of that sequence.

To corrupt a sequence, we use the same noise model as the one described in Lample et al. .

Namely, we randomly mask, remove and shuffle input tokens.

In practice, the “cross-linguality” of the model highly depends on the amount of anchor points across languages.

As a result, a XLM model trained on English- French will provide better cross-lingual representations than a model trained on English-Chinese, because of the different alphabet which reduces the number of anchor points.

In programming languages, the majority of strings are composed of English words, which results in a fairly high number of anchor points, and the model naturally becomes cross- lingual.

## Arxiv Summary

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54  A transcompiler, also known as source-to- source translator, is a system that converts source code from a high-level programming language (such as C++ or Python) to another. Transcompilers are primarily used for interoperability, and to port codebases written in an obsolete or deprecated language (e.g. COBOL, Python 2) to a modern one. They typically rely on handcrafted rewrite rules, applied to the source code abstract syntax tree. Unfortunately, the resulting translations often lack readability, fail to respect the target language conventions, and require manual modifications in order to work properly. The overall translation process is timeconsuming and requires expertise in both the source and target languages, making code- translation projects expensive. Although neural models significantly outperform their rule-based counterparts in the context of NL translation, their applications to transcompilation have been limited due to the scarcity of parallel data in this domain. In this paper, we propose to leverage recent approaches in unsupervised machine translation to train a fully unsupervised neural transcompiler. We train our model on source code from open source GitHub projects, and show that it can translate functions between C++, Java, and Python with high accuracy. Our method relies exclusively on monolingual source code, requires no expertise in the source or target languages, and can easily be generalized to other programming languages. We also build and release a test set composed of 852 parallel functions, along with unit tests to check the correctness of translations. We show that our model outperforms rule-based commercial baselines by a significant margin.

## GPT-3 summary of Arxiv summary

We train a neural transcompiler using monolingual source code from GitHub, and show that it can translate functions between C++, Java, and Python with high accuracy.