mOthello: When Do Cross-Lingual Representation Alignment and Cross-Lingual Transfer Emerge in Multilingual Models?

Tianze Hua*
Brown University
Tian Yun*
Brown University
Ellie Pavlick
Brown University


Abstract

Many pretrained multilingual models exhibit cross-lingual transfer ability, which is often attributed to a learned language-neutral representation during pretraining. However, it remains unclear what factors contribute to the learning of a language-neutral representation, and whether the learned language-neutral representation suffices to facilitate cross-lingual transfer. We propose a synthetic task, Multilingual Othello (mOthello), as a testbed to delve into these two questions.

We find that:
  1. models trained with naive multilingual pretraining fail to learn a language-neutral representation across all input languages;
  2. the introduction of "anchor tokens" (i.e., lexical items that are identical across languages) helps cross-lingual representation alignment; and
  3. the learning of a language-neutral representation alone is not sufficient to facilitate cross-lingual transfer.
Based on our findings, we propose a novel approach -- multilingual pretraining with unified output space -- that both induces the learning of language-neutral representation and facilitates cross-lingual transfer.


Multilingual Othello

An illustration of the mechanism of mapping functions that map Othello game moves into tokens of specific languages.

We use Multilingual Othello (mOthello), a sequence modeling task based on the Othello board game to investigate the essential factors for learning language-neutral representations and whether they are sufficient to facilitate the cross-lingual transfer ability of multilingual models.

In mOthello, a model is given a sequence of game moves in a specific "language", and the task is to predict the next legal move in the same "language". This environment is appropriate for our purposes, since it separates the ground truth "world" (i.e., the game state) which is assumed to be singular, from the language used to describe it, which can take any number of forms (languages). We later formulate our measure of language-neutral representations around the ground-truth underlying world states. The figures above show how mOthello instances can be used to generate a corpus of multilingual data for training and evaluation.

To test the generalizability of our findings beyond the simple mOthello languages, we introduce three variants of mOthello languages to mirror features of natural languages:
  1. Atomic language maps each game move to a single (atomic) language-specific token. For example, moves [a1, a2, b1] are mapped to [a1, a2, b1] in an atomic language.
  2. Split language simulates the scenario when a semantic unit is represented by one or more tokens. In the context of mOthello, this means that each game move can be mapped to one or more tokens in a split language. For example, moves [a1, a2, b1] are mapped to [a11, a12, a21, b11, b12, b13] in a split language. The number of tokens each move is split into is sampled randomly from 1 to 3.
  3. Compositional language represents moves by decomposing each of them into its horizontal and vertical location on the board. In this type of language, tokens are reused to represent different moves in a compositional way. For example, moves [a1, a2, b1] are mapped to [a, 1, a, 2, b, 1] in a compositional language.


Multilingual Pretraining & mOthelloGPT

Illustration of three multilingual training approaches. Blue and green blocks represent contexts in 2 different languages, and tokens from the same language have the same color.

With training data generated via mOthello, we train GPT-2 (mOthelloGPTs) with different multilingual pretraining approaches (as illustrated in the figure above). Block "M" represents an mOthelloGPT. Blue and green blocks represent words in 2 different languages.
  1. Naive Multilingual Pretraining: A model is trained on multilingual corpora, with an objective to predict the next tokens specific to each language.
  2. Multilingual Pretraining with Anchor Tokens: A model is trained on multilingual corpora, where there are tokens shared across language pairs. These tokens are named as anchor tokens. The objective is still to predict the next tokens specific to each language.
  3. Multilingual Pretraining with Unified Output Space: A model is trained on multilingual corpora, with an objective to predict the next tokens in a unified output space.


Measuring Representation Alignment and Cross-lingual Transfer


To measure to what extent the hidden representations of semantically similar tokens across languages align with one another, we propose cross-lingual alignment probes, which is a probe Psrc trained to recover the board states with input sequences in language Lsrc to recover the board states given input sequences in another language Ltgt, in a zero-shot fashion. If a cross-lingual alignment probe can reconstruct the board states in another language accurately, this reflects that there is a shared latent space for language Lsrc and Ltgt.

Cross-lingual transfer ability is the ability to enhance task performance in a target language when being finetuned exclusively with labeled data from the same task, but in a different source language. To measure the cross-lingual transfer ability of mOthelloGPT models, we first pretrain mOthelloGPTs on a prefix-filtered subset of the mOthello corpus, translated to M languages; then, we finetune the pretrained model with a non-prefix-filtered subset of the entire mOthello corpus, but only in one of the languages. We record checkpoints during finetuning process and measure the alignment and performance for each model checkpoint. The performance is measured by calculating the top-1 accuracy of legal move prediction in each language. If a model can achieve better performance in a target language when finetuned solely on the source language, this reflects that the model has good cross-lingual transfer ability.

Results

1. Models Trained with Naive Multilingual Pretraining Fail to Learn a Language-neutral Representation,
But the Introduction of Anchor Tokens Helps
The first column of the table above shows the pairwise cross-lingual alignment probe accuracy in mOthelloGPTs trained on different language pairs using the naive training approach (with no anchor tokens). We observe a lack of strong alignment in the representations across all pairs of languages, implying that naive bilingual pretraining without any inductive biases may not yield representation alignment across languages.

Multilingual Othello allows us to introduce anchor tokens, which are the shared tokens across languages. We observe that as the number of shared anchor tokens across two languages increases, the alignment of representations improves. More specifically, with 4 shared anchor tokens, the representations already reach nearly perfect alignment for all three language-pair types. This suggests that the introduction of anchor tokens can help induce the learning of language-neutral representations.

2. Learning Langauge-neutral Representations Is Not Sufficient for Cross-lingual Transfer
Next, we study whether aligned cross-lingual representations lead to cross-lingual transfer ability for mOthelloGPTs. The first and second columns in the figure above present cross-lingual transfer results of mOthelloGPTs trained with or without anchor tokens. We observe that
  1. When cross-lingual representations do not align well, mOthelloGPT finetuned on one language does not benefit another language, which means this model does not have cross-lingual transfer ability
  2. Even when the cross-lingual representation alignment is high for an mOthelloGPT, cross-lingual transfer still does not occur.
This finding goes against the common belief that cross-lingual representation alignment is a sufficient condition for the emergence of cross-lingual transfer ability in multilingual models.

3. Multilingual Pretraining with Unified Output Space Brings Both Representation Alignment and
Cross-lingual Transfer
The third column in the figure above shows the results of representation alignment and cross-lingual transfer learning under the multilingual pretraining with unified output space. We observe that pretraining with unified output space brings mOthelloGPTs not only cross-lingual alignment, but also cross-lingual transfer ability. Specifically, for mOthelloGPT pretrained with Atomic language pairs, the cross-lingual alignment probe accuracy remains at around 90%, indicating that the FT source language and the target language are well aligned. Moreover, we observe that despite not encountering any sequences from the target language during finetuning, this mOthelloGPT still manages to enhance its performance in predicting next legal moves in language the target language to the same extent as in the FT source language. This indicates that this mOthelloGPT achieves cross-lingual transfer under the unified output space approach. We notice that the cross-lingual transfer ability of mOthelloGPTs trained with Split or Compositional language pairs is slightly weaker, but the pattern that finetuning on the FT source language benefits next move prediction in the target language still holds, especially at early finetuning phase.

The improvement in performance of target language across three language pairs of structurally different languages implies that multilingual pretraining with unified output space is an effective approach for inducing cross-lingual alignment and cross-lingual transfer ability and is robust to structural differences across languages.

4. Multilingual Training with More Than Two Languages
Here, we explore whether our findings hold for multilingual models that are trained with more than two languages. The above figure shows the cross-lingual representation alignment and cross-lingual transfer performance of mOthelloGPTs trained with 4 languages consisting of different language types. We find that the results are consistent with our findings on bilingual mOthelloGPTs:
  1. While anchor tokens improve representation alignment across languages, it does not help the model to achieve its cross-lingual transfer ability;
  2. With the introduction of the unified output token space during multilingual pretraining, both cross-lingual representation alignment and cross-lingual transfer are achieved.
This result suggests that the unified output space approach also generalizes to scenarios when a multilingual model is trained on more than two languages.


Paper and BibTex

Tianze Hua*, Tian Yun*, Ellie Pavlick.
mOthello: When Do Cross-Lingual Representation Alignment and Cross-Lingual Transfer Emerge in Multilingual Models?
In NAACL 2024 (Findings).




  @inproceedings{hua2024mothello,
	title={mOthello: When Do Cross-Lingual Representation Alignment and Cross-Lingual Transfer Emerge in Multilingual Models?},
	author={Tianze Hua and Tian Yun and Ellie Pavlick},
	booktitle={2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics},
	year={2024}
  }
					


Acknowledgements

We would like to thank the anonymous reviewers for their detailed and constructive comments. Our implementation of GPT2 is based on minGPT, and the overall codebase is mostly built on Othello World. Many thanks to Andrej Karpathy and Kenneth Li for open-sourcing their projects!

This template was originally made by Phillip Isola and Richard Zhang for a colorful ECCV project; the code can be found here.