Project Details
Machine Translation of German without Parallel Corpora
Applicant
Professor Dr. Alexander Fraser
Subject Area
Image and Language Processing, Computer Graphics and Visualisation, Human Computer Interaction, Ubiquitous and Wearable Computing
Term
from 2020 to 2024
Project identifier
Deutsche Forschungsgemeinschaft (DFG) - Project number 433312382
Data-driven approaches to machine translation are the state-of-the-art approach to machine translation which are widely pursued in the academic community and extensively deployed in industry. Data-driven machine translation approaches such as phrase-based statistical machine translation and neural machine translation strongly rely on available parallel corpora, e.g., documents and their translations, from which parallel sentences are extracted. Learning of a translation system is carried out as a supervised learning problem, where a classifier is automatically learned which, given a source language sentence as input, produces a target language sentence as output. The availability of sufficient parallel text is the strongest bottleneck.We will create neural machine translation systems without the use of parallel corpora. The key to our approach is to use mappings between word embeddings spaces. Monolingual word embeddings are widely used in modern natural language processing research. Given automatically learned monolingual word embeddings, we can either use a short list of word translation pairs, or we can align the two embeddings spaces in an unsupervised fashion. Using such a mapping, we will iteratively create better and better pseudo-parallel corpora which are useful for translation from L1 to L2 and from L2 to L1.There has been very limited work on this so far, and the work has been disappointing in four major ways. (i) Most of the previous work has focused on using initial bilingual lexicons implemented using bilingual word embeddings which are limited to 1-to-1 translation, while the two works trying to overcome this have predefined interesting multi-word units rather than learning them during training. (ii) Previous models try to ensure that the system is robust to new types of input by creating data through a simple process called "noisification" and training an autoencoder, but this has important limitations which we will overcome. (iii) Previous attempts to use phrase-based statistical machine translation have suffered from overly simplistic use of just a few feature functions rather than the full phrase-based SMT model. (iv) Finally, previous neural models that are used to implement the final model are simplistic and fail to model adequancy, fluency and coverage using separate losses.We will produce systems overcoming these problems for which the final translation quality is better than the previous unsupervised state-of-the-art.
DFG Programme
Research Grants
Co-Investigator
Privatdozent Dr. Helmut Schmid