Statistical methods for machine translation of written language
Final Report Abstract
The research project "Statistical Methods for Translation of Written Language" at the Lehrstuhl für Informatik 6, RWTH Aachen University focused on development of new statistical models for machine translation (MT) of written text with the emphasis on introducing structural information such as syntax and morphology. Statistical machine translation has dramatically improved in recent years since the introduction of the phrase-based translation models. Lehrstuhl für Informatik 6 was one of the first research groups that developed a competitive phrase-based statistical MT system. In this project, this baseline system was improved substantially so that high-quality translations can now be generated even for tasks with large vocabularies. The main achievements of the project can be summarized as follows: • To automatically extract source language phrases and their target translation candidates, word alignment has to be determined for each source-target language sentence pair observed in training. The existing word alignment techniques were improved by considering word context in the lexicon model of the iterative alignment training procedure, as well as through interpolation of lexicon models from the source-to-target and target-to-source translation directions. Also, morphological information was introduced in the form of a hierarchical lexicon which helped in word disambiguation and reduced the data sparseness problem. • Different translation models were implemented, compared, and even combined, including the phrase-based model with probabilities estimated using relative frequencies, language model based on biUngual tuples, and a context-dependent biUngual lexicon. We also implemented a parsing-like search for translation with hierarchical phrases. The phrase-based, word-based and other statistical models were combined in a log-Unear translation model framework, so that each model was assigned a weight proportional to its power. Novel features such as sentence-type-specific language model scores and phrase count features were added to this model. This approach resulted in notable improvements in translation quality. • Significantly better translation output could also be obtained by introducing phrase orientation probabilities and reordering rules learned automatically from the word alignment. Depending on the particular language pair, the reordering was performed using specific (e.g. local) constraints on word or phrase movements, rules based on syntactic chunks or part-of-speech tags. In the particular case of German<->English translation, manually derived rules for reordering of verb groups were shown to be especially useful. The developed hierarchical phrase-based MT system provided the basis for future improvements in the syntactic structure of the produced translations using explicit syntactic concepts. • Linguistic and other structural information was introduced at all stages of building a statistical MT system. Morphological information was used to improve word aUgnment and generation of the correct word forms for translations Into a morphologically rich language like German. Shallow syntactic features were used to enhance the modeling of the target language. Part-of-speech information and syntactic chunks were used to learning reordering rules for the source language so that long-distance dependencies could be considered. As the main challenge for future work we see the task of finding better techniques for phrase extraction and better estimation of phrase translation probabilities. The statistical MT system developed at Lehrstuhl für Informatik 6 participated in a number of important international evaluations, with both automatic and human scoring of translation results, including the NIST MT evaluation (http://www.nist.gov/speech/tests/mt/2006/doc/mt06eval_official_results.html). The algorithmic and modellng improvements achieved during this project have made a significant contribution to the top rankings which the MT system of Lehrstuhl für Informatik 6 has always obtained in these evaluations. The MT system of Lehrstuhl für Informatik 6 was featured in the article "Selbst Arabisch lernt der PC (fast) ganz alleine", Aachener Zeitung, 29.12.2006. http://www.aachener-zeitung.de/sixcms/detail.php?template=az_detail& _wo=Suche:0nlinearchiv&id=107563
Publications
-
Integrated Chinese Word Segmentation in Statistical Machine TVanslation. In International Workshop on Spoken Language TVanslation (IWSLT), pp. 141-147, Pittsburgh, PA, USA, October 2005
J. Xu, E. Matusov, R. Zens, and H. Ney
-
Statistical Machine TYansIation of European Parliamentary Speeches. In Machine Translation Summit (MT Summit), pp. 259-266, Phuket, Thailand, September 2005
D. Vilar, E. Matusov, S. Hasan, R. Zens, and H. Ney
-
The RWTH phrasebased statistical machine translation system. In Proceedings of IWSLT, pp. 155-162, Pittsburgh, PA, October 2005
R. Zens, O. Bender, S. Hasan, S. Khadivi, E. Matusov, J. Xu, Y. Zhang, and H. Ney
-
Discriminative Reordering Models for Statistical Machine Translation. In Human Language Technology Conf. / North American Chapter of the Assoc. for Computational Linguistics Annual Meeting (HLT-NAACL), Workshop on Statistical Machine Translation, pp. 55-63, New York City, June 2006
R. Zens, and H. Ney
-
POS-based Word Reorderings for Statistical Machine TVanslalion. In International Conference on Language Resources and Evaluation (LREC), pp. 1278-1283, Genoa, Italy, May 2006
M. Popovic, and H. Ney
-
Reranking Translation Hypotheses Using Structural Properties. In Proceedings of the EACL06 Workshop on Learning Structured Information in Natural Language Applications , pp. 41-48, Trento, Italy, April 2006
S. Hasan, O. Bender, and H. Ney
-
Statistical Machine Translation with a Small Amount of TVaining Data. In 5th LREC SALTMIL Workshop on Minority Languages, pp. 25-29, Genoa, Italy, May 2006
M. Popovic, H. Ney
-
Training a Statistical Machine TYanslation System Without GIZA++. International Conference on Language Resources smd Evaluation, pp. 715-720, Genoa, Italy, May 2006
A. Mauser, E. Matusov, H. Ney
-
Analysis and System Combination of Phrase- and N-Gram-Based Statistical Machine Translation Systems. In Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers, pages 137-140, Rochester, New York, April 2007
M. R. Costa-jussa, J. M. Crego, D. Vilar, J. A. R. Fonollosa, J. B. Marino, and H. Ney
-
Are Very Large N-Best Lists Useful for SMT? In Human Language Technology Conf. / North American Chapter of the Assoc, for Computational Linguistics Annual Meeting (HLT- NAACL SSST), pp. 57-60, Rochester, NY, April 2007
S. Hasan, R. Zens, and H. Ney
-
Can We Translate Letters?. In Second Workshop on Statistical Machine Translation, pp. 33-39, Prague, Czech Republic, June 2007
D. Vilar, J. Peter, and H. Ney
-
Chunk-Level Reordering of Source Language Sentences with Automatically Learned Rules for Statistical Machine Translation. Human Language Technology Cont. / North American Chapter of the Assoc. for Computational Linguistics Annual Meeting, Rochester, NY, 2007
Y. Zhang, R. Zens, H. Ney
-
Improved Chunk-level Reordering for Statistical Machine TVanslation. International Workshop on Spoken Language Translation, Trento, Italy, 2007
Y. Zhang, R. Zens, H. Ney
-
Using a Bilingual Context in Word-Based Statistical Machine Translation, In International Workshop on Pattern Recognition in Information Systems (PRIS), pp. 144-153, Barcelona, Spain, June 2008
C. Schmidt, D. Vilar, and H. Ney