Grammatikformalismen jenseits von kontextfreien Grammatiken und ihre Verwendung im statistischen maschinellen Lernen
Zusammenfassung der Projektergebnisse
The project BeyondCFG addressed the question of how to deal with discontinous constituents in parsing and machine translation. A particular focus was on approaches based on mildly contextsensitive grammar formalisms. We developed new models and algorithms for probabilistic constituency parsing and for statistical machine translation, using formalisms such as Linear Context-Free Rewriting Systems (LCFRS) and variants of Tree Adjoining Grammar (TAG), extensions of context-free grammars (CFG) that combine aspects of synchronous grammars with the capacity to describe discontinuities. The project developed new mildly context-sensitive (MCS) grammar formalisms, investigated their formal properties and developed both symbolic as well as statistical parsers. The latter yield transparent, grammar-based characterizations of syntactic structure while achieving state-of-theart parsing accuracy. The project also developed the first approach to grammar-less, transitionbased parsing of discontinous constituents. Linked to discontinous constituency parsing, BeyondCFG also developed several methods for treebanking, combining approaches such as active learning with an intuitive annotation interface. Finally, Beyond CFG also developed a grammarbased statistical machine translation system that allows for discontinuous constituents and complex types of alignment. One topic that was not planned in the beginning was morpho-syntactic processing of Arabic. Due to the lack of Arabic constituency treebanks of sufficiently high quality at the time, our focus moved from constituency parsing to morphology. Arabic is interesting in this respect since it displays discontinuous units in morphology. An additonal complication was that many texts in Arabic come with code switching between some dialect and Modern Standard Arabic. In the context of morphosyntactic processing of Arabic, the project constributed important results to segmentation, language identification and POS tagging for Arabic NLP. The project has produced several implementations, comprising several parsers, tools for processing discontinous constituency trees, and tools for Arabic NLP, that are publicly available and that are still in use.
Projektbezogene Publikationen (Auswahl)
-
2015. Discontinuous Incremental Shift-reduce Parsing. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 1202–1212. Beijing, China: Association for Computational Linguistics
Maier, W.
-
2015. Hierarchical Machine Translation With Discontinuous Phrases. In Proceedings of the Tenth Workshop on Statistical Machine Translation, 228–238. Lisbon, Portugal: Association for Computational Linguistics
Kaeshammer, M.
-
2015. On the Mild Context-Sensitivity of k-Tree Wrapping Grammar. In Proceedings of the 20th and 21st International Conferences on Formal Grammar - Volume 9804, 77–93. Berlin, Heidelberg: Springer-Verlag
Kallmeyer, L.
-
2016. Data-oriented parsing with discontinuous constituents and function tags. Journal of Language Modelling 4(1). 57–111
van Cranenburgh, A., R. Scha & R. Bod
-
2016. Discontinuous parsing with continuous trees. In Proceedings of the Workshop on Discontinuous Structures in Natural Language Processing, 47–57. San Diego, California: Association for Computational Linguistics
Maier, W. & T. Lichte
-
2016. LR Parsing for LCFRS. Algorithms 9(3)
Kallmeyer, L. & W. Maier
-
2016. Multilingual Code-switching Identification via LSTM Recurrent Neural Networks. In Proceedings of the Second Workshop on Computational Approaches to Code Switching, 50–59. Austin, Texas: Association for Computational Linguistics
Samih, Y., S. Maharjan, M. Attia, L. Kallmeyer & T. Solorio
-
2017. Learning from Relatives: Unified Dialectal Arabic Segmentation. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), 432–441. Vancouver, Canada: Association for Computational Linguistics
Samih, Y., M. Eldesouki, M. Attia, K. Darwish, A. Abdelali, H. Mubarak & L. Kallmeyer
-
2018. Active DOP: A constituency treebank annotation tool with online learning. In Proceedings of COLING system demonstrations, 38–42
van Cranenburgh, A.
-
2019. A Neural Graph-based Approach to Verbal MWE Identification. In Proceedings of the Joint Workshop on Multiword Expressions and WordNet (MWE-WN 2019), 114–124. Florence, Italy: Association for Computational Linguistics
Waszczuk, J., R. Ehren, R. Stodden & L. Kallmeyer
-
2019. From partial neural graph-based LTAG parsing towards full parsing. Computational Linguistics in the Netherlands Journal 9. 3–26
Bladier, T., J. Waszczuk, L. Kallmeyer & J. Janke
-
2020. Statistical Parsing of Tree Wrapping Grammars. In Proceedings of the 28th International Conference on Computational Linguistics, 6759– 6766. Barcelona, Spain (Online): International Committee on Computational Linguistics
Bladier, T., J. Waszczuk & L. Kallmeyer