Detailseite
Projekt Druckansicht

Grammatikformalismen jenseits von kontextfreien Grammatiken und ihre Verwendung im statistischen maschinellen Lernen

Fachliche Zuordnung Allgemeine und Vergleichende Sprachwissenschaft, Experimentelle Linguistik, Typologie, Außereuropäische Sprachen
Förderung Förderung von 2010 bis 2021
Projektkennung Deutsche Forschungsgemeinschaft (DFG) - Projektnummer 183821958
 
Erstellungsjahr 2023

Zusammenfassung der Projektergebnisse

The project BeyondCFG addressed the question of how to deal with discontinous constituents in parsing and machine translation. A particular focus was on approaches based on mildly contextsensitive grammar formalisms. We developed new models and algorithms for probabilistic constituency parsing and for statistical machine translation, using formalisms such as Linear Context-Free Rewriting Systems (LCFRS) and variants of Tree Adjoining Grammar (TAG), extensions of context-free grammars (CFG) that combine aspects of synchronous grammars with the capacity to describe discontinuities. The project developed new mildly context-sensitive (MCS) grammar formalisms, investigated their formal properties and developed both symbolic as well as statistical parsers. The latter yield transparent, grammar-based characterizations of syntactic structure while achieving state-of-theart parsing accuracy. The project also developed the first approach to grammar-less, transitionbased parsing of discontinous constituents. Linked to discontinous constituency parsing, BeyondCFG also developed several methods for treebanking, combining approaches such as active learning with an intuitive annotation interface. Finally, Beyond CFG also developed a grammarbased statistical machine translation system that allows for discontinuous constituents and complex types of alignment. One topic that was not planned in the beginning was morpho-syntactic processing of Arabic. Due to the lack of Arabic constituency treebanks of sufficiently high quality at the time, our focus moved from constituency parsing to morphology. Arabic is interesting in this respect since it displays discontinuous units in morphology. An additonal complication was that many texts in Arabic come with code switching between some dialect and Modern Standard Arabic. In the context of morphosyntactic processing of Arabic, the project constributed important results to segmentation, language identification and POS tagging for Arabic NLP. The project has produced several implementations, comprising several parsers, tools for processing discontinous constituency trees, and tools for Arabic NLP, that are publicly available and that are still in use.

Projektbezogene Publikationen (Auswahl)

 
 

Zusatzinformationen

Textvergrößerung und Kontrastanpassung