Project Details
Projekt Print View

Grammar Formalisms beyond Context-Free Grammars and their use for Machine Learning Tasks

Subject Area General and Comparative Linguistics, Experimental Linguistics, Typology, Non-European Languages
Term from 2010 to 2021
Project identifier Deutsche Forschungsgemeinschaft (DFG) - Project number 183821958
 
Final Report Year 2023

Final Report Abstract

The project BeyondCFG addressed the question of how to deal with discontinous constituents in parsing and machine translation. A particular focus was on approaches based on mildly contextsensitive grammar formalisms. We developed new models and algorithms for probabilistic constituency parsing and for statistical machine translation, using formalisms such as Linear Context-Free Rewriting Systems (LCFRS) and variants of Tree Adjoining Grammar (TAG), extensions of context-free grammars (CFG) that combine aspects of synchronous grammars with the capacity to describe discontinuities. The project developed new mildly context-sensitive (MCS) grammar formalisms, investigated their formal properties and developed both symbolic as well as statistical parsers. The latter yield transparent, grammar-based characterizations of syntactic structure while achieving state-of-theart parsing accuracy. The project also developed the first approach to grammar-less, transitionbased parsing of discontinous constituents. Linked to discontinous constituency parsing, BeyondCFG also developed several methods for treebanking, combining approaches such as active learning with an intuitive annotation interface. Finally, Beyond CFG also developed a grammarbased statistical machine translation system that allows for discontinuous constituents and complex types of alignment. One topic that was not planned in the beginning was morpho-syntactic processing of Arabic. Due to the lack of Arabic constituency treebanks of sufficiently high quality at the time, our focus moved from constituency parsing to morphology. Arabic is interesting in this respect since it displays discontinuous units in morphology. An additonal complication was that many texts in Arabic come with code switching between some dialect and Modern Standard Arabic. In the context of morphosyntactic processing of Arabic, the project constributed important results to segmentation, language identification and POS tagging for Arabic NLP. The project has produced several implementations, comprising several parsers, tools for processing discontinous constituency trees, and tools for Arabic NLP, that are publicly available and that are still in use.

Publications

  • 2015. Discontinuous Incremental Shift-reduce Parsing. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 1202–1212. Beijing, China: Association for Computational Linguistics
    Maier, W.
    (See online at https://doi.org/10.3115/v1/P15-1116)
  • 2015. Hierarchical Machine Translation With Discontinuous Phrases. In Proceedings of the Tenth Workshop on Statistical Machine Translation, 228–238. Lisbon, Portugal: Association for Computational Linguistics
    Kaeshammer, M.
    (See online at https://doi.org/10.18653/v1/W15-3028)
  • 2015. On the Mild Context-Sensitivity of k-Tree Wrapping Grammar. In Proceedings of the 20th and 21st International Conferences on Formal Grammar - Volume 9804, 77–93. Berlin, Heidelberg: Springer-Verlag
    Kallmeyer, L.
    (See online at https://doi.org/10.1007/978-3-662-53042-9_5)
  • 2016. Data-oriented parsing with discontinuous constituents and function tags. Journal of Language Modelling 4(1). 57–111
    van Cranenburgh, A., R. Scha & R. Bod
    (See online at https://doi.org/10.15398/jlm.v4i1.100)
  • 2016. Discontinuous parsing with continuous trees. In Proceedings of the Workshop on Discontinuous Structures in Natural Language Processing, 47–57. San Diego, California: Association for Computational Linguistics
    Maier, W. & T. Lichte
    (See online at https://doi.org/10.18653/v1/W16-0906)
  • 2016. LR Parsing for LCFRS. Algorithms 9(3)
    Kallmeyer, L. & W. Maier
    (See online at https://doi.org/10.3390/a9030058)
  • 2016. Multilingual Code-switching Identification via LSTM Recurrent Neural Networks. In Proceedings of the Second Workshop on Computational Approaches to Code Switching, 50–59. Austin, Texas: Association for Computational Linguistics
    Samih, Y., S. Maharjan, M. Attia, L. Kallmeyer & T. Solorio
    (See online at https://doi.org/10.18653/v1/W16-5806)
  • 2017. Learning from Relatives: Unified Dialectal Arabic Segmentation. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), 432–441. Vancouver, Canada: Association for Computational Linguistics
    Samih, Y., M. Eldesouki, M. Attia, K. Darwish, A. Abdelali, H. Mubarak & L. Kallmeyer
    (See online at https://doi.org/10.18653/v1/K17-1043)
  • 2018. Active DOP: A constituency treebank annotation tool with online learning. In Proceedings of COLING system demonstrations, 38–42
    van Cranenburgh, A.
  • 2019. A Neural Graph-based Approach to Verbal MWE Identification. In Proceedings of the Joint Workshop on Multiword Expressions and WordNet (MWE-WN 2019), 114–124. Florence, Italy: Association for Computational Linguistics
    Waszczuk, J., R. Ehren, R. Stodden & L. Kallmeyer
    (See online at https://doi.org/10.18653/v1/W19-5113)
  • 2019. From partial neural graph-based LTAG parsing towards full parsing. Computational Linguistics in the Netherlands Journal 9. 3–26
    Bladier, T., J. Waszczuk, L. Kallmeyer & J. Janke
  • 2020. Statistical Parsing of Tree Wrapping Grammars. In Proceedings of the 28th International Conference on Computational Linguistics, 6759– 6766. Barcelona, Spain (Online): International Committee on Computational Linguistics
    Bladier, T., J. Waszczuk & L. Kallmeyer
    (See online at https://doi.org/10.18653/v1/2020.coling-main.595)
 
 

Additional Information

Textvergrößerung und Kontrastanpassung