Project Details
Projekt Print View

Efficient statistical parsing and decoding for expressive grammar formalisms based on tree automata

Subject Area Image and Language Processing, Computer Graphics and Visualisation, Human Computer Interaction, Ubiquitous and Wearable Computing
General and Comparative Linguistics, Experimental Linguistics, Typology, Non-European Languages
Term from 2014 to 2023
Project identifier Deutsche Forschungsgemeinschaft (DFG) - Project number 252303250
 
The aim of this project is to develop efficient algorithms for expressive grammar formalisms. Such grammar formalisms describe string languages that are not context-free; languages of more complex objects, such as trees or graphs; and relations between such objects. They can thus handle linguistic representations, and capture linguistic generalizations, that probabilistic context-free grammars (PCFGs) cannot. This is useful for many emerging NLP tasks, such as semantic parsing of strings into graph-based semantic representations.The key idea of the project is to encode a wide variety of expressive grammar formalisms as Interpreted Regular Tree Grammars (IRTGs), and to specify algorithms for IRTGs in general; they will then apply directly to all the more specific formalisms. In the first phase, we have made significant progress in widening the range of formalisms which can be captured by IRTGs, including grammars for graph languages and for languages of sets. We also improved the performance of IRTG parsing algorithms drastically: parsing for PCFGs encoded as IRTGs is now 1000x faster than before (and roughly on par with dedicated PCFGparsers), and our parser for graph grammars is over 1000x faster than the previously best dedicated graph parser. On a theoretical level, we have clarified the formal relationships between expressive grammar formalisms; and on a practical level, researchers working with such grammar formalisms can directly utilize our generic algorithms and their open-source implementation, Alto.In the second phase, we want to scale Alto to datasets of realistic size and complexity on NLP tasks such as parsing, translation, and generation. Even with the theoretical and foundational advances of the first phase, a number of challenges became visible as we applied Alto to increasingly complex domains. These challenges are common to all grammar-based approaches, and include the induction of grammars from corpora in which grammatical information is only incompletely observable, as well as scaling the speed of our parsing and translation algorithms to real-world data. We will tackle these challenges generally, by developing new algorithms or adapting existing ones to IRTGs. We will complement this grammar-based perspective with neural methods for parsing, which we will combine with the specific perspective on language offered by IRTGs.The overall outcome of the project will be an end-to-end toolchain in which a user only needs to specify an expressive grammar formalism in terms of IRTGs and provide some data, and can then directly use our algorithms and implementations to induce and train a statistical grammar and use it for efficient parsing and translation.
DFG Programme Research Grants
International Connection Australia
Cooperation Partner Professor Dr. Mark Johnson
 
 

Additional Information

Textvergrößerung und Kontrastanpassung