Grundlagen einer korpusgestützten Typologie der Satzverknüpfung: ein analytischer Rahmen und Fallstudien zu nicht-lokalen Abhängigkeiten
Zusammenfassung der Projektergebnisse
Human languages vary in many ways, but that variation is not random. Linguistic typology deals with the patterns and limits of variation observed in languages all over the world. It aims to determine what properties a language is likely to have, given other properties, e.g. in its sound system (phonology), in the form and order of words (morphology, syntax), in the types of concepts that it encodes (lexicon), etc. This research programme reaches back to the 19th century and has been pursued systematically since the pioneering work of Stanford linguist Joseph Greenberg in the 1960s. So far, linguistic typology has largely relied on data from grammatical descriptions and dictionaries, i.e., abstractions over linguistic data. The availability of computational resources has made it possible to carry out the program of linguistic typology on the basis of textual data, i.e., written texts as well as transcripts of spoken conversation and other types of communicative output. Using language data, rather than abstractions over such data, has the obvious advantage that it allows us to capture the probabilistic aspects of language use in a more precise way. Collections of texts that are susceptible to computational analysis are called ‘corpora’. The research program sketched above is therefore called ‘corpus-based typology’. Corpus-based typology can be done in one of two ways: We can use raw text as the primary data source, or we can use annotations. Annotations are classifications of raw text data. For example, words can be classified in terms of their lexical class as ‘nouns’, ‘verbs’, ‘adjectives’, etc. (‘part-ofspeech tagging’). If textual data is annotated for parts of speech, we can determine frequencies of specific patterns such as ‘noun-verb-noun’, ‘verb-noun-noun’, etc. In syntactic annotations, words are grouped into hierarchical tree structures, alternatively represented using bracketing, e.g. ‘[this [[very old] man]]’. Semantic annotations contain information about the meaning of a word (or other syntactic units). For example, ‘word sense disambiguation’ assigns meanings to polysemous words such as ‘hot’ (‘very warm’, ‘spicy’, etc.). When texts carry annotations at various levels of linguistic analysis (morphology, syntax, semantics, attitudinal, etc.), we speak of ‘multi-level annotations’. Our approach to corpus-based typology relies on multi-level annotations. Given that this approach is entirely new, specifically in linguistic typology, it was a crucial prerequisite for our project to develop ways of annotating data (at a theoretical level) as well as software for such annotations. This has been achieved. The area of investigation we focused on was the combination of clauses into sentences. Languages vary greatly in the ways they link sentences to each other in order to express such meanings as simulataneity (‘ x while y’), causality (‘x because y’), concession (‘x although y’), relativization (‘the x that y’) etc. Such combinations of clauses are not only established by specific lexical elements (such as the subordinators ‘while’, ‘because’, ‘although’, ‘that’, etc.), but also by many other properties of the clauses in question, e.g. their level of ‘finiteness’ (finite, infinitive, participial, etc.), tense/aspect configurations, the semantic relationships of linguistic elements within a sentence (scope), etc. In other words, inter-clausal semantic relations are a multi-faceted, and, speaking in terms of quantitative methodology, multi-variate problem of analysis that can only be approached on the basis of multi-level annotations and appropriate methods of statistical analysis. In our project we annotated considerable amounts of text from various languages in order to arrive at generalizations over the typology of clause linkage in a corpus-based way. We proceeded in a both data-driven and hypothesis-driven way, thus arriving at a multitude of results on specific topics of clause linkage and related matters, e.g. concerning the inner-clausal organization of linguistic operators. We investigated the factors establishing clause linkage in a data-driven way, identifying both universal tendencies and language-specific preferences (such as, for instance, the participial constructions of English). Moreover, we carried out various hypothesis-driven case studies, e.g. on the highly underspecified Latin conjunction cum, on the ways in which concession is expressed in European languages, and on so-called ‘wh-cleft sentences’ (‘What I want to say is …’). The main contribution of our project, however, is located at a methodological level and consists in opening up new ways of investigating cross-linguistic correlations on the basis of corpora annotated at various levels of linguistic analysis.
Projektbezogene Publikationen (Auswahl)
- (2012). Relative clauses with adverbial meaning: A quantitative investigation of hybrid adjunct clauses in Latin. In V. Gast & H. Diessel (Hgg.), Clause Linkage in Cross-Linguistic Perspective. Data-Driven Approaches to Cross-Clausal Syntax, 363–391. Berlin: de Gruyter Mouton
Gast, V. & M. Schäfer
- (2012). The typology of clause linkage: Status quo, Challenges, prospects. In V. Gast & H. Diessel (eds.), Clause Linkage in Cross-Linguistic Perspective. Data- Driven Approaches to Cross-Clausal Syntax, 1–38. Berlin: de Gruyter Mouton
Gast, V. & H. Diessel
- (2013). Scalar additive operators in Transeurasian languages: A comparison with Europe. In M. Robbeets & H. Cuyckens (eds.): Shared Grammaticalization. With Special Focus on the Transeurasian Languages, 113–145. Amsterdam: Benjamins
Gast, V. and J. van der Auwera
- (2014). Atomic: An open-source software platform for multi-level corpus annotation. In J. Ruppert & G. Faaß (Hgg.): Proceedings of the 12th Konferenz zur Verarbeitung natürlicher Sprache (KONVENS 2014), October 2014, 228–234
Druskat, S., L. Bierkandt, V. Gast, C. Rzymski, F. Zipser
- (2014). Motivating w(h)-clefts in English and German: A hypothesisdriven parallel corpus study. In de Cesare, Anna-Maria (ed.): Frequency, Forms and Functions of Cleft Constructions in Romance and Germanic. Contrastive, Corpus-Based Studies, 377– 414. Berlin: de Gruyter Mouton
Gast, V. & N. Levshina
- (2015). Annotating modals with GraphAnno, a configurable lightweight tool for multi-level annotation. In M. Nissim & P. Pietrandrea (eds.): Proceedings of the Workshop on Models for Modality Annotation, 19–28. Stroudsburg, PA : Association for Computational Linguistics (ACL)
Gast, V., L. Bierkandt and C. Rzymski
- (2015). Creating and retrieving tense and aspect annotations with GraphAnno, a lightweight tool for multi-level annotation. In Bunt, H. (ed.): Proceedings of the 11th Joint ACL-ISO Workshop on Interoperable Annotation, 23–28. Tilburg: Tilburg Center for Cognition and Communication
Gast, V., L. Bierkandt and C. Rzymski
- (2015). Towards a corpus-based analysis of evaluative scales associated with even. Linguistik Online 71
Gast, V. and C. Rzymski
(Siehe online unter https://doi.org/10.13092/lo.71.1782) - (2016). corpus-tools.org: An interoperable generic software tool set for multi-layer linguistic corpora. In Nicoletta Calzolari et al. (eds.) Proceedings of the 10th International Conference on Language Resources and Evaluation. European Language Resources Association
Druskat, S., V. Gast, T. Krause & F. Zipser
- (2016). Enriching TimeBank: Towards a more precise annotation of temporal relations in a text. In Nicoletta Calzolari et al. (eds.): Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC). European Language Resources Association
Gast, V., L. Bierkandt, S. Druskat & C. Rzymski
(Siehe online unter https://dx.doi.org/10.5281/zenodo.53772)