Segmentation of oral corpora
General and Comparative Linguistics, Experimental Linguistics, Typology, Non-European Languages
Final Report Abstract
In terms of questions regarding practical corpus construction, the project has managed to empirically demonstrate that, for larger and diverse oral corpora, a segmentation based primarily on syntax is superior to approaches based on prosody or pragmatics because it allows for a more robust, more reliable and more efficient segmentation process. The syntactic segmentation process was consequently operationalized, first by guidelines for a detailed annotation of categories underlying syntactic segmentation, then by simpler guidelines focussing on the salient characteristics for determining syntactic segment boundaries and classifying the resulting segments into major categories. Both guidelines can be implemented with the help of widely used annotation tools from the EXMARaLDA family, and a transfer to other annotation environments would be feasible without complex adaptations. The project has thus created the envisaged solid basis for future manual segmentation and annotation of oral corpora. At the IDS, such a manual annotation process will be integrated into the workflow of the FOLK project in the future, thereby improving the usability and extending the application domains of this reference corpus. The outcomes of the experiments on automatic segmentation have exceeded our initial expectations. It turned out that a fully automatic segmentation is possible with a quality which, while not matching that of manual annotation, can be sufficient for some analysis and processing purposes, or at least constitute a solid basis to considerably reduce the effort for manual segmentation. The automatic segmentation process has already been integrated into the workflow for the FOLK corpus and will be applied and maintained by the FOLK project in the future. It will also form the basis of future efforts to enrich FOLK with detailed automatic annotation of syntactic dependency relations, thereby adding a further dimension to FOLK’s corpus linguistic exploitability. With its results, the project has thus created the desired basis for further developing and analysing oral corpora, especially the FOLK corpus. The segmentation methods developed in SegCor will be used as a new reference point for improving the visualisation and query mechanisms for the corpus, and they will be the basis for future work on syntactic parsing of oral language. Concerning the cross-language aspect, SegCor has established or further developed a number of methodological instruments for dealing with segmentation of oral corpora in French and German, including, besides the comparable corpus samples, a harmonization of corpus formats, common approaches to measuring inter-annotator agreement and a crosslanguage inventory of segmentation problems. While it was unavoidable, given the language specific nature of syntax theories, that the developed guidelines and annotated data remain language specific up to a point, the project has thus also contributed to a better understanding of methodological issues in segmentation across the two languages. Possible follow-up research could further investigate the interplay of syntactic segments with other principles of structuring talk-in-interaction. Findings from the project indicate that prosodic units ("intonation phrases") and units based on interactional structure (TCUs or Actions) will, as a general rule, either coincide with syntactic units or be fully subordinate to them, i.e. a syntactic unit can be composed of one or several of such units, but the units will usually not cross major syntactic boundaries. If this could be further confirmed, corresponding segmentation schemes could be developed as refinements to the syntactic segmentation on additional levels.
Publications
-
(2018): A Study on Gaps and Syntactic Boundaries in Spoken Interaction. KONVENS 2018. Wien, Austria, 19.-21.09.2018
Schmidt, Thomas; Westpfahl, Swantje
-
(2018): A Syntax-Based Scheme for the Annotation and Segmentation of German Spoken Language Interactions. In: Proceedings of the Joint Workshop on Linguistic Annotation, Multiword Expressions and Constructions (LAW-MWE-CxG-2018), pp. 109-120. Workshop at COLING 2018. Santa Fe, New Mexico, 25.-26.08.2018
Westpfahl, Swantje; Gorisch, Jan
-
(2018): Syntactic Annotation and Segmentation in the SegCor project Version 1.0. Working paper. Mannheim: Institut für Deutsche Sprache.
Westpfahl, Swantje / Proske, Nadine / Hobich, Melanie / Borlinghaus, Anton / Strub, Hanna
-
(2019): Detecting the boundaries of sentence-like units on spoken German. In: Preliminary proceedings of the 15th Conference on Natural Language Processing (KONVENS 2019), October 9 – 11, 2019. München [u.a.]: German Society for Computational Linguistics & Language Technology und Friedrich-Alexander-Universität Erlangen-Nürnberg, 2019. S. 130-139
Ruppenhofer, Josef/Rehbein, Ines
-
(2019): Guideline. Syntaktische Segmentierung in FOLKER. Mannheim: Leibniz-Institut für Deutsche Sprache (IDS)
Westpfahl, Swantje; Schmidt, Thomas; Borlinghaus, Anton; Strub, Hanna
-
(2020): Improving Sentence Boundary Detection for Spoken Language Transcripts. Proceedings of the Language Resource and Evaluation Conference (LREC) 2020: Marseille
Rehbein, Ines/Ruppenhofer, Josef/Schmidt, Thomas