Lernen tonaler Darstellungen von Musiksignalen mittels Deep Neural Networks
Zusammenfassung der Projektergebnisse
With the growing impact of technology, musicological research is subject to a fundamental transformation. Digitized data and specialized algorithms enable corpus analyses, which can be scaled up significantly when using music audio recordings. For the tonal analysis of such recordings (e. g., regarding pitches, chords, or keys), the central objective of this project was to use deep neural networks for learning tonal representations that are interpretable, robust, and invariant regarding timbre, instrumentation, and acoustic conditions. For training and evaluating the networks, the project built on complex scenarios of classical music where scores and multiple recorded performances of a work are available. During the project, we exploited these scenarios to train convolutional neural networks (CNNs) for extracting pitch-class (chroma) and multi-pitch representations in a supervised fashion, making use of synchronized score–audio pairs of classical music. We found that these tasks can be successfully approached with lightweight CNNs comprising only 30–50k parameters when exploiting musically motivated structures for the input representation and for the network architecture. As a central application, we tested the representations as a basis for analyzing chords or local keys, which showed that tonal representations relying on deep learning clearly improve upon their traditional counterparts and help to close the gap between audio- and score-based tonal analysis. As an alternative to supervised training with aligned scores, we proposed a strategy for training tonal representations of polyphonic music recordings requiring only weakly aligned score–audio pairs. To this end, we adapted a multi-label extension of the connectionist temporal classification loss (MCTC) and performed systematic experiments to analyze its behavior. We found that this strategy performs similar to the training with strongly aligned scores, works even for longer segments up to one minute, and allows for scaling up training datasets more easily. Beyond conventional CNNs, we explored advanced architectures such as U-nets, which are inspired by hierarchical musical structures, and proposed several extensions including self-attention components and multi-task strategies. Conducting an in-depth evaluation, we compared variants of these architectures of different sizes, making several suprising observations. Most results substantially depend on randomization effects and the choice of the training–test split, thus questioning the claim of superiority (“state of the art”) for particular architectures when yielding only small improvements. As a major contribution, we therefore proposed a more robust evaluation strategy and suggest to conduct cross-dataset experiments to reliably measure progress in music analysis tasks. In summary, we conclude that deep-learning strategies provide promising possibilities for extracting tonal information from music audio recordings. Even small convolutional architectures can be very effective when exploiting musically motivated structures. Further progress can be made with larger and more elaborate network types, which, however, introduces challenges for evaluation since the improvements are often small compared to the variation across training runs and test sets. In this context, cross-version and cross-dataset scenarios can be employed to develop pitch representations that are capable of generalizing across different instruments, performances, and acoustic conditions. Such representations allow for developing robust tonal analysis methods, thus paving the way towards a new level of computational music research.