Project Details
Projekt Print View

Combining Text Mining and Multivariate Time Series Modelling

Subject Area Statistics and Econometrics
Term from 2019 to 2023
Project identifier Deutsche Forschungsgemeinschaft (DFG) - Project number 426470111
 
Final Report Year 2024

Final Report Abstract

The project focuses on the joint modelling of text-based indicators and multivariate economic time series. Thereby, text-based indicators are obtained using the Latent Dirichlet Allocation (LDA) model. LDA is an unsupervised statistical model that allows to uncover latent topic in a text collection based on word co-occurrences statistics. Specific interest is on the application of generated indicators in economics and factors influencing the estimation procedure. On the one hand (application side), we use scientific publications from Germany and Poland to estimate time series of the importance of topics in both countries. We propose methods for cross-corpora comparisons, the so-called topic matching. We demonstrate that the proposed topic matching approach allows to compare the topic-word distributions of two different LDA models and to identify suitable topic pairs across text corpora. This is useful when comparing topic trends in different countries or when analysing the emergence and evolution of topic trends over time for sub-samples of one corpus. Moreover, we combine LDA and vector autoregressive (VAR) modelling by including the extracted topic trends time series into VAR models for both countries. Significant links between topics in scientific literature and real developments for corresponding economic indicators are found. These first insights on the link between economic science and economic reality in a cross-country comparison form the basis for further research. On the other hand (methodological side, estimation procedure), we investigate the sensitiveness of the LDA algorithm to different parameter settings. First, we suggest to select the number of topics in each corpus based on the singular Bayesian information criterion (sBIC). We further conduct a comprehensive Monte Carlo (MC) study to compare different metrics often used in the literature for topic number selection, which provides valuable practical recommendations for LDA model selection in text-as-data applications. A further analysis focuses on the impact of text pre-processing on LDA model estimation. Especially, we focus on the consequences of removing infrequent terms, as these are believed not to contribute to the overall meaning of a text. The preliminary results of this analysis provide insights about how removing infrequent words could considerably reduce the dimensionality of the data corpus, and consequently the estimation time, without major effect on the resulting topics.

Publications

 
 

Additional Information

Textvergrößerung und Kontrastanpassung