Combining Text Mining and Multivariate Time Series Modelling
Final Report Abstract
The project focuses on the joint modelling of text-based indicators and multivariate economic time series. Thereby, text-based indicators are obtained using the Latent Dirichlet Allocation (LDA) model. LDA is an unsupervised statistical model that allows to uncover latent topic in a text collection based on word co-occurrences statistics. Specific interest is on the application of generated indicators in economics and factors influencing the estimation procedure. On the one hand (application side), we use scientific publications from Germany and Poland to estimate time series of the importance of topics in both countries. We propose methods for cross-corpora comparisons, the so-called topic matching. We demonstrate that the proposed topic matching approach allows to compare the topic-word distributions of two different LDA models and to identify suitable topic pairs across text corpora. This is useful when comparing topic trends in different countries or when analysing the emergence and evolution of topic trends over time for sub-samples of one corpus. Moreover, we combine LDA and vector autoregressive (VAR) modelling by including the extracted topic trends time series into VAR models for both countries. Significant links between topics in scientific literature and real developments for corresponding economic indicators are found. These first insights on the link between economic science and economic reality in a cross-country comparison form the basis for further research. On the other hand (methodological side, estimation procedure), we investigate the sensitiveness of the LDA algorithm to different parameter settings. First, we suggest to select the number of topics in each corpus based on the singular Bayesian information criterion (sBIC). We further conduct a comprehensive Monte Carlo (MC) study to compare different metrics often used in the literature for topic number selection, which provides valuable practical recommendations for LDA model selection in text-as-data applications. A further analysis focuses on the impact of text pre-processing on LDA model estimation. Especially, we focus on the consequences of removing infrequent terms, as these are believed not to contribute to the overall meaning of a text. The preliminary results of this analysis provide insights about how removing infrequent words could considerably reduce the dimensionality of the data corpus, and consequently the estimation time, without major effect on the resulting topics.
Publications
-
Choosing the Number of Topics in LDA Models - A Monte Carlo Comparison of Selection Criteria.
Bystrov, V., Naboka, V., Staszewska-Bystrova, A. & Winker, P.
-
Cross-Corpora Comparisons of Topics and Topic Trends. Jahrbücher für Nationalökonomie und Statistik, 242(4), 433-469.
Bystrov, Victor; Naboka, Viktoriia; Staszewska-Bystrova, Anna & Winker, Peter
-
Dataset for Cross-corpora comparisons of topics and topic trends. Version: 1. ZBW Journal Data Archive
Bystrov, V., Naboka, V., Staszewska-Bystrova, A. & Winker, P.
-
Visualizing Topic Uncertainty in Topic Modelling
Winker, P.
-
Comparing Links between Topic Trends and Economic Indicators in the German and Polish Academic Literature. Comparative Economic Research. Central and Eastern Europe, 27(2), 7-28.
Bystrov, Victor; Naboka‑Krell, Viktoriia; Staszewska‑Bystrova, Anna & Winker, Peter
-
Analysing the Impact of Removing Infrequent Terms on Topic Quality in Latent Dirichlet Allocation Models. Central European Journal of Economic Modelling and Econometrics, 61-85.
Bystrov, Victor; Naboka-Krell, Viktoriia; Staszewska-Bystrova, Anna & Winker, Peter
