Project Details
Projekt Print View

Data Quality of Textual, User-Generated Content

Subject Area Operations Management and Computer Science for Business Administration
Term since 2022
Project identifier Deutsche Forschungsgemeinschaft (DFG) - Project number 494840328
 
With the world becoming more and more digital, the amount and relevance of publicly available textual user-generated content (UGC) such as online consumer reviews, wiki articles, and social media posts, as well as UGC within companies, are continually increasing. In this context, the data quality (DQ) of textual UGC – in particular, the data value-oriented DQ dimensions such as the accuracy and currency of textual statements – is of very high relevance. Indeed, analyses of large amounts of textual UGC using contemporary machine learning (ML) methods (e.g., transformer models) and the resulting outcomes are only valid and valuable if the quality of the underlying data is assured. Existing methods for the assessment and improvement of the DQ of textual UGC suffer from critical limitations. In the previous project DQNGI we recognized the importance and potential of events causing DQ defects when developing methods for the assessment and improvement of DQ. However, the innovative idea of an event-based DQ assessment and improvement needs to be substantially advanced. Moreover, contemporary ML methods operate under the assumption of high-quality data, leading to lower performance and robustness for UGC with DQ defects. In the previous project DQNGI, we have developed first promising methods for the methodical processing of DQ-annotated input data through ML models. Yet, this work needs to be substantially advanced and extended regarding both propagation and training. Summing up, the proposed renewal project DQUGC focuses on the following research questions: 1) How can events that cause DQ defects in textual UGC be conceptualized and identified in general to enable event-based approaches for the assessment and improvement of DQ? 2) How can DQ-annotated textual UGC be methodically processed in contemporary ML models (e.g., transformer models)? To address these research questions, DQUGC comprises two subprojects S1 and S2. With respect to research methodology, DQUGC applies analytical, mathematical modeling as well as experimental evaluation based on real-world data. S1 addresses the conceptualization and identification of events causing DQ defects in textual UGC and the event-based assessment and improvement of DQ. S1 results in a theoretical and systematic conceptualization of events causing DQ defects as well as new approaches for the automated identification of DQ-related events in textual UGC, including implementations and evaluations of these approaches. Moreover, approaches for an event-based assessment and improvement of DQ are designed, implemented, and evaluated. S2 addresses the question of how textual UGC with DQ annotations can be methodically processed in contemporary ML models. S2 results in new approaches for ML models that process DQ-annotated data within propagation and training, as well as findings on the validity, reliability, (improved) performance, and robustness of the results of these approaches.
DFG Programme Research Grants
 
 

Additional Information

Textvergrößerung und Kontrastanpassung