LOD Link Discovery - Lernbasierte, skalierbare Link Discovery für das Daten-Web
Zusammenfassung der Projektergebnisse
The constant growth of the amount of information available on the Web brings about the need for a more intelligent processing of Web data. The Semantic Web is an extension of the Web within which knowledge is represented at Web scale in a machine-readable form. The Linked Open Data Cloud is a compendium of knowledge bases (KBs) which relies on the knowledge representation mechanisms from the Semantic Web to make knowledge available. Like on the Web of documents, the creation of KBs is a decentral process. Answering complex information needs however demands gathering information across sources of knowledge. The goal of Link Discovery is to provide links across KBs with the aim of improving information processing based on these KBs. Two major ıve challenges come about when addressing this problem. First, a naive implementation of this process leads to impractical runtimes on large KBs. Moreover, the discovery of accurate links is non-trivial and often demands applying machine learning techniques. The aim of this project was to tackle both challenges. To address the scalability challenges, we employed massively parallel infrastructure such as graphics processing units (GPUs) and distributed computing solutions. GPUs improved the runtime of algorithms which rely on numerical data (e.g., geospatial data) significantly and were second to distributed solutions only when very large data sets were to be processed. By using distributed processing frameworks such as Apache Flink, we could improve the performance of the holistic clustering to process data sets with up to 107 resources, the results were compared with existing clustering schemes, e.g., correlation clustering, star clustering. To support a continuous integration of entities in the Web of Data, incremental clustering approaches for distributed processing were introduced. All proposed approaches were evaluated for multiple data domains to show effectiveness and scalability. To address the accuracy challenge, we developed machine learning approaches for discovering accurate links across KBs while moving from stochastic methods (e.g., genetic programming) to deterministic algorithms. We determined the most important features of resources in KBs and explore configurations for linking. We presented the first approach able to deal with the open-world assumption behind the Web. The algorithm achieved the goal of abolishing the stochastic nature of previous approaches—thus saving time and resources— while achieving the same accuracy. Surprisingly, we were able to deal well with small amounts of training data and still achieve state-of-the-art performance. Learning on several KBs simultaneously was carried out using rule-based link prediction algorithms. Here, simple scoring functions often outperformed elaborate inferencing mechanisms. To improve the availability of published links and resources, we provided web services such as the link repository LinkLion to access, upload and maintain links as well as a ranked index for URIs to retrieve the respective RDF data source. LinkLion data can be utilized to feed the proposed holistic clustering approach to improve and create links between multiple LOD data sources. Overall, the research associates Tommaso Soru and Markus Nentwig contributed towards 13 international publications in proceedings and journals.
Projektbezogene Publikationen (Auswahl)
-
2015. ROCKER: A Refinement Operator for Key Discovery. Proc. WWW (pp. 1025-1033)
Soru, T., Marx, E. and Ngonga Ngomo, A.C.
-
Holistic Entity Clustering for Linked Data. In ICDM Workshops (2016), IEEE, pp. 194–201
Nentwig, M., Gross, A., and Rahm, E.
-
2017, August. CEDAL: Time-efficient Detection of Erroneous Links in Large-scale Link Repositories. In WI (pp. 106-113). ACM
Valdestilhas, A., Soru, T. and Ngonga Ngomo, A.C.
-
2017, May. Wombat – A Generalization Approach for Automatic Link Discovery. In ESWC (pp. 103-119). Springer, Cham
Sherif, M.A., Ngonga Ngomo, A.C. and Lehmann, J.
-
A Survey of Current Link Discovery Frameworks. Semantic Web 8, 3 (2017), 419–436
Nentwig, M., Hartung, M., Ngonga Ngomo, A., and Rahm, E.
-
Distributed Holistic Clustering on Linked Data. In OTM, pages 371–382, 2017
Nentwig, M., Groß, A., Möller, M., and Rahm, E.
-
Dynamic Planning for Link Discovery. ESWC (pp. 240-255). Springer, Cham. 2018
Georgala, K., Obraczka, D. and Ngonga Ngomo, A.C.
-
Incremental Clustering on Linked Data. In: 2018 IEEE International Conference on Data Mining Workshops (ICDMW)
Nentwig, M., and Rahm, E.
-
Scalable Matching and Clustering of Entities with FAMER. CSIMQ, 16:61–83, 2018
Saeedi, A., Nentwig, M., Peukert, E., and Rahm, E.