LOD Link Discovery - Learning-based scalable link discovery for the data web

Applicants Professor Dr. Axel-Cyrille Ngonga Ngomo; Professor Dr.-Ing. Erhard Rahm

Subject Area Security and Dependability, Operating-, Communication- and Distributed Systems

Term from 2012 to 2018

Project identifier Deutsche Forschungsgemeinschaft (DFG) - Project number 210434127

Final Report Year 2018

Final Report Abstract

The constant growth of the amount of information available on the Web brings about the need for a more intelligent processing of Web data. The Semantic Web is an extension of the Web within which knowledge is represented at Web scale in a machine-readable form. The Linked Open Data Cloud is a compendium of knowledge bases (KBs) which relies on the knowledge representation mechanisms from the Semantic Web to make knowledge available. Like on the Web of documents, the creation of KBs is a decentral process. Answering complex information needs however demands gathering information across sources of knowledge. The goal of Link Discovery is to provide links across KBs with the aim of improving information processing based on these KBs. Two major ıve challenges come about when addressing this problem. First, a naive implementation of this process leads to impractical runtimes on large KBs. Moreover, the discovery of accurate links is non-trivial and often demands applying machine learning techniques. The aim of this project was to tackle both challenges. To address the scalability challenges, we employed massively parallel infrastructure such as graphics processing units (GPUs) and distributed computing solutions. GPUs improved the runtime of algorithms which rely on numerical data (e.g., geospatial data) signiﬁcantly and were second to distributed solutions only when very large data sets were to be processed. By using distributed processing frameworks such as Apache Flink, we could improve the performance of the holistic clustering to process data sets with up to 107 resources, the results were compared with existing clustering schemes, e.g., correlation clustering, star clustering. To support a continuous integration of entities in the Web of Data, incremental clustering approaches for distributed processing were introduced. All proposed approaches were evaluated for multiple data domains to show effectiveness and scalability. To address the accuracy challenge, we developed machine learning approaches for discovering accurate links across KBs while moving from stochastic methods (e.g., genetic programming) to deterministic algorithms. We determined the most important features of resources in KBs and explore conﬁgurations for linking. We presented the ﬁrst approach able to deal with the open-world assumption behind the Web. The algorithm achieved the goal of abolishing the stochastic nature of previous approaches—thus saving time and resources— while achieving the same accuracy. Surprisingly, we were able to deal well with small amounts of training data and still achieve state-of-the-art performance. Learning on several KBs simultaneously was carried out using rule-based link prediction algorithms. Here, simple scoring functions often outperformed elaborate inferencing mechanisms. To improve the availability of published links and resources, we provided web services such as the link repository LinkLion to access, upload and maintain links as well as a ranked index for URIs to retrieve the respective RDF data source. LinkLion data can be utilized to feed the proposed holistic clustering approach to improve and create links between multiple LOD data sources. Overall, the research associates Tommaso Soru and Markus Nentwig contributed towards 13 international publications in proceedings and journals.

Publications

ROCKER: A Reﬁnement Operator for Key Discovery. Proc. WWW (pp. 1025-1033)
Soru, Tommaso; Marx, Edgard & Ngonga Ngomo, Axel-Cyrille
A Survey of Current Link Discovery Frameworks. Semantic Web 8, 3 (2017), 419–436
Nentwig, Markus; Hartung, Michael; Ngonga Ngomo, Axel-Cyrille & Rahm, Erhard
Holistic Entity Clustering for Linked Data. In ICDM Workshops (2016), IEEE, pp. 194–201
Nentwig, Markus; GroB, Anika & Rahm, Erhard
CEDAL: Time-eﬃcient Detection of Erroneous Links in Large-scale Link Repositories. In WI (pp. 106-113). ACM
Valdestilhas, André; Soru, Tommaso & Ngomo, Axel-Cyrille Ngonga
Distributed Holistic Clustering on Linked Data. In OTM, pages 371–382, 2017
Nentwig, Markus; Groß, Anika; Möller, Maximilian & Rahm, Erhard
Wombat – A Generalization Approach for Automatic Link Discovery. In ESWC (pp. 103-119). Springer, Cham
Sherif, Mohamed Ahmed; Ngonga Ngomo, Axel-Cyrille & Lehmann, Jens
Dynamic Planning for Link Discovery. ESWC (pp. 240-255). Springer, Cham. 2018
Georgala, Kleanthi; Obraczka, Daniel & Ngonga Ngomo, Axel-Cyrille
Incremental Clustering on Linked Data. In: 2018 IEEE International Conference on Data Mining Workshops (ICDMW)
Nentwig, Markus & Rahm, Erhard
Scalable Matching and Clustering of Entities with FAMER. CSIMQ, 16:61–83, 2018
Saeedi, Alieh; Nentwig, Markus; Peukert, Eric & Rahm, Erhard

Servicenavigation

Hauptnavigation

LOD Link Discovery - Learning-based scalable link discovery for the data web

Final Report Abstract

Publications

Additional Information

Servicenavigation

Hauptnavigation

LOD Link Discovery - Learning-based scalable link discovery for the data web

Final Report Abstract

Publications

Additional Information

Textvergrößerung und Kontrastanpassung