Detailseite
Projekt Druckansicht

LOD Link Discovery - Lernbasierte, skalierbare Link Discovery für das Daten-Web

Fachliche Zuordnung Sicherheit und Verlässlichkeit, Betriebs-, Kommunikations- und verteilte Systeme
Förderung Förderung von 2012 bis 2018
Projektkennung Deutsche Forschungsgemeinschaft (DFG) - Projektnummer 210434127
 
Erstellungsjahr 2018

Zusammenfassung der Projektergebnisse

The constant growth of the amount of information available on the Web brings about the need for a more intelligent processing of Web data. The Semantic Web is an extension of the Web within which knowledge is represented at Web scale in a machine-readable form. The Linked Open Data Cloud is a compendium of knowledge bases (KBs) which relies on the knowledge representation mechanisms from the Semantic Web to make knowledge available. Like on the Web of documents, the creation of KBs is a decentral process. Answering complex information needs however demands gathering information across sources of knowledge. The goal of Link Discovery is to provide links across KBs with the aim of improving information processing based on these KBs. Two major ıve challenges come about when addressing this problem. First, a naive implementation of this process leads to impractical runtimes on large KBs. Moreover, the discovery of accurate links is non-trivial and often demands applying machine learning techniques. The aim of this project was to tackle both challenges. To address the scalability challenges, we employed massively parallel infrastructure such as graphics processing units (GPUs) and distributed computing solutions. GPUs improved the runtime of algorithms which rely on numerical data (e.g., geospatial data) significantly and were second to distributed solutions only when very large data sets were to be processed. By using distributed processing frameworks such as Apache Flink, we could improve the performance of the holistic clustering to process data sets with up to 107 resources, the results were compared with existing clustering schemes, e.g., correlation clustering, star clustering. To support a continuous integration of entities in the Web of Data, incremental clustering approaches for distributed processing were introduced. All proposed approaches were evaluated for multiple data domains to show effectiveness and scalability. To address the accuracy challenge, we developed machine learning approaches for discovering accurate links across KBs while moving from stochastic methods (e.g., genetic programming) to deterministic algorithms. We determined the most important features of resources in KBs and explore configurations for linking. We presented the first approach able to deal with the open-world assumption behind the Web. The algorithm achieved the goal of abolishing the stochastic nature of previous approaches—thus saving time and resources— while achieving the same accuracy. Surprisingly, we were able to deal well with small amounts of training data and still achieve state-of-the-art performance. Learning on several KBs simultaneously was carried out using rule-based link prediction algorithms. Here, simple scoring functions often outperformed elaborate inferencing mechanisms. To improve the availability of published links and resources, we provided web services such as the link repository LinkLion to access, upload and maintain links as well as a ranked index for URIs to retrieve the respective RDF data source. LinkLion data can be utilized to feed the proposed holistic clustering approach to improve and create links between multiple LOD data sources. Overall, the research associates Tommaso Soru and Markus Nentwig contributed towards 13 international publications in proceedings and journals.

Projektbezogene Publikationen (Auswahl)

  • 2015. ROCKER: A Refinement Operator for Key Discovery. Proc. WWW (pp. 1025-1033)
    Soru, T., Marx, E. and Ngonga Ngomo, A.C.
    (Siehe online unter https://doi.org/10.1145/2736277.2741642)
  • Holistic Entity Clustering for Linked Data. In ICDM Workshops (2016), IEEE, pp. 194–201
    Nentwig, M., Gross, A., and Rahm, E.
    (Siehe online unter https://doi.org/10.1109/ICDMW.2016.0035)
  • 2017, August. CEDAL: Time-efficient Detection of Erroneous Links in Large-scale Link Repositories. In WI (pp. 106-113). ACM
    Valdestilhas, A., Soru, T. and Ngonga Ngomo, A.C.
    (Siehe online unter https://doi.org/10.1145/3106426.3106497)
  • 2017, May. Wombat – A Generalization Approach for Automatic Link Discovery. In ESWC (pp. 103-119). Springer, Cham
    Sherif, M.A., Ngonga Ngomo, A.C. and Lehmann, J.
    (Siehe online unter https://doi.org/10.1007/978-3-319-58068-5_7)
  • A Survey of Current Link Discovery Frameworks. Semantic Web 8, 3 (2017), 419–436
    Nentwig, M., Hartung, M., Ngonga Ngomo, A., and Rahm, E.
    (Siehe online unter https://doi.org/10.3233/SW-150210)
  • Distributed Holistic Clustering on Linked Data. In OTM, pages 371–382, 2017
    Nentwig, M., Groß, A., Möller, M., and Rahm, E.
    (Siehe online unter https://doi.org/10.1007/978-3-319-69459-7_25)
  • Dynamic Planning for Link Discovery. ESWC (pp. 240-255). Springer, Cham. 2018
    Georgala, K., Obraczka, D. and Ngonga Ngomo, A.C.
    (Siehe online unter https://doi.org/10.1007/978-3-319-93417-4_16)
  • Incremental Clustering on Linked Data. In: 2018 IEEE International Conference on Data Mining Workshops (ICDMW)
    Nentwig, M., and Rahm, E.
    (Siehe online unter https://doi.org/10.1109/ICDMW.2018.00084)
  • Scalable Matching and Clustering of Entities with FAMER. CSIMQ, 16:61–83, 2018
    Saeedi, A., Nentwig, M., Peukert, E., and Rahm, E.
    (Siehe online unter https://doi.org/10.7250/csimq.2018-16.04)
 
 

Zusatzinformationen

Textvergrößerung und Kontrastanpassung