Project Details
Projekt Print View

LOD Link Discovery - Learning-based scalable link discovery for the data web

Subject Area Security and Dependability, Operating-, Communication- and Distributed Systems
Term from 2012 to 2018
Project identifier Deutsche Forschungsgemeinschaft (DFG) - Project number 210434127
 
Final Report Year 2018

Final Report Abstract

The constant growth of the amount of information available on the Web brings about the need for a more intelligent processing of Web data. The Semantic Web is an extension of the Web within which knowledge is represented at Web scale in a machine-readable form. The Linked Open Data Cloud is a compendium of knowledge bases (KBs) which relies on the knowledge representation mechanisms from the Semantic Web to make knowledge available. Like on the Web of documents, the creation of KBs is a decentral process. Answering complex information needs however demands gathering information across sources of knowledge. The goal of Link Discovery is to provide links across KBs with the aim of improving information processing based on these KBs. Two major ıve challenges come about when addressing this problem. First, a naive implementation of this process leads to impractical runtimes on large KBs. Moreover, the discovery of accurate links is non-trivial and often demands applying machine learning techniques. The aim of this project was to tackle both challenges. To address the scalability challenges, we employed massively parallel infrastructure such as graphics processing units (GPUs) and distributed computing solutions. GPUs improved the runtime of algorithms which rely on numerical data (e.g., geospatial data) significantly and were second to distributed solutions only when very large data sets were to be processed. By using distributed processing frameworks such as Apache Flink, we could improve the performance of the holistic clustering to process data sets with up to 107 resources, the results were compared with existing clustering schemes, e.g., correlation clustering, star clustering. To support a continuous integration of entities in the Web of Data, incremental clustering approaches for distributed processing were introduced. All proposed approaches were evaluated for multiple data domains to show effectiveness and scalability. To address the accuracy challenge, we developed machine learning approaches for discovering accurate links across KBs while moving from stochastic methods (e.g., genetic programming) to deterministic algorithms. We determined the most important features of resources in KBs and explore configurations for linking. We presented the first approach able to deal with the open-world assumption behind the Web. The algorithm achieved the goal of abolishing the stochastic nature of previous approaches—thus saving time and resources— while achieving the same accuracy. Surprisingly, we were able to deal well with small amounts of training data and still achieve state-of-the-art performance. Learning on several KBs simultaneously was carried out using rule-based link prediction algorithms. Here, simple scoring functions often outperformed elaborate inferencing mechanisms. To improve the availability of published links and resources, we provided web services such as the link repository LinkLion to access, upload and maintain links as well as a ranked index for URIs to retrieve the respective RDF data source. LinkLion data can be utilized to feed the proposed holistic clustering approach to improve and create links between multiple LOD data sources. Overall, the research associates Tommaso Soru and Markus Nentwig contributed towards 13 international publications in proceedings and journals.

Publications

 
 

Additional Information

Textvergrößerung und Kontrastanpassung