Reference Understanding in the Social Sciences (OUTCITE)

Applicants Dr. Philipp Mayr-Schlegel; Professor Dr. Steffen Staab

Subject Area Data Management, Data-Intensive Systems, Computer Science Methods in Business Informatics

Term from 2016 to 2024

Project identifier Deutsche Forschungsgemeinschaft (DFG) - Project number 293069437

Making bibliographic data available for researchers, scholars and others is important in all disciplines to ensure easy and fast access to the literature and other scientific resources such as research datasets. To this end, many publishers strive to index their publications in bibliographic databases enabling the linking of publications in a citation graph. Still, a significant part of citation data in disciplines such as social science is not accessible via bibliographic databases.Our previous project EXCITE has addressed this problem and has successfully narrowed the gap between the availability of citation data in the social sciences and other disciplines. EXCITE has researched, developed, and deployed powerful tools (http://excite.west.uni-koblenz.de) that localize, extract and segment reference strings in PDF documents and then match them against bibliographic databases. EXCITE has also integrated the extracted citation data from social science publications into the Open Citations Corpus (OCC). One of the main conclusions derived from EXCITE is that the metadata of 60% of the cited papers and other scientific resources are outside of available bibliographic databases. The extracted reference strings (items) that could not be matched are called non-source items. Non-source items include incomplete or erroneous references as well as references that indeed do not exist in the available bibliographic databases, especially references to datasets, websites and other material. The main goal of OUTCITE is to research, develop and deploy a toolchain which follows-up on the output produced by the EXCITE pipeline in order to link non-source items to their sources. We will employ our gained knowledge and expertise to overcome the various foreseen challenges in OUTCITE. Specifically, we will develop a set of algorithms dedicated to understanding non-source items (challenge C1), to overcome the problem of their duplicate occurrences (C2) by gathering them into clusters. Subsequently, new algorithms and methods will be developed to derive correct and complete representations from these clusters (C3). These representations will be located by involving web search engines, such that the existence of the publication is confirmed and the corresponding source is retrieved (C4). To ensure a high-quality result at the end of the project, we will use, adapt and extend the technologies reviewed in the state-of-the-art so far. Machine learning techniques will be actively used to reach a satisfying level of quality. To this end, the phases will not only provide their outcomes but also propagate their estimation on the output's quality. At the end of the project and similar to what has been accomplished in EXCITE, the developed techniques, tools and enriched reference index will be made available under open-source licenses, integrated in the GESIS Search infrastructure and ingested in the OCC.

DFG Programme Research data and software (Scientific Library Services and Information Systems)

Co-Investigator Dr.-Ing. Zeyd Boukhers

Servicenavigation

Hauptnavigation

Reference Understanding in the Social Sciences (OUTCITE)

Additional Information

Servicenavigation

Hauptnavigation

Reference Understanding in the Social Sciences (OUTCITE)

Additional Information

Textvergrößerung und Kontrastanpassung