Reference Understanding in the Social Sciences (OUTCITE)
Final Report Abstract
The Outcite project is an initiative taken to improve the accessibility and linking of citation data, particularly in the social sciences. Extending the previous EXCITE project, which identified gaps in bibliographic databases, Outcite focuses on linking references that are not easily found in existing databases, such as incomplete citations, and web resources. The project developed tools to process and match these "non-source items" to their original sources, thereby enhancing the completeness of citation records available for research. The core objective of the project was to develop a scalable toolchain that could accurately link these non-source items to their corresponding sources. This involved several key processes: (i) Extracting the metadata and segmenting the references that appeared in academic full-text documents using various pre-existing state-of-the-art tools like Grobid, Cermine, and Anystyle. (ii) Matching and linking the references to the existing open-source bibliographic records such as SSOAR, GESIS search, DNB collection, sowiport, ArXiv, econbiz, crossref, and OpenAlex. (iii) Deduplication has been performed to reduce the redundancy and enhance the completeness of the references. (iv) The provisioning and the distribution of the outcomes by setting up a cron job to run the pipeline for SSOAR documents and the live demonstrator for public benefit has also been developed. As of the project’s completion, Outcite has processed over 73,000 PDF documents from the SSOAR repository, ingesting more than 3.4 million references into the GESIS Search database. About 1.74 million of these references have been successfully linked to their online sources. The citation data has been shared with the OpenCitations initiative for further processing. Furthermore, the project has been disseminated by publishing papers for the research outcomes. This has been presented in various workshops and conferences conducted and attended during the project tenure.
Publications
-
Data for: NILK, entity linking dataset targeting NIL-linking cases. DaRUS
Anastasiia Iurshina; Jiaxin Pan; Rafika Boutalbi & Steffen Staab
-
Extracting bibliographic references from footnotes with EXcite-docker. ULITE workshop at JCDL 2022: 26-33
Christian Boulanger & Anastasiia Iurshina
-
Extracting literature references in German Speaking Geography – the GEOcite project. In Proceedings of the Workshop on Understanding LIterature references in academic full TExt (pp. 34–41)
Birkeneder, B.; Aufenvenne, P.; Haase, C.; Mayr, P. & Steinbrink, M.
-
Lattice-based progressive author disambiguation. Information Systems, 109, 102056.
Backes, Tobias & Dietze, Stefan
-
NILK: Entity Linking Dataset Targeting NIL-linking Cases. Proceedings of the 31st ACM International Conference on Information & Knowledge Management, 4069-4073.
Iurshina, Anastasiia; Pan, Jiaxin; Boutalbi, Rafika & Staab, Steffen
-
Proceedings of the Workshop on Understanding LIterature references in academic full TExt. CEUR-WS.org
Backes, T.; Iurshina, A. & Mayr, P.
-
Tensor-based Graph Modularity for Text Data Clustering. Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2227-2231.
Boutalbi, Rafika; Ait-Saada, Mira; Iurshina, Anastasiia; Staab, Steffen & Nadif, Mohamed
-
Towards hierarchical affiliation resolution: framework, baselines, dataset. International Journal on Digital Libraries, 23(3), 267-288.
Backes, Tobias; Hienert, Daniel & Dietze, Stefan
-
Investigating the performance of GROBID and OUTCITE (Version v1). Zenodo.
Pagnotta, O.
-
Partial Orders and Progressive Blocking: A Matching-based Framework for Large-scale Entity Resolution in Bibliographic Data [PhD Thesis, Heinrich-Heine-Universität, Düsseldorf, Germany]
Backes, T.
-
Comparing free reference extraction pipelines. International Journal on Digital Libraries, 25(4), 841-853.
Backes, Tobias; Iurshina, Anastasiia; Shahid, Muhammad Ahsan & Mayr, Philipp
-
Connected Components for Scaling Partial-order Blocking to Billion Entities. Journal of Data and Information Quality, 16(1), 1-29.
Backes, Tobias & Dietze, Stefan
-
olgagolgan/RefEx: RefEx project code (scripts). Zenodo.
Olga Pagnotta
-
NeOn-GPT: A Large Language Model-Powered Pipeline for Ontology Learning. Lecture Notes in Computer Science, 36-50.
Fathallah, Nadeen; Das, Arunav; Giorgis, Stefano De; Poltronieri, Andrea; Haase, Peter & Kovriguina, Liubov
