Detailseite
Projekt Druckansicht

Mehrsprachige wissensverbesserte Informationsextraktion für die Pharmakovigilanz

Fachliche Zuordnung Bild- und Sprachverarbeitung, Computergraphik und Visualisierung, Human Computer Interaction, Ubiquitous und Wearable Computing
Förderung Förderung von 2020 bis 2025
Projektkennung Deutsche Forschungsgemeinschaft (DFG) - Projektnummer 442445488
 
Erstellungsjahr 2024

Zusammenfassung der Projektergebnisse

Nowadays, scientific knowledge can be published digitally within many different forms and sources, such as encyclopedias, scientific papers, and regulatory documents, but also structured knowledge sources like ontologies or knowledge bases. Besides that, news articles, blog posts, forums or social media can contain relevant information and be used for research. All this is published every day in a large number of different languages. The volume and speed of production of digital content have become too fast in some domains for humans to be able to keep up with them and maintain an up-to-date view of current scientific evidence. The present project aimed to design Artificial Intelligence methods that automatically digest these different text sources and jointly extract such knowledge and observations to populate existing knowledge bases. Our project showcases these methods in the domain of pharmacovigilance, which endeavors to maintain up-to-date knowledge of adverse drug reactions (ADRs) for the benefit of public health. We focused on detecting mentions of adverse drug reactions in social media and scientific journals. An ADR is an appreciably harmful or unpleasant reaction, resulting from an intervention related to the use of a medicinal product, which predicts hazard from future administration and warrants prevention or specific treatment, or alteration of the dosage regimen, or withdrawal of the product [6]. From a natural language processing point of view, an ADR is the expression of a causal relation that relates a drug to a medical problem: the latter are two mentions of entities, while the former is a relationship. This calls, therefore, for two information extraction tasks: entity detection and relation extraction. In our context, few annotated data were available to train information extraction systems. Moreover, since social media users express their views in their own languages, we needed to process text in the relevant languages, i.e., Japanese, German, and French. This set the main goals for the project and led to the achievements below. A foundational asset jointly created by the three-country team is an annotation schema for adverse drug reactions. This schema was designed iteratively through a series of consortium-wide online meetings, with feedback from each team. It was then used to perform annotations on each language: DFKI led the annotations on German and French, and the Japanese party managed annotation on Japanese, resulting in the KEEPHA tri-lingual annotated corpus. Such annotations provide the basis for training deep neural network systems that analyze text contents. All teams prepared and made available methods and deep learning tools to perform information extraction. The initial experiments were performed on pre-existing biomedical annotated corpora. After the tri-lingual annotated corpus was available, we also started to test our methods on this new corpus. The information extraction methods created by the team include entity detection (all), attribute classification (NAIST, LISN), relation extraction (RIKEN, LISN), as well as discourse dependency parsing (RIKEN). Specific methods were elaborated such as joint detection of entities and relations (RIKEN), cross-language detection (DFKI, LISN), prompt-based relation extraction (LISN), zero-shot relation extraction (RIKEN, LISN) or message classification (DFKI) among others. Additionally, RIKEN created Japanese language models pre-trained for the medical domain: these specific models improve Japanese information extraction in that domain. To reach out to the wider natural language processing community, NAIST organized a shared task (challenge) on the detection of adverse drug reactions in multilingual text, hosted by the NTCIR conference (2023), prepared together with all KEEPHA partners. This attracted the attention of a large number of teams worldwide, among which eight went all the way to creating ADR detection systems and submitting results to the shared task. We have also planned another shared task, led by LISN and DFKI with contributions by all, that will use the KEEPHA tri-lingual annotated corpus. These shared tasks lead to better dissemination of our work and to better assessment of the current state of the art on tasks of interest to our project.

Projektbezogene Publikationen (Auswahl)

 
 

Zusatzinformationen

Textvergrößerung und Kontrastanpassung