Project Details
Projekt Print View

Paraphrase Types: A New Paradigm for Paraphrase Generation and Detection

Subject Area Image and Language Processing, Computer Graphics and Visualisation, Human Computer Interaction, Ubiquitous and Wearable Computing
Term since 2025
Project identifier Deutsche Forschungsgemeinschaft (DFG) - Project number 564661959
 
Paraphrases are texts conveying the same meaning using different words or grammatical structures. Humans naturally understand that changing the grammatical structure or even a word, e.g., a negation, can change the meaning of a sentence completely. Current automated systems for paraphrase generation and detection (PGD) produce and identify semantically similar content reliably. However, they only perform binary assessments of whether sentence pairs share the same meaning and fail to understand the linguistic characteristics and syntactic or semantic changes that make two texts alike. Defining and recognizing paraphrase types, i.e., different linguistic forms of paraphrases, allows us to understand what changes make two texts similar. A technique that generates and identifies paraphrase types would open many use cases. It could identify and differentiate authors more granularly, create linguistic profiles of authors, or characterize machine-generated text to improve plagiarism detection systems. Further, this technology could enhance language learning platforms, e.g., by providing learners with personalized variations of structures they struggle with, such as practice sentences with different modal verbs for those struggling with this verb type. By not considering paraphrase types in their architectures, current methods struggle with these tasks. This project will design, implement, and evaluate an approach to learn paraphrase types in large language models (LLMs) by completing three research tasks. We will assess the handling of paraphrase types in paraphrase models (WP1), integrate paraphrase types into training objectives and datasets (WP2), and develop a PGD system that incorporates these insights (WP3). In WP1, we will propose a unified taxonomy for paraphrase types and explore current LLMs limitations in PGD. In WP2, we will conduct human studies to assess paraphrase types and propose tasks and datasets for training automated systems. Next, we will formulate training tasks and compose datasets for training new models. We will propose a new metric considering paraphrase types to evaluate models' abilities to handle specific linguistic changes. In WP3, we will implement specific LLMs for generation and detection. We will use the newly created datasets to test architecture variations and scale the best-performing models. To ensure the project's long-term success, we will develop strategies for incorporating new paraphrase types and enhancing our models with efficient computational methods. All project outputs will be made available and maintained as open-source on GitHub to ensure long-term accessibility for further research and development.
DFG Programme Research Grants
 
 

Additional Information

Textvergrößerung und Kontrastanpassung