Semantische Duplikaterkennung mithilfe von Textual Entailment
Zusammenfassung der Projektergebnisse
Duplicate detection becomes more and more important in the information society as the number of available text documents grows rapidly, e.g. the growth of the web. At a similar speed, the number of duplicates increases. In prior work, duplicate detection worked with shallow checkers. They can be called shallow because they work only on surface-oriented factors or features and do not step into the semantics of words, sentences, paragraphs, or even whole texts. Surface-oriented features of a given text are derived from n-grams, rare words, spelling errors, etc. But two texts can be semantic duplicates, i.e. expressing the same content, without sharing many words or word sequences and hence without having similar values of shallow features. Shallow checkers can be easily tricked by experienced users that employ advanced paraphrase techniques. Therefore, a deep (semantic) approach that compares full semantic representations of two given texts has been designed, implemented, and evaluated in the Semantic Duplicate Detection project (SemDupl). To achieve this goal, the following tasks had been accomplished according to the project proposal: (A1) Conceptional work: In the preparation phase, existing shallow duplicate recognizers had to be compared, with the aim to find one recognizer which can be used as a baseline and as a starting point for combining shallow and deep methods for duplicate detection. (A2) Knowledge acquisition: This task turned out to be the most ambitious part of the project, since the knowledge acquisition had to be done automatically, and the results are the precondition for a successful work of the deep duplicate recognizer. To this end, thousands of basic relations (subordination relations, meronymy relations and semantic entailments, so-called meaning postulates) had been automatically found by means of statistical learning methods and to be validated with logical methods. In total, we extracted: 391,153 hyponymy hypotheses, 265,938 synonymy hypotheses, 1,449,406 meronymy hypotheses, 426 entailment hypotheses. All hypotheses are assigned a confidence score indicating their likeliness of correctness. (A3) Creation of convenient test beds: As a basis for checking whether the duplicate recognizer works well, sufficiently large corpora of pairs of duplicate texts and non-duplicates are needed. These corpora were partially handcrafted and partially gathered from existing collections. Several corpora form the SemDupl corpus, which can be seen as a valuable resource produced in the SemDupl project. (A4) Textual entailment: During the work on the project, it turned out that the theorem prover which had been developed in the LogAnswer project is also well suited for textual entailment. Thus the main work consisted in the generation of textual entailment patterns, deep as well as shallow ones. (A5) Technical realization of the duplicate recognizer.
Projektbezogene Publikationen (Auswahl)
- (2009). Hypernymy extraction based on shallow and deep patterns. In: From Form To Meaning: Processing Texts Automatically, Proc. of the Biennial GSCL Conference 2009 (edited by Chiarcos, Christian and Richard Eckart de Castilho), pp. 41–52. Potsdam, Germany
vor der Brück, Tim
- (2010). Detecting duplicates with shallow and parser-based methods. In: Proceedings of the 6th IEEE International Conference on Natural Language Processing and Knowledge Engineering (NLPKE). Peking, China
Hartrumpf, Sven; Tim vor der Brück; and Christian Eichhorn
- (2010). Hypernymy extraction using a semantic network representation. International Journal of Computational Linguistics and Applications, Vol. 1. 2010, Nr.1/2, pp. 105-119.
vor der Brück, Tim
- (2010). Learning deep semantic patterns for hypernymy extraction following the minimum description. In: Proceedings of the 29th International Conference on Lexis and Grammar (LGC), pp. 39–49. Belgrade, Serbia
vor der Brück, Tim
- (2010). Learning semantic network patterns for hypernymy extraction. In: Proceedings of the 6th Workshop on Ontologies and Lexical Resources (OntoLex), pp. 38–47. Peking, China
vor der Brück, Tim
- (2010). Logical ontology validation using an automatic theorem prover. In: Proceedings of the 19th European Conference on Artificial Intelligence (ECAI), pp. 491–496. Lisbon, Portugal
vor der Brück, Tim and Holger Stenzhorn
- (2010). Semantic duplicate identification with parsing and machine learning. In: Proceedings of the 13th International Conference on Text, Speech and Dialogue (TSD), volume 6231 of Lecture Notes in Artificial Intelligence, pp. 84–92. Brno, Czech Republic: Springer
Hartrumpf, Sven; Tim vor der Brück; and Christian Eichhorn
- (2010). Validating meronymy hypotheses with support vector machines and graph kernels. In: Proceedings of the 9th International Conference on Machine Learning and Applications (ICMLA), pp. 243–250. Washington, District of Columbia
Tim vor der Brück, Hermann Helbig
- Synonymy extraction from semantic networks using string and graph kernel methods. In Proceedings of the 20th European Conference on Articial Intelligence (ECAI), Montpellier,France, 2012.
Frontiers in Artificial Intelligence and Applications, Vol 242: ECAI 2012, pp. 822 - 827.
vor der Brück, Tim & Yu-FangWang
(Siehe online unter https://dx.doi.org/10.3233/978-1-61499-098-7-822) - Wissensakquisition mithilfe maschineller Lernverfahren auf tiefen semantischen Reprasentationen. PhD thesis, 2012, FernUniversitat in
Hagen, Germany. 2013, XVII, 323 S, ISBN 978-3-8348-2502-5, Springer Vieweg.
Tim vor der Bruck
- Automatically generating and evaluating a full-form lexicon for english. In Proceedings of the Language Resources and Evaluation Conference (LREC), Rekjavik, Iceland, 2014.
Tim vor der Bruck