Learning Table Similarity Measures
Final Report Abstract
Information is more and more managed and exchanged in digitized form. A particular type of digital information that so far has seen comparably little attention in research as first-class objects for retrieval are tables. Tables are a universal means to structure information in a two-dimensional manner; they are used extensively in scientific articles, business reports, product descriptions, web pages, etc. Searching (similar) tables requires proper table similarity measures (TSM) and algorithms. Tables are different from pure text mostly because they have structure; searching tables as first-class objects is a different problem than that in traditional database research because it addresses entire tables, not tuples within tables. In this project, we developed novel and innovative measures of table similarity that focus on automatically derived "semantic" information regarding a table as a whole, table columns and rows, and table content. The major project aims were laid down with the following research questions: • How can we automatically learn table orientation using deep learning? • How do we map the meaning of a table's element or column into a fixed-length vector representation and learn a similarity metric for TSMs? • How can we measure and merge different similarity scores for pairs of tables using deep learning? The work performed in the project was structured in three main areas. • Classification of tables regarding their orientation • A deep neural network for estimating semantic table similarity (TabSim) • Faster and more accurate TabSim. While research towards the two first topics was finished successfully, each resulting in a high-profile publication, work on the third topic work is still ongoing. The main reason for this delay is the Corona pandemic leading to frequent drop-outs and considerable personal problems in the researcher working on the project at that time.
Publications
-
"A Permutation Invariant Neural Network for Table Orientation Classification." Data Mining and Knowledge Discovery 34(6).
Habibi, M., Starlinger, J. & Leser, U.
-
TabSim: A Siamese Neural Network for Accurate Estimation of Table Similarity. 2020 IEEE International Conference on Big Data (Big Data), 930-937. IEEE.
Habibi, Maryam; Starlinger, Johannes & Leser, Ulf
-
HunFlair: an easy-to-use tool for state-of-the-art biomedical named entity recognition. Bioinformatics, 37(17), 2792-2794.
Weber, Leon; Sänger, Mario; Münchmeyer, Jannes; Habibi, Maryam; Leser, Ulf & Akbik, Alan
