International Conference on Very Large Data Bases (VLDB)
Jianbin Qin1 Wei Wang2 Chuan Xiao3 Ying Zhang4
1Shenzhen University & Shenzhen Institute of Computing Sciences 2University of New South Wales, Osaka University and Nagoya University 3Osaka University & Nagoya University 4University of Technology Sydney
ABSTRACT
Similarity query processing has been an active research topic for several decades. It is an essential procedure in a wide range of applications. Recently, embedding and auto-encoding methods as well as pre-trained models have gained popularity. They basically deal with high-dimensional data, and this trend brings new opportunities and challenges to similarity query processing for high-dimensional data. Meanwhile, new techniques have emerged to tackle this long-standing problem theoretically and empirically. In this tutorial, we summarize existing solutions, especially recent advancements from both database (DB) and machine learning (ML) communities, and analyze their strengths and weaknesses. We review exact and approximate methods such as cover tree, locality sensitive hashing, product quantization, and proximity graphs. We also discuss the selectivity estimation problem and show how researchers are bringing in state-of-the-art ML techniques to address the problem. By highlighting the strong connections between DB and ML, we hope that this tutorial provides an impetus towards new ML for DB solutions and vice versa.
Future Opportunities
We highlight a number of promising directions for future research: (1) It is interesting to explore ML models as solu-tions to query processing (e.g., learned indexing or sampling). (2) Whilst many existing studies target search queries, we ex-pect that join queries will be explored, especially for the cold start case. (3) Answering composite queries (e.g., conjunctive queries) over multiple attributes will receive more attention, since many DB tasks deal with multi-attribute data and the advancement of deep learning methods will enable us to embed more attributes for semantic comparison. (4) Another direction is to develop efficient algorithms for query processing in data science platforms such as Pandas/R dataframe.
Acknowledgments
This work was supported by JSPS 17H06099, 18H04093, and 19K11979, NSFC 61702409, Guang-dong Basic and Applied Basic Research Foundation 2019A1515111047, 2019A1515011064, Guangdong Project 2017B030314073 and 2018B030325002, ARC DP170103710, DP180103096, DP180103411, and FT170100128, and D2D CRC DC25002 and DC25003. We thank Yaoshu Wang (SICS) for his kind advice.
BibTeX
@article{DBLP:journals/pvldb/Qin0X020,
author = {Jianbin Qin and
Wei Wang and
Chuan Xiao and
Ying Zhang},
title = {Similarity Query Processing for High-Dimensional Data},
journal = {Proc. {VLDB} Endow.},
volume = {13},
number = {12},
pages = {3437--3440},
year = {2020},
}
Downloads