High-Dimensional Similarity Query Processing for Data Science-国家高性能计算中心深圳分中心

论文

当前位置：首页 -> 项目成果 -> 论文 -> 正文

High-Dimensional Similarity Query Processing for Data Science

ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD)

阅读数：发布日期：22-09-17 21:29

ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD)

Jianbin Qin¹Wei Wang²Chuan Xiao³Ying Zhang⁴Yaoshu Wang¹

¹Shenzhen University & Shenzhen Institute of Computing Sciences²Hong Kong University of Science and Technology ³Osaka University & Nagoya University⁴University of Technology Sydney

Abstract

Similarity query (a.k.a. nearest neighbor query) processing has been an active research topic for several decades. It is an essential procedure in a wide range of applications (e.g., classification & regression, deduplication, image retrieval, and recommender systems). Recently, representation learning and auto-encoding methods as well as pre-trained models have gained popularity. They basically deal with dense high-dimensional data, and this trend brings new opportunities and challenges to similarity query processing. Meanwhile, new techniques have emerged to tackle this long-standing problem theoretically and empirically. This tutorial aims to provide a comprehensive review of high dimensional similarity query processing for data science. It introduces solutions from a variety of research communities, including data mining (DM), database (DB), machine learning (ML), computer vision (CV), natural language processing (NLP), and theoretical computer science (TCS), thereby highlighting the interplay between modern computer science and artificial intelligence technologies. We first discuss the importance of high-dimensional similarity query processing in data science applications, and then reviewquery processing algorithms such as cover tree, locality sensitive hashing, product quantization, proximity graphs, as well as recent advancements such as learned indexes. We analyze their strengths and weaknesses and discuss the selection of algorithms in various application scenarios. Moreover, we consider the selectivity estimation of high-dimensional similarity queries, and show how researchers are bringing in state-of-the-art ML techniques to address this problem. We expect that this tutorial will provide animpetus towards new technologies for data science.

Acknowledgements

This work was supported by Guangdong Basic and Applied Basic Research Foundation 2020B1515120028, Guangdong Pearl River Recruitment Program of Talents 2019ZT08X603, Shenzhen Continuous Support Grant 20200811104054002, HKUST Red Bird Visting Scholar Program, JSPS Kakenhi 17H06099, 18H04093, and 19K11979, and ARC FT170100128 and DP210101393.

BibTeX

@article{2021High,

title={High-Dimensional Similarity Query Processing for Data Science},

author={ Qin, Jianbin and Wang, Wei and Xiao, Chuan and Zhang, Ying and Wang, Yaoshu },

year={2021},

}

Downloads

Paper

上一条：Consistent and Flexible Selectivity Estimation for High-Dimensional Data

下一条：Parkinson's Disease Classification and Clinical Score Regression via United Embedding and Sparse ...

项目成果

论文

High-Dimensional Similarity Query Processing for Data Science

ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD)

友情链接

联系我们