This is a source code for the algorithm described in the paper [DB-LSH: Locality-Sensitive Hashing with Query-based Dynamic Bucketing (submitted to ICDE 2022 Second round)]. We call it as DB-LSH project.
DB-LSH project is written by C++ and is simple and easy to use. It can be complied by g++ in Linux and MSVC in Windows.
We can use Visual Studio 2019 (Other version of Visual Studio should also work but remains untested) to build the project with importing all the files in the directory ./dbLSH/src/
.
cd ./dbLSH
make
The excutable file is then in dbLSH directory, called as dblsh
dblsh datasetName c k L K beta R_min(optinal)
(the first parameter specifies the procedure be executed and change)
- datasetName : dataset name
- c : >1, a float number, the approximate ratio
- k : 1-100, an integer, the number of returned points
- L : a positive integer, the number of indexes
- K : a positive integer, the dimensionality of an index
- beta : 0-1. a float number, the maximum ratio of the number of returned points to the total number of dataset
- R_min : a float number, the inital radius
when `beta>0` , DB-LSH finishes query with `beta` and other input parameters
when `-10<beta<=0`, DB-LSH finishes query with a series of different `beta` to generate results for recall/ratio-time curves
when `beta<-10` , DB-LSH finishes query with `beta=0.1` and varying 'k'
FOR EXAMPLE, YOU CAN RUN THE FOLLOWING CODE IN COMMAND LINE AFTER BUILD ALL THE TOOLS:
cd ./dbLSH
dblsh audio 1.5 50 5 10 0.1 0.3 # a single query
dblsh audio 1.5 50 5 10 -5 0.3 # vary beta
dblsh audio 1.5 50 5 10 -20 0.3 # vary k
In our project, the format of the input file (such as audio.data_new
, which is in float
data type) is the same as that in LSHBOX. It is a binary file but not a text file, because binary file has many advantages. The binary file is organized as the following format:
{Bytes of the data type (int)} {The size of the vectors (int)} {The dimension of the vectors (int)} {All of the binary vector, arranged in turn (float)}
For your application, you should also transform your dataset into this binary format, then rename it as [datasetName].data_new
and put it in the directory ./dataset
.
A sample dataset audio.data_new
has been put in the directory ./dataset
.
Also, you can get it, audio.data
, from here(if so, rename it as audio.data_new
). If the link is invalid, you can also get it from data.
For other dataset we use, you can get the raw data from following links: MNIST, Cifar, Trevi, NUS(Extraction code: hpxg), Deep1M, GIST, TinyImages80M, SIFT. Next, you should transform your raw dataset into the mentioned binary format, then rename it is [datasetName].data_new
and put it in the directory ./dataset
.
The experimental result is saved in the directory ./dataset/ANN
as the file DB-LSH_result.csv
DB-LSH project is developed by referring to LSHBOX and RStarTree. Great appreciation to the contributors.
Please use the following bibtex to cite this work when you use DB-LSH in your paper.
@inproceedings{DBLP:conf/icde/TianZZ22,
author = {Yao Tian and
Xi Zhao and
Xiaofang Zhou},
title = {{DB-LSH:} Locality-Sensitive Hashing with Query-based Dynamic Bucketing},
booktitle = {{ICDE}},
pages = {2250--2262},
publisher = {{IEEE}},
year = {2022}
}
If you meet any issue on the code or take interest in our work, please feel free to contact me ([email protected]) or YaoTian ([email protected]). Thank you.