From a3e26cbd0a07709e5ae65a8e50b3d5b4103e69c6 Mon Sep 17 00:00:00 2001 From: David Myriel Date: Wed, 21 May 2025 10:22:29 +0200 Subject: [PATCH 1/3] add vectors document --- .../vectors}/embedding_intro.png | Bin docs/src/concepts/vectors.md | 44 +++++++++++++----- docs/src/guides/search/vector-search.md | 2 +- 3 files changed, 34 insertions(+), 12 deletions(-) rename docs/src/assets/{ => concepts/vectors}/embedding_intro.png (100%) diff --git a/docs/src/assets/embedding_intro.png b/docs/src/assets/concepts/vectors/embedding_intro.png similarity index 100% rename from docs/src/assets/embedding_intro.png rename to docs/src/assets/concepts/vectors/embedding_intro.png diff --git a/docs/src/concepts/vectors.md b/docs/src/concepts/vectors.md index a3165fe..0d6328b 100644 --- a/docs/src/concepts/vectors.md +++ b/docs/src/concepts/vectors.md @@ -3,28 +3,50 @@ title: Vector Embeddings in LanceDB | Complete Guide to Vector Representations description: Master vector embeddings in LanceDB with our comprehensive guide. Learn how to convert raw data into vector representations and understand the power of semantic similarity in vector space. --- -# Vector Embeddings in LanceDB +# Types of Vector Embeddings Supported by LanceDB -## Understanding Vector Representations +## Vectors -Modern machine learning models can be trained to convert raw data into embeddings, represented as arrays (or vectors) of floating point numbers of fixed dimensionality. What makes embeddings useful in practice is that the position of an embedding in vector space captures some of the semantics of the data, depending on the type of model and how it was trained. Points that are close to each other in vector space are considered similar (or appear in similar contexts), and points that are far away are considered dissimilar. +Modern machine learning models convert raw data into vector embeddings, which are fixed-length arrays of floating-point numbers. These embeddings capture semantic meaning through their position in vector space, where proximity indicates similarity. Points close together represent similar concepts, while distant points represent dissimilar ones. -Large datasets of multi-modal data (text, audio, images, etc.) can be converted into embeddings with the appropriate model. Projecting the vectors' principal components in 2D space results in groups of vectors that represent similar concepts clustering together, as shown below. +Multimodal data (text, audio, images, etc.) can be transformed into embeddings using appropriate models. -![](../assets/embedding_intro.png) +**Figure 1:** When projected into 2D space, semantically similar items will naturally cluster together. -## Multivector Type +![](../assets/concepts/vectors/embedding_intro.png) + +## Multivectors LanceDB natively supports multivector data types, enabling advanced search scenarios where a single data item is represented by multiple embeddings (e.g., using models like ColBERT -or CoLPali). In this framework, documents and queries are encoded as collections of +or CoLPali). + +In this framework, documents and queries are encoded as collections of contextualized vectors—precomputed for documents and indexed for queries. Key features: -- Indexing on multivector column: store and index multiple vectors per row. -- Supporint query being a single vector or multiple vectors -- Optimized search performance with [XTR](https://arxiv.org/abs/2501.17788) with improved recall. + +- Indexing on multivector column: store and index multiple vectors per row +- Supporting queries with either a single vector or multiple vectors +- Optimized search performance with [XTR](https://arxiv.org/abs/2501.17788) with improved recall !!! info "Multivector Search Limitations" Currently, only the `cosine` metric is supported for multivector search. - The vector value type can be `float16`, `float32`, or `float64`. \ No newline at end of file + The vector value type can be `float16`, `float32`, or `float64`. + +## Binary Vectors + +LanceDB supports binary vectors - vectors composed solely of 0s and 1s that are commonly used to represent categorical data or presence/absence information in a compact way. Binary vectors in LanceDB are stored efficiently as uint8 arrays, with every 8 bits packed into a single byte. + +!!! note "Dimension Requirements" + Binary vector dimensions must be multiples of 8. For example, a 128-dimensional binary vector is stored as a uint8 array of size 16 (128/8 = 16 bytes). + +LanceDB provides specialized support for searching binary vectors using Hamming distance, which measures the number of positions at which two binary vectors differ. This makes binary vectors particularly efficient for: + +- Document fingerprinting and deduplication +- Binary hash codes for image similarity search +- Compressed vector representations for large-scale retrieval +- Categorical feature encoding + +While LanceDB's primary focus is on dense floating-point vectors for semantic search, its Apache Arrow-based architecture and hardware acceleration optimizations make it equally well-suited for binary vector operations. The compact nature of binary vectors combined with efficient Hamming distance calculations enables fast similarity comparisons while minimizing storage requirements. + diff --git a/docs/src/guides/search/vector-search.md b/docs/src/guides/search/vector-search.md index d025881..28155b4 100644 --- a/docs/src/guides/search/vector-search.md +++ b/docs/src/guides/search/vector-search.md @@ -350,7 +350,7 @@ an ANN search means that using an index often involves a trade-off between recal See the [IVF_PQ index](./concepts/index_ivfpq.md) for a deeper description of how `IVF_PQ` indexes work in LanceDB. -## Binary vector +## Search with Binary Vectors LanceDB supports binary vectors as a data type, and has the ability to search binary vectors with hamming distance. The binary vectors are stored as uint8 arrays (every 8 bits are stored as a byte): From 4fccf703b28540b4f920159d76dd28d244c8ff2e Mon Sep 17 00:00:00 2001 From: David Myriel Date: Wed, 21 May 2025 10:30:57 +0200 Subject: [PATCH 2/3] update storage document --- .../storage}/lancedb_storage_tradeoffs.png | Bin docs/src/concepts/storage.md | 52 +++++++++--------- 2 files changed, 26 insertions(+), 26 deletions(-) rename docs/src/assets/{ => concepts/storage}/lancedb_storage_tradeoffs.png (100%) diff --git a/docs/src/assets/lancedb_storage_tradeoffs.png b/docs/src/assets/concepts/storage/lancedb_storage_tradeoffs.png similarity index 100% rename from docs/src/assets/lancedb_storage_tradeoffs.png rename to docs/src/assets/concepts/storage/lancedb_storage_tradeoffs.png diff --git a/docs/src/concepts/storage.md b/docs/src/concepts/storage.md index b23e297..d8c3328 100644 --- a/docs/src/concepts/storage.md +++ b/docs/src/concepts/storage.md @@ -1,84 +1,84 @@ --- -title: LanceDB Storage Guide | Complete Guide to Data Persistence +title: LanceDB Storage Guide description: Master LanceDB's storage architecture with our comprehensive guide. Learn about local storage, cloud storage options, and best practices for efficient vector data management and persistence. --- -# Storage Architecture in LanceDB +# Choosing the Right Storage Backend for LanceDB -LanceDB is among the only vector databases built on top of multiple modular components designed from the ground-up to be efficient on disk. This gives it the unique benefit of being flexible enough to support multiple storage backends, including local NVMe, EBS, EFS and many other third-party APIs that connect to the cloud. +LanceDB is among the only vector databases built on top of multiple modular components designed from the ground up to be efficient on disk. This gives it the unique benefit of being flexible enough to support multiple storage backends, including local NVMe, EBS, EFS, and many other third-party APIs that connect to the cloud. It is important to understand the tradeoffs between cost and latency for your specific application and use case. This section will help you understand the tradeoffs between the different storage backends. ## Storage Backend Selection Guide -We've prepared a simple diagram to showcase the thought process that goes into choosing a storage backend when using LanceDB OSS, Cloud or Enterprise. +We've prepared a simple diagram to showcase the thought process that goes into choosing a storage backend when using LanceDB OSS, Cloud, or Enterprise. -![](../assets/lancedb_storage_tradeoffs.png) +![](../assets/concepts/storage/lancedb_storage_tradeoffs.png) When architecting your system, you'd typically ask yourself the following questions to decide on a storage option: -1. **Latency**: How fast do I need results? What do the p50 and also p95 look like? +1. **Latency**: How fast do I need results? What do the p50 and p95 look like? 2. **Scalability**: Can I scale up the amount of data and QPS easily? -3. **Cost**: To serve my application, what's the all-in cost of *both* storage and serving infra? +3. **Cost**: To serve my application, what's the all-in cost of *both* storage and serving infrastructure? 4. **Reliability/Availability**: How does replication work? Is disaster recovery addressed? ## Storage Backend Comparison -This section reviews the characteristics of each storage option in four dimensions: latency, scalability, cost and reliability. +This section reviews the characteristics of each storage option in four dimensions: latency, scalability, cost, and reliability. -**We begin with the lowest cost option, and end with the lowest latency option.** +**We begin with the lowest cost option and end with the lowest latency option.** ### 1. Object Storage (S3 / GCS / Azure Blob) !!! tip "Lowest cost, highest latency" - - **Latency** ⇒ Has the highest latency. p95 latency is also substantially worse than p50. In general you get results in the order of several hundred milliseconds - - **Scalability** ⇒ Infinite on storage, however, QPS will be limited by S3 concurrency limits + - **Latency** ⇒ Has the highest latency. p95 latency is also substantially worse than p50. In general, you get results in the order of several hundred milliseconds + - **Scalability** ⇒ Infinite on storage; however, QPS will be limited by S3 concurrency limits - **Cost** ⇒ Lowest (order of magnitude cheaper than other options) - - **Reliability/Availability** ⇒ Highly available, as blob storage like S3 are critical infrastructure that form the backbone of the internet. + - **Reliability/Availability** ⇒ Highly available, as blob storage like S3 is critical infrastructure that forms the backbone of the internet -Another important point to note is that LanceDB is designed to separate storage from compute, and the underlying Lance format stores the data in numerous immutable fragments. Due to these factors, LanceDB is a great storage option that addresses the _N + 1_ query problem. i.e., when a high query throughput is required, query processes can run in a stateless manner and be scaled up and down as needed. +Another important point to note is that LanceDB is designed to separate storage from compute, and the underlying Lance format stores the data in numerous immutable fragments. Due to these factors, LanceDB is a great storage option that addresses the _N + 1_ query problem, i.e., when high query throughput is required, query processes can run in a stateless manner and be scaled up and down as needed. ### 2. File Storage (EFS / GCS Filestore / Azure File) !!! info "Moderately low cost, moderately low latency (<100ms)" - - **Latency** ⇒ Much better than object/blob storage but not as good as EBS/Local disk; < 100ms p95 achievable - - **Scalability** ⇒ High, but the bottleneck will be the IOPs limit, but when scaling you can provision multiple EFS volumes - - **Cost** ⇒ Significantly more expensive than S3 but still very cost effective compared to in-memory dbs. Inactive data in EFS is also automatically tiered to S3-level costs. - - **Reliability/Availability** ⇒ Highly available, as query nodes can go down without affecting EFS. However, EFS does not provide replication / backup - this must be managed manually. + - **Latency** ⇒ Much better than object/blob storage but not as good as EBS/Local disk; <100ms p95 achievable + - **Scalability** ⇒ High, but the bottleneck will be the IOPS limit; when scaling, you can provision multiple EFS volumes + - **Cost** ⇒ Significantly more expensive than S3 but still very cost-effective compared to in-memory databases. Inactive data in EFS is also automatically tiered to S3-level costs + - **Reliability/Availability** ⇒ Highly available, as query nodes can go down without affecting EFS. However, EFS does not provide replication/backup - this must be managed manually A recommended best practice is to keep a copy of the data on S3 for disaster recovery scenarios. If any downtime is unacceptable, then you would need another EFS with a copy of the data. This is still much cheaper than EC2 instances holding multiple copies of the data. ### 3. Third-party Storage Solutions -Solutions like [MinIO](https://blog.min.io/lancedb-trusted-steed-against-data-complexity/), WekaFS, etc. that deliver S3 compatible API with much better performance than S3. +Solutions like [MinIO](https://blog.min.io/lancedb-trusted-steed-against-data-complexity/), WekaFS, etc., that deliver S3-compatible API with much better performance than S3. !!! info "Moderately low cost, moderately low latency (<100ms)" - **Latency** ⇒ Should be similar latency to EFS, better than S3 (<100ms) - **Scalability** ⇒ Up to the solutions architect, who can add as many nodes to their MinIO or other third-party provider's cluster as needed - **Cost** ⇒ Definitely higher than S3. The cost can be marginally higher than EFS until you get to maybe >10TB scale with high utilization - - **Reliability/Availability** ⇒ These are all shareable by lots of nodes, quality/cost of replication/backup depends on the vendor + - **Reliability/Availability** ⇒ These are all shareable by lots of nodes; quality/cost of replication/backup depends on the vendor ### 4. Block Storage (EBS / GCP Persistent Disk / Azure Managed Disk) !!! info "Very low latency (<30ms), higher cost" - **Latency** ⇒ Very good, pretty close to local disk. You're looking at <30ms latency in most cases - - **Scalability** ⇒ EBS is not shareable between instances. If deployed via k8s, it can be shared between pods that live on the same instance, but beyond that you would need to shard data or make an additional copy - - **Cost** ⇒ Higher than EFS. There are some hidden costs to EBS as well if you're paying for IO. - - **Reliability/Availability** ⇒ Not shareable between instances but can be shared between pods on the same instance. Survives instance termination. No automatic backups. + - **Scalability** ⇒ EBS is not shareable between instances. If deployed via k8s, it can be shared between pods that live on the same instance, but beyond that, you would need to shard data or make an additional copy + - **Cost** ⇒ Higher than EFS. There are some hidden costs to EBS as well if you're paying for I/O + - **Reliability/Availability** ⇒ Not shareable between instances but can be shared between pods on the same instance. Survives instance termination. No automatic backups -Just like EFS, an EBS or persistent disk setup requires more manual work to manage data sharding, backups and capacity. +Just like EFS, an EBS or persistent disk setup requires more manual work to manage data sharding, backups, and capacity. ### 5. Local Storage (SSD/NVMe) !!! danger "Lowest latency (<10ms), highest cost" - **Latency** ⇒ Lowest latency with modern NVMe drives, <10ms p95 - - **Scalability** ⇒ Difficult to scale on cloud. Also need additional copies / sharding if QPS needs to be higher + - **Scalability** ⇒ Difficult to scale on cloud. Also need additional copies/sharding if QPS needs to be higher - **Cost** ⇒ Highest cost; the main issue with keeping your application and storage tightly integrated is that it's just not really possible to scale this up in cloud environments - - **Reliability/Availability** ⇒ If the instance goes down, so does your data. You have to be _very_ diligent about backing up your data + - **Reliability/Availability** ⇒ If the instance goes down, so does your data. You have to be *very* diligent about backing up your data -As a rule of thumb, local disk should be your storage option if you require absolutely *crazy low* latency and you're willing to do a bunch of data management work to make it happen. +As a rule of thumb, local disk should be your storage option if you require absolutely **the lowest possible** latency and you're willing to do a bunch of data management work to make it happen. From 5b95796a1d1fc3f4574c8e9afcf1a398cf36db2f Mon Sep 17 00:00:00 2001 From: David Myriel Date: Wed, 21 May 2025 16:44:28 +0200 Subject: [PATCH 3/3] add index docs --- docs/src/concepts/indexing.md | 139 +++++++---------------- docs/src/guides/indexing/scalar-index.md | 29 ----- 2 files changed, 43 insertions(+), 125 deletions(-) diff --git a/docs/src/concepts/indexing.md b/docs/src/concepts/indexing.md index c476624..c8efbc8 100644 --- a/docs/src/concepts/indexing.md +++ b/docs/src/concepts/indexing.md @@ -1,130 +1,77 @@ --- -title: Vector Indexing in LanceDB | IVF-PQ & HNSW Index Guide +title: Vector Indexes in LanceDB description: Master LanceDB's vector indexing with our comprehensive guide. Learn about IVF-PQ and HNSW indexes, product quantization, clustering, and performance optimization for large-scale vector search. --- -# Vector Indexing in LanceDB +# Indexing Data in LanceDB -## Understanding IVF-PQ Index +An **index** makes embeddings searchable by organizing them in efficient data structures for fast lookups. Without an index, searching through embeddings would require scanning through every single vector in the dataset, which becomes prohibitively slow as the dataset grows. -An ANN (Approximate Nearest Neighbors) index is a data structure that represents data in a way that makes it more efficient to search and retrieve. Using an ANN index is faster, but less accurate than kNN or brute force search because, in essence, the index is a lossy representation of the data. +LanceDB offers a number of indexes, including **IVF, HNSW, Scalar Index and Full-Text Index**. However, a key distinguishing feature of LanceDB is it uses a composite index: **IVF_PQ**, which is a variant of the **Inverted File Index (IVF) that uses Product Quantization (PQ)** to compress the embeddings. -LanceDB is fundamentally different from other vector databases in that it is built on top of [Lance](https://github.com/lancedb/lance), an open-source columnar data format designed for performant ML workloads and fast random access. Due to the design of Lance, LanceDB's indexing philosophy adopts a primarily *disk-based* indexing philosophy. +!!! note "Disk-Based Indexing" + LanceDB is fundamentally different from other vector databases in that it is built on top of [Lance](https://github.com/lancedb/lance), an open-source columnar data format designed for performant ML workloads and fast random access. Due to the design of Lance, LanceDB's indexing philosophy adopts a primarily *disk-based* indexing philosophy. -## IVF-PQ +## Inverted File Index -IVF-PQ is a composite index that combines inverted file index (IVF) and product quantization (PQ). The implementation in LanceDB provides several parameters to fine-tune the index's size, query throughput, latency and recall, which are described later in this section. +**IVF Flat Index**: This index stores raw vectors. These vectors are grouped into partitions of similar vectors. Each partition keeps track of a centroid which is the average value of all vectors in the group. -### Product Quantization +**IVF-PQ Index**: IVF-PQ is a composite index that combines the Inverted File iIndex (IVF) and product quantization (PQ). The implementation in LanceDB provides several parameters to fine-tune the index's size, query throughput, latency and recall. -Quantization is a compression technique used to reduce the dimensionality of an embedding to speed up search. +## HNSW Index -Product quantization (PQ) works by dividing a large, high-dimensional vector of size into equally sized subvectors. Each subvector is assigned a "reproduction value" that maps to the nearest centroid of points for that subvector. The reproduction values are then assigned to a codebook using unique IDs, which can be used to reconstruct the original vector. +HNSW (Hierarchically Navigable Small Worlds) is a graph-based algorithm. All graph-based search algorithms rely on the idea of a k-nearest neighbor (or k-approximate nearest neighbor) graph. HNSW also combines this with the ideas behind a classic 1-dimensional search data structure: the skip list. -![](../assets/ivfpq_pq_desc.png) +HNSW is one of the most accurate and fastest ANN search algorithms, It's beneficial in high-dimensional spaces where finding the same nearest neighbor would be too slow and costly. -It's important to remember that quantization is a *lossy process*, i.e., the reconstructed vector is not identical to the original vector. This results in a trade-off between the size of the index and the accuracy of the search results. +LanceDB currently supports HNSW_PQ and HNSW_SQL -As an example, consider starting with 128-dimensional vector consisting of 32-bit floats. Quantizing it to an 8-bit integer vector with 4 dimensions as in the image above, we can significantly reduce memory requirements. +**HNSW-PQ** is a variant of the HNSW algorithm that uses product quantization to compress the vectors. -!!! example "Effect of quantization" +**HNSW_SQ** is a variant of the HNSW algorithm that uses scalar quantization to compress the vectors. - Original: `128 × 32 = 4096` bits - Quantized: `4 × 8 = 32` bits +## Scalar Index - Quantization results in a **128x** reduction in memory requirements for each vector in the index, which is substantial. +Scalar indexes organize data by scalar attributes (e.g. numbers, categorical values), enabling fast filtering of vector data. In vector databases, scalar indices accelerate the retrieval of scalar data associated with vectors, thus enhancing the query performance when searching for vectors that meet certain scalar criteria. -### Inverted File Index (IVF) Implementation +Similar to many SQL databases, LanceDB supports several types of scalar indices to accelerate search +over scalar columns. -While PQ helps with reducing the size of the index, IVF primarily addresses search performance. The primary purpose of an inverted file index is to facilitate rapid and effective nearest neighbor search by narrowing down the search space. +- `BTREE`: The most common type is BTREE. The index stores a copy of the + column in sorted order. This sorted copy allows a binary search to be used to + satisfy queries. +- `BITMAP`: this index stores a bitmap for each unique value in the column. It + uses a series of bits to indicate whether a value is present in a row of a table +- `LABEL_LIST`: a special index that can be used on `List` columns to + support queries with `array_contains_all` and `array_contains_any` + using an underlying bitmap index. + For example, a column that contains lists of tags (e.g. `["tag1", "tag2", "tag3"]`) can be indexed with a `LABEL_LIST` index. -In IVF, the PQ vector space is divided into *Voronoi cells*, which are essentially partitions that consist of all the points in the space that are within a threshold distance of the given region's seed point. These seed points are initialized by running K-means over the stored vectors. The centroids of K-means turn into the seed points which then each define a region. These regions are then are used to create an inverted index that correlates each centroid with a list of vectors in the space, allowing a search to be restricted to just a subset of vectors in the index. +!!! tips "Which Scalar Index to Use?" -![](../assets/ivfpq_ivf_desc.webp) + `BTREE`: This index is good for scalar columns with mostly distinct values and does best when the query is highly selective. + + `BITMAP`: This index works best for low-cardinality numeric or string columns, where the number of unique values is small (i.e., less than a few thousands). + + `LABEL_LIST`: This index should be used for columns containing list-type data. -During query time, depending on where the query lands in vector space, it may be close to the border of multiple Voronoi cells, which could make the top-k results ambiguous and span across multiple cells. To address this, the IVF-PQ introduces the `nprobe` parameter, which controls the number of Voronoi cells to search during a query. The higher the `nprobe`, the more accurate the results, but the slower the query. +| Data Type | Filter | Index Type | +| --------------------------------------------------------------- | ----------------------------------------- | ------------ | +| Numeric, String, Temporal | `<`, `=`, `>`, `in`, `between`, `is null` | `BTREE` | +| Boolean, numbers or strings with fewer than 1,000 unique values | `<`, `=`, `>`, `in`, `between`, `is null` | `BITMAP` | +| List of low cardinality of numbers or strings | `array_has_any`, `array_has_all` | `LABEL_LIST` | -![](../assets/ivfpq_query_vector.webp) +## Full-Text Index (FTS) -## HNSW Index Implementation -Approximate Nearest Neighbor (ANN) search is a method for finding data points near a given point in a dataset, though not always the exact nearest one. HNSW is one of the most accurate and fastest Approximate Nearest Neighbour search algorithms, It's beneficial in high-dimensional spaces where finding the same nearest neighbor would be too slow and costly. -### Types of ANN Search Algorithms - -Approximate Nearest Neighbor (ANN) search is a method for finding data points near a given point in a dataset, though not always the exact nearest one. HNSW is one of the most accurate and fastest Approximate Nearest Neighbour search algorithms, It's beneficial in high-dimensional spaces where finding the same nearest neighbor would be too slow and costly - - -There are three main types of ANN search algorithms: - -* **Tree-based search algorithms**: Use a tree structure to organize and store data points. -* **Hash-based search algorithms**: Use a specialized geometric hash table to store and manage data points. These algorithms typically focus on theoretical guarantees, and don't usually perform as well as the other approaches in practice. -* **Graph-based search algorithms**: Use a graph structure to store data points, which can be a bit complex. - -HNSW is a graph-based algorithm. All graph-based search algorithms rely on the idea of a k-nearest neighbor (or k-approximate nearest neighbor) graph, which we outline below. -HNSW also combines this with the ideas behind a classic 1-dimensional search data structure: the skip list. - -### Understanding k-Nearest Neighbor Graphs - -The k-nearest neighbor graph actually predates its use for ANN search. Its construction is quite simple: - -* Each vector in the dataset is given an associated vertex. -* Each vertex has outgoing edges to its k nearest neighbors. That is, the k closest other vertices by Euclidean distance between the two corresponding vectors. This can be thought of as a "friend list" for the vertex. -* For some applications (including nearest-neighbor search), the incoming edges are also added. - -Eventually, it was realized that the following greedy search method over such a graph typically results in good approximate nearest neighbors: - -* Given a query vector, start at some fixed "entry point" vertex (e.g. the approximate center node). -* Look at that vertex's neighbors. If any of them are closer to the query vector than the current vertex, then move to that vertex. -* Repeat until a local optimum is found. - -The above algorithm also generalizes to e.g. top 10 approximate nearest neighbors. - -Computing a k-nearest neighbor graph is actually quite slow, taking quadratic time in the dataset size. It was quickly realized that near-identical performance can be achieved using a k-approximate nearest neighbor graph. That is, instead of obtaining the k-nearest neighbors for each vertex, an approximate nearest neighbor search data structure is used to build much faster. -In fact, another data structure is not needed: This can be done "incrementally". -That is, if you start with a k-ANN graph for n-1 vertices, you can extend it to a k-ANN graph for n vertices as well by using the graph to obtain the k-ANN for the new vertex. - -One downside of k-NN and k-ANN graphs alone is that one must typically build them with a large value of k to get decent results, resulting in a large index. - -### Hierarchical Navigable Small Worlds (HNSW) - -HNSW builds on k-ANN in two main ways: - -* Instead of getting the k-approximate nearest neighbors for a large value of k, it sparsifies the k-ANN graph using a carefully chosen "edge pruning" heuristic, allowing for the number of edges per vertex to be limited to a relatively small constant. -* The "entry point" vertex is chosen dynamically using a recursively constructed data structure on a subset of the data, similarly to a skip list. - -This recursive structure can be thought of as separating into layers: - -* At the bottom-most layer, an k-ANN graph on the whole dataset is present. -* At the second layer, a k-ANN graph on a fraction of the dataset (e.g. 10%) is present. -* At the Lth layer, a k-ANN graph is present. It is over a (constant) fraction (e.g. 10%) of the vectors/vertices present in the L-1th layer. - -Then the greedy search routine operates as follows: - -* At the top layer (using an arbitrary vertex as an entry point), use the greedy local search routine on the k-ANN graph to get an approximate nearest neighbor at that layer. -* Using the approximate nearest neighbor found in the previous layer as an entry point, find an approximate nearest neighbor in the next layer with the same method. -* Repeat until the bottom-most layer is reached. Then use the entry point to find multiple nearest neighbors (e.g. top 10). - -## Index Management and Maintenance - -Embeddings for a given dataset are made searchable via an **index**. The index is constructed by using data structures that store the embeddings such that it's very efficient to perform scans and lookups on them. A key distinguishing feature of LanceDB is it uses a disk-based index: IVF-PQ, which is a variant of the Inverted File Index (IVF) that uses Product Quantization (PQ) to compress the embeddings. - -### Reindexing Process +## Reindexing and Incremental Indexing Reindexing is the process of updating the index to account for new data, keeping good performance for queries. This applies to either a full-text search (FTS) index or a vector index. For ANN search, new data will always be included in query results, but queries on tables with unindexed data will fallback to slower search methods for the new parts of the table. This is another important operation to run periodically as your data grows, as it also improves performance. This is especially important if you're appending large amounts of data to an existing dataset. !!! tip When adding new data to a dataset that has an existing index (either FTS or vector), LanceDB doesn't immediately update the index until a reindex operation is complete. -Both LanceDB OSS and Cloud support reindexing, but the process (at least for now) is different for each, depending on the type of index. - -When a reindex job is triggered in the background, the entire data is reindexed, but in the interim as new queries come in, LanceDB will combine results from the existing index with exhaustive kNN search on the new data. This is done to ensure that you're still searching on all your data, but it does come at a performance cost. The more data that you add without reindexing, the impact on latency (due to exhaustive search) can be noticeable. - -#### Vector Index Reindexing - -* LanceDB Cloud supports incremental reindexing, where a background process will trigger a new index build for you automatically when new data is added to a dataset -* LanceDB OSS requires you to manually trigger a reindex operation -- we are working on adding incremental reindexing to LanceDB OSS as well - -#### FTS Index Reindexing +> Both **LanceDB OSS, Cloud and Enterprise** support reindexing, but the process (at least for now) is different for each, depending on the type of index. -FTS reindexing is supported in both LanceDB OSS and Cloud, but requires that it's manually rebuilt once you have a significant enough amount of new data added that needs to be reindexed. We [updated](https://github.com/lancedb/lancedb/pull/762) Tantivy's default heap size from 128MB to 1GB in LanceDB to make it much faster to reindex, by up to 10x from the default settings. +When a reindex job is triggered in the background, the entire data is reindexed, but in the interim as new queries come in, LanceDB will combine results from the existing index with exhaustive kNN search on the new data. This is done to ensure that you're still searching on all your data, but it does come at a performance cost. The more data that you add without reindexing, the impact on latency (due to exhaustive search) can be noticeable. \ No newline at end of file diff --git a/docs/src/guides/indexing/scalar-index.md b/docs/src/guides/indexing/scalar-index.md index 4d2cf9b..cdba055 100644 --- a/docs/src/guides/indexing/scalar-index.md +++ b/docs/src/guides/indexing/scalar-index.md @@ -237,35 +237,6 @@ LanceDB supports scalar indices on UUID columns (stored as `FixedSizeBinary(16)` OSS______ -Scalar indices organize data by scalar attributes (e.g. numbers, categorical values), enabling fast filtering of vector data. In vector databases, scalar indices accelerate the retrieval of scalar data associated with vectors, thus enhancing the query performance when searching for vectors that meet certain scalar criteria. - -Similar to many SQL databases, LanceDB supports several types of scalar indices to accelerate search -over scalar columns. - -- `BTREE`: The most common type is BTREE. The index stores a copy of the - column in sorted order. This sorted copy allows a binary search to be used to - satisfy queries. -- `BITMAP`: this index stores a bitmap for each unique value in the column. It - uses a series of bits to indicate whether a value is present in a row of a table -- `LABEL_LIST`: a special index that can be used on `List` columns to - support queries with `array_contains_all` and `array_contains_any` - using an underlying bitmap index. - For example, a column that contains lists of tags (e.g. `["tag1", "tag2", "tag3"]`) can be indexed with a `LABEL_LIST` index. - -!!! tips "How to choose the right scalar index type" - - `BTREE`: This index is good for scalar columns with mostly distinct values and does best when the query is highly selective. - - `BITMAP`: This index works best for low-cardinality numeric or string columns, where the number of unique values is small (i.e., less than a few thousands). - - `LABEL_LIST`: This index should be used for columns containing list-type data. - -| Data Type | Filter | Index Type | -| --------------------------------------------------------------- | ----------------------------------------- | ------------ | -| Numeric, String, Temporal | `<`, `=`, `>`, `in`, `between`, `is null` | `BTREE` | -| Boolean, numbers or strings with fewer than 1,000 unique values | `<`, `=`, `>`, `in`, `between`, `is null` | `BITMAP` | -| List of low cardinality of numbers or strings | `array_has_any`, `array_has_all` | `LABEL_LIST` | - ### Create a scalar index === "Python"