Skip to content

Commit aaa1bc6

Browse files
[CORE-4523] feat(Vector Search): Merge to 4.2.0; (#177)
* [GLE-8861] feat(vector): built-in TG function for pairwise vector embedding; * [GLE-8861] change euclidean to l2; * [GLE-8861] add missing range for foreach statements; * [GLE-8861] address comments; * [GLE-8861] add OR REPLACE for each GSQL function; --------- Co-authored-by: jue-yuan <[email protected]>
1 parent abd566c commit aaa1bc6

File tree

8 files changed

+490
-0
lines changed

8 files changed

+490
-0
lines changed

gds/vector/cosine_distance.gsql

Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,67 @@
1+
CREATE OR REPLACE FUNCTION gds.vector.cosine_distance(list<double> list1, list<double> list2) RETURNS(float) {
2+
3+
/*
4+
First Author: Jue Yuan
5+
First Commit Date: Nov 27, 2024
6+
7+
Recent Author: Jue Yuan
8+
Recent Commit Date: Nov 27, 2024
9+
10+
Maturity:
11+
alpha
12+
13+
Description:
14+
Calculates the cosine distance between two vectors represented as lists of doubles.
15+
The cosine distance is derived from the cosine similarity and provides a measure of the angle
16+
between two non-zero vectors in a multi-dimensional space. A distance of 0 indicates identical
17+
vectors, while a distance of 1 indicates orthogonal (maximally dissimilar) vectors.
18+
19+
Parameters:
20+
list<double> list1:
21+
The first vector as a list of double values.
22+
list<double> list2:
23+
The second vector as a list of double values.
24+
25+
Returns:
26+
float:
27+
The cosine distance between the two input vectors.
28+
Exceptions:
29+
list_size_mismatch (90000):
30+
Raised when the input lists are not of equal size.
31+
zero_divisor(90001);
32+
Raised either list is all zero to avoid zero-divisor issue.
33+
34+
Logic Overview:
35+
Validates that both input vectors have the same length.
36+
Computes the inner (dot) product of the two vectors.
37+
Calculates the magnitudes (Euclidean norms) of both vectors.
38+
Returns the cosine distance as 1 - (inner product) / (product of magnitudes).
39+
40+
Use Case:
41+
This function is commonly used in machine learning, natural language processing,
42+
and information retrieval tasks to quantify the similarity between vector representations,
43+
such as word embeddings or document feature vectors.
44+
*/
45+
46+
EXCEPTION list_size_mismatch (90000);
47+
EXCEPTION zero_divisor(90001);
48+
ListAccum<double> @@myList1 = list1;
49+
ListAccum<double> @@myList2 = list2;
50+
51+
IF (@@myList1.size() != @@myList2.size()) THEN
52+
RAISE list_size_mismatch ("Two lists provided for gds.vector.cosine_distance have different sizes.");
53+
END;
54+
55+
double inner_p = inner_product(@@myList1, @@myList2);
56+
double v1_magn = sqrt(inner_product(@@myList1, @@myList1));
57+
double v2_magn = sqrt(inner_product(@@myList2, @@myList2));
58+
IF (abs(v1_magn) < 0.0000001) THEN
59+
// use a small positive float to avoid numeric comparison error
60+
RAISE zero_divisor ("The elements in the first list are all zero. It will introduce a zero divisor.");
61+
END;
62+
IF (abs(v1_magn) < 0.0000001) THEN
63+
// use a small positive float to avoid numeric comparison error
64+
RAISE zero_divisor ("The elements in the second list are all zero. It will introduce a zero divisor.");
65+
END;
66+
RETURN (1 - inner_p / (v1_magn * v2_magn));
67+
}

gds/vector/dimension_count.gsql

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
CREATE OR REPLACE FUNCTION gds.vector.dimension_count(list<double> list1) RETURNS(int) {
2+
3+
/*
4+
First Author: Jue Yuan
5+
First Commit Date: Nov 27, 2024
6+
7+
Recent Author: Jue Yuan
8+
Recent Commit Date: Nov 27, 2024
9+
10+
Maturity:
11+
alpha
12+
13+
Description:
14+
Returns the number of dimensions (elements) in a given vector, represented as a list of double values.
15+
This function is useful for determining the size or dimensionality of input vectors in mathematical
16+
and data processing operations.
17+
18+
Parameters:
19+
list<double> list1:
20+
The input vector as a list of double values.
21+
22+
Returns:
23+
int:
24+
The number of elements (dimensions) in the input vector.
25+
26+
Logic Overview:
27+
Accepts a list of double values as input.
28+
Calculates the size of the list, which corresponds to the number of dimensions.
29+
Returns the size as an integer.
30+
Use Case:
31+
This function is valuable in vector-based computations, such as machine learning or data analysis tasks,
32+
where understanding the dimensionality of vectors is crucial for validation, preprocessing, or compatibility checks.
33+
*/
34+
35+
ListAccum<double> @@myList1 = list1;
36+
RETURN @@myList1.size();
37+
}

gds/vector/distance.gsql

Lines changed: 99 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,99 @@
1+
CREATE OR REPLACE FUNCTION gds.vector.distance(list<double> list1, list<double> list2, string metric) RETURNS(float) {
2+
3+
/*
4+
First Author: Jue Yuan
5+
First Commit Date: Nov 27, 2024
6+
7+
Recent Author: Jue Yuan
8+
Recent Commit Date: Nov 27, 2024
9+
10+
Maturity:
11+
alpha
12+
13+
Description:
14+
Calculates the distance between two vectors represented as lists of double values,
15+
based on a specified distance metric. This function supports multiple metrics,
16+
allowing for flexible similarity or dissimilarity measurements in various computational tasks.
17+
18+
Parameters:
19+
list<double> list1:
20+
The first vector as a list of double values.
21+
list<double> list2:
22+
The second vector as a list of double values.
23+
string metric:
24+
The distance metric to use. Supported metrics are:
25+
"cosine": Cosine distance
26+
"l2": Euclidean distance
27+
"ip": Inner product (dot product)
28+
Returns:
29+
float:
30+
The computed distance between the two input vectors based on the specified metric.
31+
32+
Exceptions:
33+
list_size_mismatch (90000):
34+
Raised when the input vectors are not of equal size.
35+
zero_divisor(90001);
36+
Raised either list is all zero to avoid zero-divisor issue.
37+
invalid_metric_type (90002):
38+
Raised when an unsupported distance metric is provided.
39+
40+
Logic Overview:
41+
Input Validation:
42+
Ensures both vectors have the same size.
43+
Metric Handling:
44+
Cosine Distance:
45+
Calculated as 1 - (inner product of vectors) / (product of magnitudes).
46+
L2 Distance:
47+
Computes the square root of the sum of squared differences between corresponding elements.
48+
Inner Product:
49+
Directly computes the dot product of the two vectors.
50+
51+
Error Handling:
52+
Raises an exception if the provided metric is invalid.
53+
54+
Use Case:
55+
This function is essential for machine learning, data science, and information retrieval applications,
56+
where distance or similarity calculations between vector representations (such as embeddings or feature vectors) are required.
57+
*/
58+
59+
EXCEPTION list_size_mismatch (90000);
60+
EXCEPTION zero_divisor(90001);
61+
EXCEPTION invalid_metric_type (90002);
62+
ListAccum<double> @@myList1 = list1;
63+
ListAccum<double> @@myList2 = list2;
64+
65+
IF (@@myList1.size() != @@myList2.size()) THEN
66+
RAISE list_size_mismatch ("Two lists provided for gds.vector.distance have different sizes.");
67+
END;
68+
69+
SumAccum<float> @@myResult;
70+
SumAccum<float> @@sqrSum;
71+
72+
CASE lower(metric)
73+
WHEN "cosine" THEN
74+
double inner_p = inner_product(@@myList1, @@myList2);
75+
double v1_magn = sqrt(inner_product(@@myList1, @@myList1));
76+
double v2_magn = sqrt(inner_product(@@myList2, @@myList2));
77+
IF (abs(v1_magn) < 0.0000001) THEN
78+
// use a small positive float to avoid numeric comparison error
79+
RAISE zero_divisor ("The elements in the first list are all zero. It will introduce a zero divisor.");
80+
END;
81+
IF (abs(v2_magn) < 0.0000001) THEN
82+
// use a small positive float to avoid numeric comparison error
83+
RAISE zero_divisor ("The elements in the second list are all zero. It will introduce a zero divisor.");
84+
END;
85+
@@myResult = 1 - inner_p / (v1_magn * v2_magn);
86+
WHEN "l2" THEN
87+
FOREACH i IN RANGE [0, @@myList1.size() - 1 ] DO
88+
@@sqrSum += (@@myList1.get(i) - @@myList2.get(i)) * (@@myList1.get(i) - @@myList2.get(i));
89+
END;
90+
@@myResult = sqrt(@@sqrSum);
91+
WHEN "ip" THEN
92+
@@myResult = inner_product(@@myList1, @@myList2);
93+
ELSE
94+
RAISE invalid_metric_type ("Invalid metric algorithm provided, currently supported: cosine, l2 and ip.");
95+
END
96+
;
97+
98+
RETURN @@myResult;
99+
}

gds/vector/elements_sum.gsql

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
CREATE OR REPLACE FUNCTION gds.vector.elements_sum(list<double> list1) RETURNS(float) {
2+
3+
/*
4+
First Author: Jue Yuan
5+
First Commit Date: Nov 27, 2024
6+
7+
Recent Author: Jue Yuan
8+
Recent Commit Date: Nov 27, 2024
9+
10+
Maturity:
11+
alpha
12+
13+
Description:
14+
Calculates the sum of all elements in a vector, represented as a list of double values.
15+
This function is useful for aggregating vector components in mathematical and statistical operations.
16+
17+
Parameters:
18+
list<double> list1:
19+
The input vector as a list of double values.
20+
21+
Returns:
22+
float:
23+
The sum of all elements in the input vector.
24+
25+
Logic Overview:
26+
Iterates through each element in the input list.
27+
Accumulates the sum of all elements.
28+
Returns the final sum as a floating-point value.
29+
30+
Use Case:
31+
This function is valuable in various data processing tasks, such as computing vector norms,
32+
validating data integrity, or performing aggregations in machine learning and statistical analysis.
33+
*/
34+
35+
SumAccum<float> @@mySum;
36+
37+
FOREACH i IN list1 DO
38+
@@mySum += i;
39+
END;
40+
RETURN @@mySum;
41+
}

gds/vector/ip_distance.gsql

Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
CREATE OR REPLACE FUNCTION gds.vector.ip_distance(list<double> list1, list<double> list2) RETURNS(float) {
2+
3+
/*
4+
First Author: Jue Yuan
5+
First Commit Date: Nov 27, 2024
6+
7+
Recent Author: Jue Yuan
8+
Recent Commit Date: Nov 27, 2024
9+
10+
Maturity:
11+
alpha
12+
13+
Description:
14+
Calculates the inner product (dot product) between two vectors represented as lists of double values.
15+
The inner product is a key measure in linear algebra, indicating the magnitude of the projection of one vector onto another.
16+
This function provides a similarity measure commonly used in machine learning and data analysis.
17+
18+
Parameters:
19+
list<double> list1:
20+
The first vector as a list of double values.
21+
list<double> list2:
22+
The second vector as a list of double values.
23+
24+
Returns:
25+
float:
26+
The inner product (dot product) of the two input vectors.
27+
28+
Exceptions:
29+
list_size_mismatch (90000):
30+
Raised when the input vectors are not of equal size.
31+
32+
Logic Overview:
33+
Input Validation:
34+
Ensures both vectors have the same length.
35+
Inner Product Calculation:
36+
Computes the sum of the element-wise products of the two vectors.
37+
38+
Formula:
39+
Inner Product = (x1 x y1) + (x2 x y2) + ... + (xn x yn)
40+
Where xi and yi are elements of list1 and list2, respectively.
41+
42+
Use Case:
43+
This function is widely used in:
44+
Calculating similarity in machine learning models (e.g., recommendation systems).
45+
Performing vector projections in linear algebra.
46+
Evaluating similarity between embeddings in natural language processing (NLP).
47+
*/
48+
49+
EXCEPTION list_size_mismatch (90000);
50+
ListAccum<double> @@myList1 = list1;
51+
ListAccum<double> @@myList2 = list2;
52+
53+
IF (@@myList1.size() != @@myList2.size()) THEN
54+
RAISE list_size_mismatch ("Two lists provided for gds.vector.ip_distance have different sizes.");
55+
END;
56+
57+
RETURN inner_product(@@myList1, @@myList2);
58+
}

gds/vector/kth_element.gsql

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
CREATE OR REPLACE FUNCTION gds.vector.kth_element(list<double> list1, int kth_index) RETURNS(float) {
2+
3+
/*
4+
First Author: Jue Yuan
5+
First Commit Date: Nov 27, 2024
6+
7+
Recent Author: Jue Yuan
8+
Recent Commit Date: Nov 27, 2024
9+
10+
Maturity:
11+
alpha
12+
13+
Description:
14+
Retrieves the k-th element from a vector, represented as a list of double values.
15+
This function ensures safe access by validating the index against the vector's size,
16+
preventing out-of-range errors.
17+
18+
Parameters:
19+
list<double> list1:
20+
The input vector as a list of double values.
21+
int kth_index:
22+
The zero-based index of the element to retrieve.
23+
24+
Returns:
25+
float:
26+
The value of the element at the specified k-th index in the input vector.
27+
28+
Exceptions:
29+
out_of_range (90000):
30+
Raised when the specified index is either negative or exceeds the size of the input vector.
31+
32+
Logic Overview:
33+
Input Validation:
34+
Checks if the provided index is within the valid range (0 to list size - 1).
35+
Raises an exception if the index is out of range.
36+
Element Retrieval:
37+
Returns the element at the specified index.
38+
39+
Use Case:
40+
This function is useful in scenarios where specific elements of a vector need to be accessed programmatically, such as:
41+
Extracting features from a dataset.
42+
Implementing custom vector operations in data processing pipelines.
43+
Accessing indexed components in mathematical computations.
44+
*/
45+
46+
EXCEPTION out_of_range (90000);
47+
48+
ListAccum<double> @@myList1 = list1;
49+
IF (kth_index >= @@myList1.size() OR kth_index < 0) THEN
50+
RAISE out_of_range("Kth index provided for gds.vector.kth_element is out of the range of this list.");
51+
END;
52+
53+
RETURN @@myList1.get(kth_index);
54+
}

0 commit comments

Comments
 (0)