You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
StarRocks has implemented GIN (Generalized Inverted Index), which works by tokenizing fields into individual tokens and building a dictionary out of them. This allows users to perform different semantic searches on this dictionary. However, due to the presence of various tokenizers, the results of tokenization can differ, making it not very intuitive for users to understand how the original field text is tokenized into specific tokens.
What I'm doing:
Support a tokenize function, to allow users to work with specific tokenizer and get results of tokenization easily.
Description
Function definition
function tokenize(tokenizer_name: string, content: string) -> list of strings
Input and output
// input
tokenizer_name: needs to be limited to the existing tokenizers. For now, only support chinese, english, standard.
content: text, but notice that the language corresponding to the text content only achieves the expected effect when work with the specified tokenizer.
// output
tokens: splited and analyzed by tokenizer
Example
// tokenize with english
mysql> SELECT tokenize('english', 'Today is saturday');
+------------------------------------------+
| tokenize('english', 'Today is saturday') |
+------------------------------------------+
| ["today","is","saturday"] |
+------------------------------------------+
1 row in set (0.00 sec)
// count word frequency
mysql> select unnest, count(*) as count
mysql> from t_tokenized_table, unnest(tokenize('english', english_text)) as unnest
mysql> group by unnest order by count;
+----------+-------+
| unnest | count |
+----------+-------+
| world | 1 |
| comes | 1 |
| tap | 1 |
| the | 1 |
| from | 1 |
| sea | 1 |
| shanghai | 1 |
| water | 1 |
| hello | 2 |
+----------+-------+
9 rows in set (0.06 sec)
Notice
This function can work independently without building GIN for a column, however, it is not advisable to invoke this function to tokenize and construct a dictionary on the massive data during the query time due to the poor performance. In fact, there is no need for the user to explicitly call the tokenize function to build a dictionary at the time of writing. Since both actions have the same behavior, this function is more suitable to be used to troubleshoot some search results that are difficult to understand.
The text was updated successfully, but these errors were encountered:
We have marked this issue as stale because it has been inactive for 6 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to StarRocks!
Enhancement
Why I'm doing:
StarRocks has implemented GIN (Generalized Inverted Index), which works by tokenizing fields into individual tokens and building a dictionary out of them. This allows users to perform different semantic searches on this dictionary. However, due to the presence of various tokenizers, the results of tokenization can differ, making it not very intuitive for users to understand how the original field text is tokenized into specific tokens.
What I'm doing:
Support a tokenize function, to allow users to work with specific tokenizer and get results of tokenization easily.
Description
Function definition
Input and output
Example
Notice
This function can work independently without building GIN for a column, however, it is not advisable to invoke this function to tokenize and construct a dictionary on the massive data during the query time due to the poor performance. In fact, there is no need for the user to explicitly call the tokenize function to build a dictionary at the time of writing. Since both actions have the same behavior, this function is more suitable to be used to troubleshoot some search results that are difficult to understand.
The text was updated successfully, but these errors were encountered: