Support tokenize function #45145

dujijun007 · 2024-05-06T15:10:46Z

Enhancement

Why I'm doing:

StarRocks has implemented GIN (Generalized Inverted Index), which works by tokenizing fields into individual tokens and building a dictionary out of them. This allows users to perform different semantic searches on this dictionary. However, due to the presence of various tokenizers, the results of tokenization can differ, making it not very intuitive for users to understand how the original field text is tokenized into specific tokens.

What I'm doing:

Support a tokenize function, to allow users to work with specific tokenizer and get results of tokenization easily.

Description

Function definition

function tokenize(tokenizer_name: string, content: string) -> list of strings

Input and output

// input 
tokenizer_name: needs to be limited to the existing tokenizers. For now, only support chinese, english, standard.
content: text, but notice that the language corresponding to the text content only achieves the expected effect when work with the specified tokenizer.

// output 
tokens: splited and analyzed by tokenizer

Example

// tokenize with english
mysql> SELECT tokenize('english', 'Today is saturday');
+------------------------------------------+
| tokenize('english', 'Today is saturday') |
+------------------------------------------+
| ["today","is","saturday"]                |
+------------------------------------------+
1 row in set (0.00 sec)

// count word frequency
mysql> select unnest, count(*) as count 
mysql> from t_tokenized_table, unnest(tokenize('english', english_text)) as unnest
mysql> group by unnest order by count;
+----------+-------+
| unnest   | count |
+----------+-------+
| world    |     1 |
| comes    |     1 |
| tap      |     1 |
| the      |     1 |
| from     |     1 |
| sea      |     1 |
| shanghai |     1 |
| water    |     1 |
| hello    |     2 |
+----------+-------+
9 rows in set (0.06 sec)

Notice

This function can work independently without building GIN for a column, however, it is not advisable to invoke this function to tokenize and construct a dictionary on the massive data during the query time due to the poor performance. In fact, there is no need for the user to explicitly call the tokenize function to build a dictionary at the time of writing. Since both actions have the same behavior, this function is more suitable to be used to troubleshoot some search results that are difficult to understand.

The text was updated successfully, but these errors were encountered:

github-actions · 2024-11-11T11:00:49Z

We have marked this issue as stale because it has been inactive for 6 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to StarRocks!

dujijun007 added the type/enhancement Make an enhancement to StarRocks label May 6, 2024

dujijun007 mentioned this issue May 6, 2024

[Enhancement] Support tokenize function #45119

Open

24 tasks

github-actions bot added the no-issue-activity label Nov 11, 2024

github-actions bot added the X-stale label Nov 25, 2024

github-actions bot closed this as completed Nov 25, 2024

INNOCENT-BOY mentioned this issue May 14, 2025

[Feature] Support tokenize function #58965

Open

24 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support tokenize function #45145

Support tokenize function #45145

dujijun007 commented May 6, 2024 •

edited

Loading

github-actions bot commented Nov 11, 2024

Support tokenize function #45145

Support tokenize function #45145

Comments

dujijun007 commented May 6, 2024 • edited Loading

Enhancement

Why I'm doing:

What I'm doing:

Description

Function definition

Input and output

Example

Notice

github-actions bot commented Nov 11, 2024

dujijun007 commented May 6, 2024 •

edited

Loading