Skip to content

Support tokenize function #45145

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dujijun007 opened this issue May 6, 2024 · 1 comment · May be fixed by #45119 or #58965
Closed

Support tokenize function #45145

dujijun007 opened this issue May 6, 2024 · 1 comment · May be fixed by #45119 or #58965
Labels
no-issue-activity type/enhancement Make an enhancement to StarRocks X-stale

Comments

@dujijun007
Copy link
Contributor

dujijun007 commented May 6, 2024

Enhancement

Why I'm doing:

StarRocks has implemented GIN (Generalized Inverted Index), which works by tokenizing fields into individual tokens and building a dictionary out of them. This allows users to perform different semantic searches on this dictionary. However, due to the presence of various tokenizers, the results of tokenization can differ, making it not very intuitive for users to understand how the original field text is tokenized into specific tokens.

What I'm doing:

Support a tokenize function, to allow users to work with specific tokenizer and get results of tokenization easily.

Description

Function definition

function tokenize(tokenizer_name: string, content: string) -> list of strings

Input and output

// input 
tokenizer_name: needs to be limited to the existing tokenizers. For now, only support chinese, english, standard.
content: text, but notice that the language corresponding to the text content only achieves the expected effect when work with the specified tokenizer.

// output 
tokens: splited and analyzed by tokenizer

Example

// tokenize with english
mysql> SELECT tokenize('english', 'Today is saturday');
+------------------------------------------+
| tokenize('english', 'Today is saturday') |
+------------------------------------------+
| ["today","is","saturday"]                |
+------------------------------------------+
1 row in set (0.00 sec)

// count word frequency
mysql> select unnest, count(*) as count 
mysql> from t_tokenized_table, unnest(tokenize('english', english_text)) as unnest
mysql> group by unnest order by count;
+----------+-------+
| unnest   | count |
+----------+-------+
| world    |     1 |
| comes    |     1 |
| tap      |     1 |
| the      |     1 |
| from     |     1 |
| sea      |     1 |
| shanghai |     1 |
| water    |     1 |
| hello    |     2 |
+----------+-------+
9 rows in set (0.06 sec)

Notice

This function can work independently without building GIN for a column, however, it is not advisable to invoke this function to tokenize and construct a dictionary on the massive data during the query time due to the poor performance. In fact, there is no need for the user to explicitly call the tokenize function to build a dictionary at the time of writing. Since both actions have the same behavior, this function is more suitable to be used to troubleshoot some search results that are difficult to understand.

@dujijun007 dujijun007 added the type/enhancement Make an enhancement to StarRocks label May 6, 2024
Copy link

We have marked this issue as stale because it has been inactive for 6 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to StarRocks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
no-issue-activity type/enhancement Make an enhancement to StarRocks X-stale
Projects
None yet
1 participant