Allow usage of `tokenizedDocument` in BERT tokenization

We would like to use these issues to gauge user interest.

The BERT tokenizer is intended as an identical reimplementation of the original BERT tokenization. However it is possible to replace the `bert.tokenizer.internal.BasicTokenizer` with a tokenizer using `tokenizedDocument`.

The belief is this should not affect the model too much as the wordpiece encoding is still the same, and it is these wordpiece encoded sub-tokens that are the input to the model. 

Advantages of this are that `tokenizedDocument` is considerably faster than `BasicTokenizer` and may offer better integration with Text Analytics Toolbox functionality. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Allow usage of `tokenizedDocument` in BERT tokenization #20

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Allow usage of tokenizedDocument in BERT tokenization #20

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Allow usage of `tokenizedDocument` in BERT tokenization #20