Skip to content

Allow usage of tokenizedDocument in BERT tokenization #20

@bwdGitHub

Description

@bwdGitHub

We would like to use these issues to gauge user interest.

The BERT tokenizer is intended as an identical reimplementation of the original BERT tokenization. However it is possible to replace the bert.tokenizer.internal.BasicTokenizer with a tokenizer using tokenizedDocument.

The belief is this should not affect the model too much as the wordpiece encoding is still the same, and it is these wordpiece encoded sub-tokens that are the input to the model.

Advantages of this are that tokenizedDocument is considerably faster than BasicTokenizer and may offer better integration with Text Analytics Toolbox functionality.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions