Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bpe based tokenizers #2872

Conversation

DevinTDHa
Copy link
Member

Description

WIP PR for review.
Changes since last merge:

  • Main BPE class is now abstract, instantiation of sub classes with companion object
  • SpecialTokens were refactored and have named properties now (e.g. specialTokens.unk)
  • MosesTokenizer added for XLM (not fully working yet)

How Has This Been Tested?

Modified and added tests for the tokenizers

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • Code improvements with no or little impact
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

@DevinTDHa DevinTDHa added new-feature Introducing a new feature on-hold cannot be merged right away Requires changes labels May 13, 2021
@DevinTDHa DevinTDHa force-pushed the BpeBasedTokenizers branch from daf70cc to 75d41e7 Compare May 14, 2021 06:18
@DevinTDHa DevinTDHa force-pushed the BpeBasedTokenizers branch from 259d2ad to 8836a83 Compare May 14, 2021 10:29
@maziyarpanahi maziyarpanahi merged commit f009d6a into JohnSnowLabs:feature/saved-model-bundle-auto-wrapper May 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new-feature Introducing a new feature on-hold cannot be merged right away Requires changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants