Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

310 release candidate #2907

Merged
merged 194 commits into from
Jun 7, 2021
Merged
Changes from 2 commits
Commits
Show all changes
194 commits
Select commit Hold shift + click to select a range
0278e0d
Added TensorFlowWrapper refactoring
wolliq Mar 29, 2021
3af61df
Removing duplicated method to extract TF variables from TFWrapper
wolliq Mar 29, 2021
2b9058d
Added refactoring for SavedModelBundle loader
wolliq Mar 30, 2021
74f6401
Added refactoring for helper methods
wolliq Mar 30, 2021
c6c9d20
Added custom signature for tf model auto wrapping
wolliq Apr 1, 2021
38fde6b
Added BERT py local train file for HuggingFace transformer custom load
wolliq Apr 2, 2021
509fc29
Rebased TensorflowBert
wolliq Apr 7, 2021
2fe547a
Rebased TensorflowWrapper
wolliq Apr 7, 2021
47119e4
Fixed merge conflicts
wolliq Apr 7, 2021
3bee746
Refactored getTFHubSession method
wolliq Apr 8, 2021
fb94281
Added TF model signature extractor
wolliq Apr 10, 2021
4e5ec05
Added signature matches to map
wolliq Apr 11, 2021
62f3309
Added signature handler methods
wolliq Apr 12, 2021
9b7d4dd
Moved option to internal signatures extraction
wolliq Apr 13, 2021
a0c02f4
Added refactoring and doc to extract signature method
wolliq Apr 14, 2021
ae06151
Added refactoring to internalize TF model signature extraction
wolliq Apr 16, 2021
05e9761
Merge branch 'master' into feature/saved-model-bundle-auto-wrapper
wolliq Apr 16, 2021
8e9e444
Added custom signature internalization in TF wrapper
wolliq Apr 18, 2021
4ea4029
Added ADT for signatures key value
wolliq Apr 19, 2021
0a536bc
Refactored accessors for bert mappers
wolliq Apr 19, 2021
0fab928
Refactored signature extraction process in sign package
wolliq Apr 19, 2021
ab90511
Refactoring bert tf info accessors
wolliq Apr 19, 2021
c8b187d
Added fix to write loaded saved model TF2 in a pipe with broken pipe …
wolliq Apr 24, 2021
b79eeb9
Merge branch 'master' into feature/saved-model-bundle-auto-wrapper
maziyarpanahi Apr 26, 2021
ddbed50
Update extracting tf signatures [skip ci]
maziyarpanahi Apr 30, 2021
10e3b89
Adopt to new TensorflowWrapper read returning signatures
maziyarpanahi Apr 30, 2021
7d0017b
Create NerDLModelPythonReader.scala
maziyarpanahi Apr 30, 2021
10b246a
Merge branch 'master' into feature/saved-model-bundle-auto-wrapper
maziyarpanahi May 2, 2021
cf2689c
Leave a todo for setting modelProvider [skip ci]
maziyarpanahi May 2, 2021
a0bb87d
Added signature names reference from HuggingFace TF models
wolliq May 2, 2021
b36fe2d
Added signature names reference from HuggingFace TF models
wolliq May 2, 2021
ff898e1
Added unified patterns for TF v1 and v2 model matcher
wolliq May 3, 2021
b8fda98
Added alignement for getters/setters in saved model signatures
wolliq May 3, 2021
03729ec
Implement DistiBertEmbeddings annotator [skip ci]
maziyarpanahi May 3, 2021
771bec4
Add TF logic DistilBertEmbeddings [skip ci]
maziyarpanahi May 3, 2021
710cb51
Add DistilBertEmbeddings to ResourceDownloader [skip ci]
maziyarpanahi May 3, 2021
b03914a
Merge branch 'feature/T5-generation-parameters' into feature/distilbe…
maziyarpanahi May 4, 2021
aee2977
Add slowTest for DistilBertEmbeddings in Scala [skip ci]
maziyarpanahi May 5, 2021
94f1464
Add DistilBertEmbeddings to Python APIs [skip ci]
maziyarpanahi May 5, 2021
c70aefe
Add pretrained default name to DistilBertEmbeddings [skip ci]
maziyarpanahi May 5, 2021
97aebb7
Added signature map with secondary indexed sign def for input and ouput
wolliq May 6, 2021
8112874
Merge branch 'master' into feature/saved-model-bundle-auto-wrapper
wolliq May 6, 2021
014b23a
Removing old test on signature parser
wolliq May 6, 2021
b0597b5
Merge branch 'feature/saved-model-bundle-auto-wrapper' into feature/d…
maziyarpanahi May 7, 2021
32ba18e
Normalize TF v2 references [skip ci]
maziyarpanahi May 7, 2021
172608c
Merge branch 'feature/saved-model-bundle-auto-wrapper' into feature/d…
maziyarpanahi May 7, 2021
374bb92
Update missing last_hidden_state key [skip ci]
maziyarpanahi May 7, 2021
86c12e1
Merge branch 'feature/saved-model-bundle-auto-wrapper' into feature/d…
maziyarpanahi May 7, 2021
2d62e2a
Adopt to the new ModelSignatureConstants [skip ci]
maziyarpanahi May 7, 2021
bbc48e2
Added saved model auto classifier for TF version
wolliq May 8, 2021
f61891f
Merge branch 'feature/saved-model-bundle-auto-wrapper' of https://git…
wolliq May 8, 2021
e29a8d0
Remove tags from loadSavedModel [skip ci]
maziyarpanahi May 9, 2021
fe3322e
Added TensorFlowWrapper refactoring
wolliq Mar 29, 2021
32f615b
Removing duplicated method to extract TF variables from TFWrapper
wolliq Mar 29, 2021
46e3a02
Added refactoring for SavedModelBundle loader
wolliq Mar 30, 2021
1df5cb8
Added refactoring for helper methods
wolliq Mar 30, 2021
b29ea1d
Added custom signature for tf model auto wrapping
wolliq Apr 1, 2021
b6af65f
Added BERT py local train file for HuggingFace transformer custom load
wolliq Apr 2, 2021
0fb2ac3
Rebased TensorflowBert
wolliq Apr 7, 2021
73cb099
Rebased TensorflowWrapper
wolliq Apr 7, 2021
1e5778d
Refactored getTFHubSession method
wolliq Apr 8, 2021
1233946
Added TF model signature extractor
wolliq Apr 10, 2021
777ddfb
Added signature matches to map
wolliq Apr 11, 2021
e5db57f
Added signature handler methods
wolliq Apr 12, 2021
604e531
Moved option to internal signatures extraction
wolliq Apr 13, 2021
2e5fbaa
Added refactoring and doc to extract signature method
wolliq Apr 14, 2021
9d85d1a
Added refactoring to internalize TF model signature extraction
wolliq Apr 16, 2021
f2067a1
Added custom signature internalization in TF wrapper
wolliq Apr 18, 2021
fc366de
Added ADT for signatures key value
wolliq Apr 19, 2021
d52dc4e
Refactored accessors for bert mappers
wolliq Apr 19, 2021
b01527b
Refactored signature extraction process in sign package
wolliq Apr 19, 2021
15787fc
Refactoring bert tf info accessors
wolliq Apr 19, 2021
00e3d4e
Added fix to write loaded saved model TF2 in a pipe with broken pipe …
wolliq Apr 24, 2021
a10a660
Update extracting tf signatures [skip ci]
maziyarpanahi Apr 30, 2021
84beb12
Adopt to new TensorflowWrapper read returning signatures
maziyarpanahi Apr 30, 2021
e53ef18
Create NerDLModelPythonReader.scala
maziyarpanahi Apr 30, 2021
a532811
Added signature names reference from HuggingFace TF models
wolliq May 2, 2021
7389424
Added unified patterns for TF v1 and v2 model matcher
wolliq May 3, 2021
9344989
Added alignement for getters/setters in saved model signatures
wolliq May 3, 2021
127143e
Added signature map with secondary indexed sign def for input and ouput
wolliq May 6, 2021
f4df661
Removing old test on signature parser
wolliq May 6, 2021
3e69885
Added saved model auto classifier for TF version
wolliq May 8, 2021
9b294aa
Normalize TF v2 references [skip ci]
maziyarpanahi May 7, 2021
0de5a73
Update missing last_hidden_state key [skip ci]
maziyarpanahi May 7, 2021
c75d34a
Remove tags from loadSavedModel [skip ci]
maziyarpanahi May 9, 2021
ffc561b
Fixing conflicts in rebase
wolliq May 10, 2021
4a8be64
WIP Bpe, input IndexToken, Output Array[TokenPiece]
DevinTDHa Mar 27, 2021
35b54f1
Remove resolved FIXME [skip ci]
maziyarpanahi May 11, 2021
6247848
Merge branch 'feature/saved-model-bundle-auto-wrapper' into feature/d…
maziyarpanahi May 11, 2021
5c8ff2f
Add unit test to Scala [skip ci]
maziyarpanahi May 11, 2021
a2fff5d
Fix failed build in ResourceDownloader
maziyarpanahi May 11, 2021
3da311a
Update code styling [skip ci]
maziyarpanahi May 11, 2021
bc08524
Change logger from warn to debug
maziyarpanahi May 11, 2021
5dbc779
Merge branch 'feature/saved-model-bundle-auto-wrapper' into feature/d…
maziyarpanahi May 11, 2021
cb32f5e
Assign types and update styling
maziyarpanahi May 12, 2021
0306f1e
Merge pull request #2867 from hatrungduc/BpeTokenizer
maziyarpanahi May 12, 2021
b534587
Merge branch 'feature/saved-model-bundle-auto-wrapper' into feature/d…
maziyarpanahi May 12, 2021
f01384c
Merge branch 'feature/saved-model-bundle-auto-wrapper' into feature/r…
maziyarpanahi May 12, 2021
071ce6a
bpe refactor; WIP XLM
DevinTDHa May 12, 2021
75d41e7
Special Tokens Refactor, WIP MosesTokenizer
DevinTDHa May 13, 2021
daf5faf
Added fix for ending zero index in single char initial token
wolliq May 14, 2021
8836a83
fix caching issue, SpecialToken refactor
DevinTDHa May 14, 2021
f009d6a
Merge pull request #2872 from hatrungduc/BpeBasedTokenizers
maziyarpanahi May 14, 2021
43384b2
Merge branch 'feature/saved-model-bundle-auto-wrapper' into feature/r…
maziyarpanahi May 14, 2021
a8d7bb1
end Index changed, prepend space for token without one
DevinTDHa May 14, 2021
bc6e694
Merge pull request #2876 from hatrungduc/BpeBasedTokenizers
maziyarpanahi May 14, 2021
453808c
Merge branch 'feature/saved-model-bundle-auto-wrapper' into feature/r…
maziyarpanahi May 14, 2021
bd91aac
Refactor BpeTokenizer
maziyarpanahi May 15, 2021
21fc916
Create RoBertaEmbeddings annotator [skip ci]
maziyarpanahi May 16, 2021
996a7a6
Add RoBertaEmbeddings to annotator import [skip ci]
maziyarpanahi May 16, 2021
6ffd1e8
Add RoBertaEmbeddings to ResourceDownloader [skip ci]
maziyarpanahi May 16, 2021
329139b
Create RoBertaEmbeddings TF backend [skip ci]
maziyarpanahi May 16, 2021
50921cd
Add RoBertaEmbeddings Scala unit test
maziyarpanahi May 16, 2021
2756057
Make BpeTokenizer compatible with Scala 2.11
maziyarpanahi May 16, 2021
cf31cda
Make BpeTokenizerTestSpec compatible with Scala 2.11
maziyarpanahi May 16, 2021
43edf92
Add RoBertaEmbeddings to Python APIs [skip ci]
maziyarpanahi May 16, 2021
8d5a217
Update Scaladocs
maziyarpanahi May 16, 2021
777a9e8
Trigger GA tests
maziyarpanahi May 16, 2021
493160c
Sync loadSavedModel in Python [skip ci]
maziyarpanahi May 17, 2021
69cdacb
MosesTokenizer, WIP XLM
DevinTDHa May 17, 2021
cfadc54
More MosesTokenizer tests, BPE slight refactor, WIP XlmTokenizer
DevinTDHa May 18, 2021
0c7a8ea
Merge pull request #2874 from JohnSnowLabs/bugfix/regex-tok-mask-algo…
maziyarpanahi May 18, 2021
9c6beeb
Merge pull request #2902 from JohnSnowLabs/feature/roberta-init
maziyarpanahi May 18, 2021
b69b87a
Merge branch 'feature/roberta-init' into feature/distilbert-init
maziyarpanahi May 18, 2021
9d8bc37
Merge pull request #2871 from JohnSnowLabs/feature/distilbert-init
maziyarpanahi May 18, 2021
3223efb
Merge pull request #2870 from JohnSnowLabs/feature/saved-model-bundle…
maziyarpanahi May 18, 2021
f3af9e1
Remove unused comments [skip ci]
maziyarpanahi May 18, 2021
cfaedd4
WIP XLM + test
DevinTDHa May 19, 2021
cb43bac
wip functions
albertoandreottiATgmail May 19, 2021
7de5e46
XLM, Optimization WIP
DevinTDHa May 20, 2021
68e5515
Merge branch '310-release-candidate' into BpeBasedTokenizers
maziyarpanahi May 21, 2021
1935bb1
Merge pull request #2903 from hatrungduc/BpeBasedTokenizers
maziyarpanahi May 21, 2021
7438f61
Bump version to 3.1.0 [skip ci]
maziyarpanahi May 21, 2021
b6bf6e8
Make distilbert_base_cased default model [skip ci]
maziyarpanahi May 21, 2021
4ce8635
Update code styling [skip ci]
maziyarpanahi May 21, 2021
3bf8286
Moses Optimization
DevinTDHa May 20, 2021
409ccf2
fix multidot issue
DevinTDHa May 21, 2021
5aae956
Merge pull request #2939 from hatrungduc/BpeBasedTokenizers
maziyarpanahi May 21, 2021
35b62b9
added example
albertoandreottiATgmail May 21, 2021
7f533a3
Create XlmRoBertaEmbeddings annotator [skip ci]
maziyarpanahi May 23, 2021
2abb98e
Implement XlmRoBerta TF backend [skip ci]
maziyarpanahi May 23, 2021
2e1c863
Add new params to SentencepieceEncoder class [skip ci]
maziyarpanahi May 23, 2021
6812d1e
Remove unused comment [skip ci]
maziyarpanahi May 23, 2021
4b89c8a
Add XlmRoBertaEmbeddingsTestSpec [skip ci]
maziyarpanahi May 23, 2021
1e28d22
Add XlmRoBertaEmbeddings to downloader and annotator [skip ci]
maziyarpanahi May 23, 2021
9463a70
Migration to TF 2.4.1 with Java bindings 0.3.1
wolliq May 24, 2021
9e56a0d
MosesPunctNormalizer improvements
DevinTDHa May 25, 2021
6e47d12
Docs: Tokenizer Example
DevinTDHa May 25, 2021
892ec52
Fix pretraind model's language [skip ci]
maziyarpanahi May 25, 2021
849ca8a
Add XlmRoBertaEmbeddings to Python [skip ci]
maziyarpanahi May 25, 2021
f4eb6ce
Merge pull request #2949 from JohnSnowLabs/feature/spknlp310-tf031-mig
maziyarpanahi May 25, 2021
7dc0898
Merge pull request #2947 from JohnSnowLabs/feature/xlm-roberta-init
maziyarpanahi May 25, 2021
015329d
Fix control_dependency in TF v2 error [skip ci]
maziyarpanahi May 25, 2021
810b952
Doc: Added Examples for Chunk2Doc, Doc2Chunk, DocumentAssembler, Toke…
DevinTDHa May 26, 2021
1e22fe4
Doc: fixed Parameter group description
DevinTDHa May 26, 2021
29a0ea3
Add signatures to MarianTransformer [skip ci]
maziyarpanahi May 26, 2021
8dbfde4
Adopt Marian TF backend to TF v2 [skip ci]
maziyarpanahi May 26, 2021
dc36a47
Add Marian model to ModelSignatureConstants [skip ci]
maziyarpanahi May 26, 2021
a2984bd
Add saverDef extractSignatures [skip ci]
maziyarpanahi May 26, 2021
169ec14
Use saveDef from Signatures in save and restore [skip ci]
maziyarpanahi May 26, 2021
321ab0c
Fix adoptedKeys in ModelSignatureConstants [skip ci]
maziyarpanahi May 26, 2021
071bca3
Adopt BertSentenceEmbeddings for ModelSignatureConstants [skip ci]
maziyarpanahi May 26, 2021
6c9abc5
Update DecoderAttentionMask key and ops [skip ci]
maziyarpanahi May 26, 2021
b12415b
Clean up Marian TF backend
maziyarpanahi May 26, 2021
2c24e2c
Merge pull request #2951 from hatrungduc/BpeBasedTokenizers
maziyarpanahi May 26, 2021
c5e61c0
Merge pull request #2960 from JohnSnowLabs/feature/marian_tf2
maziyarpanahi May 27, 2021
555e00d
Doc: Examples for EmbeddingsFinisher, Finisher
DevinTDHa May 27, 2021
443a315
Added slow tests fixes on distilbert and ner crf
wolliq May 28, 2021
64c5bc4
add map columns functions
xusliebana May 28, 2021
d8b1761
add documentation
xusliebana May 28, 2021
ffd0628
Merge pull request #2975 from JohnSnowLabs/map_multiple_annotations
maziyarpanahi May 29, 2021
3c317f1
Merge pull request #2973 from JohnSnowLabs/bugfix/slow-tests-errors
maziyarpanahi May 29, 2021
5784a4d
Remove unused functions in TF wrapper
maziyarpanahi May 30, 2021
282984d
fix test for scala 11
xusliebana May 30, 2021
b014145
Merge pull request #2983 from JohnSnowLabs/map_multiple_annotations
maziyarpanahi May 30, 2021
94f6b92
Revert branch to TF 0.2.2 for 0.3.1 perf degradation
wolliq May 31, 2021
c0bdad8
Docs: Added examples for DocumentNormalizer, Lemmatizer*, Normalizer,…
DevinTDHa Jun 2, 2021
8789841
Revert "Revert branch to TF 0.2.2 for 0.3.1 perf degradation"
maziyarpanahi Jun 2, 2021
aff4cea
Docs: Examples for Chunker, ExternalResource, NGramGenerator, RegexMa…
DevinTDHa Jun 3, 2021
b6ab69b
Adding Devin Ha to developers of this build [skip ci]
maziyarpanahi Jun 3, 2021
f035b3c
Docs: Added examples for DateMatcher, MultiDateMatcher, SentenceDetec…
DevinTDHa Jun 4, 2021
c3a9c06
Make sure initAllTables is false
maziyarpanahi Jun 6, 2021
9b1aad8
Add batchAnnotate to MarianTransformer [skip ci]
maziyarpanahi Jun 7, 2021
5093202
Make sure session has signatures [skip ci]
maziyarpanahi Jun 7, 2021
9961ecf
Add batchSize to MarianTransformer in Python [skip ci]
maziyarpanahi Jun 7, 2021
b706b26
Add signatures to getTFHubSession
maziyarpanahi Jun 7, 2021
e9c4cec
Merge branch '310-release-candidate' into feature/marian-batch-annotate
maziyarpanahi Jun 7, 2021
f927cf5
Merge pull request #5667 from JohnSnowLabs/feature/marian-batch-annotate
maziyarpanahi Jun 7, 2021
c86b82b
Merge pull request #4321 from hatrungduc/doc-examples-update
maziyarpanahi Jun 7, 2021
9a0b175
Update docs for 3.1.0 release [skip ci]
maziyarpanahi Jun 7, 2021
2b643f9
Update CHANGELOG for 3.1.0
maziyarpanahi Jun 7, 2021
fc13dec
Update language in model cards [skip ci]
maziyarpanahi Jun 7, 2021
54c9d48
Update conda to 3.1.0 release [skip ci]
maziyarpanahi Jun 7, 2021
737eed3
Update Scaladoc for 3.1.0 release [skip ci]
maziyarpanahi Jun 7, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -135,8 +135,11 @@ private[nlp] abstract class BpeTokenizer(
.map {
case (subWord: String, indexes: (Int, Int)) =>
val isWordStart = indToken.begin == indexes._1
val subWordId = if (vocab.contains(subWord)) vocab(subWord) else specialTokens.unk.id // Set unknown id
TokenPiece(subWord, processedToken, subWordId, isWordStart, indexes._1, indexes._2)
val subWordId: Int = if (subWord(0) != 'Ġ' && isWordStart)
vocab.getOrElse("Ġ" + subWord, specialTokens.unk.id) // TODO do this for non roberta case
else vocab.getOrElse(subWord, specialTokens.unk.id) // Set unknown id

TokenPiece(subWord, processedToken, subWordId, isWordStart, indexes._1, indexes._2 - 1)
}
result
}
Original file line number Diff line number Diff line change
@@ -52,7 +52,7 @@ class RobertaTokenizer(
val splitPattern: Regex = raw"'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+".r
splitPattern
.findAllMatchIn(text)
.map(tok => IndexedToken(tok.matched, tok.start + indexOffset, tok.end + indexOffset)) // TODO Expected -1?
.map(tok => IndexedToken(tok.matched, tok.start + indexOffset, tok.end + indexOffset - 1)) // TODO Expected -1?
.toArray
}

@@ -88,14 +88,14 @@ class RobertaTokenizer(
for (subText <- splitTexts) {
val subTextIndex = sentence.start + text.indexOf(subText, currentIndex)
if (!specialTokens.contains(subText)) {
val splitSubText = splitOnPattern(subText, sentence.start + subTextIndex)
val splitSubText = splitOnPattern(subText, subTextIndex)
result.append(splitSubText: _*)
} else // subtext is just the special token
result.append(
IndexedToken(
subText,
begin = subTextIndex,
end = subTextIndex + subText.length
end = subTextIndex + subText.length - 1
)
)
currentIndex = subTextIndex + subText.length
Original file line number Diff line number Diff line change
@@ -22,54 +22,57 @@ import com.johnsnowlabs.tags.FastTest
import org.scalatest.FlatSpec

class BpeTokenizerTestSpec extends FlatSpec {
val vocab: Map[String, Int] =
Array(
"<s>",
"</s>",
"<mask>",
"I",
"Ġunamb",
"ig",
"ou",
"os",
"ly",
"Ġgood",
"Ġ3",
"As",
"d",
"!",
"<unk>",
"<pad>",
).zipWithIndex.toMap
val merges: Array[String] = Array(
"o u",
"l y",
"Ġ g",
"a m",
"i g",
"Ġ u",
"o d",
"u n",
"o s",
"Ġg o",
"Ġu n",
"o od",
"A s",
"m b",
"g o",
"o o",
"n a",
"am b",
"s l",
"n am",
"b i",
"b ig",
"u o",
"s d",
"Ġun amb",
"Ġgo od",
"Ġ 3",
)
val vocab: Map[String, Int] =
Array(
"<s>",
"</s>",
"<mask>",
"ĠI",
"Ġunamb",
"ig",
"ou",
"os",
"ly",
"Ġgood",
"Ġ3",
"ĠAs",
"d",
"Ġ!",
"<unk>",
"<pad>",
// "ĠI",
// "ĠAs",
// "Ġ!",
).zipWithIndex.toMap
val merges: Array[String] = Array(
"o u",
"l y",
"Ġ g",
"a m",
"i g",
"Ġ u",
"o d",
"u n",
"o s",
"Ġg o",
"Ġu n",
"o od",
"A s",
"m b",
"g o",
"o o",
"n a",
"am b",
"s l",
"n am",
"b i",
"b ig",
"u o",
"s d",
"Ġun amb",
"Ġgo od",
"Ġ 3",
)
val bpeTokenizer: BpeTokenizer = BpeTokenizer.forModel(
"roberta",
merges,
@@ -81,13 +84,13 @@ class BpeTokenizerTestSpec extends FlatSpec {
encoded: Array[TokenPiece],
expected: Array[String],
expectedIds: Array[Int]): Unit = {
// println(encoded.mkString("Array(\n ", ",\n ", "\n)"))
println(encoded.mkString("Array(\n ", ",\n ", "\n)"))
for (i <- encoded.indices) {
val piece = encoded(i)
assert(piece.wordpiece == expected(i))
assert(piece.pieceId == expectedIds(i))

assert(text.slice(piece.begin, piece.end) == piece.wordpiece.replace("Ġ", " "))
assert(text.slice(piece.begin, piece.end + 1) == piece.wordpiece.replace("Ġ", " "))
}
}

@@ -160,4 +163,27 @@ class BpeTokenizerTestSpec extends FlatSpec {
BpeTokenizer.forModel("unsupported", merges, vocab, padWithSentenceTokens = false)
}
}

// "RobertaTokenizer" should "encode 2" taggedAs FastTest in {
// val text = "Rare Hendrix song draft sells for almost $17,000"
//// val sentence = Sentence(text, 0, text.length - 1, 0)
// val indexedTokens = text.split(" ").map(
// tok => IndexedToken(tok, text.indexOf(tok), text.indexOf(tok) + tok.length - 1)
// )
// println(indexedTokens.mkString("Array(\n ", ",\n ", "\n)"))
//
// val indexedTokSentences: Array[IndexedToken] = indexedTokens
// .map(tok => Sentence(tok.token, tok.begin, tok.begin + tok.token.length - 1, 0))
// .flatMap(bpeTokenizer.tokenize)
// println(indexedTokSentences.mkString("Array(\n ", ",\n ", "\n)"))
//
// val encoded = bpeTokenizer.encode(indexedTokSentences)
// println(encoded.mkString("Array(\n ", ",\n ", "\n)"))
// for (i <- encoded.indices) {
// val piece = encoded(i)
// println("asserting ", text.slice(piece.begin, piece.end + 1), piece.wordpiece.replace("Ġ", " "))
// assert(text.slice(piece.begin, piece.end + 1) == piece.wordpiece.replace("Ġ", " "))
//
// }
// }
}