

[](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/transformers/TF%20Hub%20in%20Spark%20NLP%20-%20BERT.ipynb)

## Import BERT models from TF Hub into Spark NLP π

Let's keep in mind a few things before we start π

- This feature is only in `Spark NLP 3.1.x` and after. So please make sure you have upgraded to the latest Spark NLP release
- You can import any BERT models from TF Hub but they have to be `TF2.0 Saved Model` models. Meaning, you cannot use `BERT models for TF1` which are `DEPRECATED`

## Save TF Hub model

- We do not need to install `tensorflow` nor `tensorflow-hub`
- We can simple download the model and extract it
- We'll use [small_bert/bert_uncased_L-2_H-128_A-2](https://tfhub.dev/google/small_bert/bert_uncased_L-2_H-128_A-2/2) model from TF Hub as an example


In [None]:
!rm -rf /content/*

In [None]:
!pip install -q tensorflow==2.4.1 tensorflow-hub

[K |ββββββββββββββββββββββββββββββββ| 394.3MB 39kB/s 
[K |ββββββββββββββββββββββββββββββββ| 3.8MB 32.3MB/s 
[K |ββββββββββββββββββββββββββββββββ| 2.9MB 33.9MB/s 
[K |ββββββββββββββββββββββββββββββββ| 471kB 42.4MB/s 
[?25h

In [None]:
EXPORTED_MODEL = 'bert_en_uncased_L-2_H-128_A-2'
TF_HUB_URL = 'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-128_A-2/2'

In [None]:
import tensorflow as tf
import tensorflow_hub as hub

encoder = hub.KerasLayer(TF_HUB_URL, trainable=False)

@tf.function
def my_module_encoder(input_mask, input_word_ids, input_type_ids):
 inputs = {
 'input_mask': input_mask,
 'input_word_ids': input_word_ids,
 'input_type_ids': input_type_ids
 }
 outputs = {
 'sequence_output': encoder(inputs)['sequence_output']
 }
 return outputs

tf.saved_model.save(
 encoder, 
 EXPORTED_MODEL, 
 signatures=my_module_encoder.get_concrete_function(
 input_mask=tf.TensorSpec(shape=(None, None), dtype=tf.int32),
 input_word_ids=tf.TensorSpec(shape=(None, None), dtype=tf.int32),
 input_type_ids=tf.TensorSpec(shape=(None, None), dtype=tf.int32)
 ), 
 options=None
)



INFO:tensorflow:Assets written to: /content/bert_en_uncased_L-2_H-128_A-2/assets


INFO:tensorflow:Assets written to: /content/bert_en_uncased_L-2_H-128_A-2/assets


Let's have a look inside these two directories and see what we are dealing with:

In [None]:
!ls -l {EXPORTED_MODEL}

total 2076
drwxr-xr-x 2 root root 4096 Jul 12 10:31 assets
-rw-r--r-- 1 root root 2115591 Jul 12 10:31 saved_model.pb
drwxr-xr-x 2 root root 4096 Jul 12 10:31 variables


In [None]:
!ls -l {EXPORTED_MODEL}/assets

total 228
-rw-r--r-- 1 root root 231508 Jul 12 10:31 vocab.txt


- as you can see, everything needed in Spark NLP is already here, including `vocab.txt` in `assets` directory
- we all set! We can got to Spark NLP π 

## Import and Save BERT in Spark NLP


- Let's install and setup Spark NLP in Google Colab
- This part is pretty easy via our simple script

In [None]:
! wget http://setup.johnsnowlabs.com/colab.sh -O - | bash

--2021-07-12 10:34:15-- http://setup.johnsnowlabs.com/colab.sh
Resolving setup.johnsnowlabs.com (setup.johnsnowlabs.com)... 51.158.130.125
Connecting to setup.johnsnowlabs.com (setup.johnsnowlabs.com)|51.158.130.125|:80... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh [following]
--2021-07-12 10:34:15-- https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1608 (1.6K) [text/plain]
Saving to: βSTDOUTβ

setup Colab for PySpark 3.0.3 and Spark NLP 3.1.2

2021-07-12 10:34:15 (40.0 MB/s) - written to stdout [1608/1608]

Get:1 http://security.ubuntu.com/ubuntu bi

Let's start Spark with Spark NLP included via our simple `start()` function

In [None]:
import sparknlp
# let's start Spark with Spark NLP
spark = sparknlp.start()

- Let's use `loadSavedModel` functon in `BertEmbeddings` which allows us to load TensorFlow model in SavedModel format
- Most params can be set later when you are loading this model in `BertEmbeddings` in runtime, so don't worry what you are setting them now
- `loadSavedModel` accepts two params, first is the path to the TF SavedModel. The second is the SparkSession that is `spark` variable we previously started via `sparknlp.start()`
- `setStorageRef` is very important. When you are training a task like NER or any Text Classification, we use this reference to bound the trained model to this specific embeddings so you won't load a different embeddings by mistake and see terrible results π
- It's up to you what you put in `setStorageRef` but it cannot be changed later on. We usually use the name of the model to be clear, but you can get creative if you want! 
- The `dimension` param is is purely cosmetic and won't change anything. It's mostly for you to know later via `.getDimension` what is the dimension of your model. So set this accordingly.
- NOTE: `loadSavedModel` accepts local paths in addition to distributed file systems such as `HDFS`, `S3`, `DBFS`, etc. This feature was introduced in Spark NLP 4.2.2 release. Keep in mind the best and recommended way to move/share/reuse Spark NLP models is to use `write.save` so you can use `.load()` from any file systems natively..


In [None]:
from sparknlp.annotator import *

bert = BertEmbeddings.loadSavedModel(
 EXPORTED_MODEL,
 spark
 )\
 .setInputCols(["sentence",'token'])\
 .setOutputCol("bert")\
 .setCaseSensitive(False)\
 .setDimension(768)\
 .setStorageRef(EXPORTED_MODEL) 

- Let's save it on disk so it is easier to be moved around and also be used later via `.load` function

In [None]:
bert.write().overwrite().save("./{}_spark_nlp".format(EXPORTED_MODEL))

Let's clean up stuff we don't need anymore

In [None]:
!rm -rf {EXPORTED_MODEL}

Awesome π !

This is your BERT model from HuggingFace π€ loaded and saved by Spark NLP π 

In [None]:
! ls -l {EXPORTED_MODEL}_spark_nlp

total 16256
-rw-r--r-- 1 root root 16635088 Jul 12 10:42 bert_tensorflow
drwxr-xr-x 4 root root 4096 Jul 12 10:42 fields
drwxr-xr-x 2 root root 4096 Jul 12 10:42 metadata


Now let's see how we can use it on other machines, clusters, or any place you wish to use your new and shiny BERT model π 

In [None]:
bert_loaded = BertEmbeddings.load("./{}_spark_nlp".format(EXPORTED_MODEL))\
 .setInputCols(["sentence",'token'])\
 .setOutputCol("bert")\
 .setCaseSensitive(False)

In [None]:
bert_loaded.getStorageRef()

'bert_en_uncased_L-2_H-128_A-2'

That's it! You can now go wild and import BERT models from TF Hub in Spark NLP π 
