{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)\n",
    "\n",
    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/transformers/TF%20Hub%20in%20Spark%20NLP%20-%20BERT.ipynb)\n",
    "\n",
    "## Import BERT models from TF Hub into Spark NLP πŸš€\n",
    "\n",
    "Let's keep in mind a few things before we start 😊\n",
    "\n",
    "- This feature is only in `Spark NLP 3.1.x` and after. So please make sure you have upgraded to the latest Spark NLP release\n",
    "- You can import any BERT models from TF Hub but they have to be `TF2.0 Saved Model` models. Meaning, you cannot use `BERT models for TF1` which are `DEPRECATED`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Save TF Hub model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- We do not need to install `tensorflow` nor `tensorflow-hub`\n",
    "- We can simple download the model and extract it\n",
    "- We'll use [small_bert/bert_uncased_L-2_H-128_A-2](https://tfhub.dev/google/small_bert/bert_uncased_L-2_H-128_A-2/2) model from TF Hub as an example\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!rm -rf /content/*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[K     |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 394.3MB 39kB/s \n",
      "\u001b[K     |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3.8MB 32.3MB/s \n",
      "\u001b[K     |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2.9MB 33.9MB/s \n",
      "\u001b[K     |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 471kB 42.4MB/s \n",
      "\u001b[?25h"
     ]
    }
   ],
   "source": [
    "!pip install -q tensorflow==2.4.1 tensorflow-hub"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "EXPORTED_MODEL = 'bert_en_uncased_L-2_H-128_A-2'\n",
    "TF_HUB_URL = 'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-128_A-2/2'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "WARNING:absl:Found untraced functions such as keras_layer_layer_call_and_return_conditional_losses, keras_layer_layer_call_fn, keras_layer_layer_call_fn, keras_layer_layer_call_and_return_conditional_losses, keras_layer_layer_call_and_return_conditional_losses while saving (showing 5 of 170). These functions will not be directly callable after loading.\n",
      "WARNING:absl:Found untraced functions such as keras_layer_layer_call_and_return_conditional_losses, keras_layer_layer_call_fn, keras_layer_layer_call_fn, keras_layer_layer_call_and_return_conditional_losses, keras_layer_layer_call_and_return_conditional_losses while saving (showing 5 of 170). These functions will not be directly callable after loading.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "INFO:tensorflow:Assets written to: /content/bert_en_uncased_L-2_H-128_A-2/assets\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "INFO:tensorflow:Assets written to: /content/bert_en_uncased_L-2_H-128_A-2/assets\n"
     ]
    }
   ],
   "source": [
    "import tensorflow as tf\n",
    "import tensorflow_hub as hub\n",
    "\n",
    "encoder = hub.KerasLayer(TF_HUB_URL, trainable=False)\n",
    "\n",
    "@tf.function\n",
    "def my_module_encoder(input_mask, input_word_ids, input_type_ids):\n",
    "   inputs = {\n",
    "        'input_mask': input_mask,\n",
    "        'input_word_ids': input_word_ids,\n",
    "        'input_type_ids': input_type_ids\n",
    "   }\n",
    "   outputs = {\n",
    "        'sequence_output': encoder(inputs)['sequence_output']\n",
    "   }\n",
    "   return outputs\n",
    "\n",
    "tf.saved_model.save(\n",
    "    encoder, \n",
    "    EXPORTED_MODEL, \n",
    "    signatures=my_module_encoder.get_concrete_function(\n",
    "        input_mask=tf.TensorSpec(shape=(None, None), dtype=tf.int32),\n",
    "        input_word_ids=tf.TensorSpec(shape=(None, None), dtype=tf.int32),\n",
    "        input_type_ids=tf.TensorSpec(shape=(None, None), dtype=tf.int32)\n",
    "    ), \n",
    "    options=None\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's have a look inside these two directories and see what we are dealing with:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "total 2076\n",
      "drwxr-xr-x 2 root root    4096 Jul 12 10:31 assets\n",
      "-rw-r--r-- 1 root root 2115591 Jul 12 10:31 saved_model.pb\n",
      "drwxr-xr-x 2 root root    4096 Jul 12 10:31 variables\n"
     ]
    }
   ],
   "source": [
    "!ls -l {EXPORTED_MODEL}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "total 228\n",
      "-rw-r--r-- 1 root root 231508 Jul 12 10:31 vocab.txt\n"
     ]
    }
   ],
   "source": [
    "!ls -l {EXPORTED_MODEL}/assets"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- as you can see, everything needed in Spark NLP is already here, including `vocab.txt` in `assets` directory\n",
    "- we all set! We can got to Spark NLP 😊 "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Import and Save BERT in Spark NLP\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Let's install and setup Spark NLP in Google Colab\n",
    "- This part is pretty easy via our simple script"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--2021-07-12 10:34:15--  http://setup.johnsnowlabs.com/colab.sh\n",
      "Resolving setup.johnsnowlabs.com (setup.johnsnowlabs.com)... 51.158.130.125\n",
      "Connecting to setup.johnsnowlabs.com (setup.johnsnowlabs.com)|51.158.130.125|:80... connected.\n",
      "HTTP request sent, awaiting response... 302 Moved Temporarily\n",
      "Location: https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh [following]\n",
      "--2021-07-12 10:34:15--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh\n",
      "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...\n",
      "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.\n",
      "HTTP request sent, awaiting response... 200 OK\n",
      "Length: 1608 (1.6K) [text/plain]\n",
      "Saving to: β€˜STDOUT’\n",
      "\n",
      "setup Colab for PySpark 3.0.3 and Spark NLP 3.1.2\n",
      "-                   100%[===================>]   1.57K  --.-KB/s    in 0s      \n",
      "\n",
      "2021-07-12 10:34:15 (40.0 MB/s) - written to stdout [1608/1608]\n",
      "\n",
      "Get:1 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]\n",
      "Ign:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease\n",
      "Ign:3 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  InRelease\n",
      "Get:4 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic InRelease [15.9 kB]\n",
      "Get:5 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Release [697 B]\n",
      "Hit:6 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  Release\n",
      "Hit:7 http://archive.ubuntu.com/ubuntu bionic InRelease\n",
      "Get:8 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Release.gpg [836 B]\n",
      "Get:9 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease [3,626 B]\n",
      "Get:10 http://archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]\n",
      "Hit:11 http://ppa.launchpad.net/cran/libgit2/ubuntu bionic InRelease\n",
      "Get:12 http://security.ubuntu.com/ubuntu bionic-security/main amd64 Packages [2,221 kB]\n",
      "Get:13 http://archive.ubuntu.com/ubuntu bionic-backports InRelease [74.6 kB]\n",
      "Get:14 http://ppa.launchpad.net/deadsnakes/ppa/ubuntu bionic InRelease [15.9 kB]\n",
      "Get:15 http://security.ubuntu.com/ubuntu bionic-security/universe amd64 Packages [1,418 kB]\n",
      "Hit:17 http://ppa.launchpad.net/graphics-drivers/ppa/ubuntu bionic InRelease\n",
      "Ign:18 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Packages\n",
      "Get:18 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Packages [637 kB]\n",
      "Get:19 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ Packages [62.2 kB]\n",
      "Get:20 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic/main Sources [1,780 kB]\n",
      "Get:21 http://archive.ubuntu.com/ubuntu bionic-updates/universe amd64 Packages [2,188 kB]\n",
      "Get:22 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic/main amd64 Packages [910 kB]\n",
      "Get:23 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 Packages [2,658 kB]\n",
      "Get:24 http://ppa.launchpad.net/deadsnakes/ppa/ubuntu bionic/main amd64 Packages [40.8 kB]\n",
      "Fetched 12.2 MB in 4s (3,186 kB/s)\n",
      "Reading package lists... Done\n",
      "\u001b[K     |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 209.1MB 70kB/s \n",
      "\u001b[K     |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 51kB 6.0MB/s \n",
      "\u001b[K     |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 204kB 38.2MB/s \n",
      "\u001b[?25h  Building wheel for pyspark (setup.py) ... \u001b[?25l\u001b[?25hdone\n"
     ]
    }
   ],
   "source": [
    "! wget http://setup.johnsnowlabs.com/colab.sh -O - | bash"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's start Spark with Spark NLP included via our simple `start()` function"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sparknlp\n",
    "# let's start Spark with Spark NLP\n",
    "spark = sparknlp.start()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Let's use `loadSavedModel` functon in `BertEmbeddings` which allows us to load TensorFlow model in SavedModel format\n",
    "- Most params can be set later when you are loading this model in `BertEmbeddings` in runtime, so don't worry what you are setting them now\n",
    "- `loadSavedModel` accepts two params, first is the path to the TF SavedModel. The second is the SparkSession that is `spark` variable we previously started via `sparknlp.start()`\n",
    "- `setStorageRef` is very important. When you are training a task like NER or any Text Classification, we use this reference to bound the trained model to this specific embeddings so you won't load a different embeddings by mistake and see terrible results 😊\n",
    "- It's up to you what you put in `setStorageRef` but it cannot be changed later on. We usually use the name of the model to be clear, but you can get creative if you want! \n",
    "- The `dimension` param is is purely cosmetic and won't change anything. It's mostly for you to know later via `.getDimension` what is the dimension of your model. So set this accordingly.\n",
    "- NOTE: `loadSavedModel` accepts local paths in addition to distributed file systems such as `HDFS`, `S3`, `DBFS`, etc. This feature was introduced in Spark NLP 4.2.2 release. Keep in mind the best and recommended way to move/share/reuse Spark NLP models is to use `write.save` so you can use `.load()` from any file systems natively..\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sparknlp.annotator import *\n",
    "\n",
    "bert = BertEmbeddings.loadSavedModel(\n",
    "     EXPORTED_MODEL,\n",
    "     spark\n",
    " )\\\n",
    " .setInputCols([\"sentence\",'token'])\\\n",
    " .setOutputCol(\"bert\")\\\n",
    " .setCaseSensitive(False)\\\n",
    " .setDimension(768)\\\n",
    " .setStorageRef(EXPORTED_MODEL) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Let's save it on disk so it is easier to be moved around and also be used later via `.load` function"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "bert.write().overwrite().save(\"./{}_spark_nlp\".format(EXPORTED_MODEL))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's clean up stuff we don't need anymore"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!rm -rf {EXPORTED_MODEL}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Awesome 😎  !\n",
    "\n",
    "This is your BERT model from HuggingFace πŸ€—  loaded and saved by Spark NLP πŸš€ "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "total 16256\n",
      "-rw-r--r-- 1 root root 16635088 Jul 12 10:42 bert_tensorflow\n",
      "drwxr-xr-x 4 root root     4096 Jul 12 10:42 fields\n",
      "drwxr-xr-x 2 root root     4096 Jul 12 10:42 metadata\n"
     ]
    }
   ],
   "source": [
    "! ls -l {EXPORTED_MODEL}_spark_nlp"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's see how we can use it on other machines, clusters, or any place you wish to use your new and shiny BERT model 😊 "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "bert_loaded = BertEmbeddings.load(\"./{}_spark_nlp\".format(EXPORTED_MODEL))\\\n",
    "  .setInputCols([\"sentence\",'token'])\\\n",
    "  .setOutputCol(\"bert\")\\\n",
    "  .setCaseSensitive(False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.google.colaboratory.intrinsic+json": {
       "type": "string"
      },
      "text/plain": [
       "'bert_en_uncased_L-2_H-128_A-2'"
      ]
     },
     "execution_count": null,
     "metadata": {
      "tags": []
     },
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bert_loaded.getStorageRef()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That's it! You can now go wild and import BERT models from TF Hub in Spark NLP πŸš€ \n"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "authorship_tag": "ABX9TyNMrDVCZXsvZYgfF2ZWHz6D",
   "collapsed_sections": [],
   "name": "TF Hub in Spark NLP - BERT.ipynb",
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}