{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "[](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/transformers/TF%20Hub%20in%20Spark%20NLP%20-%20BERT.ipynb)\n", "\n", "## Import BERT models from TF Hub into Spark NLP π\n", "\n", "Let's keep in mind a few things before we start π\n", "\n", "- This feature is only in `Spark NLP 3.1.x` and after. So please make sure you have upgraded to the latest Spark NLP release\n", "- You can import any BERT models from TF Hub but they have to be `TF2.0 Saved Model` models. Meaning, you cannot use `BERT models for TF1` which are `DEPRECATED`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Save TF Hub model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- We do not need to install `tensorflow` nor `tensorflow-hub`\n", "- We can simple download the model and extract it\n", "- We'll use [small_bert/bert_uncased_L-2_H-128_A-2](https://tfhub.dev/google/small_bert/bert_uncased_L-2_H-128_A-2/2) model from TF Hub as an example\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!rm -rf /content/*" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[K |ββββββββββββββββββββββββββββββββ| 394.3MB 39kB/s \n", "\u001b[K |ββββββββββββββββββββββββββββββββ| 3.8MB 32.3MB/s \n", "\u001b[K |ββββββββββββββββββββββββββββββββ| 2.9MB 33.9MB/s \n", "\u001b[K |ββββββββββββββββββββββββββββββββ| 471kB 42.4MB/s \n", "\u001b[?25h" ] } ], "source": [ "!pip install -q tensorflow==2.4.1 tensorflow-hub" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "EXPORTED_MODEL = 'bert_en_uncased_L-2_H-128_A-2'\n", "TF_HUB_URL = 'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-128_A-2/2'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "WARNING:absl:Found untraced functions such as keras_layer_layer_call_and_return_conditional_losses, keras_layer_layer_call_fn, keras_layer_layer_call_fn, keras_layer_layer_call_and_return_conditional_losses, keras_layer_layer_call_and_return_conditional_losses while saving (showing 5 of 170). These functions will not be directly callable after loading.\n", "WARNING:absl:Found untraced functions such as keras_layer_layer_call_and_return_conditional_losses, keras_layer_layer_call_fn, keras_layer_layer_call_fn, keras_layer_layer_call_and_return_conditional_losses, keras_layer_layer_call_and_return_conditional_losses while saving (showing 5 of 170). These functions will not be directly callable after loading.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Assets written to: /content/bert_en_uncased_L-2_H-128_A-2/assets\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:tensorflow:Assets written to: /content/bert_en_uncased_L-2_H-128_A-2/assets\n" ] } ], "source": [ "import tensorflow as tf\n", "import tensorflow_hub as hub\n", "\n", "encoder = hub.KerasLayer(TF_HUB_URL, trainable=False)\n", "\n", "@tf.function\n", "def my_module_encoder(input_mask, input_word_ids, input_type_ids):\n", " inputs = {\n", " 'input_mask': input_mask,\n", " 'input_word_ids': input_word_ids,\n", " 'input_type_ids': input_type_ids\n", " }\n", " outputs = {\n", " 'sequence_output': encoder(inputs)['sequence_output']\n", " }\n", " return outputs\n", "\n", "tf.saved_model.save(\n", " encoder, \n", " EXPORTED_MODEL, \n", " signatures=my_module_encoder.get_concrete_function(\n", " input_mask=tf.TensorSpec(shape=(None, None), dtype=tf.int32),\n", " input_word_ids=tf.TensorSpec(shape=(None, None), dtype=tf.int32),\n", " input_type_ids=tf.TensorSpec(shape=(None, None), dtype=tf.int32)\n", " ), \n", " options=None\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's have a look inside these two directories and see what we are dealing with:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "total 2076\n", "drwxr-xr-x 2 root root 4096 Jul 12 10:31 assets\n", "-rw-r--r-- 1 root root 2115591 Jul 12 10:31 saved_model.pb\n", "drwxr-xr-x 2 root root 4096 Jul 12 10:31 variables\n" ] } ], "source": [ "!ls -l {EXPORTED_MODEL}" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "total 228\n", "-rw-r--r-- 1 root root 231508 Jul 12 10:31 vocab.txt\n" ] } ], "source": [ "!ls -l {EXPORTED_MODEL}/assets" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- as you can see, everything needed in Spark NLP is already here, including `vocab.txt` in `assets` directory\n", "- we all set! We can got to Spark NLP π " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Import and Save BERT in Spark NLP\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Let's install and setup Spark NLP in Google Colab\n", "- This part is pretty easy via our simple script" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--2021-07-12 10:34:15-- http://setup.johnsnowlabs.com/colab.sh\n", "Resolving setup.johnsnowlabs.com (setup.johnsnowlabs.com)... 51.158.130.125\n", "Connecting to setup.johnsnowlabs.com (setup.johnsnowlabs.com)|51.158.130.125|:80... connected.\n", "HTTP request sent, awaiting response... 302 Moved Temporarily\n", "Location: https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh [following]\n", "--2021-07-12 10:34:15-- https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh\n", "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...\n", "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.\n", "HTTP request sent, awaiting response... 200 OK\n", "Length: 1608 (1.6K) [text/plain]\n", "Saving to: βSTDOUTβ\n", "\n", "setup Colab for PySpark 3.0.3 and Spark NLP 3.1.2\n", "- 100%[===================>] 1.57K --.-KB/s in 0s \n", "\n", "2021-07-12 10:34:15 (40.0 MB/s) - written to stdout [1608/1608]\n", "\n", "Get:1 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]\n", "Ign:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64 InRelease\n", "Ign:3 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64 InRelease\n", "Get:4 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic InRelease [15.9 kB]\n", "Get:5 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64 Release [697 B]\n", "Hit:6 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64 Release\n", "Hit:7 http://archive.ubuntu.com/ubuntu bionic InRelease\n", "Get:8 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64 Release.gpg [836 B]\n", "Get:9 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease [3,626 B]\n", "Get:10 http://archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]\n", "Hit:11 http://ppa.launchpad.net/cran/libgit2/ubuntu bionic InRelease\n", "Get:12 http://security.ubuntu.com/ubuntu bionic-security/main amd64 Packages [2,221 kB]\n", "Get:13 http://archive.ubuntu.com/ubuntu bionic-backports InRelease [74.6 kB]\n", "Get:14 http://ppa.launchpad.net/deadsnakes/ppa/ubuntu bionic InRelease [15.9 kB]\n", "Get:15 http://security.ubuntu.com/ubuntu bionic-security/universe amd64 Packages [1,418 kB]\n", "Hit:17 http://ppa.launchpad.net/graphics-drivers/ppa/ubuntu bionic InRelease\n", "Ign:18 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64 Packages\n", "Get:18 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64 Packages [637 kB]\n", "Get:19 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ Packages [62.2 kB]\n", "Get:20 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic/main Sources [1,780 kB]\n", "Get:21 http://archive.ubuntu.com/ubuntu bionic-updates/universe amd64 Packages [2,188 kB]\n", "Get:22 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic/main amd64 Packages [910 kB]\n", "Get:23 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 Packages [2,658 kB]\n", "Get:24 http://ppa.launchpad.net/deadsnakes/ppa/ubuntu bionic/main amd64 Packages [40.8 kB]\n", "Fetched 12.2 MB in 4s (3,186 kB/s)\n", "Reading package lists... Done\n", "\u001b[K |ββββββββββββββββββββββββββββββββ| 209.1MB 70kB/s \n", "\u001b[K |ββββββββββββββββββββββββββββββββ| 51kB 6.0MB/s \n", "\u001b[K |ββββββββββββββββββββββββββββββββ| 204kB 38.2MB/s \n", "\u001b[?25h Building wheel for pyspark (setup.py) ... \u001b[?25l\u001b[?25hdone\n" ] } ], "source": [ "! wget http://setup.johnsnowlabs.com/colab.sh -O - | bash" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's start Spark with Spark NLP included via our simple `start()` function" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import sparknlp\n", "# let's start Spark with Spark NLP\n", "spark = sparknlp.start()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Let's use `loadSavedModel` functon in `BertEmbeddings` which allows us to load TensorFlow model in SavedModel format\n", "- Most params can be set later when you are loading this model in `BertEmbeddings` in runtime, so don't worry what you are setting them now\n", "- `loadSavedModel` accepts two params, first is the path to the TF SavedModel. The second is the SparkSession that is `spark` variable we previously started via `sparknlp.start()`\n", "- `setStorageRef` is very important. When you are training a task like NER or any Text Classification, we use this reference to bound the trained model to this specific embeddings so you won't load a different embeddings by mistake and see terrible results π\n", "- It's up to you what you put in `setStorageRef` but it cannot be changed later on. We usually use the name of the model to be clear, but you can get creative if you want! \n", "- The `dimension` param is is purely cosmetic and won't change anything. It's mostly for you to know later via `.getDimension` what is the dimension of your model. So set this accordingly.\n", "- NOTE: `loadSavedModel` accepts local paths in addition to distributed file systems such as `HDFS`, `S3`, `DBFS`, etc. This feature was introduced in Spark NLP 4.2.2 release. Keep in mind the best and recommended way to move/share/reuse Spark NLP models is to use `write.save` so you can use `.load()` from any file systems natively..\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sparknlp.annotator import *\n", "\n", "bert = BertEmbeddings.loadSavedModel(\n", " EXPORTED_MODEL,\n", " spark\n", " )\\\n", " .setInputCols([\"sentence\",'token'])\\\n", " .setOutputCol(\"bert\")\\\n", " .setCaseSensitive(False)\\\n", " .setDimension(768)\\\n", " .setStorageRef(EXPORTED_MODEL) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Let's save it on disk so it is easier to be moved around and also be used later via `.load` function" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bert.write().overwrite().save(\"./{}_spark_nlp\".format(EXPORTED_MODEL))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's clean up stuff we don't need anymore" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!rm -rf {EXPORTED_MODEL}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Awesome π !\n", "\n", "This is your BERT model from HuggingFace π€ loaded and saved by Spark NLP π " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "total 16256\n", "-rw-r--r-- 1 root root 16635088 Jul 12 10:42 bert_tensorflow\n", "drwxr-xr-x 4 root root 4096 Jul 12 10:42 fields\n", "drwxr-xr-x 2 root root 4096 Jul 12 10:42 metadata\n" ] } ], "source": [ "! ls -l {EXPORTED_MODEL}_spark_nlp" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's see how we can use it on other machines, clusters, or any place you wish to use your new and shiny BERT model π " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bert_loaded = BertEmbeddings.load(\"./{}_spark_nlp\".format(EXPORTED_MODEL))\\\n", " .setInputCols([\"sentence\",'token'])\\\n", " .setOutputCol(\"bert\")\\\n", " .setCaseSensitive(False)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "application/vnd.google.colaboratory.intrinsic+json": { "type": "string" }, "text/plain": [ "'bert_en_uncased_L-2_H-128_A-2'" ] }, "execution_count": null, "metadata": { "tags": [] }, "output_type": "execute_result" } ], "source": [ "bert_loaded.getStorageRef()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That's it! You can now go wild and import BERT models from TF Hub in Spark NLP π \n" ] } ], "metadata": { "colab": { "authorship_tag": "ABX9TyNMrDVCZXsvZYgfF2ZWHz6D", "collapsed_sections": [], "name": "TF Hub in Spark NLP - BERT.ipynb", "provenance": [] }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3" } }, "nbformat": 4, "nbformat_minor": 1 }