[SPARKNLP-1113] Adding txt reader notebook example

danilojsl · danilojsl · commit 30502cc3676a · 2025-03-06T18:20:32.000-05:00
diff --git a/examples/python/reader/SparkNLP_TXT_Reader_Demo.ipynb b/examples/python/reader/SparkNLP_TXT_Reader_Demo.ipynb
@@ -0,0 +1,241 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)\n",
+    "\n",
+    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/reader/SparkNLP_TXT_Reader_Demo.ipynb)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "application/vnd.databricks.v1+cell": {
+     "cellMetadata": {
+      "byteLimit": 2048000,
+      "rowLimit": 10000
+     },
+     "inputWidgets": {},
+     "nuid": "c0efed73-75e9-41f1-9a2e-a2d0953b3a76",
+     "showTitle": false,
+     "tableResultSettingsMap": {},
+     "title": ""
+    },
+    "id": "tzcU5p2gdak9"
+   },
+   "source": [
+    "# Introducing TXT reader in SparkNLP\n",
+    "This notebook showcases the newly added  `sparknlp.read().txt()` method in Spark NLP that parses txt file content from both local files and real-time URLs into a Spark DataFrame."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "application/vnd.databricks.v1+cell": {
+     "cellMetadata": {
+      "byteLimit": 2048000,
+      "rowLimit": 10000
+     },
+     "inputWidgets": {},
+     "nuid": "356de93e-af38-4156-823b-6371d7fd825c",
+     "showTitle": false,
+     "tableResultSettingsMap": {},
+     "title": ""
+    },
+    "id": "RFOFhaEedalB"
+   },
+   "source": [
+    "## Setup and Initialization\n",
+    "Let's keep in mind a few things before we start 😊\n",
+    "\n",
+    "Support for reading html files was introduced in Spark NLP 5.6.0. Please make sure you have upgraded to the latest Spark NLP release.\n",
+    "\n",
+    "- Let's install and setup Spark NLP in Google Colab\n",
+    "- This part is pretty easy via our simple script"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "! wget -q http://setup.johnsnowlabs.com/colab.sh -O - | bash"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "For local files example we will download a TXT file from Spark NLP Github repo:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 0,
+   "metadata": {
+    "application/vnd.databricks.v1+cell": {
+     "cellMetadata": {
+      "byteLimit": 2048000,
+      "rowLimit": 10000
+     },
+     "inputWidgets": {},
+     "nuid": "bb622e88-2ef9-49c4-8cfb-e49209ad206a",
+     "showTitle": false,
+     "tableResultSettingsMap": {},
+     "title": ""
+    },
+    "colab": {
+     "base_uri": "https://localhost:8080/"
+    },
+    "id": "ya8qZe00dalC",
+    "outputId": "268ccacb-ba1c-4753-f251-014fb0003f38"
+   },
+   "outputs": [],
+   "source": [
+    "!mkdir txt-files\n",
+    "!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/reader/txt/simple-text.txt -P txt-files"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "application/vnd.databricks.v1+cell": {
+     "cellMetadata": {
+      "byteLimit": 2048000,
+      "rowLimit": 10000
+     },
+     "inputWidgets": {},
+     "nuid": "13d72e9f-04b4-4547-bc4e-35b3878a93c2",
+     "showTitle": false,
+     "tableResultSettingsMap": {},
+     "title": ""
+    },
+    "id": "EoFI66NAdalE"
+   },
+   "source": [
+    "## Parsing text from Local Files\n",
+    "Use the `txt()` method to parse text file content from local directories."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 0,
+   "metadata": {
+    "application/vnd.databricks.v1+cell": {
+     "cellMetadata": {
+      "byteLimit": 2048000,
+      "rowLimit": 10000
+     },
+     "inputWidgets": {},
+     "nuid": "df54ed9b-682b-4b99-891a-84c23bc5cbd0",
+     "showTitle": false,
+     "tableResultSettingsMap": {},
+     "title": ""
+    },
+    "colab": {
+     "base_uri": "https://localhost:8080/"
+    },
+    "id": "bAkMjJ1vdalE",
+    "outputId": "a0a2e727-fcc3-474b-eaaa-20bf15f19773"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Warning::Spark Session already created, some configs may not take.\n",
+      "+--------------------+--------------------+--------------------+\n",
+      "|                path|             content|                 txt|\n",
+      "+--------------------+--------------------+--------------------+\n",
+      "|dbfs:/danilo/data...|BIG DATA ANALYTIC...|[{Title, BIG DATA...|\n",
+      "+--------------------+--------------------+--------------------+\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "import sparknlp\n",
+    "txt_df = sparknlp.read().txt(\"dbfs:/danilo/datasets/txt\")\n",
+    "\n",
+    "txt_df.show()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 0,
+   "metadata": {
+    "application/vnd.databricks.v1+cell": {
+     "cellMetadata": {
+      "byteLimit": 2048000,
+      "rowLimit": 10000
+     },
+     "inputWidgets": {},
+     "nuid": "9f5c787d-2eab-4546-8001-e34f00124670",
+     "showTitle": false,
+     "tableResultSettingsMap": {},
+     "title": ""
+    },
+    "colab": {
+     "base_uri": "https://localhost:8080/"
+    },
+    "id": "4iky1gvEz7Pt",
+    "outputId": "a986947b-f874-46bc-88c8-093dc42c83cb"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n",
+      "|txt                                                                                                                                                                                                                                                                                                                                                                                                                                        |\n",
+      "+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n",
+      "|[{Title, BIG DATA ANALYTICS, {paragraph -> 0}}, {NarrativeText, Apache Spark is a fast and general-purpose cluster computing system.\\nIt provides high-level APIs in Java, Scala, Python, and R., {paragraph -> 0}}, {Title, MACHINE LEARNING, {paragraph -> 1}}, {NarrativeText, Spark's MLlib provides scalable machine learning algorithms.\\nIt includes tools for classification, regression, clustering, and more., {paragraph -> 1}}]|\n",
+      "+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "txt_df.select(\"txt\").show(truncate=False)"
+   ]
+  }
+ ],
+ "metadata": {
+  "application/vnd.databricks.v1+notebook": {
+   "computePreferences": null,
+   "dashboards": [],
+   "environmentMetadata": null,
+   "language": "python",
+   "notebookMetadata": {
+    "pythonIndentUnit": 4
+   },
+   "notebookName": "SparkNLP_TXT_Reader_Demo",
+   "widgets": {}
+  },
+  "colab": {
+   "provenance": []
+  },
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.12"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 1
+}
diff --git a/python/sparknlp/reader/sparknlp_reader.py b/python/sparknlp/reader/sparknlp_reader.py
@@ -254,7 +254,6 @@ def ppt(self, docPath):
         jdf = self._java_obj.ppt(docPath)
         dataframe = self.getDataFrame(self.spark, jdf)
         return dataframe
-        return self.getDataFrame(self.spark, jdf)
 
     def txt(self, docPath):
         """Reads TXT files and returns a Spark DataFrame.