Skip to content

Commit 30502cc

Browse files
committed
[SPARKNLP-1113] Adding txt reader notebook example
1 parent 8710011 commit 30502cc

File tree

2 files changed

+241
-1
lines changed

2 files changed

+241
-1
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,241 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)\n",
8+
"\n",
9+
"[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/reader/SparkNLP_TXT_Reader_Demo.ipynb)"
10+
]
11+
},
12+
{
13+
"cell_type": "markdown",
14+
"metadata": {
15+
"application/vnd.databricks.v1+cell": {
16+
"cellMetadata": {
17+
"byteLimit": 2048000,
18+
"rowLimit": 10000
19+
},
20+
"inputWidgets": {},
21+
"nuid": "c0efed73-75e9-41f1-9a2e-a2d0953b3a76",
22+
"showTitle": false,
23+
"tableResultSettingsMap": {},
24+
"title": ""
25+
},
26+
"id": "tzcU5p2gdak9"
27+
},
28+
"source": [
29+
"# Introducing TXT reader in SparkNLP\n",
30+
"This notebook showcases the newly added `sparknlp.read().txt()` method in Spark NLP that parses txt file content from both local files and real-time URLs into a Spark DataFrame."
31+
]
32+
},
33+
{
34+
"cell_type": "markdown",
35+
"metadata": {
36+
"application/vnd.databricks.v1+cell": {
37+
"cellMetadata": {
38+
"byteLimit": 2048000,
39+
"rowLimit": 10000
40+
},
41+
"inputWidgets": {},
42+
"nuid": "356de93e-af38-4156-823b-6371d7fd825c",
43+
"showTitle": false,
44+
"tableResultSettingsMap": {},
45+
"title": ""
46+
},
47+
"id": "RFOFhaEedalB"
48+
},
49+
"source": [
50+
"## Setup and Initialization\n",
51+
"Let's keep in mind a few things before we start 😊\n",
52+
"\n",
53+
"Support for reading html files was introduced in Spark NLP 5.6.0. Please make sure you have upgraded to the latest Spark NLP release.\n",
54+
"\n",
55+
"- Let's install and setup Spark NLP in Google Colab\n",
56+
"- This part is pretty easy via our simple script"
57+
]
58+
},
59+
{
60+
"cell_type": "code",
61+
"execution_count": null,
62+
"metadata": {},
63+
"outputs": [],
64+
"source": [
65+
"! wget -q http://setup.johnsnowlabs.com/colab.sh -O - | bash"
66+
]
67+
},
68+
{
69+
"cell_type": "markdown",
70+
"metadata": {},
71+
"source": [
72+
"For local files example we will download a TXT file from Spark NLP Github repo:"
73+
]
74+
},
75+
{
76+
"cell_type": "code",
77+
"execution_count": 0,
78+
"metadata": {
79+
"application/vnd.databricks.v1+cell": {
80+
"cellMetadata": {
81+
"byteLimit": 2048000,
82+
"rowLimit": 10000
83+
},
84+
"inputWidgets": {},
85+
"nuid": "bb622e88-2ef9-49c4-8cfb-e49209ad206a",
86+
"showTitle": false,
87+
"tableResultSettingsMap": {},
88+
"title": ""
89+
},
90+
"colab": {
91+
"base_uri": "https://localhost:8080/"
92+
},
93+
"id": "ya8qZe00dalC",
94+
"outputId": "268ccacb-ba1c-4753-f251-014fb0003f38"
95+
},
96+
"outputs": [],
97+
"source": [
98+
"!mkdir txt-files\n",
99+
"!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/reader/txt/simple-text.txt -P txt-files"
100+
]
101+
},
102+
{
103+
"cell_type": "markdown",
104+
"metadata": {
105+
"application/vnd.databricks.v1+cell": {
106+
"cellMetadata": {
107+
"byteLimit": 2048000,
108+
"rowLimit": 10000
109+
},
110+
"inputWidgets": {},
111+
"nuid": "13d72e9f-04b4-4547-bc4e-35b3878a93c2",
112+
"showTitle": false,
113+
"tableResultSettingsMap": {},
114+
"title": ""
115+
},
116+
"id": "EoFI66NAdalE"
117+
},
118+
"source": [
119+
"## Parsing text from Local Files\n",
120+
"Use the `txt()` method to parse text file content from local directories."
121+
]
122+
},
123+
{
124+
"cell_type": "code",
125+
"execution_count": 0,
126+
"metadata": {
127+
"application/vnd.databricks.v1+cell": {
128+
"cellMetadata": {
129+
"byteLimit": 2048000,
130+
"rowLimit": 10000
131+
},
132+
"inputWidgets": {},
133+
"nuid": "df54ed9b-682b-4b99-891a-84c23bc5cbd0",
134+
"showTitle": false,
135+
"tableResultSettingsMap": {},
136+
"title": ""
137+
},
138+
"colab": {
139+
"base_uri": "https://localhost:8080/"
140+
},
141+
"id": "bAkMjJ1vdalE",
142+
"outputId": "a0a2e727-fcc3-474b-eaaa-20bf15f19773"
143+
},
144+
"outputs": [
145+
{
146+
"name": "stdout",
147+
"output_type": "stream",
148+
"text": [
149+
"Warning::Spark Session already created, some configs may not take.\n",
150+
"+--------------------+--------------------+--------------------+\n",
151+
"| path| content| txt|\n",
152+
"+--------------------+--------------------+--------------------+\n",
153+
"|dbfs:/danilo/data...|BIG DATA ANALYTIC...|[{Title, BIG DATA...|\n",
154+
"+--------------------+--------------------+--------------------+\n",
155+
"\n"
156+
]
157+
}
158+
],
159+
"source": [
160+
"import sparknlp\n",
161+
"txt_df = sparknlp.read().txt(\"dbfs:/danilo/datasets/txt\")\n",
162+
"\n",
163+
"txt_df.show()"
164+
]
165+
},
166+
{
167+
"cell_type": "code",
168+
"execution_count": 0,
169+
"metadata": {
170+
"application/vnd.databricks.v1+cell": {
171+
"cellMetadata": {
172+
"byteLimit": 2048000,
173+
"rowLimit": 10000
174+
},
175+
"inputWidgets": {},
176+
"nuid": "9f5c787d-2eab-4546-8001-e34f00124670",
177+
"showTitle": false,
178+
"tableResultSettingsMap": {},
179+
"title": ""
180+
},
181+
"colab": {
182+
"base_uri": "https://localhost:8080/"
183+
},
184+
"id": "4iky1gvEz7Pt",
185+
"outputId": "a986947b-f874-46bc-88c8-093dc42c83cb"
186+
},
187+
"outputs": [
188+
{
189+
"name": "stdout",
190+
"output_type": "stream",
191+
"text": [
192+
"+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n",
193+
"|txt |\n",
194+
"+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n",
195+
"|[{Title, BIG DATA ANALYTICS, {paragraph -> 0}}, {NarrativeText, Apache Spark is a fast and general-purpose cluster computing system.\\nIt provides high-level APIs in Java, Scala, Python, and R., {paragraph -> 0}}, {Title, MACHINE LEARNING, {paragraph -> 1}}, {NarrativeText, Spark's MLlib provides scalable machine learning algorithms.\\nIt includes tools for classification, regression, clustering, and more., {paragraph -> 1}}]|\n",
196+
"+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n",
197+
"\n"
198+
]
199+
}
200+
],
201+
"source": [
202+
"txt_df.select(\"txt\").show(truncate=False)"
203+
]
204+
}
205+
],
206+
"metadata": {
207+
"application/vnd.databricks.v1+notebook": {
208+
"computePreferences": null,
209+
"dashboards": [],
210+
"environmentMetadata": null,
211+
"language": "python",
212+
"notebookMetadata": {
213+
"pythonIndentUnit": 4
214+
},
215+
"notebookName": "SparkNLP_TXT_Reader_Demo",
216+
"widgets": {}
217+
},
218+
"colab": {
219+
"provenance": []
220+
},
221+
"kernelspec": {
222+
"display_name": "Python 3 (ipykernel)",
223+
"language": "python",
224+
"name": "python3"
225+
},
226+
"language_info": {
227+
"codemirror_mode": {
228+
"name": "ipython",
229+
"version": 3
230+
},
231+
"file_extension": ".py",
232+
"mimetype": "text/x-python",
233+
"name": "python",
234+
"nbconvert_exporter": "python",
235+
"pygments_lexer": "ipython3",
236+
"version": "3.10.12"
237+
}
238+
},
239+
"nbformat": 4,
240+
"nbformat_minor": 1
241+
}

python/sparknlp/reader/sparknlp_reader.py

-1
Original file line numberDiff line numberDiff line change
@@ -254,7 +254,6 @@ def ppt(self, docPath):
254254
jdf = self._java_obj.ppt(docPath)
255255
dataframe = self.getDataFrame(self.spark, jdf)
256256
return dataframe
257-
return self.getDataFrame(self.spark, jdf)
258257

259258
def txt(self, docPath):
260259
"""Reads TXT files and returns a Spark DataFrame.

0 commit comments

Comments
 (0)