SPARKNLP-1109 Adding Extractor to Sparknlp #14519
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR introduces two new SparkNLP powered annotators: Extractor and Cleaner.
Extractor Annotator:
Enables seamless extraction of key information (e.g., dates, emails, IP addresses) from various data sources such as
.eml
files. This simplifies data parsing workflows by isolating relevant details automatically.Cleaner Annotator:
Removes unnecessary or undesirable content from datasets, such as bullets, dashes, and non-ASCII characters, enhancing data consistency and readability.
These annotators are designed for scalable, high-performance data processing, leveraging Apache Spark to handle large datasets efficiently.
Motivation and Context
In many real-world NLP tasks, raw data often contains noise, such as irrelevant symbols, inconsistent formatting, or extraneous details. Similarly, extracting structured information from unstructured data sources can be complex and inefficient without dedicated tools. These new annotators aim to solve these challenges by providing built-in, scalable capabilities for data preprocessing, reducing manual effort and boilerplate code while improving pipeline performance and accuracy.
How Has This Been Tested?
Screenshots (if appropriate):
Types of changes
Checklist: