PySpark UDFs, Spark-NLP, and scrapping unstructured text data on spark clusters — a complete ETL pipeline for BigData architecture
--
This is a beginner to pro guide to deal with PySpark clusters. Complete jupyter notebook can be found here: https://github.com/yogenderPalChandra/spark/blob/main/DefForAllCode.ipynb
Apache Spark is an in-memory distributed computing platform built on top of Hadoop. Spark is used to build data ingestion pipelines on various cloud platforms such as AWS Glue, AWS EMR, and Databricks and to perform ETL jobs on that data lakes.
PySpark is a Python API for Spark. This allows you to interact with Resilient Distributed Datasets (RDDs) in Apache Spark with Python
Components
Spark applications run as independent sets of processes on a cluster, coordinated by the SparkContext
object in your main program (called the driver program). To run the job on clusters, SparkContext connects to cluster manager which allocates resources across applications. There are worker nodes that are responsible for the job execution. Spark acquires executors on worker nodes in the cluster, which are processes that run computations and store data for your jobs. Finally, SparkContext sends tasks to the executors to run.
There are 10 HTML files in the directory. Please scrap (parse, extract, clean) this data in a human-readable format (CSV more preferably) using spark RDDs.
Assignment: The analytical team requires the following data to be extracted from HTML files:
- Price value of class=”norm-price ng-binding”
- Location value of class=”location-text ng-binding”
- All parameters labels and values in class=”params1" and class=”params2"
I will demonstrate three methods to deal with such kind of problem:
Method1: Spark-NLP by John Snow Labs
Why Spark-NLP for this solution?
Since we are dealing with text data (on spark) it’s no harm to assume the end consumer of the scrapped…