PySpark UDFs, Spark-NLP, and scrapping unstructured text data on spark clusters — a complete ETL pipeline for BigData architecture

Yogender Pal
4 min readApr 9, 2022

This is a beginner to pro guide to deal with PySpark clusters. Complete jupyter notebook can be found here:

Image by Apache Spark

Apache Spark is an in-memory distributed computing platform built on top of Hadoop. Spark is used to build data ingestion pipelines on various cloud platforms such as AWS Glue, AWS EMR, and Databricks and to perform ETL jobs on that data lakes.

PySpark is a Python API for Spark. This allows you to interact with Resilient Distributed Datasets (RDDs) in Apache Spark with Python


Spark applications run as independent sets of processes on a cluster, coordinated by the SparkContext object in your main program (called the driver program). To run the job on clusters, SparkContext connects to cluster manager which allocates resources across applications. There are worker nodes that are responsible for the job execution. Spark acquires executors on worker nodes in the cluster, which are processes that run computations and store data for your jobs. Finally, SparkContext sends tasks to the executors to run.

Source: Apache Spark

There are 10 HTML files in the directory. Please scrap (parse, extract, clean) this data in a human-readable format (CSV more preferably) using spark RDDs.

Assignment: The analytical team requires the following data to be extracted from HTML files:

  • Price value of class=”norm-price ng-binding”
  • Location value of class=”location-text ng-binding”
  • All parameters labels and values in class=”params1" and class=”params2"

I will demonstrate three methods to deal with such kind of problem:

Method1: Spark-NLP by John Snow Labs

Why Spark-NLP for this solution?

Since we are dealing with text data (on spark) it’s no harm to assume the end consumer of the scrapped…

Yogender Pal

I talk about programing, AWS, and two hamsters in my room