Getting started with Pyspark (rdd, spark SQL)— A 10 minute tutorial

with “Google Colaboratory” and “Word Count Example”

SoniaComp
3 min readJan 4, 2022

You can check complete code at the link below

code description

1. Spark Environment Setup
- Install Java, Spark, and Findspark
- Set Environment Variables
- Start a SparkSession
2. Loading data into Spark
- Create your own RDD
- Import data from outside
3. Word counting Example: Check the number of each word
- Use SparkSQL API (Dataframe)
- Use RDD API

1. Spark Environment Setup

Install Java, Spark, Findspark

  • java: Since Spark runs on the JVM, java must be installed.
  • spark: We will try to process data with spark 😄
  • findspark: a library that makes it easy for Python to find Spark

Set Environment Variables

Set the locations where Spark and Java are installed.

Start a SparkSession

The entry point into all functionality in Spark is the SparkSession class.
(cf. SparkContext: Entry point for applications prior to spark version 2.0)

2. Loading Data into Spark

Create your own RDD

For an explanation of `createDataFrame`,You can refer to this official document: link

from spark documentation

Import Data From outside

You can import various types of data. link to official documentation.

  • parquet files: Apache Parquet is designed for efficient as well as performant flat columnar storage format of data compared to row based files like CSV or TSV files. Parquet uses the record shredding and assembly algorithm which is superior to simple flattening of nested namespaces. (from databricks official document)
  • orc files
  • json files
  • csv files
  • text files

Checking Loaded Data

3. Word counting Example: Check the number of each word

Use SparkSQL API (Dataframe)

Spark operates with lazy evaluation, so it does not evaluate before an action that needs to perform actual calculation and return the result.

Using the “explain” method, you can see the data processing process of the dataframe, DAG. Spark creates an optimized logical plan and physical plan with the DAG engine.

Spark can use distributed ANSI SQL queries to data frames through createOrReplaceTempView. TempView exists only while the Spark Session is valid.

Use RDD

The RDD API provides a more low-level API. DataFrame can be counted with count immediately after groupBy, but RDD must implement count using map .

Dataframe is fast because it has an optimized built-in API. Therefore, it is recommended to use SparkSQL using Dataframe rather than using the low-level RDD API.

--

--

SoniaComp

Data Engineer interested in Data Infrastructure Powering Fintech Innovation (https://www.linkedin.com/in/sonia-comp/)