Getting started with Pyspark (rdd, spark SQL)— A 10 minute tutorial

You can check complete code at the link below

code description

1. Spark Environment Setup
- Install Java, Spark, and Findspark
- Set Environment Variables
- Start a SparkSession
2. Loading data into Spark
- Create your own RDD
- Import data from outside
3. Word counting Example: Check the number of each word
- Use SparkSQL API (Dataframe)
- Use RDD API

1. Spark Environment Setup

Install Java, Spark, Findspark

  • java: Since Spark runs on the JVM, java must be installed.
  • spark: We will try to process data with spark 😄
  • findspark: a library that makes it easy for Python to find Spark

Set Environment Variables

Set the locations where Spark and Java are installed.

Start a SparkSession

The entry point into all functionality in Spark is the SparkSession class.
(cf. SparkContext: Entry point for applications prior to spark version 2.0)

2. Loading Data into Spark

Create your own RDD

For an explanation of `createDataFrame`,You can refer to this official document: link

from spark documentation

Import Data From outside

You can import various types of data. link to official documentation.

  • parquet files: Apache Parquet is designed for efficient as well as performant flat columnar storage format of data compared to row based files like CSV or TSV files. Parquet uses the record shredding and assembly algorithm which is superior to simple flattening of nested namespaces. (from databricks official document)
  • orc files
  • json files
  • csv files
  • text files

Checking Loaded Data

3. Word counting Example: Check the number of each word

Use SparkSQL API (Dataframe)

Spark operates with lazy evaluation, so it does not evaluate before an action that needs to perform actual calculation and return the result.

Using the “explain” method, you can see the data processing process of the dataframe, DAG. Spark creates an optimized logical plan and physical plan with the DAG engine.

Spark can use distributed ANSI SQL queries to data frames through createOrReplaceTempView. TempView exists only while the Spark Session is valid.

Use RDD

The RDD API provides a more low-level API. DataFrame can be counted with count immediately after groupBy, but RDD must implement count using map .

Dataframe is fast because it has an optimized built-in API. Therefore, it is recommended to use SparkSQL using Dataframe rather than using the low-level RDD API.

--

--

--

Data Engineer interested in constructing Data-Driven Architecture with Cloud Service

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Efficient implementation Solutions for CFOs through Oracle Cloud

Meet Apron Network in Web 3.0 world

Upgrading to NUnit3 — not such a walk in the park

Interactive Devices: Space Explorer Game

Serverless Computing

Welsh Azure user group with Abel Wang and April Edwards!

Data Vault Advent Calendar 3/25

Spring Data JPA Query And Pageable

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
SoniaComp

SoniaComp

Data Engineer interested in constructing Data-Driven Architecture with Cloud Service

More from Medium

Learning Spark — Part 1 Spark Environment Installation

withColumn() in PySpark

Spark Tutorial Part-1(Basic understanding)

Run The Simple End-to-End Example From Book “Spark: The Definitive Guide” Using Covid-19…