Getting started with Pyspark (rdd, spark SQL)— A 10 minute tutorial

You can check complete code at the link below

code description

1. Spark Environment Setup
- Install Java, Spark, and Findspark
- Set Environment Variables
- Start a SparkSession
2. Loading data into Spark
- Create your own RDD
- Import data from outside
3. Word counting Example: Check the number of each word
- Use SparkSQL API (Dataframe)
- Use RDD API

1. Spark Environment Setup

Install Java, Spark, Findspark

  • java: Since Spark runs on the JVM, java must be installed.
  • spark: We will try to process data with spark 😄
  • findspark: a library that makes it easy for Python to find Spark

Set Environment Variables

Set the locations where Spark and Java are installed.

Start a SparkSession

The entry point into all functionality in Spark is the SparkSession class.
(cf. SparkContext: Entry point for applications prior to spark version 2.0)

2. Loading Data into Spark

Create your own RDD

For an explanation of `createDataFrame`,You can refer to this official document: link

from spark documentation

Import Data From outside

You can import various types of data. link to official documentation.

  • parquet files: Apache Parquet is designed for efficient as well as performant flat columnar storage format of data compared to row based files like CSV or TSV files. Parquet uses the record shredding and assembly algorithm which is superior to simple flattening of nested namespaces. (from databricks official document)
  • orc files
  • json files
  • csv files
  • text files

Checking Loaded Data

3. Word counting Example: Check the number of each word

Use SparkSQL API (Dataframe)

Spark operates with lazy evaluation, so it does not evaluate before an action that needs to perform actual calculation and return the result.

Using the “explain” method, you can see the data processing process of the dataframe, DAG. Spark creates an optimized logical plan and physical plan with the DAG engine.

Spark can use distributed ANSI SQL queries to data frames through createOrReplaceTempView. TempView exists only while the Spark Session is valid.

Use RDD

The RDD API provides a more low-level API. DataFrame can be counted with count immediately after groupBy, but RDD must implement count using map .

Dataframe is fast because it has an optimized built-in API. Therefore, it is recommended to use SparkSQL using Dataframe rather than using the low-level RDD API.

--

--

--

Data Engineer interested in constructing Data-Driven Architecture with Cloud Service

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Azure Functions running on Kubernetes using Keda

Binary Search An Alternative Method

Writeup two-Hitcon2017 & Find A function instead of rop-gadget

Encoding for programmers

red book with 8 bit bytes carved on the cover

How to Set Up Code-Completion for Vim in macOS

EasyGA: Genetic Algorithms made Easy. Genetic algorithm in 5 lines of python. Seriously 5 lines.

Design patterns

Rethinking Cross Company Planning for the 21st Century

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
SoniaComp

SoniaComp

Data Engineer interested in constructing Data-Driven Architecture with Cloud Service

More from Medium

Spark Tutorial Part-1(Basic understanding)

Did Stacking Improve My PySpark Churn Prediction Model?

Real Time Data Processing Using Spark Streaming

Apache Spark: Largest Open Source Project in Data Processing