Getting started with Pyspark (rdd, spark SQL)— A 10 minute tutorial
You can check complete code at the link below
code description
1. Spark Environment Setup
- Install Java, Spark, and Findspark
- Set Environment Variables
- Start a SparkSession
2. Loading data into Spark
- Create your own RDD
- Import data from outside
3. Word counting Example: Check the number of each word
- Use SparkSQL API (Dataframe)
- Use RDD API
1. Spark Environment Setup
Install Java, Spark, Findspark
- java: Since Spark runs on the JVM, java must be installed.
- spark: We will try to process data with spark 😄
- findspark: a library that makes it easy for Python to find Spark
Set Environment Variables
Set the locations where Spark and Java are installed.
Start a SparkSession
The entry point into all functionality in Spark is the SparkSession
class.
(cf. SparkContext: Entry point for applications prior to spark version 2.0)
2. Loading Data into Spark
Create your own RDD
For an explanation of `createDataFrame`,You can refer to this official document: link
Import Data From outside
You can import various types of data. link to official documentation.
- parquet files: Apache Parquet is designed for efficient as well as performant flat columnar storage format of data compared to row based files like CSV or TSV files. Parquet uses the record shredding and assembly algorithm which is superior to simple flattening of nested namespaces. (from databricks official document)
- orc files
- json files
- csv files
- text files
- …
Checking Loaded Data
3. Word counting Example: Check the number of each word
Use SparkSQL API (Dataframe)
Spark operates with lazy evaluation, so it does not evaluate before an action that needs to perform actual calculation and return the result.
Using the “explain” method, you can see the data processing process of the dataframe, DAG. Spark creates an optimized logical plan and physical plan with the DAG engine.
Spark can use distributed ANSI SQL queries to data frames through createOrReplaceTempView. TempView exists only while the Spark Session is valid.
Use RDD
The RDD API provides a more low-level API. DataFrame can be counted with count immediately after groupBy, but RDD must implement count using map .
Dataframe is fast because it has an optimized built-in API. Therefore, it is recommended to use SparkSQL using Dataframe rather than using the low-level RDD API.