AWS Certified Data Analytics(DAS-C01) — Certification Summary

(3) Data Processing (AWS EMR and AWS ETL)

8 min readJun 25, 2022

Amazon’s data services can be divided into five categories: data ingestion, storage, processing, analysis and visualization, and security.

This article is part of a series, each dealing with each of the five topics above.

1. Data Ingestion
2. Data Storage3. Data Processing
- Glue
- EMR
- Lambda
- AWS Data Pipeline
- Sage Maker
- Architecture4. Analysis and Visualization
5. Security

Data Processing (ETL)

AWS Glue

AWS Glue is a fully managed ETL service.
1) AWS Glue Data Catalog (central metadata repository)
2) Glue ETL engine (generating Python and Scala code)
3) Scheduler
+ Connector, Bookmark(Incremental processing problem)

AWS periodically updates the AWS Glue Data Catalog using a scheduled Glue Crawler program. The crawler infers the schema, format, and type of the data in the data source and stores it as a table in the AWS Glue Data Catalog.

AWS Glue crawler

To reduce overhead, configure the AWS Glue crawler to catalog your on-premises data using a JDBC connection. AWS Glue crawler also catalog data on S3. When the input JSON file lands in S3, it triggers a Lambda that calls the Glue Crawler. (S3:ObjectCreated:* in S3 bucket)
Updating Manually Created Data Catalog Tables Using Crawlers “The following are other reasons why you might want to manually create catalog tables and specify catalog tables as the crawler source
Use the excludeStorageClasses property in the AWS Glue Data Catalog table to exclude files on S3 Glacier storage.
can run in different regions

AWS Glue Catalog

security: create a fine-grained Glue Catalogue access policy to allow users access to production tables.

AWS Glue ETL

Query enhancements
- Compresses data files using the .lzo format. Query compressed data.
- Driver memory exceeded threshold: Modify your AWS Glue ETL code to use the ‘groupFiles’:‘inPartition’ feature.

AWS Glue’s FindMatches ML Transform to find and link records that refer to the same entity: The most cost-effective and scalable method for text-based redundancy efficiency

Glue Spark jobs can write the data catalogue metadata on the fly. Running additional crawler jobs may accrue more cost.

Small File: Set up AWS Glue Python jobs to merge the small data files in Amazon S3 into larger files and transform them to Apache Parquet format. Migrate the downstream PySpark jobs from Amazon EMR to AWS Glue.

EMR (Amazon Elastic Map Reduce — for On-premise Apache Hadoop cluster)

https://docs.aws.amazon.com/ko_kr/emr/latest/EMR-on-EKS-DevelopmentGuide/emr-eks-overview.html

Spin up an EMR cluster with multi master nodes for High Availabiltiy.
Using minimum number of on demand EC2 as core nodes and auto scale with Spot instances for better cost savings. (Amazon EC2 Spot Instances allow you to utilize unused EC2 capacity in the AWS Cloud. Spot Instances are available at discounts of up to 90% compared to On-Demand pricing.) (Creating a fleet of Spot and On-Demand Instances makes it fault-tolerant, opting for On-Demand if Spot is not available in a particular Region at the time.)
Use AWS EMR Autoscaling feature to scale up during high usage hours and scale down during low usage hours for cost effectiveness.

You must create a subnet that does not use an unsupported Availability Zone and issue the create-cluster command.

Spark MLLib libraries on EMR Bigdata cluster best fit for the use case due to very large volume of data.
Master Node(Core Node): R5 are Memory Optimised instances, a good fit for optimized performance of memory intensive Spark jobs.

ThirdParty Solution

Flume helps with streaming data ingestion.
Sqoop is used for batch data ingestion.
HBase helps with NoSQL Data warehouse.
Hive is a SQL Data warehouse.
Oozie is used to manage and coordinate Hadoop jobs.

How to increase availability
- Store the data on a EMR File System(EMRFS) instead of HDFS and enable EMRFS consistent view.
- Create a primary EMR HBase cluster with multiple master nodes.
- Create a secondary EMR HBase read-replica cluster in a separate Availabiltiy Zone. Point both clusters to the same HBase root directory in the same Amazon S3 Bucket.
- AWS Glue Data Catalog as a Metastore for Apache Hive
- A read replica of the EMR cluster is configured in another Availability Zone with shared storage.

Connection
- Place the required installation scripts in Amazon S3 and run them using custom bootstrap actions.
- Launch an Amazon EC2 instance with Amazon Linux and install the required third-party libraries on the instance. Create an AMI(Amazon Machine Image) and use that AMI to create an EMR cluster

Security
- Kerberos: Kerberos can be set up to provide strong authentication through secret-key cryptography.Kerberos uses secret-key cryptography to provide strong authentication so that passwords or other credentials aren’t sent over the network in an unencrypted format.
- Put EMR clusters in a private subnet with no public IP space of Internet Gateway attachment, Create a private VPC Gateway endpoint for Amazon S3 in your subnet to enable access from EMR. Maximum security will be enabled in Private Subnet. An S3 endpoint is a VPC Gateway Interface. When you connect through a VPC endpoint, you connect through an Amazon Virtual Private Cloud instead of a public Internet.
- When using the Amazon EMR version, encrypted EBS root device volumes are only supported when using custom AMIs. For Amazon EMR version 5.24.0 and later, you can use the security configuration option to encrypt EBS root devices and storage volumes when specifying AWS KMS as the key provider.
- Create a custom AMI with an encrypted root device volume. Configure Amazon EMR to use a custom AMI using the CustomAmild attribute of the CloudFormation template.

emr-s3–503-slow-down
This error occurs when you exceed the Amazon Simple Storage Service (Amazon S3) request rate.
1) Add more prefixes to the S3 bucket.
2) Reduce the number of Amazon S3 requests.
3) Increase the EMR File System (EMRFS) retry limit.

Why use EMR? Why not just install the required software on EC2 instances?

Upgradability: EMR uses software set versions like 5.32 or 6.X, these contain software versions as per the stability matrix between the various tools (this can be overridden as well)
Step Functions: These allow you to trigger Linux commands (for us spark jobs in particular) on an EMR cluster remotely. This doesn’t require SSH access or the IP address of the cluster. The only requirement is that this should be triggered from either a machine that has access to the EMR or using Access and Secret Keys.
Transient Clusters: A transient cluster is launched for each step and terminates automatically after finishing that step. I can’t stress enough how useful this can be, this very significantly reduces the cost of operations and brings in good development practices.
Monitoring: AWS CloudWatch is an excellent tool for monitoring EMR applications, it requires no additional software setup and works right off the bat. Amazon CloudSearch is a fully managed service on the AWS Cloud that makes it easy to install, manage, and scale a search solution for your website or application.

Transient EMR Cluster Solution

Transient EMR Cluster automatically terminate after finishing their job.
Bootstrap actions enable installing third party software outside of the EMR standard. This can include python packages, logging tools, databases etc.
Static EMR will incur cost even when the data is not flowing.
Redshift based solution will incur additional cost of Redshift cluster.

AWS Lambda

Number of concurrent Lambda functions run, as there is a default limit of 1000 concurrent run per user per account.
Schedule serverless CRON jobs using AWS Cloud Watch Event Rules and Lambda
Realtime data transformation using AWS Kinesis Firehose asynchronously invoking AWS Lambda functions.
AWS Lambda is not best fit for handling large volume data. It is also not a good option to run Pyspark using Lambdas.

AWS Data Pipeline

Use AWS Data Pipeline to connect EMR, RDS, and Redshift. AWS Data Pipeline fits for this purpose. Airflow is a task scheduler that helps manage ETL tasks. With Airflow you can automate workflows.

SageMaker

Automatically build and deploy modern machine learning models using structured data. You can train machine learning models at scale, host the trained models in the cloud, and use the models to make predictions on new data. One-click instances running JupyterLab pre-installed with the latest data science and machine learning frameworks.

Architecture

Architecture V1

Amazon Kinesis Data Firehose is used to deliver all data to an Amazon S3 bucket.
Periodically, the AWS Glue job issues a COPY command to send data to Amazon Redshift. Increase the number of AWS Glue job retries. Reduce the timeout value. Increase task concurrency.

Create metadata on AWS Glue Data Catalogue on top of CSV data sets using AWS Glue Crawler.
Use AWS Glue Spark ETL job to join the data between on prom extracted data and CSV data using Glue Catalogue metadata. Transform and load on to Redshift. (Use the AWS Glue dynamic frame file grouping option while ingesting raw input files.)
Use AWS Glue ETL Spark jobs to extract data from on prem sources using ODBC connectivity.

Only AWS Glue Streaming ETL jobs are scalable, serverless and have options to use Scala or Python for transformation operations. You can process streaming data with Glue and transform it any way you’d like. Glue can then send that data to Redshift for reporting.

Architecture V2

https://programmaticponderings.com/2020/01/14/getting-started-with-data-analysis-on-aws-using-aws-glue-amazon-athena-and-quicksight-part-2/

AWS Glue can communicate with your on-premises data store over a VPN or DX connection.
To reduce overhead, configure the AWS Glue crawler to catalog your on-premises data using a JDBC connection. AWS Glue crawler also catalog data on S3. When the input JSON file lands in S3, it triggers a Lambda that calls the Glue Crawler. (S3:ObjectCreated:* in S3 bucket)
ETL jobs are scheduled every 15 minutes. Each item is automatically appended with AWS Glue data. It does not provide a service, and is managed using AWS Glue data. Update your AWS Glue ETL code to require the enableUpdateCatalog and PartitionKeys arguments.
Enrich your data using AWS Glue jobs and store the results in Amazon S3 in Apache Parquet format. Query data using Amazon Athena.

Architecture V3

Source: https://aws.amazon.com/ko/blogs/architecture/how-parametric-built-audit-surveillance-using-aws-data-lake-architecture/

Process the semi structured data using Spark and store the data into Hive warehouse keeping underlying storage location to S3 via EMRFS.
This will allow the users to access the same data via Athena as well as reporting via Athena ODBC Driver.