AWS Certified Data Analytics(DAS-C01) — Certification Summary
Amazon’s data services can be divided into five categories: data ingestion, storage, processing, analysis and visualization, and security.
This article is part of a series, each dealing with each of the five topics above.
1. Data Ingestion
2. Data Storage
3. Data Processing4. Analysis and Visualization
- Lake Formation
Analysis: Data Warehouse
- Open Search(Elastic Search)
- Lake Formation
- QuickInsight5. Security
Lake Formation helps you build, secure, and manage your data lake. First, identify existing data stores in S3 or relational databases and NoSQL databases and move the data into a data lake. It then crawls, catalogs, and prepares the data for analysis. Next, provide secure, self-service access to data through the analytics service of your choice. Other AWS services and third-party applications may also access data through the marked services. Lake Formation manages all the tasks shown in the orange boxes and integrates with the datastores and services shown in the blue boxes.
AWS Lake Formation Permissions — Permissions Synchronization System: You can grant access to data using AWS Glue methods or by using AWS Lake Formation Authorization. AWS Glue methods use AWS Identity and Access Management (IAM) policies to achieve fine-grained access control. Lake Formation uses a simpler GRANT/REVOKE permission model similar to the GRANT/REVOKE commands in relational database systems. Column-based authorization possible. Authentication is managed via IAM.
Analysis: Data Warehouse
Redshift requires framework management and data preparation while Athena bypasses that and gets straight to querying data from Amazon S3. Amazon Redshift excels when it comes to large, organized, and traditionally relational datasets- it does well with performing aggregations, complex joins, and inner queries.
Athena is a serverless service and does not need any infrastructure to create, manage, or scale data sets. It works directly on top of Amazon S3 data sets. It creates external tables and therefore does not manipulate S3 data sources, working as a read-only service from an S3 perspective. Athena uses Presto and ANSI SQL to query on the data sets. It also uses HiveQL for DDL statements.
Comparing Athena to Redshift is not simple. Athena has an edge in terms of portability and cost, whereas Redshift stands tall in terms of performance and scale.
On the other hand, Redshift is a petabyte-scale data warehouse used together with business intelligence tools for modern analytical solutions. Unlike Athena, Redshift requires a cluster for which we need to upload the data extracts and build tables before we can query. Redshift is based on PostgreSQL 8.0.2.
- Redshift is a data warehouse with a query-editor to run queries(complex and non-complex) and analyze the data.
- Multiple complex joins, a small subset of the columns in a table queries, a high-volume of aggregation, a low volume of single-row inserts
- Managing the Workload: WLM: Users have the flexibility to manage priorities within their workloads, so short, fast-running queries don’t get stuck in a queue behind long-running queries. The WLM Concurrency Scaling feature automatically adds additional capacity for both read and write queries, and you are charged only while the query is actively running.
- COPY: Optimize data load times. Loading all the data from one large file is much slower as Amazon Redshift has to do the serialized load. Split the load data file so that the file size after compression is between 1 MB and 1 GB. For optimal parallelism, the ideal size is between 1 MB and 125 MB after compression. The number of files must be a multiple of the number of slices in the cluster. Split the large file and then use the COPY command to load the data into Amazon Redshift. The loading process uses a temporary staging table.
If you use many COPYs Redshift will have to load data in sequential manner. Using metadata is also helpful. => Single Copy is enough.
- Distribution Style
1) Auto distribution: With AUTO distribution, Amazon Redshift assigns the optimal distribution style based on the size of the table data.
2) EVEN distribution: The leader node distributes rows across fragments in a round-robin fashion, regardless of specific column values. EVEN distributions are suitable when tables do not participate in joins.
3) Key distribution: Rows are distributed according to the values in one column. The leader node assigns matching values to the same node fragment. Distributing table pairs based on join key ensures that matching values in common columns are physically stored together because leader nodes co-locate rows into slices based on values in join columns. This physically stores matching values in a common column together.
4) ALL distribution: A copy of the entire table is distributed across all nodes.
If you don’t have IAM permissions, you’ll get an error right away. Timeouts are usually caused by missing security group permissions (allow IP and port access), connection timeouts are always caused by network connections. Resize the cluster using elastic resize on dense storage nodes.
- monitoring rules: three conditions or predicates, and one action
- Audit logging feature: Enable audit logging for Amazon Redshift using the AWS Management Console or AWS CLI.
- Red shift snapshot: taking a snapshot for recovery before a major change
- AWS offers high-density storage instances (ex. D2) that will provide additional options for processing multi-terabyte data sets.
- Unload data older than 3 years to Amazon S3 and delete it from Amazon Redshift. Data older than three years can be moved to Amazon S3 using the UNLOAD command.
- Data Integrity: merge and change the structure of few tables by changing the DDL. Even though it is not enforced, both primary and foreign keys should be defined to maintain data integrity. Primary and foreign keys prevent redundant data and maintain the integrity of your database by making sure that only correct values get entered.
- When Query Submitted
1) The leader node parses the request and develops the query into the execution plan.
2) Every query has a complex query planning and execution workflow.
- Use the GRANT SQL command to grant that user read-only access to a subset of columns.
- Redshift: Corporate governance policies require that encryption keys be managed through an on-premises hardware security module (HSM). AWS CloudHSM is a modern service. Therefore, Redshift supports existing AWS CloudHSM Classic or on-premises HSMs.
- Optimize Query: Compound sort keys speed up joins, GROUP BY and ORDER BY operations, and window functions that use PARTITION BY and ORDER BY. When you create a table, you can define one or many columns as having sort keys which pre-sort the data. Sort keys enable the query optimizer to read fewer rows of data thereby speeding up execution.
- Analyze petabytes of data using ANSI SQL.
- You can use Redshift Spectrum to query data in S3 without having to copy data from S3 to Redshift.
- If the query frequency is high, it is better to query directly to Redsfhit, and if the query frequency is low, it is better to use redshift spectrum.
- Amazon Redshift Spectrum delivers fast results by running queries on thousands of parallel nodes, regardless of the complexity of the query or the amount of data.
- We keep data for the last 90 days in Amazon Redshift. Move data older than 90 days to Amazon S3 and store it in Apache Parquet format partitioned by date. It then uses Amazon Redshift Spectrum for further analysis.
- Redshift provides better performance for querying and analyzing up-to-date 2-month data, combined with Spectrum for infrequent queries on 24-month data.
- Serverless (S3, Athena, Glue) — Low-overhead services
- Amazon Athena is used by a business to do ad-hoc searches on data stored in Amazon S3.
- S3 => Partition your data, Optimize columnar data store generation
- It uses Amazon Athena and JDBC drivers to run a large number of data manipulation language (DML) queries. The query timeout can be increased.
- Workgroups to isolate queries for teams, applications, or different workloads. For example, you may create separate workgroups for two different teams in your organization
- Cost Optimization
- Set the control limits for each query to a prescribed threshold for the workgroup.
- Athena column-level permission
- individual IAM users
- Lake formation: Specify column-level permissions on individual IAM users. Lake Formation has fine-grained access control and the ability to set permissions on speciifc tables or columns. These permissions may be granted to IAM users, external AWS accounts, or SAML.
- Athena query result location supports CSE-KMS, SSE-KMS or SSE-S3 only not customer provided encryption keys
Open Search ( AWS elastic search )
- Kibana, large log analysis, memory error troubleshooting, performance
- Near real time => operational
- scan all text
- Archive indices that are older than 3 months by using Index State Management (ISM) to create a policy to migrate the indices to Amazon OpenSearch Service (Amazon Elasticsearch Service) UltraWarm storage.
Operational Analytics: Use Amazon Kinesis Data Firehose to push data(compressed and batched clickstream records) to an Amazon OpenSearch Service (Amazon Elasticsearch Service) cluster. Visualize your data using the OpenSearch dashboard (Kibana). ELK can automatically refresh data every 5 seconds. Refresh your content performance dashboard in near real time. No custom code development is required. + Kinesis, Logstash, and Elasticsearch’s API
- Quicksight is a Business intelligence tool which gives you rich visualization. Using Machine learning capabilities in Quicksight, frauds are detected by performing the anomaly detection in the provided data. Existing resources can be leveraged for this task.
- Create the Quicksight dataset by querying the Athena. Use Amazon Quick sight to build visualization and perform anomaly detection using ML insights.
- Quicksight connects to any data source, including Salesforce, MySQL, and Amazon S3. Use Amazon QuickSight to connect to data sources and create mobile dashboards. Aurora(OLTP), Athena, and Snowflake are supported data sources in Quicksight. Connect to on-premise data sources by adding them and a Quicksight specific network interface to your VPC.
- RLS(Row-level security): When you share it with readers, they can only see the data restricted by the permission dataset rules. You can add one column to the query or file for each field that you want to grant or restrict access to.
I use QuickSight Enterprise edition. Configure 50 author users and 1,000 reader users. Configure Athena data sources and import data into SPICE. Automatically refreshes every 24 hours.
Only AWS managed keys are supported. In Amazon QuickSight Enterprise edition, data at rest in SPICE is encrypted using block-level encryption with AWS managed keys. Customer-supplied keys imported into AWS KMS cannot be used.
SPICE Data Engine(Enterprise Edition)
Amazon QuickSight’s SPICE data engine supports data sets up to 250 million rows and 500 GB. SPICE is Amazon QuickSight’s ultra-fast, parallel, in-memory computational engine. SPICE is a query acceleration layer used by QuickSight customers to analyze their data. It provides consistently fast query performance and automatically scales to scale to ensure high concurrency. SPICE saves you time because you don’t have to fetch data every time you change your analysis or update your visuals.
- Auto Graph
- bar charts: single measure, multi-measure, clustered, stacked, stacked 100 percent
- box plots: minimum, maximum, median, first quartile, third quartile
- combo charts
- donut charts
- funnel charts
- gauge charts
- heat maps: Use heat maps to show a measure for the intersection of two dimensions, with color-coding to easily differentiate where values fall in the range. Heat maps can also be used to show the count of values for the intersection of the two dimensions.
- histograms: display the distribution of continuous numerical values in your data.
- line charts
- Geospatial chart: Easily identify revenue patterns across cities and locations
- pie charts
- pivot tables: Use pivot tables to show measure values for the intersection of two dimensions.
- forecasting widget
- Use QuickSight folders to organize your dashboards, analytics, and datasets. Use this folder to assign group permissions.
- Create a new security group for Amazon Redshift in a different region with an inbound rule that authorizes access from the appropriate IP address range to your Amazon QuickSight servers.
- Isolated namespaces: Individuals who simply need to examine dashboards may be read-only users. Individuals having the ability to develop and share new dashboards with other users are considered power users.
Measure, Dimension, Correlation, Time series, Continuous variable
Category outliers, Change points in a time series, Correlation, Low Variance, Majority, Outliers, Overall trends in time series, Seasonality in time series, Steady share, Time series outliers
- Create an external table using Amazon Redshift Spectrum and join it with Amazon Redshift.
- Convert data from JSON to Apache Parquet using AWS Glue jobs.
- Use the AWS Glue crawler to discover schemas and build the AWS Glue data catalog.
- Use Amazon Athena to create a table with a subset of columns.
- Visualize your data using Amazon QuickSight, then use Amazon QuickSight machine learning-based anomaly detection.
Use AWS Glue ML transformations to create predictions, and then use Amazon QuickSight to visualize the data. (Ex. Autonomous machine learning (ML) uses RCF (Random Cut Forest) technology to represent complex real-world situations)