AWS Certified Data Analytics(DAS-C01) — Certification Summary
Amazon’s data services can be divided into five categories: data ingestion, storage, processing, analysis and visualization, and security.
This article is part of a series, each dealing with each of the five topics above.
1. Data Ingestion2. Data Storage
- Data Format
- EBS, S3, DynamoDB, RDS
- Architecture
- Other services3. Data Processing
4. Analysis and Visualization
5. Security
Data Storage
Data Format
- Best Practice: Use Snappy compression to store data in Apache Parquet format. (Parquet with Snappy)
- Snappy compression compresses data to aid I/O, and does roughly the same level of compression for both AVROs. AVRO stores data in row format and does not compress the data.
- Column-based data storage. ORC and Parquet are both column-based storage formats and both are supported by Athena.
- Parquet is partitionable and compressed by default. Parquet is a columnar storage (no additional compression algorithms like snappy applied) and by default compresses data by 2x to 5x on average. It is optimized for fast retrieval of data.
- Partitioning (limited partitions and low cardinality) improves performance. The data is partitioned by cardinality key. Adding certain column values to a partition can create high cardinality in the partition and result in multiple small files under each partition, which can degrade performance.
EBS
- EBS uses block storage and S3 uses object storage.
- Unbalanced shard allocations across nodes or too many shards in a cluster can cause JVMMemoryPressue.
S3
- Split the data by appending key prefixes to the S3 object.
- Lifecycle
- S3 Standard-IA(Amazon S3 Standard-Infrequent Access): 자주 액세스 하지 않지만 빠른 검색이 필요
- S3 Glacier
- Amazon Glacier Deep Archive: 요청 후 15시간 이내에만 데이터를 복원하면 되므로 가장 비용 효율적인 보관 솔루션 - Before you store your data in Amazon S3, you should organize your data by standardizing the column format. The amount of data fluctuates with the total load at any given time. A single data record can be between 100 KB and 10 MB in size.
- E-Tag: Compare the S3 E-tags of both files to check if contents are the same. Comparing E-tags of both files is the most efficient way to ensure data integrity of the files.However, NOTE: The ETag for an object created using the multipart upload api will contain one or more non-hexadecimal characters and/or will consist of less than 16 or more than 16 hexadecimal digits.
- S3 inventory: Amazon S3 Inventory is one of the tools Amazon S3 provides to help manage your storage. You can use it to audit and report on the replication and encryption status of your objects for business, compliance, and regulatory needs.
- Replication
- To automatically replicate when new objects are written to your bucket, use live replication such as Same-Region Replication (SRR) or Cross-Region Replication (CRR).
- Cross-Region Replication (CRR) is used to copy objects across Amazon S3 buckets in different AWS Regions. This would help reduce latency for team members working from other parts of the world.
- V2 Replication ensures that delete marker is not replicated . A subsequent GET request to the deleted object returns the object only in the destination bucket. When you enable S3 Replication, V2 configuration is selected by default. - Amazon S3 Glacier Retrieval: Bulk Retrieval, Standard Retrieval
Bulk and standard retrieval are correct ways to retrieve compressed files. Different retrieval methods have different retrieval speeds. TAR and ZIP file formats are common formats that people use with AWS.
Glacier Read: S3 select
Multiple file reads: Athena - Deleting a dataset is irreversible, and it will impact dashboard, analysis, and dependent objects. If you do decide to delete a dataset, make sure no existing analysis or dashboard is using it.
- Security
- Enables default encryption on Amazon S3 buckets where logs are stored using AES-256 encryption.
- Get Access to the KMS key used for encrypting data in S3 bucket. KMS key access is needed to access and decrypt the data from S3 bucket.
- It provides a granular level of security by registering Amazon S3 routes and then enforcing permissions through Lake Formation.
- Use AWS Config to monitor S3 ACLs and bucket policies. Configure Cloud watch Events rule to invoke a Lambda when AWS Config detects a policy violation and then trigger SNS to send a notification to the team for the compliance policy violation.
- Configure AWS Macie to scan S3 data and validate the data using AWS Macie for data compliance rules, using AWS Cloud Watch Events to trigger SNS notifications.
- Object lock: Safeguarded from being deleted for 10 years.
- Block Public Access: S3 Block Public Access provides controls across an entire AWS Account or at the individual S3 bucket level to ensure that objects never have public permissions. By default S3 buckets don’t allow public access but by specifying Block Public Access you can ensure they never see the public.
- S3 endpoint policy: Confirm if the S3 endpoint policy allows access to the required S3 location. VPC endpoint policy includes the correct permissions to access the S3 bucket.
S3 select
- It is cost-effective to load data into Amazon S3 and query it with Amazon S3 Select.
DynamoDB
Run large-scale analytic and operational workloads using a fully managed, scalable NoSQL database service.
- Partition Key
- Sort Key
- Main Table: partitioned by specific columns
- Global Table: Global Secondary Index (GSI) is an index with a partition key and a sort key that can be different from those on the base table. It can be created even after the main table is created. Hence, GSI can be added without any interruptions to the existing table.
- Consistency: The only way to achieve strong consistency is to query on the main table instead of the replicas.
- Local Secondary Index: the way to query all of the threads
- Global Secondary Index: reduce the load on the main table and provide an ability to query on different metrics. Create a global secondary Index on each of these dimensions and periodically query the index for values greater than the threshold for each metric. GSI can be added without any interruptions to the existing table.
- Adaptive Capacity: To accommodate uneven data access patterns, DynamoDB adaptive capacity lets your application continue reading and writing to hot partitions without request failures (as long as you don’t exceed your overall table-level throughput, of course). Adaptive capacity works by automatically increasing throughput capacity for partitions that receive more traffic.
- Integrated with AWS lambda: With triggers, you can build applications that react to data modifications in DynamoDB tables. (tracking modifications in the DynamoDB table)
- EMRFS consistent view tracks consistency using a DynamoDB table to track objects in Amazon S3 that have been synced with or created by EMRF. So increasing RCU for the shared DynamoDB table will help
Amazon RDS
- used as a Hive Metastore: Hive metastore needs to be centralized and accessed from multiple clusters. It can be externalized into RDS. It also supports Spark, Hive, and Presto. If you need metastore to persist, you must create an external metastore that exists outside the cluster.
Architecture V1
- Real-Time Analytics on DynamoDB
- Use DynamoDB TTL to delete items that are older than a year.
- Use DynamoDB streams to capture items delete by TTL.
- Use Lambda function to read these items from the stream and load them to S3 using Kinesis Data Firehose.
- Use S3 Lifecycle configuration to transition this data to Amazon Glacier for long term back-up.
Architecture V2
- Amazon RDS: OLTP
- Amazon S3: storing tons of data
- Redshift: using for data search or query
- Amazon RedshiftSpectrum: merging existing data and up-to-date data
Other services for Data Storage
- Amazon DocumentDB (MongoDB Compatible): A scalable, durable, fully managed database service running mission-critical MongoDB workloads.