AWS Certified Data Analytics(DAS-C01) — Certification Summary

(1) Data Ingestion (AWS Kinesis and other related services)

SoniaComp
6 min readJun 20, 2022

Amazon’s data services can be divided into five categories: data ingestion, storage, processing, analysis and visualization, and security.

This article is part of a series, each dealing with each of the five topics above.

1. Data Ingestion
- Kinesis Tuning
- Amazon Managed Streaming for Apache Kafka Security
- Kinesis Security
- Architecture V1-V4- Other Services for Data Ingestion
- Network Services for Data Ingestion
- Orchestration for Data Ingestion
2. Data Storage
3. Data Processing
4. Analysis and Visualization
5. Security

Data Ingestion

Main services: Kinesis, Kinesis Firehose, Kinesis Data Stream, Kinesis Data Analytics, KPL (Kinesis Producer Library)

Kinesis Tuning

  • RecordMaxBufferedTime: The higher this value, the higher the packing efficiency and the better the performance. Applications that cannot tolerate this additional delay may need to use the AWS SDK directly.
  • Performance optimization
    1) By increasing the number of shards of Kinesis Data Stream, operational overhead can be minimized.
    2) Choose partition keys in a way that results in an even distribution of records across shards.
  • Kinesis DataFirehose
    - maximum number of delivery streams that can be encrypted using CMK(Customer_Managed_CMK) = 500
    - maximum size of record in Kinesis Data Firehose, before the record is Base-64 encoded = 1,000 KiB
    - Kinesis Data Streams is suitable for customization. KDF can handle compression without writing code in lambda.
    - Kinesis firehose can directly transfer the data into S3 from the same data stream in addition to cost effectiveness specified
  • Kinesis Data Anlaytics
    Kinesis data analytics only consume the data from KDS and KDF ,so S3 Data reference can be attached to it as a datasource .

Amazon Managed Streaming for Apache Kafka Security

  • Kafka clusters are set up to only accept TLS encrypted data and encrypt data in transit.
  • Organizations want to ban applicants from writing on topics other than what they are supposed to write about. Use Kafka ACLs and configure read and write permissions for each topic. Use the distinguished name of the client TLS certificate as the principal in the ACL

Kinesis Security

  • KMS: AWS Managed Keys
  • CMK: Rotatable keys by self
    Suppose the Amazon Kinesis Data Firehose service cannot decrypt records because of a KMSNotFoundException, a KMSInvalidStateException, a KMSDisabledException, or a KMSAccessDeniedException. In that case, the service waits up to 24 hours (the retention period) for you to resolve the problem. If the problem persists beyond the retention period, the service skips those records that have passed the retention period and couldn’t be decrypted and will discard the data.
  • Cognito: Using Cognito to authenticate and directly calling the Kinesis API is a reliable and straightforward way. Cognito lets you sign up and sign in users to web and mobile apps and scales up to millions of users.

Architecture V1

Source: https://aws.amazon.com/ko/solutions/implementations/aws-streaming-data-solution-for-amazon-kinesis/
  • Ingest data into Amazon Kinesis Data Streams using an Amazon API Gateway API as a Kinesis proxy.
  • Amazon Kinesis can collect and process hundreds of gigabytes of data per second from hundreds of thousands of sources.
  • Ingest personalized data and process them sequentially for each user by using Amazon Kinesis as it can scale to the required load, allow multiple applications to access the records and process them sequentially.
  • Kinesis provides an ordering of records and the ability to read/replay records in the same order to multiple Kinesis applications. Kinesis data streams store data for up to 7 days.

Architecture V2

Source: https://aws.amazon.com/ko/solutions/implementations/aws-streaming-data-solution-for-amazon-kinesis/
  • Update your code to use the PutRecord/PutRecords calls of the Kinesis Data Streams API with the AWS SDK for Java(KPL).
  • Run Amazon Kinesis Data Analytic on stream data.
    Add anomaly detection SQL scripts using Kinesis Data Analytics and enable data transformation to flatten the JSON file.
  • Run analytic calculations using Amazon Athena.
  • Configure Apache Flink applications as consumer applications to process and analyze data.
  • Data Type: nested JSON (Schema may change over time)
  • Use S3 lifecycle rules to transition objects to S3 Glacier after 1 year.

Architecture V3

Source: https://aws.amazon.com/ko/solutions/implementations/aws-streaming-data-solution-for-amazon-kinesis/
  • Data Deduplication: Include a unique ID for each record to ensure deduplication during processing.
  • Data Scalability: Replace Kafka with Amazon Kinesis Data Stream, consume data using Amazon Kinesis Data Firehose, and store data in Amazon S3.
  • Use Amazon Kinesis Data Firehose to output processed data to Amazon S3.
  • Use Amazon Kinesis Data Firehose to push data to my Amazon OpenSearch Service (Amazon Elasticsearch Service) cluster.
  • Visualize your data using the OpenSearch dashboard (Kibana).

Architecture V4

Source: https://aws.amazon.com/ko/solutions/implementations/aws-streaming-data-solution-for-amazon-kinesis/
  • It uses Amazon Kinesis Data Streams to receive data from sensors. Use Amazon Kinesis Data Analytics to read streams, aggregate data, and send data to AWS Lambda functions. Configure Lambda function to store data in Amazon DynamoDB.
  • Using Kinesis, real-time data can be captured and sent to Kinesis Analytics to run SQL queries. Data can further be stored in Redshift for future research and analysis.
  • Firehose is able to ingest real-time streaming data. It then passes that data to an S3 bucket which is then given to Redshift. Redshift is an effective database solution to run SQL queries.
  • Using AWS Lambda does not affect the performance of other HTTP endpoints.

Other Services for Data Ingestion

  • Amazon MQ: Amazon MQ is a managed message broker service that makes it easy to migrate to a message broker in the cloud. Currently, Amazon MQ supports the Apache ActiveMQ and RabbitMQ engine types.
  • SNS(Simple Notification Service): Great for new applications that can benefit from near-unlimited scalability and simple APIs. A managed service that delivers messages from publishers to subscribers (also known as producers and consumers).
  • SQS(Simple Queue Service): Amazon SQS supports an unlimited number of queues and unlimited number of messages per queue for each user. Amazon SQS automatically deletes messages that have been in the queue for more than 4 days.
    (If the user tries to delete the queue, It will delete queue even if it has messages.)
    (SQS Operations: Delete Message Batch, Send Message Batch, Create Queue — Delete Message Queue API is not provided.)
    (By default, instances of the AWS SDK for Java AmazonSQSClient class can maintain at most 50 connections to Amazon SQS.)
    (improve the security and performance of Amazon SQS — Use Signature Version3, which provides improved SHA256-based security and performance over previous versions)
  • SQS [Standard Queue]: Provide maximum throughput, best-effort ordering, and at-least-once delivery. Support unlimited API calls.
  • SQS [FIFO Queue]: Guarantee that messages are processed exactly once, in the exact order they are sent. Enhance messaging between applications when operations and events are critical or where duplicates can’t be tolerated.

Network Services for Data Ingestion

  • AWS API Gateway: AWS services for creating, publishing, maintaining, monitoring, and securing REST and WebSocket APIs at any scale. API developers can create APIs that access data stored in the AWS Cloud, including AWS or other web services.
  • AWS Direct Connect
  • Amazon VPC
  • AWS Shield: To help protect your application and database from layer three and layer four volumetric attacks, configure AWS Shield with Amazon CloudFront to provide additional security and a CDN. DDoS attack scenario can be best secured using AWS Shield. Amazon CloudFront offers advanced security capabilities. It integrates with AWS Shield and AWS Web Application Firewall to protect against attacks.
  • AWS Database Migration Service(AWS DMS): You can use this when migrate data to S3.
  • AWS DataSync: AWS DataSync is a secure online data transfer service that simplifies, automates, and accelerates the copying of terabytes of data to and from AWS storage services.
  • AWS Snowball: A service that provides secure, robust devices to bring AWS compute and storage capabilities to the edge and to send and receive data to and from AWS.
  • AWS Transfer for SFTP(Secure File transfer protocol)

Orchestration for Data Ingestion

  • AWS Step Functions: Automated process orchestration. Reduce human intervention and improve architecture.

--

--

SoniaComp

Data Engineer interested in Data Infrastructure Powering Fintech Innovation (https://www.linkedin.com/in/sonia-comp/)