Ebook

Cheatsheet

Redshift Spectrum A Redshift feature that allows users to run queries against data stored on Amazon S3.
Geospatial chart A QuickSight visual type that is best suited for displaying different data values across a geographical map.
Amazon Kinesis Data Streams that can collect and process large streams of data records in real time.
Amazon Macie A fully managed data security and data privacy service that uses machine learning and pattern matching to discover and protect your sensitive data in AWS.
Amazon Athena An interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL.
True True or False. A single COPY command is faster than multiple COPY commands when loading one Redshift table from multiple files.
AWS Glue A serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development. Not studied (3)You haven’t studied these terms yet!
Per-query limit A form of cost control in Amazon Athena that limits the amount of data scanned per query.
Amazon S3 An AWS service that is used as a referenced data source for a Kinesis Data Analytics application?
Amazon S3 Glacier A secure, durable, and extremely low-cost Amazon S3 storage class for data archiving and long-term backup.

Types of Data Collection

Real Time
Near Real Time
Batch

Real Time Data Collection Services

Kinesis Data Streams
Simple Queue Service
Internet of Things

Near Real Time Collection Services

Kinesis Data Firehose
Database Migration Service

Batch Collection Services

Snowball
Data Pipeline

Kineses Services

Kinesis Streams
Kinesis Analytics
Kinesis Firehose

Kinesis Streams

Steams are divided in ordered Shards
Data retention is 24 hours by default, 7 day max
Ability to reprocess data within the data retention period
Once data is inserted, it cannot be deleted

Kinesis Streams Records

Data Blob
Record Key
Sequence Number

Data Blob Data being sent, serialized as bytes. Up to 1MB. Can represent anything

Record Key

Sent alongside a record, helps to group records in Shards. Same key = Same shard
Use a highly distributed key to avoid the “hot partition” problem

Sequence Number Unique identifier for each records put in shards. Added by Kinesis after ingestion

Kinesis Streams Limits (Producer) 1MB/s or 1000 messages/s at write per shard.

Kinesis Producer Components

Kinesis SDK
Kinesis Producer Library (KPL)
Kinesis Agent
3rd Party Libraries

Kinesis Streams Limits (Consumer Classic) 2MB/s or 5 API calls/s per shard across all consumers

Kinesis Streams Limits (Consumer Enhanced) 2MB/s per shard, per enhanced consumer. No API calls needed

Kinesis Producer SDK (PutRecord(s))

API’s that are used are PutRecord (one) and PutRecords (many records).
PutRecords uses batching and increases throughput (less HTTP requests)

Kinesis Producer SDK (Use Cases)

Low throughput
Higher latency
Simple API
AWS Lambda

Kinesis Data Streams (Manage AWS Services)

CloudWatch Logs
AWS IoT
Kinesis Data Analytics

ProvisionThroughputExceeded Exceptions Problem: Happens when sending more data (exceeding MB/s or TPS for any shard) Make sure you don’t have a hot shard (such as your partition key is bad and too much data goes to that partition Solution:

Retries with backoff
Increase shards (scaling)
Ensure partition key is distributed

Kinesis Producer Library (KPL)

Easy to use highly configurable C++/Java library
Used for building high performance, long-running producers
Automated and configurable retry mechanism
Synchronous or Asynchronous API
Submits metrics to CloudWatch for monitoring
Batching - increase throughput, decrease cost
Compression must be implemented by user
KPL Records must be de-coded with KCL or special helper library

Kinesis Agent

Monitor Log files and sends them to Kinesis Data Streams
Java-based agent, built on top of KPL
Install Linux-based server environments
Write from multiple directories and write to multiple streams
Routing based on directory/ log file
Pre-process data before sending to streams
The agent handles file rotation, checkpointing, and retry up failures
Emits metrics to CloudWatch for monitoring

Kinesis Consumers Classic Components

Kinesis SDK
Kinesis Client Library
Kinesis Connector Library
3rd party libraries
Kinesis Firehose
AWS Lambda

Kinesis Consumer Classic SDK (GetRecords)

Records are polled by consumers from a shard
Each shard has 2MB total aggregate throughput
GetRecords returns up to 10MB/s of data (then throttle for 5 seconds) or up to 10K records/s
Maximum of 5 GetRecords API calls per shard per second = 200ms latency
If 5 consumer applications consume from the same shard, means every consumer can poll once a second and receive less than 400KB/s

Kinesis Client Library

Lanjutan (Masih dalam penulisan)

Refferences :

https://quizlet.com/search?query=aws-data-analytics-specialty&type=sets&useOriginal=

Belajar AWS Data Analytics Speciality DAS-CO1

November 09, 2022

Ebook

Cheatsheet

Lanjutan (Masih dalam penulisan)

Cara Cloning Git Projet Berdasarkan Group ID

Perbandingan Layanan Cloud Modern Saat Ini (Part 1)

Materi Persiapan Ujian Sertifikasi Oracle Cloud Infrastructure Architect 2023 Professional (1Z0-997-23)