Belajar AWS Data Analytics Speciality DAS-CO1
Belajar AWS Data Analytics Speciality DAS-CO1
Ebook
Ebook AWS Certified Data Analytics Study Guide DAS-C01 Exam
Big_Data_on_AWS_3.8_(EN)_Student_Guide.pdf
Big_Data_on_AWS_3.8_(EN)_Lab_Guide.pdf
Cheatsheet
Redshift Spectrum A Redshift feature that allows users to run queries against data stored on Amazon S3.
Geospatial chart A QuickSight visual type that is best suited for displaying different data values across a geographical map.
Amazon Kinesis Data Streams that can collect and process large streams of data records in real time.
Amazon Macie A fully managed data security and data privacy service that uses machine learning and pattern matching to discover and protect your sensitive data in AWS.
Amazon Athena An interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL.
True True or False. A single COPY command is faster than multiple COPY commands when loading one Redshift table from multiple files.
AWS Glue A serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development. Not studied (3)You haven’t studied these terms yet!
Per-query limit A form of cost control in Amazon Athena that limits the amount of data scanned per query.
Amazon S3 An AWS service that is used as a referenced data source for a Kinesis Data Analytics application?
Amazon S3 Glacier A secure, durable, and extremely low-cost Amazon S3 storage class for data archiving and long-term backup.
Types of Data Collection
- Real Time
- Near Real Time
- Batch
Real Time Data Collection Services
- Kinesis Data Streams
- Simple Queue Service
- Internet of Things
Near Real Time Collection Services
- Kinesis Data Firehose
- Database Migration Service
Batch Collection Services
- Snowball
- Data Pipeline
Kineses Services
- Kinesis Streams
- Kinesis Analytics
- Kinesis Firehose
Kinesis Streams
- Steams are divided in ordered Shards
- Data retention is 24 hours by default, 7 day max
- Ability to reprocess data within the data retention period
- Once data is inserted, it cannot be deleted
Kinesis Streams Records
- Data Blob
- Record Key
- Sequence Number
Data Blob Data being sent, serialized as bytes. Up to 1MB. Can represent anything
Record Key
- Sent alongside a record, helps to group records in Shards. Same key = Same shard
- Use a highly distributed key to avoid the “hot partition” problem
Sequence Number Unique identifier for each records put in shards. Added by Kinesis after ingestion
Kinesis Streams Limits (Producer) 1MB/s or 1000 messages/s at write per shard.
Kinesis Producer Components
- Kinesis SDK
- Kinesis Producer Library (KPL)
- Kinesis Agent
- 3rd Party Libraries
Kinesis Streams Limits (Consumer Classic) 2MB/s or 5 API calls/s per shard across all consumers
Kinesis Streams Limits (Consumer Enhanced) 2MB/s per shard, per enhanced consumer. No API calls needed
Kinesis Producer SDK (PutRecord(s))
- API’s that are used are PutRecord (one) and PutRecords (many records).
- PutRecords uses batching and increases throughput (less HTTP requests)
Kinesis Producer SDK (Use Cases)
- Low throughput
- Higher latency
- Simple API
- AWS Lambda
Kinesis Data Streams (Manage AWS Services)
- CloudWatch Logs
- AWS IoT
- Kinesis Data Analytics
ProvisionThroughputExceeded Exceptions Problem: Happens when sending more data (exceeding MB/s or TPS for any shard) Make sure you don’t have a hot shard (such as your partition key is bad and too much data goes to that partition Solution:
- Retries with backoff
- Increase shards (scaling)
- Ensure partition key is distributed
Kinesis Producer Library (KPL)
- Easy to use highly configurable C++/Java library
- Used for building high performance, long-running producers
- Automated and configurable retry mechanism
- Synchronous or Asynchronous API
- Submits metrics to CloudWatch for monitoring
- Batching - increase throughput, decrease cost
- Compression must be implemented by user
- KPL Records must be de-coded with KCL or special helper library
Kinesis Agent
- Monitor Log files and sends them to Kinesis Data Streams
- Java-based agent, built on top of KPL
- Install Linux-based server environments
- Write from multiple directories and write to multiple streams
- Routing based on directory/ log file
- Pre-process data before sending to streams
- The agent handles file rotation, checkpointing, and retry up failures
- Emits metrics to CloudWatch for monitoring
Kinesis Consumers Classic Components
- Kinesis SDK
- Kinesis Client Library
- Kinesis Connector Library
- 3rd party libraries
- Kinesis Firehose
- AWS Lambda
Kinesis Consumer Classic SDK (GetRecords)
- Records are polled by consumers from a shard
- Each shard has 2MB total aggregate throughput
- GetRecords returns up to 10MB/s of data (then throttle for 5 seconds) or up to 10K records/s
- Maximum of 5 GetRecords API calls per shard per second = 200ms latency
- If 5 consumer applications consume from the same shard, means every consumer can poll once a second and receive less than 400KB/s
Kinesis Client Library
Lanjutan (Masih dalam penulisan)
Refferences :
- https://quizlet.com/search?query=aws-data-analytics-specialty&type=sets&useOriginal=
