Ebook
Ebook AWS Certified Data Analytics Study Guide DAS-C01 Exam
Big_Data_on_AWS_3.8_(EN)_Student_Guide.pdf
Big_Data_on_AWS_3.8_(EN)_Lab_Guide.pdf
Cheatsheet
-
Redshift Spectrum A Redshift feature that allows users to run queries against data stored on Amazon S3.
-
Geospatial chart A QuickSight visual type that is best suited for displaying different data values across a geographical map.
-
Amazon Kinesis Data Streams that can collect and process large streams of data records in real time.
-
Amazon Macie A fully managed data security and data privacy service that uses machine learning and pattern matching to discover and protect your sensitive data in AWS.
-
Amazon Athena An interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL.
-
True True or False. A single COPY command is faster than multiple COPY commands when loading one Redshift table from multiple files.
-
AWS Glue A serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development. Not studied (3)You haven’t studied these terms yet!
-
Per-query limit A form of cost control in Amazon Athena that limits the amount of data scanned per query.
-
Amazon S3 An AWS service that is used as a referenced data source for a Kinesis Data Analytics application?
-
Amazon S3 Glacier A secure, durable, and extremely low-cost Amazon S3 storage class for data archiving and long-term backup.
Types of Data Collection
- Real Time
- Near Real Time
- Batch
Real Time Data Collection Services
- Kinesis Data Streams
- Simple Queue Service
- Internet of Things
Near Real Time Collection Services
- Kinesis Data Firehose
- Database Migration Service
Batch Collection Services
- Snowball
- Data Pipeline
Kineses Services
- Kinesis Streams
- Kinesis Analytics
- Kinesis Firehose
Kinesis Streams
- Steams are divided in ordered Shards
- Data retention is 24 hours by default, 7 day max
- Ability to reprocess data within the data retention period
- Once data is inserted, it cannot be deleted
Kinesis Streams Records
- Data Blob
- Record Key
- Sequence Number
Data Blob Data being sent, serialized as bytes. Up to 1MB. Can represent anything
Record Key
- Sent alongside a record, helps to group records in Shards. Same key = Same shard
- Use a highly distributed key to avoid the “hot partition” problem
Sequence Number Unique identifier for each records put in shards. Added by Kinesis after ingestion
Kinesis Streams Limits (Producer) 1MB/s or 1000 messages/s at write per shard.
Kinesis Producer Components
- Kinesis SDK
- Kinesis Producer Library (KPL)
- Kinesis Agent
- 3rd Party Libraries
Kinesis Streams Limits (Consumer Classic) 2MB/s or 5 API calls/s per shard across all consumers
Kinesis Streams Limits (Consumer Enhanced) 2MB/s per shard, per enhanced consumer. No API calls needed
Kinesis Producer SDK (PutRecord(s))
- API’s that are used are PutRecord (one) and PutRecords (many records).
- PutRecords uses batching and increases throughput (less HTTP requests)
Kinesis Producer SDK (Use Cases)
- Low throughput
- Higher latency
- Simple API
- AWS Lambda
Kinesis Data Streams (Manage AWS Services)
- CloudWatch Logs
- AWS IoT
- Kinesis Data Analytics
ProvisionThroughputExceeded Exceptions Problem: Happens when sending more data (exceeding MB/s or TPS for any shard) Make sure you don’t have a hot shard (such as your partition key is bad and too much data goes to that partition Solution:
- Retries with backoff
- Increase shards (scaling)
- Ensure partition key is distributed
Kinesis Producer Library (KPL)
- Easy to use highly configurable C++/Java library
- Used for building high performance, long-running producers
- Automated and configurable retry mechanism
- Synchronous or Asynchronous API
- Submits metrics to CloudWatch for monitoring
- Batching - increase throughput, decrease cost
- Compression must be implemented by user
- KPL Records must be de-coded with KCL or special helper library
Kinesis Agent
- Monitor Log files and sends them to Kinesis Data Streams
- Java-based agent, built on top of KPL
- Install Linux-based server environments
- Write from multiple directories and write to multiple streams
- Routing based on directory/ log file
- Pre-process data before sending to streams
- The agent handles file rotation, checkpointing, and retry up failures
- Emits metrics to CloudWatch for monitoring
Kinesis Consumers Classic Components
- Kinesis SDK
- Kinesis Client Library
- Kinesis Connector Library
- 3rd party libraries
- Kinesis Firehose
- AWS Lambda
Kinesis Consumer Classic SDK (GetRecords)
- Records are polled by consumers from a shard
- Each shard has 2MB total aggregate throughput
- GetRecords returns up to 10MB/s of data (then throttle for 5 seconds) or up to 10K records/s
- Maximum of 5 GetRecords API calls per shard per second = 200ms latency
- If 5 consumer applications consume from the same shard, means every consumer can poll once a second and receive less than 400KB/s
Kinesis Client Library
Lanjutan (Masih dalam penulisan)
Refferences :
- https://quizlet.com/search?query=aws-data-analytics-specialty&type=sets&useOriginal=