The difference between current time and when the last record of the GetRecords call was written to the stream
Used tot rack the progress if the consumer (tracks read position)
If we have IteratorAgeMilliseconds = 0, then the records that are being read are completely caught up with the producer
IteratorAgeMilliseconds > 0 means we are not processing the records fast enough
Kinesis Client Library - KCL
It is a Java library
It uses DynamoDB to checkpoint offsets and to track other workers and share the work amongst them
Great for reading in a distributed manner
We can not have more KCL applications than shards (example: in case of 6 shards, we can have up to 6 application consumers)
Amazon Kinesis Firehose
Fully managed service, no administration required
Offers Near Real Time performance:
Buffer interval: 0 seconds with no buffering, up to 900 seconds
Buffer size: minimum 1 MB, we can set a higher number
We can load data into AWS destinations (S3, Redshift - COPY through S3, OpenSearch), into 3rd party destinations (such as Splunk, Datadog, New Relic, MongoDB, etc.) or into a custom destination using a HTTP Endpoint
Offers automatic scaling
Offers data transformation ability through AWS Lambda (ex: CSV => JSON)
Supports compression when target is S3 (GZIP, ZIP and SNAPPY)
We pay for the amount of data going through Firehose
Data is written in batches to the destinations
We can send failed or all data into a backup S3 bucket
Kinesis Data Streams vs Firehose
Kinesis Data Streams:
Requires to write custom code (producer/consumer)
Offers real time (~200 ms latency for classic) performance
We must manage scaling (shard splitting/merging)
Data storage for 1 to 365 days, data can be replayed, streams can have multiple consumers
Kinesis Firehose:
Fully managed, we can send data to S3, Redshift, OpenSearch / 3rd party / custom HTTP endpoints
Serverless data transformation with Lambda
Offers near real time performance
Automatically scales
Offers no data storage, does not support replay capability
Kinesis Data Analytics for SQL Applications
We can perform real-time data analytics on Kinesis Streams using SQL
We can add reference data from Amazon S3 to enrich streaming data
Kinesis Data Analytics is a managed service, has auto scaling capabilities by default
We can create streams out of real-time queries
Payment: we pay for actual consumption rate
Data can be sent into 2 destinations:
Kinesis Data Streams
Kinesis Data Firehose
Use cases:
Time-series analytics
Real-time dashboards
Real-time metrics
Kinesis Data Analytics for Apache Flink
Note: This service has been renamed to Amazon Managed Service for Apache Flink!
We can use Flink (Java, Scala or SQL) to process and analyze streaming data
We can read data from 2 data sources:
Kinesis Data Streams
Amazon MSK
With this service we run any Apache Flink application on a managed cluster on AWS:
We get automatic provisioning of resources, parallel computation and automatic scaling
We get application backups: implemented as checkpoints and snapshots
Flink is a lot more powerful than a simple SQL query language
Flink does not supports reading data from Firehose (use Kinesis Analytics for SQL instead)
Machine Learning on Kinesis Data Analytics
Kinesis Data Analytics provides 2 machine learning algorithms:
RANDOM_CUT_FOREST:
It is a SQL function used for anomaly detection on numeric columns in a stream
Uses the recent history to compute and train the model
HOTSPOTS:
Locate and return information about relatively dense regions in our data