Amazon Kinesis

Kinesis is a managed alternative to Apache Kafka
Great for application logs, metrics, IoT, clickstreams
Great for real-time big
Data is automatically replicate to 3 AZs
Kinesis offers 3 types of services:
- Kinesis Streams: capture, process and store data streams
- Kinesis Firehose: load streams into AWS data stores (S3, Redshift and OpenSearch)
- Kinesis Data Analytics: analyzer data streams with SQL or Apache Flink
- Kinesis Video Streams: capture, process and store video streams

Kinesis Data Streams

Streams are divided in ordered shards/partitions
Data retention is 1 day by default, can go up to 365 days
Provides the ability to reprocess, replay data
Multiple applications can consume the same stream
Provides real-time processing with scale of throughput
Once the data inserted in Kinesis, it can not be deleted (immutability)
Data that shares the same partition key goes to the same shard (ordering)
Producers:
- Kinesis SDK
- Kinesis Producer Library (KPL)
- Kinesis Agent
- 3rd party libraries
Consumers:
- Kinesis SDK
- Kinesis Client Library (KCL)
- Kinesis Connector Library
- Managed consumers: AWS Lambda, Kinesis Data Firehose, Kinesis Data Analytics

Provisioned mode:
- We choose the number of shards provisioned, scale them manually or using the API
- Each shard gets 1 MB/s (or 1000 records per second)
- Each shard gets 2 MB/s output speed (applicable for classic or fan-out consumer)
- We pay per shard provisioned per hour
On-demand mode:
- We don’t need to provision or manage the capacity
- Default capacity provisioned (4 MB/s in or 4000 records per second)
- Scales automatically based on observed throughput peak during the last 30 days
- We pau per stream per hour and data in/out per GB

Data blob: data being sent, serialized as bytes, can be up to 1 MB, can represent anything
Record key:
- Sent alongside a record, helps to group records in shards (same key=same shard)
- We should use a highly distributed key to avoid the hot partition problem
Sequence number: uniq id for each record. Added by Kinesis after ingestion

Producer: 1MB/s or 1000 messages/s at write PER SHARD other we get ProducerThroughputException
Consumer Classic:
- 2MB at read PER SHARD across all consumers
- 5 API class per second PER SHARD across all consumers (if 3 different applications are consuming, possibility of throttling)

GetRecords.IteratorAgeMilliseconds CloudWatch Metric:
- The difference between current time and when the last record of the GetRecords call was written to the stream
- Used tot rack the progress if the consumer (tracks read position)
- If we have IteratorAgeMilliseconds = 0, then the records that are being read are completely caught up with the producer
- IteratorAgeMilliseconds > 0 means we are not processing the records fast enough

It is a Java library
It uses DynamoDB to checkpoint offsets and to track other workers and share the work amongst them
Great for reading in a distributed manner
We can not have more KCL applications than shards (example: in case of 6 shards, we can have up to 6 application consumers)

Fully managed service, no administration required
Offers Near Real Time performance:
- Buffer interval: 0 seconds with no buffering, up to 900 seconds
- Buffer size: minimum 1 MB, we can set a higher number
We can load data into AWS destinations (S3, Redshift - COPY through S3, OpenSearch), into 3rd party destinations (such as Splunk, Datadog, New Relic, MongoDB, etc.) or into a custom destination using a HTTP Endpoint
Offers automatic scaling
Offers data transformation ability through AWS Lambda (ex: CSV => JSON)
Supports compression when target is S3 (GZIP, ZIP and SNAPPY)
We pay for the amount of data going through Firehose
Data is written in batches to the destinations
We can send failed or all data into a backup S3 bucket

Kinesis Data Streams:
- Requires to write custom code (producer/consumer)
- Offers real time (~200 ms latency for classic) performance
- We must manage scaling (shard splitting/merging)
- Data storage for 1 to 365 days, data can be replayed, streams can have multiple consumers
Kinesis Firehose:
- Fully managed, we can send data to S3, Redshift, OpenSearch / 3rd party / custom HTTP endpoints
- Serverless data transformation with Lambda
- Offers near real time performance
- Automatically scales
- Offers no data storage, does not support replay capability

We can perform real-time data analytics on Kinesis Streams using SQL
We can add reference data from Amazon S3 to enrich streaming data
Kinesis Data Analytics is a managed service, has auto scaling capabilities by default
We can create streams out of real-time queries
Payment: we pay for actual consumption rate
Data can be sent into 2 destinations:
- Kinesis Data Streams
- Kinesis Data Firehose
Use cases:
- Time-series analytics
- Real-time dashboards
- Real-time metrics

Note: This service has been renamed to Amazon Managed Service for Apache Flink!
We can use Flink (Java, Scala or SQL) to process and analyze streaming data
We can read data from 2 data sources:
- Kinesis Data Streams
- Amazon MSK
With this service we run any Apache Flink application on a managed cluster on AWS:
- We get automatic provisioning of resources, parallel computation and automatic scaling
- We get application backups: implemented as checkpoints and snapshots
Flink is a lot more powerful than a simple SQL query language
Flink does not supports reading data from Firehose (use Kinesis Analytics for SQL instead)