Amazon S3

Amazon S3 allows us to store objects (files) in “buckets” (directories)

S3 Buckets and Objects

Buckets must have a globally unique name
Buckets are defined at the region level
Buckets do have a naming convention:
- No uppercase
- No underscore
- 3-63 character long names
- Name should not be an IP
- Must start with a lower case letter or number
Objects (files) have a key, which is the full path:
- s3://my-bucket/my_file.txt
- s3://my-bucket/my_folder/another_folder/my_file.txt
They key is composed of the prefix + object name:
- s3://my-bucket/ my_folder/another_folder/ my_file.txt
There is no concept of directories within buckets, although the UI will trick us to think otherwise
The keys can be very long names which contain slashes (“/”)
Object values are the content of the body
Max object size is 5TB
If uploading mor than 5 GB, we must use multi-part upload
Each object in S3 can have metadata, tags and (optional) version ID

S3 Versioning

Versioning should be enabled at the bucket level
Versioning means: objects uploaded with same key will increment the version
It is best practice to version files in S3, because:
- It will protect against unintended deletes (ability to restore version)
- Provides the ability to rollback a file
Notes:
- Any file that is not versioned prior to enabling versioning, will have the version of null
- Suspending versioning does not delete the previous versions of existing files

S3 Encryption for Objects

There are 4 methods of encrypting objects in S3:
- SSE-S3: encrypts S3 objects using keys handled managed by AWS
- SSE-KMS: leverages AWS KMS to manage encryption keys
- SSE-C: keys are managed by the users
- Client Side Encryption

SSE-S3:

Encryption using keys handled and managed by AWS
Objects are encrypted server side
Uses AES-256 encryption
In order to upload and set the SSE-S3 encryption, the upload request must have a header "x-amz-server-side-encryption": "AES256"

SSE-KMS

Encryption using keys handled and managed by KMS
KMS advantages: user control + audit trail
Objects are encrypted in the server side
In order to upload and set the SSE-S3 encryption, the upload request must have a header "x-amz-server-side-encryption": "aws:kms"

SSE-C

Server-side encryption using the keys provided by the customers
S3 does not store the encryption key the customers provide
To upload the data, HTTPS must be used
Encryption key must be provided in HTTP headers, for every HTTP request

Client Side Encryption

Files are encrypted by the customers before uploading them to S3
Client side libraries can be used to accomplish this, example Amazon S3 Encryption Client
Customers are responsible for encrypting the data as well
Encryption keys are fully managed by the customers

S3 Security

User based security:
- IAM policies - which API calls should be allowed for a specific user from IAM console
Resource based:
- Bucket policies - bucket wide rules from the S3 console, allows cross account
- Object Access Control List
- Bucket Access Control List
An IAM principal can access an S3 object if:
- the user IAM permissions allow it or the resource policy allows it
- there is no explicit DENY

S3 Bucket Policies

JSON based documents, which can contain:
- Resources: buckets and objects
- Actions: set of API to Allow or Deny
- Effects: Allow or Deny
- Principal: the account or user to apply the policy to
Use S3 bucket for policy to:
- Grant public access to the bucket
- Force objects to be encrypted at upload
- Gran access to another account

Bucket Settings for Block Public Access

Block public access to buckets and objects granted through
- new access control list (ACLs)
- any access control list (ACLs)
- new public bucket or access point policies
Block public and cross-account access to buckets and objects through any public or access point policies
These settings were created to prevent company data leaks
ACL (Access Control List): define read and write permissions at the object level

S3 Security - Other

Networking: supports VPC Endpoints
Logging and audit:
- S3 Access Logs can be stored in other S3 buckets
- API calls can be logged in AWS CloudTrail
Ser Security:
- MFA Delete
- Pre-Signed URLs: URLS that are valid only for a limited time

S3 Websites

S3 can host static websites and have them accessible on the web
The website url will be something like this:
- s3-website dash (-) Region ‐ http://bucket-name.s3-website-Region.amazonaws.com
- s3-website dot (.) Region ‐ http://bucket-name.s3-website.Region.amazonaws.com
If we forget to allow public reads in bucket policies, we will get a 403 - Forbidden when accessing the website

CORS

An origin is a scheme (protocol), host (domain) and a port
CORS: Cross-Origin Resource Sharing
Web browser based mechanism to allow requests to other origins while visiting the main origin
The request wont ne fulfilled unless the other origin allows for the request, using CORS Headers (ex: Access-Control-Allow-Origin, Access-Control-Allow-Method)
If a client does a cross-origin request on our S3 bucket, we need to enable the correct CORS headers
We can allow for a specific origin or for * (all origins)

S3 Consistency Model

S3 provides read after write consistency for PUTS of new objects
- As son as a new object is written, we can retrieve it (PUT 200 => GET 200)
- This is true, except if we did a GET before to see if the object existed (GET 404 => PUT 200 => GET 404)
S3 provides eventual consistency for DELETES and PUTS of existing objects
- If we read an object after updating it, we might get a the older version (PUT 200 => PUT 200 => GET 200 (which might return the older version))
- If we delete an object, we might be able to delete it for a short amount of time (DELETE 200 => GET 200)
There is no way to request strong consistency for S3!

S3 Advanced Versioning

S3 Versioning creates a new version each time a file is changed
That includes when we encrypt a file
It is a nice ti get protected against unintentional file encryptions (example: ransomware)
Deleting a file is S3 bucket just adds a delete marker on the versioning
To delete a bucket, we need to remove all the file and their versions from it

S3 MFA-Delete

MFA forces users to generate a code on a device (usually a mobile phone) before doing import operations on S3
To use MFA-Delete, we have to enable versioning on the S3 bucket
If MFA-Delete is active, we would require MFA generated codes in order to:
- Permanently delete an object version
- Suspend versioning on the bucket
We don’t need MFA codes for:
- Enabling versioning
- Listing deleted versions
Only bucket owner (root account) can enable/disable MFA-Delete!
MFA-Delete can be enabled currently using the CLI only!

S3 Default Encryption and Bucket Policies

The old way of enabling default encryption was to use a bucket policy and refuse any HTTP command without proper headers
The new way is to use the default encryption option in S3
Bucket polices are evaluated before the default encryption flag

S3 Access Logs

For audit purposes, we may want to log all access to S3 buckets
Any request made to S3, from any account, authorized or denied, will be logged into another S3 bucket
This data can be analyzed using data analysis tools or Amazon Athena
WARNING: We should not set the logging bucket to be the monitored bucket. This will create a logging loop and the bucket will grow in size exponentially!

S3 Replication

Replication can be cross-region (CRR) or same-region (SRR)
It happens asynchronously
In order to enable S3 replication, we must have versioning enabled in the source and destination buckets
Buckets can be in different accounts and must have proper IAM permissions
CRR - use cases: compliance, lower latency access, replication across accounts
SRR - use cases: log aggregation, live replication between production and test accounts
After activating replication, only new objects are replicated
For delete operations:
- If we delete without a version ID, S3 adds a delete market which is not replicated
- If we delete an object with version ID, it deletes the object from the source bucket, operation is not replicated
- Delete marker replication is an option which can be enabled
There is no “chaining” of replication: if a bucket 1 has replication in bucket 2, which has replication in bucket 3, then objects in bucket 1 are not replicated in bucket 3

S3 Pre-Signed URLs

We can generate pre-signed URLs using SDK or CLI
A pre-signed URL is valid for a default 3600 seconds, can change timeout with --expires-in [SECONDS] argument
Users given a pre-signed URL inherit the permissions of the person who generated the URL for GET/PUT
Use cases for pre-signed URLs:
- Allow only logged-in users to download a premium video from a bucket
- Allow an even changing list of users to download files by generating URLs dynamically
- Allow temporarily a user to upload a file to a precise location in a bucket
Command to generate a presigned url: aws s3 presign s3://bucket-name --region us-east-2 --expires-in 300

CloudFront

It is content delivery network (CDN)
Improves read performance by caching content at the edge locations
Can provide DDoS protection, integration with AWS Shield and AWS WAF
Can expose external HTTPS and can talk to intern HTTPS back-ends

CloudFront - Origins

S3 buckets:
- For distributing files and caching them at the edge
- Enhanced security with CloudFront Origin Access Identity (OAI)
- CloudFront can be used as an ingress to upload files
Custom Origin (HTTP)
- Application Load Balancer
- EC2 instance
- S3 website (must first enable the bucket as a static S3 website)
- Any HTTP backend

CloudFront Geo Restriction

We can restrict who can access our distribution by using:
- Whitelists: allow our users to access our content only if they are in on of the countries on a list of approved countries
- Blacklist: prevent our users from accessing our content if they are in one the countries on a blacklist of banned countries

CloudFront vs S3 CRR

CloudFront:
- Uses global edge network
- Files are cached for a TTL
- Great for static content which must be available everywhere
S3 Cross Region Replication:
- Must be set up for each region we want replication to happen
- Files are updated in near real-time
- Great for dynamic content that needs to be available at low-latency in a few regions

CloudFront Access Logs

CloudFront access logs: contain every request made to CloudFront
Access logs are stored in S3

CloudFront Reports

It is possible to generate reports on:
- Cache statistics report
- Popular object reports
- Top referrers report
- Usage report
- Viewers report
These reports are based on the data from access logs

CloudFront Troubleshooting

CloudFront caches HTTP 4xx and 5xx status codes returned by S3 or the origin server
4xx error code indicates the user does not have access to the underlying resources (403) or the object requested does not exist (404)
5xx error codes indicate gateway issues

S3 Inventory

Amazon S3 inventory helps manage our storage
Used for audit and report in the replication and encryption status of our objects
Use cases:
- Business
- Compliance
- Regulatory needs
We can query all the data using Amazon Athena, Redshift, Presto, Hive, Spark, etc.
We can set up multiple inventories
Inventory setup: we have to setup a target bucket with a policy, inventory data will be saved here

S3 Storage Classes

Standard - General Purpose
Standard - Infrequent Access (IA)
One Zone - Infrequent Access
Intelligent Tiering
Glacier
Glacier Deep Archive
Reduced Redundancy Storage (deprecated)

S3 Standard - General Purpose

Provides high durability of objects across multiple AZs
99.99% availability over a year
Can sustain 2 concurrent facility failures
Suitable for frequently accessed data such for use cases such as: big data, mobile and gaming applications, content distribution

S3 Standard - Infrequent Access (IA)

Provides high durability of objects across multiple AZs
99.9% availability
Lower cost compared to S3 Standard General Purpose
Can sustain 2 concurrent facility failures
Suitable for data that is less frequently accessed, but requires rapid access when needed. Use cases: data store for disaster recovery, backups

S3 One Zone - Infrequent Access (IA)

Similar to IA, but data is stored in a single AZ
99.5% availability
Provides low latency and high throughput
Supports SSL for data at transit and encryption at rest
Lower cost compared to IA (by about 20%)
Use cases: store secondary backup copies or data that can be recreated

S3 Intelligent Tiering

Same low latency and high throughput performance of S3 Standard
Requires a small monthly fee for monitoring and auto-tiering
Automatically moves objects between two access tiers based on changing access patterns
Resilient against events that impacts an entire AZ

Amazon Glacier

Low cost object storage meant for archiving, backup
Data is retained for longer terms (tens of years)
It is an alternative to on-premise magnetic tape storage
Cost per storage per month is $0.004/GB + retrieval cost
Each item in Glacier is called an Archive. An Archive can be up to 40TB is size
Archives are stored in Vaults
Amazon Glacier data retrieval options:
- Expedited (1 to 5 minutes)
- Standard (3 to 5 hours)
- Bulk (5 to 12 hours)
Minimum storage duration for Glacier is 90 days
Amazon Glacier Deep Archive data retrieval options:
- Standard (up to 12 hours)
- Bulk (up to 48 hours)
Minimum storage duration for Deep Archive is 180 days

S3 Moving Between Storage Classes

We can transition objects between storage classes
For infrequent accessed objects, we can move them to standard IA
For archiving objects, we should move them to Glacier or Deep Archive
Moving objects can be automated using lifecycle configurations

S3 Lifecycle Rules

Transition actions: define when objects are transitioned to another storage class. Examples:
- Move objects to standard Ia 60 days after creation
- Move objects go Glacier for archiving after 6 months
Expiration actions: automatically expire (delete) objects after some time. Examples:
- Access logs can be set to be deleted after 1 year
- Can be used to delete old versions of files if versioning is enabled
- Clean up incomplete multipart uploads parts
Rules can be created for a certain prefix
Rules can be created for certain tags as well

S3 Baseline Performance

S3 automatically scales to high requests, having a latency of 100-200 ms
Our application can achieve al least 3500 PUT/COPY/POST/DELETE and 5500 GET/HEAD requests per second per prefix in a bucket
There are no limits to the number of prefixes in a bucket

S3 - KMS Limitation

If we use SSE-KMS, we may be impacted by the KMS limits
When we upload a file to a bucket with KMS encryption enabled, S3 calls the GenerateDataKey KMS API
When we download an encrypted file, it will cal the Decrypt KMS API
These calls count towards the KMS quota per second (which can be 5500, 10000, 30000 req/s based on the region)
Quota increase for KMS can not be requested

S3 Performance Optimization

S3 Uploads

Multi-Part upload:
- Recommended for files > 100MB, required for files > 5GB
- Can help parallelize uploads speeding up transfers
S3 Transfer Acceleration:
- Increase the transfer speed by transferring file to an AWS edge location which will forward the data to the S3 bucket in the target region
- Compatible with multi-part upload

S3 Byte-Rage Fetches

Parallelize GETs by requesting specific byte ranges
Better resilience in case of failures
Can be used to speed up downloads
Can used to retrieve only partial data (for example the head of a file)

S3 Select and Glacier Select

Retrieve less data using SQL by performing server side filtering
Can filter by rows and columns (simple SQL statements)
Using selects we can have less network usage, less CPU cost on the client-side

S3 Event Notifications

React to changes, events happening in a bucket. Example of events:
- S3:ObjectCreated
- S3:ObjectRemoved
- S3:ObjectRestore
- S3:Replication
Use cases: generate thumbnail if images uploaded to S3
Event consumers can be:
- SNS
- SQS
- Lambda Functions
S3 events notifications typically deliver events in seconds, sometimes cna take over a minute
If two writes are made to a single non-versioned object at the same time, it is possible that only a single event notification will be sent. To solve this issue, we have to enable versioning on the bucket

S3 Analytics - Storage Class Analysis

We can set up analytics to help determine when to transition objects from standard to standard IA
Does not work for one zone IA or Glacier
Report is updated on daily basis
Takes about 24h to 48h to first start
Helps put together lifecycle rules

Glacier Overview

Low cost storage for archiving and backups
Data is retained for longer terms
Alternative to on-premise tape storage
Cost per storage per month ($0.004/GB + retrieval cost)
Each item in Glacier is called an Archive, which can be up to 40TB
Archives are stored in Vaults

Glacier Operations

Provides 3 types of operations:
- Upload: single operation or by parts (multi-part upload) for larger archives
- Download: first initiates a retrieval job for the particular archive, the prepares the date for download. Users have a limited time to download the data from the staging server
- Delete: single operation, can be achieved by user Glacier Rest API or by specifying the archive ID
Restore links have an expiry date
Glacier provides 3 retrieval options:
- Expedited (1 to 5 minutes retrieval time): $0.03 per GB $0.01 per request
- Standard (3 to 5 hours retrieval time): $0.01 per GB $0.05 per 1000 requests
- Bulk (5 to 12 hours retrieval time): $0.0025 per GB $0.025 per 1000 requests

Glacier Vault Policies and Vault Locks

Vault is a collection of archives
Each vault has:
- ONE vault access policy
- One vault lock policy
Vault policies are written in JSON
Vault access policies are similar to bucket policies
Vault lock policies are used for regulatory and compliance requirements:
- The policy is immutable, it can never be changed
- Example: forbid to delete an archive if is less than 1 year old
- Example: WORM policy (write once read many times)

Snowball

Physical data transport solution for moving TBs or PBs of data in and out of AWS
Alternative to move data over the network
It is secure, tamper resistant, uses KMS 256 bit encryption
Tracking is done using SNS and text messages. It has an E-ink shipping label
Pay per data transfer job
Use cases: large data cloud migration, DC decommission, disaster recovery
If it takes more than a week to transfer over the network, probably it is a good idea to use Snowball

Snowball Transferring Process

Request a Snowball form AWS console delivery
Install the device on the server
Copy the files over the device using the Snowball client
Ship back the device
Data will be loaded into an S3 bucket
Snowball will be wiped completely

Snowball Edge

Snowball Edge adds computational capability to the device
Provides up to 100TB of storage with either:
- 24 vCPU - storage optimized (100TB storage)
- 52 vCPU and optional GPU - compute optimized (42TB storage HDD, 7.68TB SSD)
Supports a custom EC2 AMI for data processing on the go
Supports custom Lambda Functions
Very useful for data preprocessing
Use cases: data migration, image collation, IoT capture, machine learning

Snowmobile

It is a truck
It can transfer exabytes of data
Each Snowmobile has 100PB of capacity
Better option than Snowball in case we want transfer more than 10PB of data

Snowball into Glacier

Snowball cannot import data into Glacier directly
We have to use S3 first and create an S3 lifecycle policy to transfer data to Glacier

Storage Gateway

It is a bridge between on-premise data and cloud data in S3
Provides 3 types of gateways:
- File Gateway
- Volume Gateway
- Tape Gateway

File Gateway

Configured S3 buckets are accessible using NFS and SMB protocols
Supports S3 standard, IA, One Zone IA
Buckets will be accessed by the file gateway, the file gateway will have its own IAM role
Most recently used data will be cached in the file gateway
File gateway can be mounted in many servers

Volume Gateway

Provides block storage using iSCSI protocol backed by S3
Backed by EBS snapshots which cna help restore on-premise volumes
Cached volumes: low latency access to most recent data
Stored volumes: entire dataset in on premise, scheduled backups to S3

Tape Gateway

Provides backup processes similar to the ones using physical tapes
It is a Virtual Tape Library (VTL) backed by S3 and Glacier
Back up data using existing tape-based processes (and iSCSI interface)
Works with leading backup software vendors

Athena

Serverless service used to perform analytics directly against S3 files
Uses SQL language to query the files
Provides JDBC/ODBC driver
We are charged per query and amount of data scanned
Supports CSV, JSON, ORC, Avro and Parquet. In the backend is using Presto’
Use cases: business intelligence, analytics, reporting, analyze and query VPC Flow Logs, ELB Logs and CloudTrail logs, etc.