There is no concept of directories within buckets, although the UI will trick us to think otherwise
The keys can be very long names which contain slashes (“/”)
Object values are the content of the body
Max object size is 5TB
If uploading mor than 5 GB, we must use multi-part upload
Each object in S3 can have metadata, tags and (optional) version ID
S3 Versioning
Versioning should be enabled at the bucket level
Versioning means: objects uploaded with same key will increment the version
It is best practice to version files in S3, because:
It will protect against unintended deletes (ability to restore version)
Provides the ability to rollback a file
Notes:
Any file that is not versioned prior to enabling versioning, will have the version of null
Suspending versioning does not delete the previous versions of existing files
S3 Encryption for Objects
There are 4 methods of encrypting objects in S3:
SSE-S3: encrypts S3 objects using keys handled managed by AWS
SSE-KMS: leverages AWS KMS to manage encryption keys
SSE-C: keys are managed by the users
Client Side Encryption
SSE-S3:
Encryption using keys handled and managed by AWS
Objects are encrypted server side
Uses AES-256 encryption
In order to upload and set the SSE-S3 encryption, the upload request must have a header "x-amz-server-side-encryption": "AES256"
SSE-KMS
Encryption using keys handled and managed by KMS
KMS advantages: user control + audit trail
Objects are encrypted in the server side
In order to upload and set the SSE-S3 encryption, the upload request must have a header "x-amz-server-side-encryption": "aws:kms"
SSE-C
Server-side encryption using the keys provided by the customers
S3 does not store the encryption key the customers provide
To upload the data, HTTPS must be used
Encryption key must be provided in HTTP headers, for every HTTP request
Client Side Encryption
Files are encrypted by the customers before uploading them to S3
Client side libraries can be used to accomplish this, example Amazon S3 Encryption Client
Customers are responsible for encrypting the data as well
Encryption keys are fully managed by the customers
S3 Security
User based security:
IAM policies - which API calls should be allowed for a specific user from IAM console
Resource based:
Bucket policies - bucket wide rules from the S3 console, allows cross account
Object Access Control List
Bucket Access Control List
An IAM principal can access an S3 object if:
the user IAM permissions allow it or the resource policy allows it
there is no explicit DENY
S3 Bucket Policies
JSON based documents, which can contain:
Resources: buckets and objects
Actions: set of API to Allow or Deny
Effects: Allow or Deny
Principal: the account or user to apply the policy to
Use S3 bucket for policy to:
Grant public access to the bucket
Force objects to be encrypted at upload
Gran access to another account
Bucket Settings for Block Public Access
Block public access to buckets and objects granted through
new access control list (ACLs)
any access control list (ACLs)
new public bucket or access point policies
Block public and cross-account access to buckets and objects through any public or access point policies
These settings were created to prevent company data leaks
ACL (Access Control List): define read and write permissions at the object level
S3 Security - Other
Networking: supports VPC Endpoints
Logging and audit:
S3 Access Logs can be stored in other S3 buckets
API calls can be logged in AWS CloudTrail
Ser Security:
MFA Delete
Pre-Signed URLs: URLS that are valid only for a limited time
S3 Websites
S3 can host static websites and have them accessible on the web
The website url will be something like this:
s3-website dash (-) Region ‐ http://bucket-name.s3-website-Region.amazonaws.com
s3-website dot (.) Region ‐ http://bucket-name.s3-website.Region.amazonaws.com
If we forget to allow public reads in bucket policies, we will get a 403 - Forbidden when accessing the website
CORS
An origin is a scheme (protocol), host (domain) and a port
CORS: Cross-Origin Resource Sharing
Web browser based mechanism to allow requests to other origins while visiting the main origin
The request wont ne fulfilled unless the other origin allows for the request, using CORS Headers (ex: Access-Control-Allow-Origin, Access-Control-Allow-Method)
If a client does a cross-origin request on our S3 bucket, we need to enable the correct CORS headers
We can allow for a specific origin or for * (all origins)
S3 Consistency Model
S3 provides read after write consistency for PUTS of new objects
As son as a new object is written, we can retrieve it (PUT 200 => GET 200)
This is true, except if we did a GET before to see if the object existed (GET 404 => PUT 200 => GET 404)
S3 provides eventual consistency for DELETES and PUTS of existing objects
If we read an object after updating it, we might get a the older version (PUT 200 => PUT 200 => GET 200 (which might return the older version))
If we delete an object, we might be able to delete it for a short amount of time (DELETE 200 => GET 200)
There is no way to request strong consistency for S3!
S3 Advanced Versioning
S3 Versioning creates a new version each time a file is changed
That includes when we encrypt a file
It is a nice ti get protected against unintentional file encryptions (example: ransomware)
Deleting a file is S3 bucket just adds a delete marker on the versioning
To delete a bucket, we need to remove all the file and their versions from it
S3 MFA-Delete
MFA forces users to generate a code on a device (usually a mobile phone) before doing import operations on S3
To use MFA-Delete, we have to enable versioning on the S3 bucket
If MFA-Delete is active, we would require MFA generated codes in order to:
Permanently delete an object version
Suspend versioning on the bucket
We don’t need MFA codes for:
Enabling versioning
Listing deleted versions
Only bucket owner (root account) can enable/disable MFA-Delete!
MFA-Delete can be enabled currently using the CLI only!
S3 Default Encryption and Bucket Policies
The old way of enabling default encryption was to use a bucket policy and refuse any HTTP command without proper headers
The new way is to use the default encryption option in S3
Bucket polices are evaluated before the default encryption flag
S3 Access Logs
For audit purposes, we may want to log all access to S3 buckets
Any request made to S3, from any account, authorized or denied, will be logged into another S3 bucket
This data can be analyzed using data analysis tools or Amazon Athena
WARNING: We should not set the logging bucket to be the monitored bucket. This will create a logging loop and the bucket will grow in size exponentially!
S3 Replication
Replication can be cross-region (CRR) or same-region (SRR)
It happens asynchronously
In order to enable S3 replication, we must have versioning enabled in the source and destination buckets
Buckets can be in different accounts and must have proper IAM permissions
CRR - use cases: compliance, lower latency access, replication across accounts
SRR - use cases: log aggregation, live replication between production and test accounts
After activating replication, only new objects are replicated
For delete operations:
If we delete without a version ID, S3 adds a delete market which is not replicated
If we delete an object with version ID, it deletes the object from the source bucket, operation is not replicated
Delete marker replication is an option which can be enabled
There is no “chaining” of replication: if a bucket 1 has replication in bucket 2, which has replication in bucket 3, then objects in bucket 1 are not replicated in bucket 3
S3 Pre-Signed URLs
We can generate pre-signed URLs using SDK or CLI
A pre-signed URL is valid for a default 3600 seconds, can change timeout with --expires-in [SECONDS] argument
Users given a pre-signed URL inherit the permissions of the person who generated the URL for GET/PUT
Use cases for pre-signed URLs:
Allow only logged-in users to download a premium video from a bucket
Allow an even changing list of users to download files by generating URLs dynamically
Allow temporarily a user to upload a file to a precise location in a bucket
Command to generate a presigned url:
aws s3 presign s3://bucket-name --region us-east-2 --expires-in 300
CloudFront
It is content delivery network (CDN)
Improves read performance by caching content at the edge locations
Can provide DDoS protection, integration with AWS Shield and AWS WAF
Can expose external HTTPS and can talk to intern HTTPS back-ends
CloudFront - Origins
S3 buckets:
For distributing files and caching them at the edge
Enhanced security with CloudFront Origin Access Identity (OAI)
CloudFront can be used as an ingress to upload files
Custom Origin (HTTP)
Application Load Balancer
EC2 instance
S3 website (must first enable the bucket as a static S3 website)
Any HTTP backend
CloudFront Geo Restriction
We can restrict who can access our distribution by using:
Whitelists: allow our users to access our content only if they are in on of the countries on a list of approved countries
Blacklist: prevent our users from accessing our content if they are in one the countries on a blacklist of banned countries
CloudFront vs S3 CRR
CloudFront:
Uses global edge network
Files are cached for a TTL
Great for static content which must be available everywhere
S3 Cross Region Replication:
Must be set up for each region we want replication to happen
Files are updated in near real-time
Great for dynamic content that needs to be available at low-latency in a few regions
CloudFront Access Logs
CloudFront access logs: contain every request made to CloudFront
Access logs are stored in S3
CloudFront Reports
It is possible to generate reports on:
Cache statistics report
Popular object reports
Top referrers report
Usage report
Viewers report
These reports are based on the data from access logs
CloudFront Troubleshooting
CloudFront caches HTTP 4xx and 5xx status codes returned by S3 or the origin server
4xx error code indicates the user does not have access to the underlying resources (403) or the object requested does not exist (404)
5xx error codes indicate gateway issues
S3 Inventory
Amazon S3 inventory helps manage our storage
Used for audit and report in the replication and encryption status of our objects
Use cases:
Business
Compliance
Regulatory needs
We can query all the data using Amazon Athena, Redshift, Presto, Hive, Spark, etc.
We can set up multiple inventories
Inventory setup: we have to setup a target bucket with a policy, inventory data will be saved here
S3 Storage Classes
Standard - General Purpose
Standard - Infrequent Access (IA)
One Zone - Infrequent Access
Intelligent Tiering
Glacier
Glacier Deep Archive
Reduced Redundancy Storage (deprecated)
S3 Standard - General Purpose
Provides high durability of objects across multiple AZs
99.99% availability over a year
Can sustain 2 concurrent facility failures
Suitable for frequently accessed data such for use cases such as: big data, mobile and gaming applications, content distribution
S3 Standard - Infrequent Access (IA)
Provides high durability of objects across multiple AZs
99.9% availability
Lower cost compared to S3 Standard General Purpose
Can sustain 2 concurrent facility failures
Suitable for data that is less frequently accessed, but requires rapid access when needed. Use cases: data store for disaster recovery, backups
S3 One Zone - Infrequent Access (IA)
Similar to IA, but data is stored in a single AZ
99.5% availability
Provides low latency and high throughput
Supports SSL for data at transit and encryption at rest
Lower cost compared to IA (by about 20%)
Use cases: store secondary backup copies or data that can be recreated
S3 Intelligent Tiering
Same low latency and high throughput performance of S3 Standard
Requires a small monthly fee for monitoring and auto-tiering
Automatically moves objects between two access tiers based on changing access patterns
Resilient against events that impacts an entire AZ
Amazon Glacier
Low cost object storage meant for archiving, backup
Data is retained for longer terms (tens of years)
It is an alternative to on-premise magnetic tape storage
Cost per storage per month is $0.004/GB + retrieval cost
Each item in Glacier is called an Archive. An Archive can be up to 40TB is size
Archives are stored in Vaults
Amazon Glacier data retrieval options:
Expedited (1 to 5 minutes)
Standard (3 to 5 hours)
Bulk (5 to 12 hours)
Minimum storage duration for Glacier is 90 days
Amazon Glacier Deep Archive data retrieval options:
Standard (up to 12 hours)
Bulk (up to 48 hours)
Minimum storage duration for Deep Archive is 180 days
S3 Moving Between Storage Classes
We can transition objects between storage classes
For infrequent accessed objects, we can move them to standard IA
For archiving objects, we should move them to Glacier or Deep Archive
Moving objects can be automated using lifecycle configurations
S3 Lifecycle Rules
Transition actions: define when objects are transitioned to another storage class. Examples:
Move objects to standard Ia 60 days after creation
Move objects go Glacier for archiving after 6 months
Expiration actions: automatically expire (delete) objects after some time. Examples:
Access logs can be set to be deleted after 1 year
Can be used to delete old versions of files if versioning is enabled
Clean up incomplete multipart uploads parts
Rules can be created for a certain prefix
Rules can be created for certain tags as well
S3 Baseline Performance
S3 automatically scales to high requests, having a latency of 100-200 ms
Our application can achieve al least 3500 PUT/COPY/POST/DELETE and 5500 GET/HEAD requests per second per prefix in a bucket
There are no limits to the number of prefixes in a bucket
S3 - KMS Limitation
If we use SSE-KMS, we may be impacted by the KMS limits
When we upload a file to a bucket with KMS encryption enabled, S3 calls the GenerateDataKey KMS API
When we download an encrypted file, it will cal the Decrypt KMS API
These calls count towards the KMS quota per second (which can be 5500, 10000, 30000 req/s based on the region)
Quota increase for KMS can not be requested
S3 Performance Optimization
S3 Uploads
Multi-Part upload:
Recommended for files > 100MB, required for files > 5GB
Can help parallelize uploads speeding up transfers
S3 Transfer Acceleration:
Increase the transfer speed by transferring file to an AWS edge location which will forward the data to the S3 bucket in the target region
Compatible with multi-part upload
S3 Byte-Rage Fetches
Parallelize GETs by requesting specific byte ranges
Better resilience in case of failures
Can be used to speed up downloads
Can used to retrieve only partial data (for example the head of a file)
S3 Select and Glacier Select
Retrieve less data using SQL by performing server side filtering
Can filter by rows and columns (simple SQL statements)
Using selects we can have less network usage, less CPU cost on the client-side
S3 Event Notifications
React to changes, events happening in a bucket. Example of events:
S3:ObjectCreated
S3:ObjectRemoved
S3:ObjectRestore
S3:Replication
Use cases: generate thumbnail if images uploaded to S3
Event consumers can be:
SNS
SQS
Lambda Functions
S3 events notifications typically deliver events in seconds, sometimes cna take over a minute
If two writes are made to a single non-versioned object at the same time, it is possible that only a single event notification will be sent. To solve this issue, we have to enable versioning on the bucket
S3 Analytics - Storage Class Analysis
We can set up analytics to help determine when to transition objects from standard to standard IA
Does not work for one zone IA or Glacier
Report is updated on daily basis
Takes about 24h to 48h to first start
Helps put together lifecycle rules
Glacier Overview
Low cost storage for archiving and backups
Data is retained for longer terms
Alternative to on-premise tape storage
Cost per storage per month ($0.004/GB + retrieval cost)
Each item in Glacier is called an Archive, which can be up to 40TB
Archives are stored in Vaults
Glacier Operations
Provides 3 types of operations:
Upload: single operation or by parts (multi-part upload) for larger archives
Download: first initiates a retrieval job for the particular archive, the prepares the date for download. Users have a limited time to download the data from the staging server
Delete: single operation, can be achieved by user Glacier Rest API or by specifying the archive ID
Restore links have an expiry date
Glacier provides 3 retrieval options:
Expedited (1 to 5 minutes retrieval time): $0.03 per GB $0.01 per request
Standard (3 to 5 hours retrieval time): $0.01 per GB $0.05 per 1000 requests
Bulk (5 to 12 hours retrieval time): $0.0025 per GB $0.025 per 1000 requests
Glacier Vault Policies and Vault Locks
Vault is a collection of archives
Each vault has:
ONE vault access policy
One vault lock policy
Vault policies are written in JSON
Vault access policies are similar to bucket policies
Vault lock policies are used for regulatory and compliance requirements:
The policy is immutable, it can never be changed
Example: forbid to delete an archive if is less than 1 year old
Example: WORM policy (write once read many times)
Snowball
Physical data transport solution for moving TBs or PBs of data in and out of AWS
Alternative to move data over the network
It is secure, tamper resistant, uses KMS 256 bit encryption
Tracking is done using SNS and text messages. It has an E-ink shipping label
Pay per data transfer job
Use cases: large data cloud migration, DC decommission, disaster recovery
If it takes more than a week to transfer over the network, probably it is a good idea to use Snowball
Snowball Transferring Process
Request a Snowball form AWS console delivery
Install the device on the server
Copy the files over the device using the Snowball client
Ship back the device
Data will be loaded into an S3 bucket
Snowball will be wiped completely
Snowball Edge
Snowball Edge adds computational capability to the device