EC2 High Availability and Scalability - Load Balancers and Auto Scaling Groups
Scalability and High Availability
Scalability: an application or system can handle greater loads by adapting
They are two kinds of scalability:
Vertical scalability: we need to increase the size of the instance. Used for non-distributed systems such as databases
Horizontal scalability (elasticity): we need to increase the number of instances/systems for the application. Implies having distributed systems, such as modern web applications
High Availability
High availability means running an application/system in at least 2 data centers (usually 2 AZs)
The goal of the high availability is to survive a data center loss
High availability can be passive (RDS Multi AZ deployment) or active (horizontal scaling)
Load Balancing Introduction
Load balancers are servers that forward internet traffic to multiple servers (EC2 instances)
Reasons to use a load balancer:
Spread the load across multiple downstream instances
Expose a single point of access (DNS) to the application
Seamlessly handle failures of downstream instances
Do regular health checks to the instances, redirect the traffic in case of failure to healthy instances
Provide SSL termination (HTTPS) for our website
Enforce stickiness with cookies
High available across availability zones
Separate public traffic from private traffic
Elastic Load Balancer (ELB)
An ELB is a managed load balancer, meaning:
AWS guarantees that it will be working
AWS takes care of upgrades, maintenance and high availability
AWS provides only a few configuration knobs
It costs less to setup our non-managed load balancer but it will be a lot less effort to use a managed one
Types of Load Balancers
AWS offers 3 types of load balancers:
Classic Load Balancer (v1 - old generation)
Application Load Balancer (v2 - new generation)
Network Load Balancer (v2 - new generation)
Overall it is recommended to use the newer v2 generation load balancers as they provide more features
We can set up external or internal load balancers
Health Checks
Health Checks are crucial for load balancers
They enable the load balancer to know if instances, to which traffic is forwarded, are available to reply to requests
The health check is done on a port and a route (/health is common)
If the response is not 200 (OK), then the instance is considered to be unhealthy
Application Load Balancers (v2)
Application Load Balancers (Layer 7) allow to do:
Load balancing to multiple HTTP applications across machines (target groups)
Load balancing to multiple applications on the same machine (ex: containers)
Load balancing based on route in URL
Load balancing based on hostname in URL
Basically they are perfect for micro-services and container based applications
ALB has a port mapping feature to redirect to a dynamic port
Stickiness can be enabled at the target group level
Same request goes to the same instance
Stickiness is directly generated by the ALB (not the application)
ALB supports HTTP/HTTPS and WebSockets protocols
The application servers don’t see the IP of the client directly
The IP of the client we initiated the request is encoded in the header X-Forwarded-For
We can also get the Port (X-Forwarded-Port) and the protocol (X- Forwarded-Proto)
Network Load Balancers (v2)
Network Load Balancers (Layer 4) allow to do:
Forward TCP and UDP traffic to the instances
Handle millions of requests oer second
Support for static IP and elastic IP
NLBs have less latency (~100ms) compared to ALB (~400ms)
NLBs are mostly used for extreme performance and should not be the default load balancer of choice
Got to Know
Classic Load Balancers (CLB) are deprecated, we should use ALB or NLB instead
CLB and ALB support SSL certificates and provide SSL termination
All load balancers have health check capability
ALB can route based on hostname and path
All of the load balancers have a static host name, we should not resolve the URL to get the IP
LBs can scale but not instantly - contact AWS for warm-up
In case of NLB, we can directly see the requester IP address
4xx errors are client induced errors
5xx errors are application induced errors
LB error 503 means the LB is at capacity or there are no registered targets
If the LB can not connect ot an application, we should check the security groups
Load Balancer Stickiness
It is possible to implement stickiness meaning the same client is always redirected to the same instance behind a load balancer
Stickiness works for both CLB and ALB
Stickiness is achieved by setting a cookie on the request. This cookie has an expiration date
Use case for stickiness: make sure to not use session data
Enabling stickiness may bring imbalance to the load over the backend EC2 instances
Load Balancer for SysOps
Application Load Balancer (ALB):
Layer 7 (HTTP, HTTPS, WebSocket)
URL based routing (hostname or path)
Does not support static IP, but has a fixed DNS
Provides SSL termination
Network Load Balancer (NLB):
Layer 4 (TCP)
No pre-warming needed
I provides a static IP per subnet
Does not provide SSL termination (SSL must be enabled by the application itself)
Fixed IP for ALB: we have to chain an NLB and an ALB the have a static fixed IP address for the ALB
Load Balancer Pre-Warming
ELB scales gradually to the actual traffic
ELB may fail in case of sudden spike of traffic (10x traffic)
If we expect high traffic, we have to open a support ticket with AWS to pre-warm the ELB. We have to answer the following questions:
Duration of traffic
Expected request per second
Size of a request (in KB)
Load Balancer Error Codes
Successful request: 200
Unsuccessful at client side: 2xx
400: Bad Request
401: Unauthorized
403: Forbidden
460: Client closed connection
463: X-Forwarded-For header with has over 30 IP addresses (similar to malformed request)
Unsuccessful at server side: 5xx
500: Internal server error, would mean some error happened on the ELB itself
502: Bad Gateway
503: Service Unavailable (server overloaded)
504: Gateway Timeout
561: Unauthorized Request
SSL for Older Browser
Common question: how do we support Legacy Browsers that have and old TLS (TLS 1.0)?
Answer: change the policy to allow for weaker cipher (example: DES-CBC3-SHA for TLS 1.0)
LB Common Troubleshooting
We have to check the security groups
We have check the health checks (maybe some application is down)
Sticky sessions may bring imbalance on the load balancing side
For Multi-AZ, we have to make sure cross zone load balancing is enabled
We should use internal load balancer for private applications that don’t need a public access
We should enable deletion protection for production load balancers
Load Balancing Monitoring
All LB metrics are directly pushed to CloudWatch metrics
Metrics:
BackendConnectionErrors
HealthyHostCount/UnhealthyHostCount
HTTPCode_Backend_2XX: successful requests
HTTPCode_Backend_3XX: redirects
HTTPCode_Backend_4XX: client errors
HTTPCode_Backend_5XX: server errors
Latency
RequestCount
SurgeQueueLength: the total number of requests or connections that are pending routing to a healthy instance. Can be used for scale out. Max value is 1024
SpilloverCount: the total number of requests that were rejected because the surge queue is full
Load Balancers Access Logs
Access logs are disabled by default, we can enable them
Access logs from load balancers can be stored in S3 and can contain:
Time
Client IP address
Latencies
Request path
Server response
Trace ID
We pay only for the S3 storage in case of LB logs
LB logs are useful for compliance reasons
Helpful for keeping access data even after ELB and EC2 instances are terminated
Access Logs are automatically encrypted
Application Load Balancer Request Tracing
Request tracing: each HTTP request has an added custom header X-Amzn-Trace-Id
This is useful in logs/distributed tracing platform to track a single request
Application Load Balancer is not (yet) integrated with AWS X-Ray
LB Troubleshooting Using Metrics
HTTP 400 - BAD REQUEST: the client sent a malformed request
HTTP 503 - Service Unavailable: ensure we have a healthy instance in every AZ that our LB is configured to respond. Look for HealthyHostCount int CloudWatch
HTTP 503 - Gateway Timeout: check if keep-alive settings on EC2 instances are enabled and make sure that the keep-alive timeout is greater than the idle timeout settings of the LB
Auto Scaling Groups
The goal of an Auto Scaling Group (ASG) is to:
Scale out (add EC2 instances) to match an increasing load
Scale in (remove EC2 instances) to mach a decreasing load
Ensure we have a minimum and a maximum number of machines running
Automatically register new instances to LB
ASG attributes:
A launch configuration:
AMI + Instance Type
EC2 User Data
EBS Volumes
Security Groups
SSH Key Pair
Min size/max size/initial capacity
Network + Subnet information
Load balancer information
Scaling Policies
ASG alarms:
It is possible to scale an ASG based on CloudWatch alarms
An alarm monitors a metric (such as Average CPU)
Metrics are computed for the overall ASG instances (overall average)
Based on the alarm we can create scale-out and scale-in policies
ASG Scaling new rules:
It is now possible to define “better” ASG scaling rules that are directly managed by EC2:
Target average CPU usage
Number of requests on the ELB per instance
Average network in/out
These rules are easier to set up and make more sense than the previous rules
Auto scaling based on custom metric:
We can auto scale instances based on a custom metric (ex: number of connected users)
This will happen as it follows:
Send a custom metric from application on EC2 to CloudWatch using the PutMetric API
Create a CloudWatch alarm to react to low/high values
Use the CloudWatch alarm as the scaling policy for ASG
ASG - Types of Scaling
Scheduled scaling:
Scaling based on a schedule allows us to scale the application ahead of know load changes
Dynamic scaling:
ASG enables us to follow the demand curve for our application closely, reducing the need to manually provision instances
ASG can automatically adjust the number of EC2 instances as needed to maintain a target
Predictive scaling:
ASG uses machine learning to schedule the right number of EC2 instances in anticipation of traffic changes
ASG Scaling Policies
Target Tracking Scaling
Most simple and easy to setup
Example: we want the average ASG CPU to stay around 40%
Simple/Step Scaling
Example:
When a CloudWatch alarm is triggered (example average CPU > 70%), then add 2 units
When a CloudWatch alarm is triggered (example average CPU < 30%), then remove 1 unit
Scheduled Actions
Can be used if we can anticipate scaling based on known usage patterns
Example: increase the min capacity to 10 at 5 PM on Fridays
Scaling Cool-downs
The cool-down period helps to ensure that our ASG doesn’t launch or terminate additional instances before the previous scaling activity takes effect
In addition to default cool-down for ASG we can create cool-downs that apply to specific simple scaling policy
A scaling-specific cool-down overrides the default cool-down period
Common use case for scaling-specific cool-downs is when a scale-in policy terminates instances based in a criteria or metric. Because this policy terminates instances, an ASG needs less time to determine wether to terminate additional instances
If the default cool-down period of 300 seconds is too long, we can reduce costs by applying a scaling-specific cool-down of 180 seconds for example
If our application is scaling up and down multiple times each hour, we can modify the ASG cool-down timers and the CloudWatch alarm period that triggers the scale
ASG Scaling Termination Policies
Determine which AZ hast the most instances
Determinate which instance to terminate so as to align the remaining instances to the allocation strategy for the On-Demand or Spot instances
Determine whether any of the instances uses the oldest launch template
Determine whether any of the instances uses the oldest launch configuration
If there are multiple unprotected instances to terminate, determine which is closest to the next billing hour
ASG Summary
Scaling policies can be based on CPU, Network or a custom metric even
We can scale instances based on a schedule
ASG use Launch configurations and we update an ASG by providing a new launch configuration
IAM roles attached to an ASG will get assigned to EC2 instances
ASG is a free service, we pay for the underlying resources created
Having instances under an ASG means that if they get terminated for whatever reason, the ASG will restart them
ASG can terminate instances marked ans unhealthy by an LB (and replace them)
Scaling Processes in ASG
Launch: ASG adds a new EC2 to the group, increasing the capacity
Terminate: ASG removes an EC2 instances from the group, decreasing its capacity
HealthCheck: ASG checks the health of an instance
ReplaceUnhealthy: ASG terminates unhealthy instances and recreates them
AZRebalance: ASG balances the number of instances across AZs
AlarmNotification: ASG accepts notifications from CloudWatch
ScheduledAction: ASG performs a scheduled action
AddToLoadBalancer: ASG adds instances to the load balancer or target group
We can suspend these processes!
AZRebalance:
Launch new instances then terminate old instances
If we suspend the Launch process:
AZRebalance wont launch instances
AZRebalance wont terminate instances
If we suspend the Terminate process:
The ASG can grow up to 10% of this size (it’s allowed during rebalances)
The ASG could remain at the increased capacity as it can not terminate instances
ASG for Sysops
To make sure we have high availability, means we have to have at least 2 instances running across 2 AZs in the ASG (must configure multi AZ ASG)
Health checks available:
EC2 Status Checks
ELB Health Checks
ASG will launch a new instance after terminating an unhealthy one
ASG will not reboot unhealthy hosts for us
CLI commands:
set-instance-health
terminate-instance-in-autoscaling-group
Troubleshooting ASG issues
< number of instances > instance(s) are already running. Launching EC2 instance failed:
ASG group has reached the limit set by the DesiredCapacity parameter. We should update the ASG group by providing a new value for the desired capacity
Launching EC2 instances is failing:
The security group does not exist. SG might have been deleted
The key pair does not exist. The key pair might have been deleted
If the ASG fails to launch an instance for over 24 hours, it will automatically suspend the process (administration suspension)
CloudWatch Metrics for ASG
The following metrics are available for ASG:
GroupMinSize
GroupMaxSize
GroupDesiredCapacity
GroupInServiceInstances
GroupPendingInstances
GroupStandbyInstances
GroupTerminatingInstances
GroupTotalInstances
We should enable metric collection to see these metrics
Metrics for the ASG are collected at every 1 minute