Cloud Cost Optimization Strategies for AWS, Azure, and GCP
Cloud computing offers incredible flexibility and scalability, but without proper cost management, cloud bills can quickly spiral out of control. In this comprehensive guide, we will explore practical cost optimization strategies across the three major cloud providers: Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP). We will cover cost monitoring, resource rightsizing, reserved instances, spot instances, auto-scaling, storage optimization, network costs, tagging strategies, and FinOps practices.
Understanding Cloud Cost Fundamentals
Cloud costs typically fall into several categories:
- Compute: Virtual machines, containers, serverless functions
- Storage: Object storage, block storage, databases
- Network: Data transfer, load balancers, CDN
- Managed Services: Databases, analytics, AI/ML services
- Support: Technical support plans
The pay-as-you-go model provides flexibility but requires active management to avoid waste.
Cost Monitoring and Visibility
The first step in cost optimization is understanding where money is being spent.
AWS Cost Monitoring
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
| # Install AWS CLI
pip install awscli
# Configure credentials
aws configure
# Get cost and usage data
aws ce get-cost-and-usage \
--time-period Start=2024-01-01,End=2024-01-31 \
--granularity MONTHLY \
--metrics "UnblendedCost" \
--group-by Type=DIMENSION,Key=SERVICE
# Create cost budget
aws budgets create-budget \
--account-id 123456789012 \
--budget file://budget.json \
--notifications-with-subscribers file://notifications.json
|
Budget configuration file:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
| {
"BudgetName": "Monthly-Budget-2024",
"BudgetLimit": {
"Amount": "10000",
"Unit": "USD"
},
"TimeUnit": "MONTHLY",
"BudgetType": "COST",
"CostFilters": {
"TagKeyValue": [
"user:Environment$Production"
]
}
}
|
Azure Cost Monitoring
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
| # Install Azure CLI
curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash
# Login
az login
# Show cost analysis
az consumption usage list \
--start-date 2024-01-01 \
--end-date 2024-01-31 \
--query "[].{Name:instanceName,Cost:pretaxCost}" \
--output table
# Create budget
az consumption budget create \
--budget-name monthly-budget \
--amount 10000 \
--category Cost \
--time-grain Monthly \
--start-date 2024-01-01 \
--end-date 2024-12-31
|
Azure Cost Management query using Python:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
| from azure.identity import DefaultAzureCredential
from azure.mgmt.costmanagement import CostManagementClient
from azure.mgmt.costmanagement.models import QueryDefinition, QueryDataset, QueryTimePeriod
credential = DefaultAzureCredential()
client = CostManagementClient(credential)
# Define query
query = QueryDefinition(
type="Usage",
timeframe="MonthToDate",
dataset=QueryDataset(
granularity="Daily",
aggregation={
"totalCost": {
"name": "PreTaxCost",
"function": "Sum"
}
},
grouping=[
{
"type": "Dimension",
"name": "ResourceGroup"
}
]
)
)
# Execute query
scope = f"/subscriptions/{subscription_id}"
result = client.query.usage(scope, query)
for row in result.rows:
print(f"Date: {row[0]}, Resource Group: {row[1]}, Cost: ${row[2]:.2f}")
|
GCP Cost Monitoring
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
| # Install gcloud CLI
curl https://sdk.cloud.google.com | bash
# Initialize
gcloud init
# Export billing data to BigQuery (one-time setup)
gcloud alpha billing accounts list
# Query costs using bq
bq query --use_legacy_sql=false '
SELECT
service.description as service,
SUM(cost) as total_cost
FROM `project-id.billing_dataset.gcp_billing_export_v1_XXXXX`
WHERE DATE(_PARTITIONTIME) BETWEEN "2024-01-01" AND "2024-01-31"
GROUP BY service
ORDER BY total_cost DESC
'
|
GCP Budget alert using Terraform:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
| resource "google_billing_budget" "monthly_budget" {
billing_account = var.billing_account
display_name = "Monthly Budget"
budget_filter {
projects = ["projects/${var.project_number}"]
}
amount {
specified_amount {
currency_code = "USD"
units = "10000"
}
}
threshold_rules {
threshold_percent = 0.5
}
threshold_rules {
threshold_percent = 0.9
}
threshold_rules {
threshold_percent = 1.0
}
all_updates_rule {
pubsub_topic = google_pubsub_topic.budget_alerts.id
}
}
resource "google_pubsub_topic" "budget_alerts" {
name = "budget-alerts"
}
|
Resource Rightsizing
Rightsizing ensures you are using the appropriate instance types and sizes for your workloads.
AWS Rightsizing
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
| import boto3
from datetime import datetime, timedelta
cloudwatch = boto3.client('cloudwatch')
ec2 = boto3.client('ec2')
def analyze_ec2_utilization(instance_id, days=30):
"""Analyze EC2 instance CPU and memory utilization"""
end_time = datetime.utcnow()
start_time = end_time - timedelta(days=days)
# Get CPU utilization
cpu_metrics = cloudwatch.get_metric_statistics(
Namespace='AWS/EC2',
MetricName='CPUUtilization',
Dimensions=[{'Name': 'InstanceId', 'Value': instance_id}],
StartTime=start_time,
EndTime=end_time,
Period=3600,
Statistics=['Average', 'Maximum']
)
# Calculate averages
avg_cpu = sum(m['Average'] for m in cpu_metrics['Datapoints']) / len(cpu_metrics['Datapoints'])
max_cpu = max(m['Maximum'] for m in cpu_metrics['Datapoints'])
print(f"Instance {instance_id}:")
print(f" Average CPU: {avg_cpu:.2f}%")
print(f" Maximum CPU: {max_cpu:.2f}%")
# Recommendations
if avg_cpu < 20 and max_cpu < 40:
print(" Recommendation: Consider downsizing or using Burstable instances (t3/t4g)")
elif avg_cpu > 70:
print(" Recommendation: Consider upgrading instance type")
return avg_cpu, max_cpu
# Get all running instances
instances = ec2.describe_instances(
Filters=[{'Name': 'instance-state-name', 'Values': ['running']}]
)
for reservation in instances['Reservations']:
for instance in reservation['Instances']:
analyze_ec2_utilization(instance['InstanceId'])
|
Azure Rightsizing with Azure Advisor
1
2
3
4
5
| # Get Azure Advisor recommendations
az advisor recommendation list \
--category Cost \
--query "[].{Category:category,Impact:impact,Resource:impactedValue,Recommendation:shortDescription.solution}" \
--output table
|
Python script to automate rightsizing:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
| from azure.identity import DefaultAzureCredential
from azure.mgmt.advisor import AdvisorManagementClient
from azure.mgmt.compute import ComputeManagementClient
credential = DefaultAzureCredential()
advisor_client = AdvisorManagementClient(credential, subscription_id)
compute_client = ComputeManagementClient(credential, subscription_id)
# Get cost recommendations
recommendations = advisor_client.recommendations.list(
filter="Category eq 'Cost'"
)
for rec in recommendations:
if rec.impacted_field == "Microsoft.Compute/virtualMachines":
print(f"VM: {rec.impacted_value}")
print(f"Recommendation: {rec.short_description.solution}")
print(f"Potential Savings: ${rec.extended_properties.get('savingsAmount', 'N/A')}")
print(f"Current SKU: {rec.extended_properties.get('currentSku')}")
print(f"Recommended SKU: {rec.extended_properties.get('targetSku')}")
print("---")
|
GCP Rightsizing Recommendations
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
| from google.cloud import recommender_v1
def get_rightsizing_recommendations(project_id):
"""Get VM rightsizing recommendations from GCP"""
client = recommender_v1.RecommenderClient()
parent = f"projects/{project_id}/locations/us-central1/recommenders/google.compute.instance.MachineTypeRecommender"
recommendations = client.list_recommendations(parent=parent)
for recommendation in recommendations:
print(f"Recommendation: {recommendation.name}")
print(f"Description: {recommendation.description}")
print(f"Priority: {recommendation.priority}")
# Parse recommendation content
for operation in recommendation.content.operation_groups:
for op in operation.operations:
print(f"Action: {op.action}")
print(f"Resource: {op.resource}")
print(f"Current machine type: {op.value_matcher}")
# Cost impact
if recommendation.primary_impact:
impact = recommendation.primary_impact
if impact.category == recommender_v1.Impact.Category.COST:
print(f"Estimated monthly savings: ${abs(impact.cost_projection.cost.units)}")
print("---")
# Usage
get_rightsizing_recommendations("your-project-id")
|
Reserved Instances and Savings Plans
Reserved instances and savings plans offer significant discounts in exchange for commitment.
AWS Reserved Instances and Savings Plans
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
| import boto3
ce = boto3.client('ce')
def get_ri_recommendations():
"""Get Reserved Instance recommendations"""
response = ce.get_reservation_purchase_recommendation(
ServiceSpecification={'EC2Specification': {'OfferingClass': 'STANDARD'}},
PaymentOption='PARTIAL_UPFRONT',
TermInYears='ONE_YEAR',
LookbackPeriodInDays='SIXTY_DAYS'
)
for recommendation in response['Recommendations']:
details = recommendation['RecommendationDetail']
print(f"Instance Type: {details['InstanceDetails']['EC2InstanceDetails']['InstanceType']}")
print(f"Recommended Instances: {details['RecommendedNumberOfInstancesToPurchase']}")
print(f"Estimated Monthly Savings: ${details['EstimatedMonthlySavingsAmount']}")
print(f"Upfront Cost: ${details['UpfrontCost']}")
print("---")
def get_savings_plans_recommendations():
"""Get Savings Plans recommendations"""
response = ce.get_savings_plans_purchase_recommendation(
SavingsPlansType='COMPUTE_SP',
TermInYears='ONE_YEAR',
PaymentOption='PARTIAL_UPFRONT',
LookbackPeriodInDays='SIXTY_DAYS'
)
for rec in response['SavingsPlansPurchaseRecommendation']['SavingsPlansPurchaseRecommendationDetails']:
print(f"Hourly Commitment: ${rec['HourlyCommitmentToPurchase']}")
print(f"Estimated Monthly Savings: ${rec['EstimatedMonthlySavingsAmount']}")
print(f"Estimated ROI: {rec['EstimatedROI']}%")
print("---")
|
Terraform for purchasing Reserved Instances:
1
2
3
4
5
6
7
8
| resource "aws_ec2_reserved_instance" "production" {
instance_type = "t3.large"
instance_count = 10
availability_zone = "us-east-1a"
offering_class = "standard"
offering_type = "Partial Upfront"
duration = 31536000 # 1 year in seconds
}
|
Azure Reserved VM Instances
1
2
3
4
5
6
7
8
9
10
11
12
13
14
| # List available reservations
az reservations catalog show \
--subscription-id $SUBSCRIPTION_ID \
--reserved-resource-type VirtualMachines \
--location eastus
# Purchase reservation
az reservations reservation-order purchase \
--reservation-order-id /providers/Microsoft.Capacity/reservationOrders/XXXXX \
--sku Standard_D2s_v3 \
--location eastus \
--quantity 10 \
--term P1Y \
--billing-plan Monthly
|
GCP Committed Use Discounts
1
2
3
4
5
6
7
8
9
10
| # Create committed use discount
gcloud compute commitments create my-commitment \
--region us-central1 \
--resources vcpu=100,memory=400GB \
--plan 12-month
# List active commitments
gcloud compute commitments list
# Create commitment using Terraform
|
Terraform configuration:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
| resource "google_compute_commitment" "commitment" {
name = "production-commitment"
region = "us-central1"
plan = "TWELVE_MONTH"
type = "GENERAL_PURPOSE"
resources {
type = "VCPU"
amount = "100"
}
resources {
type = "MEMORY"
amount = "400"
}
}
|
Spot Instances and Preemptible VMs
Use spot instances for fault-tolerant, flexible workloads to save up to 90 percent.
AWS Spot Instances
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
| # AWS Spot Fleet request
apiVersion: v1
kind: ConfigMap
metadata:
name: spot-config
data:
spot-request.json: |
{
"IamFleetRole": "arn:aws:iam::123456789012:role/aws-ec2-spot-fleet-role",
"AllocationStrategy": "lowestPrice",
"TargetCapacity": 10,
"SpotPrice": "0.05",
"ValidFrom": "2024-01-01T00:00:00Z",
"ValidUntil": "2024-12-31T23:59:59Z",
"LaunchSpecifications": [
{
"ImageId": "ami-12345678",
"InstanceType": "t3.medium",
"KeyName": "my-key",
"SubnetId": "subnet-12345",
"SpotPrice": "0.05"
},
{
"ImageId": "ami-12345678",
"InstanceType": "t3.large",
"KeyName": "my-key",
"SubnetId": "subnet-12345",
"SpotPrice": "0.08"
}
]
}
|
Using Spot Instances with Kubernetes Karpenter:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
| apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
name: spot-provisioner
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot"]
- key: node.kubernetes.io/instance-type
operator: In
values: ["t3.medium", "t3.large", "t3.xlarge"]
limits:
resources:
cpu: 1000
memory: 1000Gi
provider:
instanceProfile: KarpenterNodeInstanceProfile
subnetSelector:
karpenter.sh/discovery: my-cluster
securityGroupSelector:
karpenter.sh/discovery: my-cluster
tags:
Name: karpenter-spot-node
ttlSecondsAfterEmpty: 30
ttlSecondsUntilExpired: 604800
|
Azure Spot VMs
1
2
3
4
5
6
7
8
9
| # Create Spot VM
az vm create \
--resource-group myResourceGroup \
--name mySpotVM \
--image UbuntuLTS \
--priority Spot \
--max-price 0.05 \
--eviction-policy Deallocate \
--size Standard_D2s_v3
|
Azure Spot with VMSS:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
| resource "azurerm_linux_virtual_machine_scale_set" "spot_vmss" {
name = "spot-vmss"
resource_group_name = azurerm_resource_group.main.name
location = azurerm_resource_group.main.location
sku = "Standard_D2s_v3"
instances = 5
priority = "Spot"
eviction_policy = "Deallocate"
max_bid_price = 0.05
admin_username = "azureuser"
admin_ssh_key {
username = "azureuser"
public_key = file("~/.ssh/id_rsa.pub")
}
source_image_reference {
publisher = "Canonical"
offer = "UbuntuServer"
sku = "18.04-LTS"
version = "latest"
}
os_disk {
caching = "ReadWrite"
storage_account_type = "Standard_LRS"
}
network_interface {
name = "spot-nic"
primary = true
ip_configuration {
name = "internal"
primary = true
subnet_id = azurerm_subnet.main.id
}
}
}
|
GCP Preemptible VMs
1
2
3
4
5
6
7
8
9
10
11
12
13
14
| # Create preemptible VM
gcloud compute instances create preemptible-vm \
--zone us-central1-a \
--machine-type n1-standard-4 \
--preemptible \
--maintenance-policy TERMINATE
# Create instance template with preemptible
gcloud compute instance-templates create preemptible-template \
--machine-type n1-standard-4 \
--preemptible \
--boot-disk-size 100GB \
--image-family ubuntu-2004-lts \
--image-project ubuntu-os-cloud
|
GKE with Spot VMs:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
| apiVersion: container.cnrm.cloud.google.com/v1beta1
kind: ContainerNodePool
metadata:
name: spot-pool
spec:
clusterRef:
name: my-cluster
location: us-central1
initialNodeCount: 3
autoscaling:
minNodeCount: 1
maxNodeCount: 10
nodeConfig:
machineType: n1-standard-4
preemptible: true
diskSizeGb: 100
oauthScopes:
- "https://www.googleapis.com/auth/cloud-platform"
labels:
workload-type: batch
taints:
- key: preemptible
value: "true"
effect: NoSchedule
|
Auto-Scaling
Automatically adjust resources based on demand.
AWS Auto Scaling
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
| # AWS Auto Scaling Group with CloudFormation
AWSTemplateFormatVersion: '2010-09-09'
Resources:
LaunchTemplate:
Type: AWS::EC2::LaunchTemplate
Properties:
LaunchTemplateName: WebServerTemplate
LaunchTemplateData:
ImageId: ami-12345678
InstanceType: t3.medium
SecurityGroupIds:
- sg-12345678
UserData:
Fn::Base64: !Sub |
#!/bin/bash
yum update -y
yum install -y httpd
systemctl start httpd
AutoScalingGroup:
Type: AWS::AutoScaling::AutoScalingGroup
Properties:
LaunchTemplate:
LaunchTemplateId: !Ref LaunchTemplate
Version: !GetAtt LaunchTemplate.LatestVersionNumber
MinSize: 2
MaxSize: 10
DesiredCapacity: 2
TargetGroupARNs:
- !Ref TargetGroup
VPCZoneIdentifier:
- subnet-12345
- subnet-67890
ScaleUpPolicy:
Type: AWS::AutoScaling::ScalingPolicy
Properties:
AdjustmentType: ChangeInCapacity
AutoScalingGroupName: !Ref AutoScalingGroup
Cooldown: 60
ScalingAdjustment: 2
CPUAlarmHigh:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmDescription: Scale up when CPU exceeds 70%
MetricName: CPUUtilization
Namespace: AWS/EC2
Statistic: Average
Period: 300
EvaluationPeriods: 2
Threshold: 70
AlarmActions:
- !Ref ScaleUpPolicy
Dimensions:
- Name: AutoScalingGroupName
Value: !Ref AutoScalingGroup
ComparisonOperator: GreaterThanThreshold
|
Azure Auto-scaling
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
| resource "azurerm_monitor_autoscale_setting" "vmss_autoscale" {
name = "vmss-autoscale"
resource_group_name = azurerm_resource_group.main.name
location = azurerm_resource_group.main.location
target_resource_id = azurerm_linux_virtual_machine_scale_set.main.id
profile {
name = "default"
capacity {
default = 2
minimum = 2
maximum = 10
}
rule {
metric_trigger {
metric_name = "Percentage CPU"
metric_resource_id = azurerm_linux_virtual_machine_scale_set.main.id
time_grain = "PT1M"
statistic = "Average"
time_window = "PT5M"
time_aggregation = "Average"
operator = "GreaterThan"
threshold = 70
}
scale_action {
direction = "Increase"
type = "ChangeCount"
value = "2"
cooldown = "PT5M"
}
}
rule {
metric_trigger {
metric_name = "Percentage CPU"
metric_resource_id = azurerm_linux_virtual_machine_scale_set.main.id
time_grain = "PT1M"
statistic = "Average"
time_window = "PT5M"
time_aggregation = "Average"
operator = "LessThan"
threshold = 30
}
scale_action {
direction = "Decrease"
type = "ChangeCount"
value = "1"
cooldown = "PT5M"
}
}
}
}
|
GCP Auto-scaling
1
2
3
4
5
6
7
8
9
10
11
12
13
| # Create managed instance group with autoscaling
gcloud compute instance-groups managed create web-group \
--base-instance-name web \
--template web-template \
--size 2 \
--zone us-central1-a
gcloud compute instance-groups managed set-autoscaling web-group \
--max-num-replicas 10 \
--min-num-replicas 2 \
--target-cpu-utilization 0.70 \
--cool-down-period 60 \
--zone us-central1-a
|
Storage Optimization
Optimize storage costs by choosing the right storage class and implementing lifecycle policies.
AWS S3 Storage Classes and Lifecycle
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
| import boto3
s3 = boto3.client('s3')
lifecycle_policy = {
'Rules': [
{
'Id': 'TransitionToIA',
'Filter': {'Prefix': 'logs/'},
'Status': 'Enabled',
'Transitions': [
{
'Days': 30,
'StorageClass': 'STANDARD_IA'
},
{
'Days': 90,
'StorageClass': 'GLACIER'
},
{
'Days': 365,
'StorageClass': 'DEEP_ARCHIVE'
}
],
'Expiration': {
'Days': 730
}
},
{
'Id': 'DeleteOldVersions',
'Filter': {},
'Status': 'Enabled',
'NoncurrentVersionTransitions': [
{
'NoncurrentDays': 30,
'StorageClass': 'STANDARD_IA'
}
],
'NoncurrentVersionExpiration': {
'NoncurrentDays': 90
}
}
]
}
s3.put_bucket_lifecycle_configuration(
Bucket='my-bucket',
LifecycleConfiguration=lifecycle_policy
)
|
Azure Blob Storage Tiers and Lifecycle
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
| resource "azurerm_storage_account" "main" {
name = "mystorageaccount"
resource_group_name = azurerm_resource_group.main.name
location = azurerm_resource_group.main.location
account_tier = "Standard"
account_replication_type = "LRS"
access_tier = "Hot"
}
resource "azurerm_storage_management_policy" "lifecycle" {
storage_account_id = azurerm_storage_account.main.id
rule {
name = "rule1"
enabled = true
filters {
prefix_match = ["logs/"]
blob_types = ["blockBlob"]
}
actions {
base_blob {
tier_to_cool_after_days_since_modification_greater_than = 30
tier_to_archive_after_days_since_modification_greater_than = 90
delete_after_days_since_modification_greater_than = 730
}
snapshot {
delete_after_days_since_creation_greater_than = 90
}
}
}
}
|
GCP Cloud Storage Classes
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
| from google.cloud import storage
def set_bucket_lifecycle(bucket_name):
"""Set lifecycle rules for GCS bucket"""
storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
rule = storage.bucket.LifecycleRule(
action=storage.bucket.LifecycleRule.SetStorageClass(
storage_class='NEARLINE'
),
condition=storage.bucket.LifecycleRule.AgeCondition(age=30)
)
rule2 = storage.bucket.LifecycleRule(
action=storage.bucket.LifecycleRule.SetStorageClass(
storage_class='COLDLINE'
),
condition=storage.bucket.LifecycleRule.AgeCondition(age=90)
)
rule3 = storage.bucket.LifecycleRule(
action=storage.bucket.LifecycleRule.SetStorageClass(
storage_class='ARCHIVE'
),
condition=storage.bucket.LifecycleRule.AgeCondition(age=365)
)
rule4 = storage.bucket.LifecycleRule(
action=storage.bucket.LifecycleRule.Delete(),
condition=storage.bucket.LifecycleRule.AgeCondition(age=730)
)
bucket.lifecycle_rules = [rule, rule2, rule3, rule4]
bucket.patch()
|
Network Cost Optimization
Network costs can be significant, especially for data transfer.
Strategies to Reduce Network Costs
- Use CDN for static content delivery
- Implement VPC peering instead of internet gateway
- Minimize cross-region data transfer
- Use private endpoints for cloud services
- Compress data before transfer
AWS Network Cost Optimization
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
| # Use VPC Endpoints to avoid NAT Gateway costs
resource "aws_vpc_endpoint" "s3" {
vpc_id = aws_vpc.main.id
service_name = "com.amazonaws.us-east-1.s3"
route_table_ids = [aws_route_table.private.id]
tags = {
Name = "s3-endpoint"
}
}
resource "aws_vpc_endpoint" "dynamodb" {
vpc_id = aws_vpc.main.id
service_name = "com.amazonaws.us-east-1.dynamodb"
route_table_ids = [aws_route_table.private.id]
}
# Use CloudFront for content delivery
resource "aws_cloudfront_distribution" "cdn" {
origin {
domain_name = aws_s3_bucket.static.bucket_regional_domain_name
origin_id = "S3-static"
s3_origin_config {
origin_access_identity = aws_cloudfront_origin_access_identity.oai.cloudfront_access_identity_path
}
}
enabled = true
default_root_object = "index.html"
default_cache_behavior {
allowed_methods = ["GET", "HEAD"]
cached_methods = ["GET", "HEAD"]
target_origin_id = "S3-static"
forwarded_values {
query_string = false
cookies {
forward = "none"
}
}
viewer_protocol_policy = "redirect-to-https"
min_ttl = 0
default_ttl = 3600
max_ttl = 86400
}
price_class = "PriceClass_100" # Use only US, Canada, Europe
restrictions {
geo_restriction {
restriction_type = "none"
}
}
viewer_certificate {
cloudfront_default_certificate = true
}
}
|
Tagging Strategies
Proper tagging enables cost allocation and tracking.
Comprehensive Tagging Strategy
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
| # AWS resource tagging
import boto3
def apply_tags(resource_arn, tags):
"""Apply standardized tags to AWS resources"""
client = boto3.client('resourcegroupstaggingapi')
client.tag_resources(
ResourceARNList=[resource_arn],
Tags=tags
)
# Standard tag schema
standard_tags = {
'Environment': 'production',
'Project': 'web-application',
'Owner': 'platform-team',
'CostCenter': 'engineering',
'Application': 'api-backend',
'ManagedBy': 'terraform',
'Backup': 'daily',
'Compliance': 'pci-dss'
}
# Terraform with consistent tagging
|
Terraform module for consistent tagging:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
| # modules/tags/variables.tf
variable "environment" {
type = string
}
variable "project" {
type = string
}
variable "additional_tags" {
type = map(string)
default = {}
}
# modules/tags/outputs.tf
output "tags" {
value = merge(
{
Environment = var.environment
Project = var.project
ManagedBy = "Terraform"
CreatedDate = timestamp()
},
var.additional_tags
)
}
# Usage in main.tf
module "common_tags" {
source = "./modules/tags"
environment = "production"
project = "web-app"
additional_tags = {
Owner = "platform-team"
CostCenter = "engineering"
}
}
resource "aws_instance" "web" {
ami = "ami-12345678"
instance_type = "t3.medium"
tags = module.common_tags.tags
}
|
FinOps Practices
FinOps is a cultural practice that brings financial accountability to cloud spending.
Key FinOps Principles
- Teams need to collaborate
- Everyone takes ownership of cloud usage
- A centralized team drives FinOps
- Reports should be accessible and timely
- Decisions are driven by business value
- Take advantage of variable cost model
Implementing FinOps with Infrastructure as Code
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
| # Cost-aware infrastructure deployment
import boto3
import json
class CostAwareDeployer:
def __init__(self):
self.ce = boto3.client('ce')
self.ec2 = boto3.client('ec2')
def get_current_month_cost(self):
"""Get current month's cost"""
response = self.ce.get_cost_and_usage(
TimePeriod={
'Start': '2024-01-01',
'End': '2024-01-31'
},
Granularity='MONTHLY',
Metrics=['UnblendedCost']
)
return float(response['ResultsByTime'][0]['Total']['UnblendedCost']['Amount'])
def check_budget(self, proposed_cost):
"""Check if proposed deployment fits within budget"""
current_cost = self.get_current_month_cost()
budget_limit = 10000 # $10,000
if current_cost + proposed_cost > budget_limit:
return False, f"Deployment would exceed budget: ${current_cost + proposed_cost} > ${budget_limit}"
return True, "Within budget"
def deploy_with_cost_check(self, instance_type, count):
"""Deploy instances only if within budget"""
# Calculate estimated cost
pricing = self.get_instance_pricing(instance_type)
monthly_cost = pricing * count * 730 # hours per month
can_deploy, message = self.check_budget(monthly_cost)
if can_deploy:
print(f"Deploying {count} {instance_type} instances")
print(f"Estimated monthly cost: ${monthly_cost:.2f}")
# Actual deployment code here
else:
print(f"Deployment blocked: {message}")
# Send notification to team
|
Cost Optimization Automation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
| # Automated cost optimization script
import boto3
from datetime import datetime, timedelta
class CostOptimizer:
def __init__(self):
self.ec2 = boto3.client('ec2')
self.rds = boto3.client('rds')
def stop_unused_instances(self):
"""Stop EC2 instances with low CPU utilization"""
cloudwatch = boto3.client('cloudwatch')
instances = self.ec2.describe_instances(
Filters=[{'Name': 'instance-state-name', 'Values': ['running']}]
)
for reservation in instances['Reservations']:
for instance in reservation['Instances']:
instance_id = instance['InstanceId']
# Check CPU utilization
metrics = cloudwatch.get_metric_statistics(
Namespace='AWS/EC2',
MetricName='CPUUtilization',
Dimensions=[{'Name': 'InstanceId', 'Value': instance_id}],
StartTime=datetime.utcnow() - timedelta(days=7),
EndTime=datetime.utcnow(),
Period=86400,
Statistics=['Average']
)
if metrics['Datapoints']:
avg_cpu = sum(m['Average'] for m in metrics['Datapoints']) / len(metrics['Datapoints'])
if avg_cpu < 5:
print(f"Stopping unused instance {instance_id} (avg CPU: {avg_cpu:.2f}%)")
self.ec2.stop_instances(InstanceIds=[instance_id])
def delete_old_snapshots(self, days=90):
"""Delete snapshots older than specified days"""
snapshots = self.ec2.describe_snapshots(OwnerIds=['self'])
cutoff_date = datetime.utcnow() - timedelta(days=days)
for snapshot in snapshots['Snapshots']:
snapshot_date = snapshot['StartTime'].replace(tzinfo=None)
if snapshot_date < cutoff_date:
print(f"Deleting old snapshot {snapshot['SnapshotId']}")
self.ec2.delete_snapshot(SnapshotId=snapshot['SnapshotId'])
def schedule_non_prod_instances(self):
"""Schedule non-production instances to stop at night"""
# Implementation for scheduling
pass
# Run optimization
optimizer = CostOptimizer()
optimizer.stop_unused_instances()
optimizer.delete_old_snapshots()
|
Conclusion
Cloud cost optimization is an ongoing process that requires continuous monitoring, analysis, and adjustment. By implementing the strategies covered in this guide, including proper cost monitoring, resource rightsizing, leveraging reserved instances and spot instances, implementing auto-scaling, optimizing storage and network costs, using effective tagging strategies, and adopting FinOps practices, organizations can significantly reduce their cloud spending while maintaining or improving performance.
Key takeaways:
- Implement comprehensive cost monitoring and alerting
- Regularly review and rightsize resources
- Use commitment-based pricing for predictable workloads
- Leverage spot instances for fault-tolerant workloads
- Implement auto-scaling to match demand
- Optimize storage with lifecycle policies
- Reduce network costs with CDN and VPC endpoints
- Use consistent tagging for cost allocation
- Adopt FinOps culture across the organization
- Automate cost optimization where possible
References
- AWS Cost Management: https://aws.amazon.com/aws-cost-management/
- AWS Well-Architected Framework - Cost Optimization Pillar: https://docs.aws.amazon.com/wellarchitected/latest/cost-optimization-pillar/welcome.html
- Azure Cost Management: https://azure.microsoft.com/en-us/products/cost-management/
- Azure Architecture Center - Cost Optimization: https://learn.microsoft.com/en-us/azure/architecture/framework/cost/
- GCP Cost Management: https://cloud.google.com/cost-management
- GCP Best Practices for Cost Optimization: https://cloud.google.com/architecture/best-practices-for-optimizing-your-cloud-costs
- FinOps Foundation: https://www.finops.org/
- Cloud FinOps Book: https://www.oreilly.com/library/view/cloud-finops/9781492054610/
- AWS Spot Instance Best Practices: https://aws.amazon.com/ec2/spot/getting-started/
- Kubernetes Resource Management: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/