Spot Optimizer ≠ Spot Fleet: decide smart, then launch
Picking EC2 Spot instances shouldn’t feel like rolling dice. Most teams either guess and get bitten by interruptions, or overbuild complicated logic to dodge them.
Here’s the clean split that keeps you sane:
- Spot Optimizer is a Python library/CLI that figures out the right instance types and counts for your workload.
- EC2 Fleet / Spot Fleet are AWS services that launch and maintain that capacity.
Short version: Spot Optimizer decides what to run; EC2/Spot Fleet decide how to run it.
📦 Project Links
🔗 GitHub Repository — Source code, issues, and contributions
📋 PyPI Package — Install with pip install spot-optimizer
The Problem with Manual Spot Instance Selection
Manual spot instance selection typically involves:
- Guessing which instance types have low interruption rates
- Trial-and-error testing across regions and families
- Building custom logic to match workload requirements
- Constantly updating interruption rate data
This leads to over-provisioning, frequent interruptions, or both.
TL;DR
- Use Spot Optimizer to compute a shortlist of instance types that meet your cores/RAM and reliability needs (filters for region, SSD-only, ARM/x86, EMR, families).
- Feed that list to EC2 Fleet/Spot Fleet with a capacity‑optimized (or price‑capacity‑optimized) strategy.
- Result: fewer surprises, fewer interruptions, and IaC you don’t have to babysit.
What Spot Optimizer Actually Does
Spot Optimizer abstracts away the complexity of spot instance selection.
Inputs
- Required:
cores
,memory
(GB) - Optional:
region
,ssd_only
,arm_instances
,instance_family
filters (e.g., [“m6i”, “r6i”]),emr_version
, andmode
(latency
,fault_tolerance
,balanced
)
Guarantees
- Always meets or exceeds requested cores/RAM by rounding up to available shapes
Outputs
- Minimal set of instance type(s) and counts plus reliability hints (spot score, interruption-rate band)
Interfaces
- Python API and a mirrored CLI
Under the hood
- Fast lookups backed by DuckDB.
- Hourly refresh of interruption/reliability signals so picks don’t go stale.
- Solid test suite (≈98% coverage) covering the decision engine and CLI, so “wrong picks” get caught early.
Quick Start
Install
pip install spot-optimizer
API
from spot_optimizer import optimize
plan = optimize(
cores=64,
memory=256,
region="us-east-1",
ssd_only=True,
arm_instances=False,
instance_family=["m6i", "r6i"],
mode="fault_tolerance",
)
print(plan) # example: {"instances": {"type": "m6i.4xlarge", "count": 4}, "mode": "fault_tolerance", ...}
CLI
spot-optimizer \
--cores 64 \
--memory 256 \
--region us-east-1 \
--ssd-only \
--no-arm \
--instance-family m6i r6i \
--mode fault_tolerance
Spot Optimizer vs EC2/Spot Fleet
Aspect | Spot Optimizer | EC2 Fleet / Spot Fleet |
---|---|---|
Purpose | Decide instance types and counts | Launch and maintain capacity |
When to use | When you’re deciding what to run | When you’re launching and keeping it running |
Output | A shortlist of types + counts | Actual EC2 instances |
Focus | Fit, stability, cost | Capacity fulfillment, diversification |
Why Pair Them
- Smarter diversification: Start from low-interruption candidates, then let Fleet diversify across pools/AZs
- Fewer interruptions: Use
capacity-optimized
orprice-capacity-optimized
in the Fleet so AWS places you where spare capacity is deeper - Less thrash: Stop hand-picking random types and hoping they survive
Integration with EC2 Fleet
Take Spot Optimizer’s output and pass it to LaunchTemplateConfigs[].Overrides
.
import boto3
from spot_optimizer import optimize
def launch_optimized_fleet(cores: int, memory: int, region: str):
# Get recommendations from Spot Optimizer
plan = optimize(cores=cores, memory=memory, region=region, mode="fault_tolerance")
itype = plan["instances"]["type"]
count = plan["instances"]["count"]
# Diversify across AZs; add sibling instance types if you want more pools
overrides = [
{"InstanceType": itype, "AvailabilityZone": f"{region}{az}"}
for az in ("a", "b", "c")
]
ec2 = boto3.client("ec2", region_name=region)
resp = ec2.create_fleet(
LaunchTemplateConfigs=[
{
"LaunchTemplateSpecification": {
"LaunchTemplateName": "my-workload-template",
"Version": "$Latest",
},
"Overrides": overrides,
}
],
TargetCapacitySpecification={
"TotalTargetCapacity": count,
"DefaultTargetCapacityType": "spot",
},
Type="maintain", # keep capacity steady
SpotOptions={
"AllocationStrategy": "price-capacity-optimized", # recommended
"InstanceInterruptionBehavior": "terminate",
},
)
return resp["FleetId"]
Best practices for EC2 Fleet integration
- Region‑scoped: create one fleet per region.
- Request type: prefer maintain for steady capacity; use instant when you need immediate placement (EC2 Fleet only).
- Allocation strategy: prefer price‑capacity‑optimized (or capacity‑optimized) and be flexible across many instance types.
- Diversify across AZs: include multiple AZs in your overrides; let the allocation strategy do its job.
- Capacity Rebalancing: enable it and handle interruption/rebalance signals gracefully.
Integration with Spot Fleet
If you’re using the older Spot Fleet service, populate Launch Specifications/Overrides with the recommended instance types:
{
"SpotFleetRequestConfig": {
"IamFleetRole": "arn:aws:iam::<account>:role/aws-ec2-spot-fleet-role",
"TargetCapacity": 8,
"AllocationStrategy": "priceCapacityOptimized",
"LaunchSpecifications": [
{
"ImageId": "ami-12345678",
"InstanceType": "m6i.2xlarge",
"SubnetId": "subnet-12345678"
}
]
}
}
When Spot Optimizer Makes a Difference
High‑impact scenarios
- ML training jobs: 50+ hour workloads where interruptions are costly
- Spark clusters: multi‑node jobs requiring stable capacity
- Game server fleets: when you can drain sessions and scale out
- CI/CD build farms: interruptions slow teams down
Lower‑impact scenarios (still good Spot candidates)
- Single‑instance development environments
- Stateless web services with strong auto‑scaling (already resilient; a shortlist still helps)
- Batch jobs that checkpoint frequently
Choosing the right optimization mode
- latency — fewer, larger nodes. Good for chatty services and tight p95/p99.
- fault_tolerance — more, smaller nodes. Great for batch/Spark and failure isolation.
- balanced — sensible default when you’re not sure.
Tip: whichever you pick, pair it with capacity‑optimized in Fleet to land in deeper capacity pools.
Advanced Filtering Options
Filter by EMR version, choose ARM (Graviton) or x86, enforce SSD-only storage, and constrain by families like m6i
/r6i
. The goal: your shortlist is runnable and optimized, not just cheap.
Example with all filters
plan = optimize(
cores=128,
memory=512,
region="us-west-2",
ssd_only=True, # Only SSD-backed instances
arm_instances=True, # Include ARM/Graviton instances
instance_family=["m6i", "m6a", "r6i"], # Specific families
emr_version="6.4.0", # EMR compatibility
mode="balanced",
)
Performance & reliability
- Fast decisions: the selector is designed to be quick enough for CI or deploy-time.
- Well‑tested: comprehensive tests around the core decision logic and CLI.
- Fresh data: interruption and reliability signals are refreshed on a schedule so recommendations don’t drift.
- Make targets:
make install
,make test
,make test-all
,make coverage
,make clean
.
What this really means is fewer late‑night pages because a random pool went sideways.
Production Checklist
Before Deployment
- Generate candidates with Spot Optimizer (apply EMR/arch/storage/family filters and choose a mode)
- Test recommendations in a staging environment
Fleet Configuration
- Set
AllocationStrategy
tocapacity-optimized
orprice-capacity-optimized
- Provide multiple instance types as overrides; let AWS diversify
- Use
maintain
for steady capacity - Enable Capacity Rebalancing and handle the 2-minute interruption notice
Ongoing Maintenance
- Refresh recommendations regularly so new reliability data is used at deploy time
- Monitor interruption rates and adjust optimization mode if needed
FAQ
Is Spot Optimizer an alternative to Spot Fleet?
No. It’s complementary: Spot Optimizer chooses the right instances; EC2/Spot Fleet launches and manages them.
Can it run cross-region?
Yes—the library analyzes any region. Fleets themselves are region-bound, so you’ll need separate fleets per region.
Why not just pick the lowest-price instances?
Cheap pools can be fragile. Capacity-oriented strategies prefer deeper capacity pools to reduce interruptions, which often saves more money long-term.
How often should I refresh recommendations?
For production workloads, refresh daily or before major deployments. Interruption rates change as AWS capacity shifts.
Can I use this with Kubernetes/EKS?
Yes! Use Spot Optimizer to choose node instance types, then configure your node groups or Karpenter with those recommendations.
What about savings vs on-demand?
Spot instances typically save 50-90% vs on-demand. Spot Optimizer helps you get those savings without the interruption headaches.
References
Lastly, thank you for reading this post. For more awesome posts, you can explore my other articles here, and follow me on Github — amarlearning.
#aws #ec2 #spot-instances #cloud-computing #cost-optimization #infrastructure #devops #python #cli-tools