One policy model for cost, uptime, and execution control
Autopilot is the Phase 3 control plane: it coordinates migration triggers, spot recovery behavior, and schedule guardrails automatically.
How It Works
Define Policy
Set budget cap, migration threshold, spot preference, and reliability rules for each workload class.
Launch + Route
Autopilot selects matching capacity and launches according to your policy constraints.
Enforce Continuously
Policy remains active after launch: optimize spend, recover preemptions, and respect schedule/cost limits.
Policy Inputs & Outcomes
Threshold-based (default 15%)
Migration trigger
Detection ~30s + automated restore
Spot recovery
Cron launch + max time/cost
Schedule controls
Example outcome: cost savings
A team spending $2,000/month on GPU training can often recover 20–35% by applying thresholded migration and spot-first policy where safe.
Example outcome: reliability
With checkpoint + recovery policy enabled, spot preemptions become recoverable events instead of full incident escalations.
Use Cases
ML Training
Long-running training with migration + spot recovery to reduce spend while protecting progress.
Inference Serving
Reliability-first policy for serving paths, with optional cost optimization in background.
Batch Processing
Schedule-bound pipelines with strict run and cost ceilings for predictable monthly spend.