
skypilot
Run, manage, and scale AI workloads on any AI infrastructure. Use one system to access & manage all AI compute (Kubernetes, Slurm, 20+ clouds, on-prem).
The Lens
SkyPilot runs your AI training jobs on whichever cloud has GPUs available. Write the job once, and it figures out where to land it: your Kubernetes cluster, your Slurm cluster, AWS, GCP, Azure, RunPod, Lambda, 20+ other clouds. It handles spot instance failover, queue management, and auto-cleanup of idle resources. Apache 2.0, install with pip.
Self-hosted reality is straightforward for the user side. Pip install, configure cloud credentials, write a YAML spec. The infra-team side is heavier: for shared clusters with multi-tenancy, gang scheduling, and team resource quotas, you're running their API server and tuning it. Most solo users skip that and just point SkyPilot at their existing cloud accounts.
Solo ML engineers chasing GPU availability across clouds: this is the move. Small teams sharing a Kubernetes cluster or split across GCP and AWS: same. Large ML platforms at companies like Shopify already use it. Enterprise features (SSO, RBAC, advanced governance) push you toward the paid SkyPilot offering, but the core is unlimited.
The mental model has a learning curve. SkyPilot speaks YAML and CLI fluently. For a click-to-launch UI for non-engineers, you'll need to build that on top.
Free vs Self-Hosted vs Paid
open coreFree tier: The OSS core. Apache 2.0, full feature set for individuals and small teams.
Self-hosted (free): Install with pip. Connect any cloud credentials. Multi-cloud orchestration, spot failover, queueing, auto-stop, gang scheduling. Run the API server yourself for team sharing.
Paid (SkyPilot Enterprise): The company offers a commercial tier for enterprise features (SSO, RBAC, advanced governance, dedicated support). Contact sales for pricing. The OSS version remains fully functional without it.
The cost you don't see: The cloud bills SkyPilot launches against. This is BYOC, so you're paying AWS/GCP/Azure/etc directly for GPU time. SkyPilot's job is to minimize that bill by hopping between clouds and using spot capacity.
Open core. The OSS version covers most use cases for free.
Get tools like this every Wednesday
One featured tool, three on the radar. No fluff.
License: Apache License 2.0
Use freely. Patent grant included.
Commercial use: ✓ Yes
About
- Owner
- skypilot-org (Organization)
- Stars
- 9,983
Explore Further
More tools in the directory
dolphinscheduler
Apache DolphinScheduler is the modern data orchestration platform. Agile to create high performance workflow with low-code
14.3k ★TensorRT-LLM
TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT LLM also contains components to create Python and C++ runtimes that orchestrate the inference execution in a performant way.
13.6k ★awesome-opensource-ai
Curated list of the best truly open-source AI projects, models, tools, and infrastructure.
3.6k ★