Serverless ML Compute: The Architecture Bet

← All Notes

ML engineering teams have a compute problem that is structurally different from general application compute. The workload is bursty, heterogeneous, and expensive. A training run might saturate a cluster of GPU instances for six hours and then go idle. An inference endpoint might need zero resources for twenty minutes and then need to serve a thousand requests in ten seconds. A data preprocessing job might run once a week on a very large dataset and take hours to complete.

These patterns don't fit neatly into either the VM model (where you pay for reserved capacity whether or not you're using it) or the original serverless model (where execution time is measured in milliseconds and GPU access is not part of the design). For most of the history of cloud ML, teams handled this by over-provisioning — maintaining more compute capacity than they needed, accepting the cost as the price of avoiding cold-start latency.

The compute options as we mapped them

Before investing in Modal Labs, we did a systematic mapping of the compute layer options available to ML engineering teams at the time. The options sorted into a few categories.

Hyperscaler ML platforms (AWS SageMaker, Google Vertex AI, Azure ML): These provide managed infrastructure for training and inference, but they're expensive, complex to configure, and carry significant vendor lock-in risk. The configuration surface area for a non-trivial training job is large. Teams that outgrow the happy path often find themselves working around the managed platform rather than using it. The benefit is integration with the broader cloud ecosystem; the cost is abstraction overhead and pricing.

Raw cloud VMs / reserved GPU instances: The flexible option. You rent GPU capacity, configure your own runtime, manage your own container images, build your own job scheduling. This works and many serious ML teams do it. The operational overhead is significant — maintaining a cluster, handling preemption, managing environment reproducibility across training runs, building the tooling to observe utilization — but the control is real.

Kubernetes + GPU operators: For teams with existing Kubernetes competency, this is a reasonable path. GPU operators (NVIDIA device plugin, etc.) handle the hardware allocation layer; workload scheduling is handled by the Kubernetes scheduler. The upside is that it's a general-purpose platform that handles ML workloads among others. The downside is that Kubernetes was not designed with GPU workloads as a primary use case, and the operational complexity of maintaining a cluster is non-trivial.

Emerging serverless ML platforms: The design bet — that ML compute can be abstracted to the level of serverless, with on-demand GPU access, fast cold start, and billing by compute-second rather than by reserved capacity. This is the category that Modal Labs is building in.

Why the serverless model is the right architecture bet for ML

The core argument is about the economics and ergonomics of the workload.

ML workloads are predominantly bursty. A team running daily model evaluations, weekly training runs, and on-demand inference doesn't need steady-state GPU capacity. They need burst capacity on demand. The economics of reserved compute are poor for this pattern — you're paying for capacity you're not using most of the time.

The environment reproducibility problem is also a strong argument for the serverless model. Container-native execution — where the environment is defined by an image and is therefore reproducible across runs — eliminates the class of "works on my machine" problems that plague GPU cluster management. If every run is a fresh container built from a specified image, the environment is consistent by construction. This is not a small benefit. Environment debugging is a significant source of lost time for ML engineering teams.

The cold-start latency objection is real but tractable. Early serverless compute products had cold-start times that made them unsuitable for interactive workloads. The engineering investment in reducing cold-start latency for GPU containers is significant but achievable. The question is whether the latency can be brought down to a level that makes the model viable for the majority of ML workloads — not just batch jobs where latency tolerance is high.

The bet

The investment thesis is that the technical barriers to serverless GPU compute are engineering problems, not physics problems. Cold-start latency is reducible through pre-warming, snapshot-based container launch, and intelligent prediction of workload patterns. The economics of the model are structurally better for the customer than reserved capacity for bursty workloads. And the developer experience of writing Python that runs on GPU without infrastructure configuration is a dramatically lower barrier to adoption than any of the alternatives.

The bet we made is that ML infrastructure is at the beginning of a transition that application infrastructure went through a decade earlier — from owned/reserved compute to cloud-native, to serverless. The trajectory is the same; the timeline is compressed because the lessons from the previous transition are available to learn from.

Akave Capital is an investor in Modal Labs. This note reflects our investment thesis at the time of initial investment.