#136: vLLM, LMD, and the Quest to Build the Linux of AI Inference Artwork

De Nederlandse Kubernetes Podcast

De Nederlandse Kubernetes Podcast: gemaakt door én voor mensen met een hart voor IT. In deze reeks gaan Ronald Kers en Jan Stomphorst in gesprek over Kubernetes met als doel Kubernetes toegankelijk te maken voor iedereen.

All Episodes

De Nederlandse Kubernetes Podcast

#136: vLLM, LMD, and the Quest to Build the Linux of AI Inference

June 09, 2026 • Ronald Kers en Jan Stomphorst • Season 4 • Episode 11

0:00 | 32:21

In this episode, hosts Ronald and Jan are joined at KubeCon by two guests from Red Hat: Brian Stevens, AI CTO and one of the original architects behind the creation of Kubernetes and the CNCF, and Rob Shaw, co-lead of the vLLM project and maintainer of LMD.

Brian shares the remarkable backstory of how Kubernetes came to be open source, including how Red Hat negotiated a single committer seat before agreeing to be a launch partner, and how he later pushed Google to contribute Kubernetes to the newly formed CNCF rather than keeping it proprietary like TensorFlow.

Rob explains what an inference runtime actually is: the critical piece of software that takes an abstract AI model and runs it as efficiently as possible on a GPU or other accelerator — handling everything from CUDA-level kernel optimization to memory management and concurrent request scheduling. vLLM serves as a "Rosetta Stone" between the ever-growing zoo of models (Llama, DeepSeek, Mistral, Qwen, Nvidia Nemotron) and accelerators (Nvidia, AMD, Intel, Google TPUs).

The conversation covers model compression and quantization how techniques like 4-bit precision can deliver 2x hardware efficiency gains while preserving 99%+ model accuracy. Brian and Rob also address the "big model vs. many small models" debate, recommending to always start with the largest capable model to validate a use case before optimizing down.

Looking ahead, both guests see inference as potentially the single largest workload ever run on Kubernetes, and position LMD (now contributed to the CNCF) as the distributed inference layer that will make this possible across heterogeneous accelerator environments preventing enterprises from ending up with 42 incompatible AI stacks.
The episode closes with a discussion on AI slop, human-in-the-loop thinking, and the future of Kubernetes as the universal platform for running AI agents at scale.

Powered by @acc-ict

Stuur ons een bericht.

ACC ICT Specialist in IT-CONTINUÏTEIT
Bedrijfskritische applicaties én data veilig beschikbaar, onafhankelijk van derden, altijd en overal

Support the show

Like and subscribe! It helps out a lot.

You can also find us on:
De Nederlandse Kubernetes Podcast - YouTube
Nederlandse Kubernetes Podcast (@k8spodcast.nl) | TikTok
De Nederlandse Kubernetes Podcast

Where can you meet us:
Events

This Podcast is powered by:
ACC ICT - IT-Continuïteit voor Bedrijfskritische Applicaties | ACC ICT

Jan Stomphorst

Host

Ronald Kers

Host