Open source · Apache

Stop burning
GPU dollars.
Start slicing.

CastSlice is a zero-touch Kubernetes Mutating Webhook that automatically converts full GPU requests into shared GPU slices — no application changes required.

Get Started → View on GitHub

kubectl apply -f deployment.yaml

# Before: 1 Pod consumes the entire GPU

resources:

limits:

nvidia.com/gpu: 1

# Add one annotation to opt-in

annotations:

castops.io/optimize: "true" # ← magic

# After: CastSlice rewrites on-the-fly

resources:

limits:

nvidia.com/gpu-shared: 1 ✓

How It Works

Zero-touch GPU sharing in 3 steps

CastSlice sits in your Kubernetes Control Plane and intercepts Pod creation requests transparently.

Deploy CastSlice

Install with a single kubectl command. cert-manager injects TLS automatically.

kubectl apply -f install.yaml

Annotate your Pod

Add one annotation to any Pod or Deployment. No code changes. No restarts.

castops.io/optimize: "true"

Slicing happens automatically

CastSlice intercepts the CREATE request and rewrites nvidia.com/gpu to a shared resource — before the Pod is scheduled.

Features

What you get

Everything you need to start sharing GPUs across your AI workloads today.

✏️

Zero-touch Mutation

The webhook intercepts Pod CREATE requests and rewrites resource specs on-the-fly. Your application never knows it was changed.

🔒

Opt-in by Annotation

Only Pods with castops.io/optimize: "true" are mutated. Everything else passes through unchanged.

☁️

Cloud Agnostic

Works on any CNCF-conformant Kubernetes cluster — EKS, GKE, AKS, or on-prem bare metal.

🔐

cert-manager TLS

Leverages cert-manager for automatic TLS certificate injection and rotation. No manual cert management.

🏥

Health Probes

Exposes /healthz and /readyz endpoints. The Pod only becomes Ready once the webhook server is fully up.

🛡️

Failure-safe Policy

Webhook is configured with failurePolicy: Ignore — if CastSlice is down, Pods still schedule normally.

The Magic Annotation

One line. That's all it takes.

Add castops.io/optimize: "true" to any Pod or Deployment template. CastSlice detects it at admission time and rewrites nvidia.com/gpu limits into nvidia.com/gpu-shared — enabling NVIDIA MPS or Time-Slicing to pack multiple Pods onto a single physical card.

No changes to your container image, entrypoint, or business logic. The application is completely unaware.

Read the docs →

deployment.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama-inference
spec:
  template:
    metadata:
      annotations:
        castops.io/optimize: "true" # ← opt-in
    spec:
      containers:
      - name: ollama
        image: ollama/ollama
        resources:
          limits:
            # CastSlice rewrites this ↓
            nvidia.com/gpu: 1

Roadmap

What's coming

CastSlice is actively developed. Here's what's planned.

v0.1.0

Basic Mutating Webhook

Static slicing — rewrites nvidia.com/gpu to gpu-shared on annotated Pods.

✓ Shipped

v0.2.0

Smart Slicing

Dynamic ratios based on workload type — training, inference, batch, dev — or an explicit slice-ratio override.

✓ Shipped

v0.3.0

FinOps Dashboard

Live GPU utilization metrics and a "dollars saved" counter.

Planned

v0.4.0

Policy Engine

Namespace-level and label-based slicing rules without annotation changes.

Planned

Stop burningGPU dollars.Start slicing.

Ready to stop wasting GPU budget?

Stop burning
GPU dollars.
Start slicing.