Open source · Apache

Stop burning
GPU dollars.
Start slicing.

CastSlice is a zero-touch Kubernetes Mutating Webhook that automatically converts full GPU requests into shared GPU slices — no application changes required.

kubectl apply -f deployment.yaml
# Before: 1 Pod consumes the entire GPU
resources:
  limits:
    nvidia.com/gpu: 1

# Add one annotation to opt-in
annotations:
  castops.io/optimize: "true" # ← magic

# After: CastSlice rewrites on-the-fly
resources:
  limits:
    nvidia.com/gpu-shared: 1

10×
More Pods per GPU
1
Annotation to enable
0
App changes required
3
Cloud providers supported

Zero-touch GPU sharing in 3 steps

CastSlice sits in your Kubernetes Control Plane and intercepts Pod creation requests transparently.

1
Deploy CastSlice

Install with a single kubectl command. cert-manager injects TLS automatically.

kubectl apply -f install.yaml
2
Annotate your Pod

Add one annotation to any Pod or Deployment. No code changes. No restarts.

castops.io/optimize: "true"
3
Slicing happens automatically

CastSlice intercepts the CREATE request and rewrites nvidia.com/gpu to a shared resource — before the Pod is scheduled.

What you get

Everything you need to start sharing GPUs across your AI workloads today.

✏️
Zero-touch Mutation

The webhook intercepts Pod CREATE requests and rewrites resource specs on-the-fly. Your application never knows it was changed.

🔒
Opt-in by Annotation

Only Pods with castops.io/optimize: "true" are mutated. Everything else passes through unchanged.

☁️
Cloud Agnostic

Works on any CNCF-conformant Kubernetes cluster — EKS, GKE, AKS, or on-prem bare metal.

🔐
cert-manager TLS

Leverages cert-manager for automatic TLS certificate injection and rotation. No manual cert management.

🏥
Health Probes

Exposes /healthz and /readyz endpoints. The Pod only becomes Ready once the webhook server is fully up.

🛡️
Failure-safe Policy

Webhook is configured with failurePolicy: Ignore — if CastSlice is down, Pods still schedule normally.

One line. That's all it takes.

Add castops.io/optimize: "true" to any Pod or Deployment template. CastSlice detects it at admission time and rewrites nvidia.com/gpu limits into nvidia.com/gpu-shared — enabling NVIDIA MPS or Time-Slicing to pack multiple Pods onto a single physical card.

No changes to your container image, entrypoint, or business logic. The application is completely unaware.

Read the docs →
deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama-inference
spec:
  template:
    metadata:
      annotations:
        castops.io/optimize: "true" # ← opt-in
    spec:
      containers:
      - name: ollama
        image: ollama/ollama
        resources:
          limits:
            # CastSlice rewrites this ↓
            nvidia.com/gpu: 1

What's coming

CastSlice is actively developed. Here's what's planned.

v0.1.0
Basic Mutating Webhook

Static slicing — rewrites nvidia.com/gpu to gpu-shared on annotated Pods.

✓ Shipped
v0.2.0
Smart Slicing

Dynamic ratios based on workload type — training, inference, batch, dev — or an explicit slice-ratio override.

✓ Shipped
v0.3.0
FinOps Dashboard

Live GPU utilization metrics and a "dollars saved" counter.

Planned
v0.4.0
Policy Engine

Namespace-level and label-based slicing rules without annotation changes.

Planned

Ready to stop wasting GPU budget?

Deploy CastSlice in under five minutes with a single kubectl command.

Get Started → Star on GitHub