How It Works

A deep-dive into CastSlice's webhook architecture, Kubernetes Admission Control integration, and the JSON Patch strategy used to rewrite GPU resource requests.

Architecture

CastSlice is a Kubernetes Mutating Admission Webhook. It runs as a standard Deployment inside your cluster and registers itself with the API server’s admission controller pipeline.

When any Pod is created in the cluster, the API server forwards the request to CastSlice before persisting it to etcd. CastSlice inspects the Pod annotation and — if opted in — rewrites the GPU resource request to a GPU-shared resource. The mutated object is returned and the API server continues scheduling normally.

Admission Flow

┌──────────────────── Kubernetes Control Plane ──────────────────────┐ │ │ │ kubectl apply │ │ │ │ │ ▼ │ │ API Server ────────── AdmissionReview ──────────▶ CastSlice │ │ │ │ │ Has castops.io/optimize: "true"? │ │ │ No ◀────┤ │ │ pass-through ◀───────┘ │ │ │ Yes ──▶ Mutate │ │ JSON Patch │ │ gpu → gpu-shared│ │ │ │ │ ◀────────────────── AdmissionReview ───────────────┘ │ │ │ (with patches embedded) │ │ ▼ │ │ API Server → etcd │ │ │ └─────────────────────────────────────────────────────────────────────┘

ℹ️

Admission webhooks are called synchronously and in-band. CastSlice must respond within the API server's webhook timeout (default 10s). The mutated Pod is what gets persisted; the original request is discarded.

Annotation Reference

CastSlice uses annotations on the Pod (or Deployment Pod template) to determine whether mutation should occur and how many GPU slices to assign.

Annotation	Value	Required	Description
`castops.io/optimize`	`"true"`	Required	Opts the Pod into GPU slice mutation. Any value other than `"true"` is treated as absent and the Pod is passed through unchanged.
`castops.io/workload-type`	`"training"` / `"inference"` / `"batch"` / `"dev"`	Optional	Selects a preset GPU slice count based on the workload category. `training` → 4, `inference` → 2, `batch` → 2, `dev` → 1 (default when omitted).
`castops.io/slice-ratio`	Positive integer string, e.g. `"8"`	Optional	Overrides the workload-type preset with an explicit slice count. Takes priority over `castops.io/workload-type`.

Resolution order

When determining the final slice count, CastSlice evaluates annotations in this order:

castops.io/slice-ratio — explicit override, highest priority
castops.io/workload-type — preset lookup
Default of 1 — backward-compatible with v0.1.0

Placement matters

The annotation must be on the Pod’s own metadata. For Deployments, this means spec.template.metadata.annotations — not the Deployment’s top-level metadata.annotations.

Correct placement (Deployment)
spec:
  template:
    metadata:
      annotations:
        castops.io/optimize: "true"   ✓ On the Pod template
  spec:
    containers: ...

JSON Patch Strategy

When mutation is triggered, CastSlice returns an AdmissionResponse containing a JSON Patch (RFC 6902) array. The API server applies the patch atomically before persisting the object.

CastSlice iterates over all container types — initContainers, containers, and ephemeralContainers. For each entry with the key nvidia.com/gpu:

The existing nvidia.com/gpu key is removed via op: remove
A new nvidia.com/gpu-shared entry is added with the resolved slice count via op: add

JSON Patch produced by CastSlice (workload-type: training → ratio 4)
[
  {
    "op":    "remove",
    "path":  "/spec/containers/0/resources/limits/nvidia.com~1gpu"
  },
  {
    "op":    "add",
    "path":  "/spec/containers/0/resources/limits/nvidia.com~1gpu-shared",
    "value": "4"   // dynamic — was always "1" in v0.1.0
  }
]

ℹ️

JSON Pointer encoding: Forward slashes in key names (/) are encoded as ~1 in JSON Patch paths per RFC 6901. That's why nvidia.com/gpu becomes nvidia.com~1gpu in the patch path.

The patch is base64-encoded and sent back in the AdmissionResponse.Patch field alongside patchType: JSONPatch.

Webhook Configuration Reference

CastSlice registers as a MutatingWebhookConfiguration. Key settings:

Field	Value	Reason
`rules[].resources`	`["pods"]`	Only intercept Pod CREATE/UPDATE; ignore all other resource types.
`rules[].operations`	`["CREATE"]`	Mutation happens at Pod creation. Existing running Pods are never modified.
`failurePolicy`	`Ignore`	If CastSlice is unreachable, admit the Pod anyway. This prevents CastSlice downtime from blocking your workloads.
`admissionReviewVersions`	`["v1"]`	Uses the stable v1 AdmissionReview API.
`sideEffects`	`None`	The webhook has no side effects; required for dry-run compatibility.

TLS via cert-manager

The MutatingWebhookConfiguration requires the API server to trust the webhook’s TLS certificate. CastSlice uses a cert-manager Certificate resource to obtain a CA-signed cert, and sets the cert-manager.io/inject-ca-from annotation on the webhook configuration. cert-manager automatically injects the CA bundle into webhooks[].clientConfig.caBundle, keeping it up to date on renewal.

NVIDIA provides two main mechanisms to share a physical GPU across multiple Pods:

Time-Slicing

NVIDIA GPU Operator (v1.10+) supports time-slicing via a ConfigMap that defines GPU replicas. Each replica appears as a separate nvidia.com/gpu resource. CastSlice complements this by writing to the nvidia.com/gpu-shared resource key — which you can map to the time-sliced pool via the device plugin configuration.

Multi-Process Service (MPS)

NVIDIA MPS allows multiple CUDA processes to share a single GPU context concurrently, with better isolation than time-slicing. The nvidia.com/gpu-shared resource key is used in clusters where the device plugin is configured to expose an MPS pool.

⚠️

CastSlice only handles the Kubernetes resource naming — the actual GPU sharing is provided by the NVIDIA device plugin and your cluster configuration. Ensure nvidia.com/gpu-shared is a recognized resource on your nodes before deploying.

How It Works

Architecture

Admission Flow

Annotation Reference

Resolution order

Placement matters

JSON Patch Strategy

Webhook Configuration Reference

TLS via cert-manager

GPU Sharing: nvidia.com/gpu-shared

Time-Slicing

Multi-Process Service (MPS)