How It Works
A deep-dive into CastSlice's webhook architecture, Kubernetes Admission Control integration, and the JSON Patch strategy used to rewrite GPU resource requests.
Architecture
CastSlice is a Kubernetes Mutating Admission Webhook. It runs as a standard Deployment inside your cluster and registers itself with the API serverβs admission controller pipeline.
When any Pod is created in the cluster, the API server forwards the request to CastSlice before persisting it to etcd. CastSlice inspects the Pod annotation and β if opted in β rewrites the GPU resource request to a GPU-shared resource. The mutated object is returned and the API server continues scheduling normally.
Admission Flow
Admission webhooks are called synchronously and in-band. CastSlice must respond within the API server's webhook timeout (default 10s). The mutated Pod is what gets persisted; the original request is discarded.
Annotation Reference
CastSlice uses annotations on the Pod (or Deployment Pod template) to determine whether mutation should occur and how many GPU slices to assign.
| Annotation | Value | Required | Description |
|---|---|---|---|
castops.io/optimize |
"true" |
Required | Opts the Pod into GPU slice mutation. Any value other than "true" is treated as absent and the Pod is passed through unchanged. |
castops.io/workload-type |
"training" / "inference" / "batch" / "dev" |
Optional | Selects a preset GPU slice count based on the workload category. training β 4, inference β 2, batch β 2, dev β 1 (default when omitted). |
castops.io/slice-ratio |
Positive integer string, e.g. "8" |
Optional | Overrides the workload-type preset with an explicit slice count. Takes priority over castops.io/workload-type. |
Resolution order
When determining the final slice count, CastSlice evaluates annotations in this order:
castops.io/slice-ratioβ explicit override, highest prioritycastops.io/workload-typeβ preset lookup- Default of 1 β backward-compatible with v0.1.0
Placement matters
The annotation must be on the Podβs own metadata. For Deployments, this means spec.template.metadata.annotations β not the Deploymentβs top-level metadata.annotations.
JSON Patch Strategy
When mutation is triggered, CastSlice returns an AdmissionResponse containing a JSON Patch (RFC 6902) array. The API server applies the patch atomically before persisting the object.
CastSlice iterates over all container types β initContainers, containers, and ephemeralContainers. For each entry with the key nvidia.com/gpu:
- The existing
nvidia.com/gpukey is removed viaop: remove - A new
nvidia.com/gpu-sharedentry is added with the resolved slice count viaop: add
JSON Pointer encoding: Forward slashes in key names (/) are encoded as ~1 in JSON Patch paths per RFC 6901. That's why nvidia.com/gpu becomes nvidia.com~1gpu in the patch path.
The patch is base64-encoded and sent back in the AdmissionResponse.Patch field alongside patchType: JSONPatch.
Webhook Configuration Reference
CastSlice registers as a MutatingWebhookConfiguration. Key settings:
| Field | Value | Reason |
|---|---|---|
rules[].resources |
["pods"] |
Only intercept Pod CREATE/UPDATE; ignore all other resource types. |
rules[].operations |
["CREATE"] |
Mutation happens at Pod creation. Existing running Pods are never modified. |
failurePolicy |
Ignore |
If CastSlice is unreachable, admit the Pod anyway. This prevents CastSlice downtime from blocking your workloads. |
admissionReviewVersions |
["v1"] |
Uses the stable v1 AdmissionReview API. |
sideEffects |
None |
The webhook has no side effects; required for dry-run compatibility. |
TLS via cert-manager
The MutatingWebhookConfiguration requires the API server to trust the webhookβs TLS certificate. CastSlice uses a cert-manager Certificate resource to obtain a CA-signed cert, and sets the cert-manager.io/inject-ca-from annotation on the webhook configuration. cert-manager automatically injects the CA bundle into webhooks[].clientConfig.caBundle, keeping it up to date on renewal.
GPU Sharing: nvidia.com/gpu-shared
NVIDIA provides two main mechanisms to share a physical GPU across multiple Pods:
Time-Slicing
NVIDIA GPU Operator (v1.10+) supports time-slicing via a ConfigMap that defines GPU replicas. Each replica appears as a separate nvidia.com/gpu resource. CastSlice complements this by writing to the nvidia.com/gpu-shared resource key β which you can map to the time-sliced pool via the device plugin configuration.
Multi-Process Service (MPS)
NVIDIA MPS allows multiple CUDA processes to share a single GPU context concurrently, with better isolation than time-slicing. The nvidia.com/gpu-shared resource key is used in clusters where the device plugin is configured to expose an MPS pool.
CastSlice only handles the Kubernetes resource naming β the actual GPU sharing is provided by the NVIDIA device plugin and your cluster configuration. Ensure nvidia.com/gpu-shared is a recognized resource on your nodes before deploying.