-
Notifications
You must be signed in to change notification settings - Fork 431
Description
Describe the bug
hen installing GPU Operator via ArgoCD, the ClusterPolicy creation fails with the following error:
ClusterPolicy.spec.operator missing required field "defaultRuntime"
However, the same chart installs successfully when using helm install directly.
Environment
- GPU Operator Version: [e.g., v25.3.4]
- Kubernetes Version: [e.g., v1.33.5]
- Installation Method: ArgoCD (Helm chart)
- Container Runtime: containerd
Root Cause Analysis
1. Missing Template Rendering
The Helm chart template (templates/clusterpolicy.yaml) does not render the defaultRuntime field from values:
# Current template
spec:
operator:
{{- if .Values.operator.runtimeClass }}
runtimeClass: {{ .Values.operator.runtimeClass }}
{{- end }}
{{- if .Values.operator.defaultGPUMode }}
defaultGPUMode: {{ .Values.operator.defaultGPUMode }}
{{- end }}
# ❌ No defaultRuntime rendering!Verification:
helm template gpu-operator nvidia/gpu-operator --version v24.9.0 | grep -A 20 "kind: ClusterPolicy"
# Result: No defaultRuntime field in the rendered manifest2. CRD Schema
The CRD defines defaultRuntime with a default value:
defaultRuntime:
type: string
default: docker
enum:
- docker
- crio
- containerdAnd it appears to be required (either explicitly or implicitly through schema validation).
3. Why Helm Install Works But ArgoCD Fails
Helm Direct Install (Client-Side Apply):
- Helm renders manifest without
defaultRuntime - kubectl applies using client-side apply
- API Server performs defaulting before/during required validation
- Default value
dockeris applied automatically - ✅ Success
ArgoCD Install (Server-Side Apply):
- ArgoCD renders manifest without
defaultRuntime - ArgoCD applies using server-side apply (default behavior)
- Server-side apply performs stricter validation
- Required field check happens before defaulting can occur
- ❌ Fails with "missing required field"
This is a known Kubernetes behavior where server-side apply is more strict about required fields than client-side apply.
Steps to Reproduce
- Install ArgoCD in a cluster
- Create an ArgoCD Application:
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: gpu-operator
namespace: argocd
spec:
project: default
source:
chart: gpu-operator
repoURL: https://helm.ngc.nvidia.com/nvidia
targetRevision: v24.9.0
helm:
values: |
driver:
enabled: true
destination:
server: https://kubernetes.default.svc
namespace: gpu-operator
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true- Sync the application
- Observe the error:
ClusterPolicy.spec.operator missing required field "defaultRuntime"
Current Workaround
Users must explicitly set the value in ArgoCD Application:
helm:
values: |
operator:
defaultRuntime: containerdHowever, this doesn't actually work because the template doesn't render it!
Alternative workaround - disable server-side apply:
syncPolicy:
syncOptions:
- ServerSideApply=falseProposed Solution
Fix 1: Add defaultRuntime to Template (Recommended)
Update templates/clusterpolicy.yaml:
spec:
operator:
{{- if .Values.operator.defaultRuntime }}
defaultRuntime: {{ .Values.operator.defaultRuntime }}
{{- end }}
{{- if .Values.operator.runtimeClass }}
runtimeClass: {{ .Values.operator.runtimeClass }}
{{- end }}And ensure values.yaml has a default:
operator:
defaultRuntime: docker # or detect from clusterFix 2: Remove Required Constraint from CRD
If defaulting should handle this, consider making the field optional in the CRD and relying on the default value.
Fix 3: Add Mutating Webhook
Implement a mutating admission webhook to inject the default value before validation occurs.
Expected Behavior
GPU Operator should install successfully via ArgoCD without requiring users to:
- Explicitly set
defaultRuntimein values (when template doesn't render it) - Disable server-side apply
- Use workarounds
Additional Context
This issue affects all GitOps tools that use server-side apply by default (ArgoCD, Flux, etc.).
The combination of:
- CRD with
required+defaultfields - Helm template not rendering the field
- Server-side apply's strict validation
Creates an incompatibility that only manifests in GitOps scenarios.
Related Issues
- Similar issues have been reported in the Kubernetes community regarding server-side apply strictness with required+default fields
- Replacing key in StringData secret result in extra key in live object kubernetes/kubernetes#108008
- Server-side apply: migration from client-side apply leaves stuck fields in the object kubernetes/kubernetes#99003
Suggested Priority
High - This breaks GPU Operator installation for all ArgoCD/GitOps users, which is a common deployment pattern in production environments.