Skip to content

Pod Label not visible in DCGM Exporter Metrics #2009

@olumideajiboye-pixel

Description

@olumideajiboye-pixel

Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.

Describe the bug
The GPU Operator is the recommended install mode for DCGM Exporter however it doesn't seem to support enabling Pod Labels for metrics.

To Reproduce
Deploy GPU-Operator and enable DCGM Exporter with extra environment variables

  • DCGM_EXPORTER_KUBERNETES_ENABLE_POD_LABELS
  • DCGM_EXPORTER_KUBERNETES_ENABLE_POD_UID

Expected behavior
Deployment should create

  • ClusterRole and ClusterRoleBinding to ServiceAccount used by dcgm export pods
  • AutoMount Service Account Token to allow Pods read Kubernetes API
  • Mount kubelet path as a volume by DCGM pods

Environment (please provide the following information):

  • GPU Operator Version: v23.6.1
  • OS: Rocky-Linux-8.10
  • Kernel Version: [e.g. 6.8.0-generic]
  • Container Runtime Version: containerd v1.7.1
  • Kubernetes Distro and Version: K8s v1.24.12

Information to attach (optional if deemed irrelevant)

  • kubernetes pods status: kubectl get pods -n OPERATOR_NAMESPACE
  • kubernetes daemonset status: kubectl get ds -n OPERATOR_NAMESPACE
  • If a pod/ds is in an error state or pending state kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME
  • If a pod/ds is in an error state or pending state kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers
  • Output from running nvidia-smi from the driver container: kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi
  • containerd logs journalctl -u containerd > containerd.log

Collecting full debug bundle (optional):

curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/main/hack/must-gather.sh
chmod +x must-gather.sh
./must-gather.sh

NOTE: please refer to the must-gather script for debug data collected.

This bundle can be submitted to us via email: operator_feedback@nvidia.com

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions