创建部署问题时,请参考下面模板,你提供的信息越多,越容易及时获得解答。如果未按模板创建问题,管理员有权关闭问题。
确保帖子格式清晰易读,用 markdown code block 语法格式化代码块。
你只花一分钟创建的问题,不能指望别人花上半个小时给你解答。
操作系统信息
物理机,Ubuntu18.04,4C/8G
Kubernetes版本信息
1.20.4
容器运行时
Client: Docker Engine - Community
Version: 20.10.9
API version: 1.41
Go version: go1.16.8
Git commit: c2ea9bc
Built: Mon Oct 4 16:08:29 2021
OS/Arch: linux/amd64
Context: default
Experimental: true
Server: Docker Engine - Community
Engine:
Version: 20.10.9
API version: 1.41 (minimum version 1.12)
Go version: go1.16.8
Git commit: 79ea9d3
Built: Mon Oct 4 16:06:34 2021
OS/Arch: linux/amd64
Experimental: true
containerd:
Version: 1.4.11
GitCommit: 5b46e404f6b9f661a205e28d59c982d3634148f8
nvidia:
Version: 1.0.2
GitCommit: v1.0.2-0-g52b36a2
docker-init:
Version: 0.19.0
GitCommit: de40ad0
KubeSphere版本信息
kk安装kubesphere3.2.0
问题是什么
安装好gpu监控之后,自定义监控模板里面没有gpu的监控指标。
speedbot@c172:/opt/gpu_test/train_params$ kubectl get pods -n gpu-operator-resources
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-5xdhd 1/1 Running 2 2d3h
gpu-feature-discovery-p2xw5 1/1 Running 0 2d3h
gpu-operator-1639972176-node-feature-discovery-master-dbf7ck98n 1/1 Running 0 2d3h
gpu-operator-1639972176-node-feature-discovery-worker-8bdl8 1/1 Running 0 2d3h
gpu-operator-1639972176-node-feature-discovery-worker-c6s69 1/1 Running 0 2d3h
gpu-operator-1639972176-node-feature-discovery-worker-nq58w 1/1 Running 2 2d3h
gpu-operator-868b78d4d8-v5zj2 1/1 Running 0 2d3h
nvidia-container-toolkit-daemonset-nl8w6 1/1 Running 1 2d3h
nvidia-container-toolkit-daemonset-wplq9 1/1 Running 0 2d3h
nvidia-cuda-validator-7kgxh 0/1 Completed 0 46h
nvidia-cuda-validator-pmgxn 0/1 Completed 0 2d3h
nvidia-dcgm-5vmjg 1/1 Running 2 2d3h
nvidia-dcgm-exporter-4hrrw 1/1 Running 3 2d3h
nvidia-dcgm-exporter-v24tv 1/1 Running 2 2d3h
nvidia-dcgm-qsrwd 1/1 Running 0 2d3h
nvidia-device-plugin-daemonset-8nmvp 1/1 Running 2 2d3h
nvidia-device-plugin-daemonset-hl95b 1/1 Running 0 2d3h
nvidia-device-plugin-validator-bbjsh 0/1 Completed 0 2d3h
nvidia-device-plugin-validator-t2g99 0/1 Completed 0 46h
nvidia-operator-validator-mdnqr 1/1 Running 0 2d3h
nvidia-operator-validator-rnxp6 1/1 Running 1 2d3h