最近在学习k8s和GPU,部署过程中,一直不正常
FAILED - RETRYING: kubesphere-installer is ok (30 retries left).
1.经过查询发现 sudo kubectl get pod –all-namespaces,发现以下容器出现问题
nvidia-driver-installer-dfjr4 0/1 Init:0/1
2.systemadmin@master01:$ sudo kubectl logs nvidia-driver-installer-dfjr4 -n kube-system
Error from server (BadRequest): container “pause” in pod “nvidia-driver-installer-dfjr4″ is waiting to start: PodInitializing
3.systemadmin@master01:$ sudo kubectl describe nvidia-driver-installer-dfjr4 -n kube-system
error: the server doesn’t have a resource type “nvidia-driver-installer-dfjr4”
systemadmin@master01:$ sudo kubectl describe pods/nvidia-driver-installer-dfjr4 -n kube-system
Name: nvidia-driver-installer-dfjr4
Namespace: kube-system
Priority: 2000001000
Priority Class Name: system-node-critical
Node: node02/10.20.XXX.XXX
Start Time: Wed, 10 Jun 2020 11:18:23 +0800
Labels: controller-revision-hash=5b4c4b9dd8
name=nvidia-driver-installer
pod-template-generation=1
Annotations: <none>
Status: Pending
IP: 10.20.XXX.XXX
IPs:
IP: 10.20.XXX.XXX
Controlled By: DaemonSet/nvidia-driver-installer
Init Containers:
nvidia-driver-installer:
Container ID: docker://7f780b55a852d37681b8fd86db0583c839b296c55a9b33689d16407190a61723
Image: kubesphere/ubuntu-nvidia-driver-installer:lates
Image ID: docker-pullable://kubesphere/ubuntu-nvidia-driver-installer@sha256:7df76a0f0a17294e86f691c81de6bbb7c04a1b4b3d4ea4e7e2cccdc42e1f6d63
Port: <none>
Host Port: <none>
State: Running
Started: Wed, 10 Jun 2020 11:26:51 +0800
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Wed, 10 Jun 2020 11:22:36 +0800
Finished: Wed, 10 Jun 2020 11:26:36 +0800
Ready: False
Restart Count: 2
Requests:
cpu: 150m
Environment:
NVIDIA_INSTALL_DIR_HOST: /home/kubernetes/bin/nvidia
NVIDIA_INSTALL_DIR_CONTAINER: /usr/local/nvidia
ROOT_MOUNT_DIR: /root
NVIDIA_DRIVER_VERSION: 387.26
NVIDIA_DRIVER_DOWNLOAD_URL: https://dl-test.sh1a.qingstor.com/NVIDIA-Linux-x86_64-387.26.run
Mounts:
/dev from dev (rw)
/root from root-mount (rw)
/usr/local/nvidia from nvidia-install-dir-host (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-5cdk4 (ro)
Containers:
pause:
Container ID:
Image: mirrorgooglecontainers/pause-amd64:3.1
Image ID:
Port: <none>
Host Port: <none>
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-5cdk4 (ro)
Conditions:
Type Status
Initialized False
Ready False
ContainersReady False
PodScheduled True
Volumes:
dev:
Type: HostPath (bare host directory volume)
Path: /dev
HostPathType:
nvidia-install-dir-host:
Type: HostPath (bare host directory volume)
Path: /home/kubernetes/bin/nvidia
HostPathType:
root-mount:
Type: HostPath (bare host directory volume)
Path: /
HostPathType:
default-token-5cdk4:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-5cdk4
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/disk-pressure:NoSchedule
node.kubernetes.io/memory-pressure:NoSchedule
node.kubernetes.io/network-unavailable:NoSchedule
node.kubernetes.io/not-ready:NoExecute
node.kubernetes.io/pid-pressure:NoSchedule
node.kubernetes.io/unreachable:NoExecute
node.kubernetes.io/unschedulable:NoSchedule
nvidia.com/gpu:NoSchedule
Events:
Type Reason Age From Message
Normal Scheduled <unknown> default-scheduler Successfully assigned kube-system/nvidia-driver-installer-dfjr4 to node02
Warning BackOff 91s kubelet, node02 Back-off restarting failed container
Normal Pulled 78s (x3 over 9m44s) kubelet, node02 Container image “kubesphere/ubuntu-nvidia-driver-installer:lates” already present on machine
Normal Created 77s (x3 over 9m44s) kubelet, node02 Created container nvidia-driver-installer
Normal Started 77s (x3 over 9m44s) kubelet, node02 Started container nvidia-driver-installer