Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Installer freezes node on updating_container_ld_cache #80

Closed
heroic opened this issue Jul 17, 2018 · 7 comments
Closed

Installer freezes node on updating_container_ld_cache #80

heroic opened this issue Jul 17, 2018 · 7 comments

Comments

@heroic
Copy link

heroic commented Jul 17, 2018

Same as #71. Running on cos. Cluster version: 1.10.5-gke.0. Here's the log dump: https://gist.github.com/heroic/5bdc756732a8ec5d5081227d1cbb2048

@heroic
Copy link
Author

heroic commented Jul 17, 2018

@mindprince Any clues?

@rohitagarwal003
Copy link
Contributor

rohitagarwal003 commented Jul 17, 2018 via email

@heroic
Copy link
Author

heroic commented Jul 17, 2018

@mindprince The container I am using on this node doesn't have a /usr/local/nvidia. Shouldn't that be exposed to all containers after the installer is done via nvidia-gpu-device-plugin?

@rohitagarwal003
Copy link
Contributor

rohitagarwal003 commented Jul 17, 2018 via email

@heroic
Copy link
Author

heroic commented Jul 17, 2018

Yep. I am looking in the GPU pod itself. Here's the pod's YAML

apiVersion: v1
kind: Pod
metadata:
  annotations:
    prometheus.io.scrape: "false"
  creationTimestamp: 2018-07-17T21:29:07Z
  generateName: ultron-776bb98fbb-
  labels:
    faas_function: ultron
    pod-template-hash: "3326654966"
    uid: "861317054"
  name: ultron-776bb98fbb-7zxhv
  namespace: openfaas-fn
  ownerReferences:
  - apiVersion: extensions/v1beta1
    blockOwnerDeletion: true
    controller: true
    kind: ReplicaSet
    name: ultron-776bb98fbb
    uid: 0688f010-8a06-11e8-a53c-42010a80009d
  resourceVersion: "50748"
  selfLink: /api/v1/namespaces/openfaas-fn/pods/ultron-776bb98fbb-7zxhv
  uid: 6fc5e972-8a08-11e8-a53c-42010a80009d
spec:
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: faas_function
            operator: In
            values:
            - ultron
        topologyKey: kubernetes.io/hostname
  containers:
  - env:
    - name: read_timeout
      value: 300s
    - name: write_timeout
      value: 300s
    - name: ack_wait
      value: 300s
    - name: exec_timeout
      value: 300s
    image: asia.gcr.io/galaxycard-490d9/ultron
    imagePullPolicy: Always
    livenessProbe:
      exec:
        command:
        - cat
        - /tmp/.lock
      failureThreshold: 3
      initialDelaySeconds: 3
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1
    name: ultron
    ports:
    - containerPort: 8080
      protocol: TCP
    readinessProbe:
      exec:
        command:
        - cat
        - /tmp/.lock
      failureThreshold: 3
      initialDelaySeconds: 3
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1
    resources: {}
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: default-token-fwk55
      readOnly: true
  dnsPolicy: ClusterFirst
  nodeName: gke-faas-pool-4-cpu-8-ram-26f8627d-g1ms
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - name: default-token-fwk55
    secret:
      defaultMode: 420
      secretName: default-token-fwk55

and here's what's contained in /usr/local

root@ultron-776bb98fbb-7zxhv:/home/app# ls /usr/local
bin  etc  games  include  lib  man  sbin  share  src

@heroic
Copy link
Author

heroic commented Jul 17, 2018

@mindprince Found the issue! Closing this! Thanks for bearing with me!

@heroic heroic closed this as completed Jul 17, 2018
@rohitagarwal003
Copy link
Contributor

rohitagarwal003 commented Jul 17, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants