运行 Pod 需要很长时间才能访问内部服务

Running Pod takes a long time for internal service to be accessible

我已经实现了 gRPC 服务,将其构建到容器中,并使用 k8s(尤其是 AWS EKS)将其部署为 DaemonSet。

Pod 启动并很快进入运行状态,但需要很长时间才能访问实际服务,通常为 300s。

其实我在运行kubectl logs打印Pod的log的时候,空了好久

我在服务一开始就记录了一些东西。事实上,我的代码看起来像

package main

func init() {
    log.Println("init")
}

func main() {
  // ...
}

所以我很确定当没有日志时,服务还没有启动。

我了解到 Pod 是 运行ning 和里面的实际进程是 运行ning 之间可能存在时间差。但是,300s 对我来说太长了。

此外,这种情况是随机发生的,有时服务几乎立即就绪。顺便说一句,我的运行时间图像是基于chromedp headless-shell,不确定是否相关。

谁能给点建议,如何调试定位问题?非常感谢!


更新

我没有设置任何就绪探测器。

我的 DaemonSet 的

运行 kubectl get -o yaml 给出

apiVersion: apps/v1
kind: DaemonSet
metadata:
  annotations:
    deprecated.daemonset.template.generation: "1"
  creationTimestamp: "2021-10-13T06:30:16Z"
  generation: 1
  labels:
    app: worker
    uuid: worker
  name: worker
  namespace: collection-14f45957-e268-4719-88c3-50b533b0ae66
  resourceVersion: "47265945"
  uid: 88e4671f-9e33-43ef-9c49-b491dcb578e4
spec:
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: worker
      uuid: worker
  template:
    metadata:
      annotations:
        prometheus.io/path: /metrics
        prometheus.io/port: "2112"
        prometheus.io/scrape: "true"
      creationTimestamp: null
      labels:
        app: worker
        uuid: worker
    spec:
      containers:
      - env:
        - name: GRPC_PORT
          value: "22345"
        - name: DEBUG
          value: "false"
        - name: TARGET
          value: localhost:12345
        - name: TRACKER
          value: 10.100.255.31:12345
        - name: MONITOR
          value: 10.100.125.35:12345
        - name: COLLECTABLE_METHODS
          value: shopping.ShoppingService.GetShop
        - name: POD_IP
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: status.podIP
        - name: DISTRIBUTABLE_METHODS
          value: collection.CollectionService.EnumerateShops
        - name: PERFORM_TASK_INTERVAL
          value: 0.000000s
        image: xxx
        imagePullPolicy: Always
        name: worker
        ports:
        - containerPort: 22345
          protocol: TCP
        resources:
          requests:
            cpu: 1800m
            memory: 1Gi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      - env:
        - name: CAPTCHA_PARALLEL
          value: "32"
        - name: HTTP_PROXY
          value: http://10.100.215.25:8080
        - name: HTTPS_PROXY
          value: http://10.100.215.25:8080
        - name: API
          value: 10.100.111.11:12345
        - name: NO_PROXY
          value: 10.100.111.11:12345
        - name: POD_IP
        image: xxx
        imagePullPolicy: Always
        name: source
        ports:
        - containerPort: 12345
          protocol: TCP
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /etc/ssl/certs/api.crt
          name: ca
          readOnly: true
          subPath: tls.crt
      dnsPolicy: ClusterFirst
      nodeSelector:
        api/nodegroup-app: worker
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
      volumes:
      - name: ca
        secret:
          defaultMode: 420
          secretName: ca
  updateStrategy:
    rollingUpdate:
      maxSurge: 0
      maxUnavailable: 1
    type: RollingUpdate
status:
  currentNumberScheduled: 2
  desiredNumberScheduled: 2
  numberAvailable: 2
  numberMisscheduled: 0
  numberReady: 2
  observedGeneration: 1
  updatedNumberScheduled: 2

此外,Pod中还有两个容器。只有一个启动特别慢,另一个一直没问题。

当您将 HTTP_PROXY 用于您的解决方案时,请注意它的路由可能与您的底层集群网络有何不同 - 这通常会导致意外超时。

我已经发布了社区维基答案来总结主题:

正如 gohm'c 在评论中提到的:

Do connections made by container "source" always have to go thru HTTP_PROXY, even if it is connecting services in the cluster - do you think possible long time been taken because of proxy? Can try kubectl exec -it <pod> -c <source> -- sh and curl/wget external services.

这是一个很好的观察。请注意,某些连接可以直接建立,通过代理添加额外流量可能会导致延迟。例如,可能会出现瓶颈。您可以在 documentation.

中阅读有关使用 HTTP 代理访问 Kubernetes API 的更多信息

此外,您还可以创建 readiness probes 以了解容器何时准备好开始接受流量。

A Pod is considered ready when all of its containers are ready. One use of this signal is to control which Pods are used as backends for Services. When a Pod is not ready, it is removed from Service load balancers.

The kubelet uses startup probes to know when a container application has started. If such a probe is configured, it disables liveness and readiness checks until it succeeds, making sure those probes don't interfere with the application startup. This can be used to adopt liveness checks on slow starting containers, avoiding them getting killed by the kubelet before they are up and running.