RabbitMQ pod 意外崩溃

Question

我有一个 pod 运行 RabbitMQ。以下是部署清单：

apiVersion: v1
kind: Service
metadata:
  name: service-rabbitmq
spec:
  selector:
    app: service-rabbitmq
  ports:
    - port: 5672
      targetPort: 5672
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: deployment-rabbitmq
spec:
  selector:
    matchLabels:
      app: deployment-rabbitmq
  template:
    metadata:
      labels:
        app: deployment-rabbitmq
    spec:
      containers:
        - name: rabbitmq
          image: rabbitmq:latest
          volumeMounts:
            - name: rabbitmq-data-volume
              mountPath: /var/lib/rabbitmq
          resources:
            requests:
              cpu: 250m
              memory: 128Mi
            limits:
              cpu: 750m
              memory: 256Mi
      volumes:
        - name: rabbitmq-data-volume
          persistentVolumeClaim:
            claimName: rabbitmq-pvc

当我在我的本地集群中部署它时，我看到 pod 运行一段时间然后崩溃。所以基本上它处于崩溃循环之下。以下是我从 pod 获得的日志：

$ kubectl logs deployment-rabbitmq-649b8479dc-kt9s4
2021-10-14 06:46:36.182390+00:00 [info] <0.222.0> Feature flags: list of feature flags found:
2021-10-14 06:46:36.221717+00:00 [info] <0.222.0> Feature flags:   [ ] implicit_default_bindings
2021-10-14 06:46:36.221768+00:00 [info] <0.222.0> Feature flags:   [ ] maintenance_mode_status
2021-10-14 06:46:36.221792+00:00 [info] <0.222.0> Feature flags:   [ ] quorum_queue
2021-10-14 06:46:36.221813+00:00 [info] <0.222.0> Feature flags:   [ ] stream_queue
2021-10-14 06:46:36.221916+00:00 [info] <0.222.0> Feature flags:   [ ] user_limits
2021-10-14 06:46:36.221933+00:00 [info] <0.222.0> Feature flags:   [ ] virtual_host_metadata
2021-10-14 06:46:36.221953+00:00 [info] <0.222.0> Feature flags: feature flag states written to disk: yes
2021-10-14 06:46:37.018537+00:00 [noti] <0.44.0> Application syslog exited with reason: stopped
2021-10-14 06:46:37.018646+00:00 [noti] <0.222.0> Logging: switching to configured handler(s); following messages may not be visible in this log output
2021-10-14 06:46:37.045601+00:00 [noti] <0.222.0> Logging: configured log handlers are now ACTIVE
2021-10-14 06:46:37.635024+00:00 [info] <0.222.0> ra: starting system quorum_queues
2021-10-14 06:46:37.635139+00:00 [info] <0.222.0> starting Ra system: quorum_queues in directory: /var/lib/rabbitmq/mnesia/rabbit@deployment-rabbitmq-649b8479dc-kt9s4/quorum/rabbit@deployment-rabbitmq-649b8479dc-kt9s4
2021-10-14 06:46:37.849041+00:00 [info] <0.259.0> ra: meta data store initialised for system quorum_queues. 0 record(s) recovered
2021-10-14 06:46:37.877504+00:00 [noti] <0.264.0> WAL: ra_log_wal init, open tbls: ra_log_open_mem_tables, closed tbls: ra_log_closed_mem_tables

这个日志没有太大帮助，我从这里找不到任何错误消息。此处唯一有用的行可能是 Application syslog exited with reason: stopped，但据我所知还不够。事件日志也没有帮助：

$ kubectl describe pods deployment-rabbitmq-649b8479dc-kt9s4
Name:         deployment-rabbitmq-649b8479dc-kt9s4
Namespace:    default
Priority:     0
Node:         docker-desktop/192.168.65.4
Start Time:   Thu, 14 Oct 2021 12:45:03 +0600
Labels:       app=deployment-rabbitmq
              pod-template-hash=649b8479dc
              skaffold.dev/run-id=7af5e1bb-e0c8-4021-a8a0-0c8bf43630b6
Annotations:  <none>
Status:       Running
IP:           10.1.5.138
IPs:
  IP:           10.1.5.138
Controlled By:  ReplicaSet/deployment-rabbitmq-649b8479dc
Containers:
  rabbitmq:
    Container ID:   docker://de309f94163c071afb38fb8743d106923b6bda27325287e82bc274e362f1f3be
    Image:          rabbitmq:latest
    Image ID:       docker-pullable://rabbitmq@sha256:d8efe7b818e66a13fdc6fdb84cf527984fb7d73f52466833a20e9ec298ed4df4
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    0
      Started:      Thu, 14 Oct 2021 13:56:29 +0600
      Finished:     Thu, 14 Oct 2021 13:56:39 +0600
    Ready:          False
    Restart Count:  18
    Limits:
      cpu:     750m
      memory:  256Mi
    Requests:
      cpu:        250m
      memory:     128Mi
    Environment:  <none>
    Mounts:
      /var/lib/rabbitmq from rabbitmq-data-volume (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-9shdv (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  rabbitmq-data-volume:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  rabbitmq-pvc
    ReadOnly:   false
  kube-api-access-9shdv:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason   Age                    From     Message
  ----     ------   ----                   ----     -------
  Normal   Pulled   23m (x6 over 50m)      kubelet  (combined from similar events): Successfully pulled image "rabbitmq:latest" in 4.267310231s
  Normal   Pulling  18m (x16 over 73m)     kubelet  Pulling image "rabbitmq:latest"
  Warning  BackOff  3m45s (x307 over 73m)  kubelet  Back-off restarting failed container

此崩溃循环的原因可能是什么？

NOTE: rabbitmq-pvc is successfully bound. No issue there.

更新：

表示RabbitMQ应该部署为StatefulSet。所以我像这样调整清单：

apiVersion: v1
kind: Service
metadata:
  name: service-rabbitmq
spec:
  selector:
    app: service-rabbitmq
  ports:
    - name: rabbitmq-amqp
      port: 5672
    - name: rabbitmq-http
      port: 15672
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: statefulset-rabbitmq
spec:
  selector:
    matchLabels:
      app: statefulset-rabbitmq
  serviceName: service-rabbitmq
  template:
    metadata:
      labels:
        app: statefulset-rabbitmq
    spec:
      containers:
        - name: rabbitmq
          image: rabbitmq:latest
          volumeMounts:
            - name: rabbitmq-data-volume
              mountPath: /var/lib/rabbitmq/mnesia
          resources:
            requests:
              cpu: 250m
              memory: 128Mi
            limits:
              cpu: 750m
              memory: 256Mi
      volumes:
        - name: rabbitmq-data-volume
          persistentVolumeClaim:
            claimName: rabbitmq-pvc

pod 仍处于崩溃循环，但日志略有不同。

$ kubectl logs statefulset-rabbitmq-0
2021-10-14 09:38:26.138224+00:00 [info] <0.222.0> Feature flags: list of feature flags found:
2021-10-14 09:38:26.158953+00:00 [info] <0.222.0> Feature flags:   [x] implicit_default_bindings
2021-10-14 09:38:26.159015+00:00 [info] <0.222.0> Feature flags:   [x] maintenance_mode_status
2021-10-14 09:38:26.159037+00:00 [info] <0.222.0> Feature flags:   [x] quorum_queue
2021-10-14 09:38:26.159078+00:00 [info] <0.222.0> Feature flags:   [x] stream_queue
2021-10-14 09:38:26.159183+00:00 [info] <0.222.0> Feature flags:   [x] user_limits
2021-10-14 09:38:26.159236+00:00 [info] <0.222.0> Feature flags:   [x] virtual_host_metadata
2021-10-14 09:38:26.159270+00:00 [info] <0.222.0> Feature flags: feature flag states written to disk: yes
2021-10-14 09:38:26.830814+00:00 [noti] <0.44.0> Application syslog exited with reason: stopped
2021-10-14 09:38:26.830925+00:00 [noti] <0.222.0> Logging: switching to configured handler(s); following messages may not be visible in this log output
2021-10-14 09:38:26.852048+00:00 [noti] <0.222.0> Logging: configured log handlers are now ACTIVE
2021-10-14 09:38:33.754355+00:00 [info] <0.222.0> ra: starting system quorum_queues
2021-10-14 09:38:33.754526+00:00 [info] <0.222.0> starting Ra system: quorum_queues in directory: /var/lib/rabbitmq/mnesia/rabbit@statefulset-rabbitmq-0/quorum/rabbit@statefulset-rabbitmq-0
2021-10-14 09:38:33.760365+00:00 [info] <0.290.0> ra: meta data store initialised for system quorum_queues. 0 record(s) recovered
2021-10-14 09:38:33.761023+00:00 [noti] <0.302.0> WAL: ra_log_wal init, open tbls: ra_log_open_mem_tables, closed tbls: ra_log_closed_mem_tables

功能标志现在已按原样标记。没有其他显着变化。所以我仍然需要帮助。

！新一期！

头顶。

Answer 1

pod 被 oomkilled（最后状态、原因），您需要为 pod 分配更多资源（内存）。

RabbitMQ pod 意外崩溃

RabbitMQ pod is crashing unexpectedly

rabbitmq

kubernetes

更新：

！新一期！