为什么在滚动更新部署甚至缩减副本集时会出现停机
Why there is downtime while rolling update a deployment or even scaling down a replicaset
由于kubernetes官方文档
Rolling updates allow Deployments' update to take place with zero downtime by incrementally updating Pods instances with new ones
我尝试使用 Rolling Update
策略执行零停机更新,这是在 kube 集群中更新应用程序的推荐方法。
官方参考:
https://kubernetes.io/docs/tutorials/kubernetes-basics/update/update-intro/
但是我在执行它时对定义有点困惑:应用程序的停机时间仍然会发生。开头是我的集群信息,如下图:
liguuudeiMac:~ liguuu$ kubectl get all
NAME READY STATUS RESTARTS AGE
pod/ubuntu-b7d6cb9c6-6bkxz 1/1 Running 0 3h16m
pod/webapp-deployment-6dcf7b88c7-4kpgc 1/1 Running 0 3m52s
pod/webapp-deployment-6dcf7b88c7-4vsch 1/1 Running 0 3m52s
pod/webapp-deployment-6dcf7b88c7-7xzsk 1/1 Running 0 3m52s
pod/webapp-deployment-6dcf7b88c7-jj8vx 1/1 Running 0 3m52s
pod/webapp-deployment-6dcf7b88c7-qz2xq 1/1 Running 0 3m52s
pod/webapp-deployment-6dcf7b88c7-s7rtt 1/1 Running 0 3m52s
pod/webapp-deployment-6dcf7b88c7-s88tb 1/1 Running 0 3m52s
pod/webapp-deployment-6dcf7b88c7-snmw5 1/1 Running 0 3m52s
pod/webapp-deployment-6dcf7b88c7-v287f 1/1 Running 0 3m52s
pod/webapp-deployment-6dcf7b88c7-vd4kb 1/1 Running 0 3m52s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 3h16m
service/tc-webapp-service NodePort 10.104.32.134 <none> 1234:31234/TCP 3m52s
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/ubuntu 1/1 1 1 3h16m
deployment.apps/webapp-deployment 10/10 10 10 3m52s
NAME DESIRED CURRENT READY AGE
replicaset.apps/ubuntu-b7d6cb9c6 1 1 1 3h16m
replicaset.apps/webapp-deployment-6dcf7b88c7 10 10 10 3m52s
deployment.apps/webapp-deployment
是一个基于 tomcat 的 webapp 应用程序,映射到 Pods 的服务 tc-webapp-service
包含 tomcat 个容器(完整的部署配置文件出现在文章末尾)。 deployment.apps/ubuntu
只是集群中的一个独立应用程序,它将每秒向 tc-webapp-service
执行无限 http 请求,以便我可以跟踪所谓的 webapp-deployment
滚动更新的状态, ubuntu 容器中的命令 运行 可能如下所示(每 0.01 秒无限循环 curl 命令):
for ((;;)); do curl -sS -D - http://tc-webapp-service:1234 -o /dev/null | grep HTTP; date +"%Y-%m-%d %H:%M:%S"; echo ; sleep 0.01 ; done;
ubuntu 应用程序的输出(一切正常):
...
HTTP/1.1 200
2019-08-30 07:27:15
...
HTTP/1.1 200
2019-08-30 07:27:16
...
然后我尝试更改 tomcat 图像的标签,从 8-jdk8
到 8-jdk11
。请注意,deployment.apps/webapp-deployment
的滚动更新策略已正确配置,maxSurge 0
和 maxUnavailable 9
。(如果这两个 attr 为默认值,结果相同)
...
spec:
containers:
- name: tc-part
image: tomcat:8-jdk8 -> tomcat:8-jdk11
...
然后,ubuntu app 的输出:
HTTP/1.1 200
2019-08-30 07:47:43
curl: (56) Recv failure: Connection reset by peer
2019-08-30 07:47:43
HTTP/1.1 200
2019-08-30 07:47:44
如上图,部分http请求失败,这无疑是kube集群中应用滚动更新时应用中断。
但是,我也可以在Scaling down
中重现上述情况(中断),命令如下(从10到2):
kubectl scale deployment.apps/tc-webapp-service --replicas=2
经过上面的测试,我在想,所谓的Zero downtime
到底是不是这个意思。虽然模拟http请求的方式有点棘手,但对于一些设计为能够在一秒钟内处理数千、数百万请求的应用程序来说,这种情况是很正常的。
环境:
liguuudeiMac:cacheee liguuu$ minikube version
minikube version: v1.3.1
commit: ca60a424ce69a4d79f502650199ca2b52f29e631
liguuudeiMac:cacheee liguuu$ kubectl version
Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.3", GitCommit:"5e53fd6bc17c0dec8434817e69b04a25d8ae0ff0", GitTreeState:"clean", BuildDate:"2019-06-06T01:44:30Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.2", GitCommit:"f6278300bebbb750328ac16ee6dd3aa7d3549568", GitTreeState:"clean", BuildDate:"2019-08-05T09:15:22Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}
部署和服务配置:
# Service
apiVersion: v1
kind: Service
metadata:
name: tc-webapp-service
spec:
type: NodePort
selector:
appName: tc-webapp
ports:
- name: tc-svc
protocol: TCP
port: 1234
targetPort: 8080
nodePort: 31234
---
# Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: webapp-deployment
spec:
replicas: 10
selector:
matchLabels:
appName: tc-webapp
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 0
maxUnavailable: 9
# Pod Templates
template:
metadata:
labels:
appName: tc-webapp
spec:
containers:
- name: tc-part
image: tomcat:8-jdk8
ports:
- containerPort: 8080
livenessProbe:
tcpSocket:
port: 8080
initialDelaySeconds: 10
periodSeconds: 10
readinessProbe:
httpGet:
scheme: HTTP
port: 8080
path: /
initialDelaySeconds: 5
periodSeconds: 1
要部署真正零停机更新的应用程序,应用程序应满足一些要求。仅举其中的几个:
- 应用程序应该处理正常关闭
- 应用程序应正确实施就绪和活跃度探测
例如,如果收到关闭信号,则它不应以 200 响应新的就绪探测,但在处理所有旧请求之前,它仍以 200 响应活动。
由于kubernetes官方文档
Rolling updates allow Deployments' update to take place with zero downtime by incrementally updating Pods instances with new ones
我尝试使用 Rolling Update
策略执行零停机更新,这是在 kube 集群中更新应用程序的推荐方法。
官方参考:
https://kubernetes.io/docs/tutorials/kubernetes-basics/update/update-intro/
但是我在执行它时对定义有点困惑:应用程序的停机时间仍然会发生。开头是我的集群信息,如下图:
liguuudeiMac:~ liguuu$ kubectl get all
NAME READY STATUS RESTARTS AGE
pod/ubuntu-b7d6cb9c6-6bkxz 1/1 Running 0 3h16m
pod/webapp-deployment-6dcf7b88c7-4kpgc 1/1 Running 0 3m52s
pod/webapp-deployment-6dcf7b88c7-4vsch 1/1 Running 0 3m52s
pod/webapp-deployment-6dcf7b88c7-7xzsk 1/1 Running 0 3m52s
pod/webapp-deployment-6dcf7b88c7-jj8vx 1/1 Running 0 3m52s
pod/webapp-deployment-6dcf7b88c7-qz2xq 1/1 Running 0 3m52s
pod/webapp-deployment-6dcf7b88c7-s7rtt 1/1 Running 0 3m52s
pod/webapp-deployment-6dcf7b88c7-s88tb 1/1 Running 0 3m52s
pod/webapp-deployment-6dcf7b88c7-snmw5 1/1 Running 0 3m52s
pod/webapp-deployment-6dcf7b88c7-v287f 1/1 Running 0 3m52s
pod/webapp-deployment-6dcf7b88c7-vd4kb 1/1 Running 0 3m52s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 3h16m
service/tc-webapp-service NodePort 10.104.32.134 <none> 1234:31234/TCP 3m52s
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/ubuntu 1/1 1 1 3h16m
deployment.apps/webapp-deployment 10/10 10 10 3m52s
NAME DESIRED CURRENT READY AGE
replicaset.apps/ubuntu-b7d6cb9c6 1 1 1 3h16m
replicaset.apps/webapp-deployment-6dcf7b88c7 10 10 10 3m52s
deployment.apps/webapp-deployment
是一个基于 tomcat 的 webapp 应用程序,映射到 Pods 的服务 tc-webapp-service
包含 tomcat 个容器(完整的部署配置文件出现在文章末尾)。 deployment.apps/ubuntu
只是集群中的一个独立应用程序,它将每秒向 tc-webapp-service
执行无限 http 请求,以便我可以跟踪所谓的 webapp-deployment
滚动更新的状态, ubuntu 容器中的命令 运行 可能如下所示(每 0.01 秒无限循环 curl 命令):
for ((;;)); do curl -sS -D - http://tc-webapp-service:1234 -o /dev/null | grep HTTP; date +"%Y-%m-%d %H:%M:%S"; echo ; sleep 0.01 ; done;
ubuntu 应用程序的输出(一切正常):
...
HTTP/1.1 200
2019-08-30 07:27:15
...
HTTP/1.1 200
2019-08-30 07:27:16
...
然后我尝试更改 tomcat 图像的标签,从 8-jdk8
到 8-jdk11
。请注意,deployment.apps/webapp-deployment
的滚动更新策略已正确配置,maxSurge 0
和 maxUnavailable 9
。(如果这两个 attr 为默认值,结果相同)
...
spec:
containers:
- name: tc-part
image: tomcat:8-jdk8 -> tomcat:8-jdk11
...
然后,ubuntu app 的输出:
HTTP/1.1 200
2019-08-30 07:47:43
curl: (56) Recv failure: Connection reset by peer
2019-08-30 07:47:43
HTTP/1.1 200
2019-08-30 07:47:44
如上图,部分http请求失败,这无疑是kube集群中应用滚动更新时应用中断。
但是,我也可以在Scaling down
中重现上述情况(中断),命令如下(从10到2):
kubectl scale deployment.apps/tc-webapp-service --replicas=2
经过上面的测试,我在想,所谓的Zero downtime
到底是不是这个意思。虽然模拟http请求的方式有点棘手,但对于一些设计为能够在一秒钟内处理数千、数百万请求的应用程序来说,这种情况是很正常的。
环境:
liguuudeiMac:cacheee liguuu$ minikube version
minikube version: v1.3.1
commit: ca60a424ce69a4d79f502650199ca2b52f29e631
liguuudeiMac:cacheee liguuu$ kubectl version
Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.3", GitCommit:"5e53fd6bc17c0dec8434817e69b04a25d8ae0ff0", GitTreeState:"clean", BuildDate:"2019-06-06T01:44:30Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.2", GitCommit:"f6278300bebbb750328ac16ee6dd3aa7d3549568", GitTreeState:"clean", BuildDate:"2019-08-05T09:15:22Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}
部署和服务配置:
# Service
apiVersion: v1
kind: Service
metadata:
name: tc-webapp-service
spec:
type: NodePort
selector:
appName: tc-webapp
ports:
- name: tc-svc
protocol: TCP
port: 1234
targetPort: 8080
nodePort: 31234
---
# Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: webapp-deployment
spec:
replicas: 10
selector:
matchLabels:
appName: tc-webapp
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 0
maxUnavailable: 9
# Pod Templates
template:
metadata:
labels:
appName: tc-webapp
spec:
containers:
- name: tc-part
image: tomcat:8-jdk8
ports:
- containerPort: 8080
livenessProbe:
tcpSocket:
port: 8080
initialDelaySeconds: 10
periodSeconds: 10
readinessProbe:
httpGet:
scheme: HTTP
port: 8080
path: /
initialDelaySeconds: 5
periodSeconds: 1
要部署真正零停机更新的应用程序,应用程序应满足一些要求。仅举其中的几个:
- 应用程序应该处理正常关闭
- 应用程序应正确实施就绪和活跃度探测
例如,如果收到关闭信号,则它不应以 200 响应新的就绪探测,但在处理所有旧请求之前,它仍以 200 响应活动。