在 Google App Engine 上连续 integration/deployment/delivery,风险太大?

Continuous integration/deployment/delivery on Google App Engine, too risky?

我们最近在 Google App Engine 上设置了连续 integration/deployment/delivery 的 nodejs webapp。 CI 服务器 (GitLabCI) 运行s 依赖项安装、构建、测试和部署到 integration/prod 取决于分支 (develop/master)。

在今天,我们遇到的唯一错误是在依赖项步骤中,因此我们不太关心它。但是昨天 (21/10/16),发生了大规模的 DNS 中断,管道在部署步骤中途失败,破坏了产品。只需重新运行管道即可完成工作,但问题随时可能重现。

我的问题是:

目前我们只有两个版本 "dev" 和 "prod" 之后更新 提交,但在随机时间我可以观察到奇怪的行为。

任何 response/suggestions/feedback 都非常欢迎!

关于我正在谈论的网络问题的堆栈跟踪示例:

DEBUG: Error sending result: 'MetadataServerException(HTTPError(),)'. Reason: 'PicklingError("Can't pickle <type 'cStringIO.StringO'>: attribute lookup cStringIO.StringO failed",)'
Traceback (most recent call last):
  File "/google-cloud-sdk/lib/googlecloudsdk/calliope/cli.py", line 733, in Execute
    resources = args.calliope_command.Run(cli=self, args=args)
  File "/google-cloud-sdk/lib/googlecloudsdk/calliope/backend.py", line 1630, in Run
    resources = command_instance.Run(args)
  File "/google-cloud-sdk/lib/surface/app/deploy.py", line 53, in Run
    return deploy_util.RunDeploy(self, args)
  File "/google-cloud-sdk/lib/googlecloudsdk/command_lib/app/deploy_util.py", line 387, in RunDeploy
    all_services)
  File "/google-cloud-sdk/lib/googlecloudsdk/command_lib/app/deploy_util.py", line 247, in Deploy
    manifest = _UploadFiles(service, code_bucket_ref)
  File "/google-cloud-sdk/lib/googlecloudsdk/command_lib/app/deploy_util.py", line 115, in _UploadFiles
    service, code_bucket_ref)
  File "/google-cloud-sdk/lib/googlecloudsdk/api_lib/app/deploy_app_command_util.py", line 277, in CopyFilesToCodeBucketNoGsUtil
    _UploadFiles(files_to_upload, bucket_ref)
  File "/google-cloud-sdk/lib/googlecloudsdk/api_lib/app/deploy_app_command_util.py", line 219, in _UploadFiles
    results = pool.map(_UploadFile, tasks)
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 251, in map
    return self.map_async(func, iterable, chunksize).get()
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 558, in get
    raise self._value
MaybeEncodingError: Error sending result: 'MetadataServerException(HTTPError(),)'. Reason: 'PicklingError("Can't pickle <type 'cStringIO.StringO'>: attribute lookup cStringIO.StringO failed",)'
DEBUG: Exception captured in Error
Traceback (most recent call last):
  File "/google-cloud-sdk/lib/googlecloudsdk/core/metrics.py", line 411, in Wrapper
    return func(*args, **kwds)
TypeError: Error() takes exactly 3 arguments (1 given)
ERROR: gcloud crashed (MaybeEncodingError): Error sending result: 'MetadataServerException(HTTPError(),)'. Reason: 'PicklingError("Can't pickle <type 'cStringIO.StringO'>: attribute lookup cStringIO.StringO failed",)'
Traceback (most recent call last):
  File "/google-cloud-sdk/lib/gcloud.py", line 65, in <module>
    main()
  File "/google-cloud-sdk/lib/gcloud.py", line 61, in main
    sys.exit(googlecloudsdk.gcloud_main.main())
  File "/google-cloud-sdk/lib/googlecloudsdk/gcloud_main.py", line 145, in main
    crash_handling.HandleGcloudCrash(err)
  File "/google-cloud-sdk/lib/googlecloudsdk/command_lib/crash_handling.py", line 107, in HandleGcloudCrash
    _ReportError(err)
  File "/google-cloud-sdk/lib/googlecloudsdk/command_lib/crash_handling.py", line 86, in _ReportError
    util.ErrorReporting().ReportEvent(error_message=stacktrace,
  File "/google-cloud-sdk/lib/googlecloudsdk/api_lib/error_reporting/util.py", line 28, in __init__
    self._API_NAME, self._API_VERSION)
  File "/google-cloud-sdk/lib/googlecloudsdk/core/apis.py", line 254, in GetClientInstance
    http_client = http.Http()
  File "/google-cloud-sdk/lib/googlecloudsdk/core/credentials/http.py", line 60, in Http
    creds = store.Load()
  File "/google-cloud-sdk/lib/googlecloudsdk/core/credentials/store.py", line 282, in Load
    if account in c_gce.Metadata().Accounts():
  File "/google-cloud-sdk/lib/googlecloudsdk/core/credentials/gce.py", line 122, in Accounts
    gce_read.GOOGLE_GCE_METADATA_ACCOUNTS_URI + '/')
  File "/google-cloud-sdk/lib/googlecloudsdk/core/util/retry.py", line 160, in TryFunc
    return func(*args, **kwargs), None
  File "/google-cloud-sdk/lib/googlecloudsdk/core/credentials/gce.py", line 45, in _ReadNoProxyWithCleanFailures
    raise MetadataServerException(e)
googlecloudsdk.core.credentials.gce.MetadataServerException: HTTP Error 503: Service Unavailable
DEBUG: Uploading [/builds/apps/webapp/lib/jinja2/defaults.pyc] to [151c77b4e5bdd2c38b6a2bf914fffa3a6ffa71a6]
INFO: Uploading [/builds/apps/webapp/lib/jinja2/defaults.pyc] to [151c77b4e5bdd2c38b6a2bf914fffa3a6ffa71a6]
INFO: Refreshing access_token

Good/bad?主观 - 因此对于 SO 来说是题外话。假设问题是如何使持续部署可靠:)

好吧,问题在于您正在使用 app versions 作为您的 CI 环境,这意味着您无法避免由于特定版本坏的。您只能希望通过重新部署版本(当中断结束时)尽快恢复 - 这可以自动化。

您不应该让您的生产站点 运行 直接脱离被 CI production 管道覆盖的版本,否则您可能会因部署不当而导致站点中断。相反,您可以为 CI production 管道的每次执行使用 new/unique 版本,只有在成功完成之后,您才能使用下面描述的流程(也可以如果使用不同的 apps 而不是 app versions 作为 CI 环境,则在 CI 管道内使用)

来自Deploying your program

By default the deploy command automatically generates a new version ID each time that you use it and will route any traffic to the new version.

To override this behavior, you can specify the version ID with the version flag:

gcloud app deploy --version myID

You can also specify not to send all traffic to the new version immediatey with the --no-promote flag:

gcloud app deploy --no-promote

因此请确保您永远不会部署一个版本并在同一步骤中将该版本设为默认流量目的地(如果从客户端驱动,则可能不是原子的)。特别是对于生产应用程序。相反:

通过这种方式,唯一的关键操作是流量切换,它(希望)是一个原子操作,要么成功,要么在 GAE 端完全回滚(如果不成功,则为 GAE 错误)。如果此步骤失败,应用程序仍应继续使用旧版本。

当然,这是假设网络问题只存在于您和 GAE 之间,如果它们也影响 GAE 的内部操作 已关闭(但我信任的应该及时修复)。