运行在 GCP 上的 Ubuntu 16.04 实例上使用 tensorflow 的烧瓶应用程序，模型运行但预测与本地主机上的不同

Question

我在我的本地主机上使用 huggingface for tensorflow 训练了一个 BERT 模型。运行在我的本地主机上的预测工作正常。

然后我实施了一个解决方案，这样我就可以通过 Flask 从 GCP VM 实例 (Ubuntu 16.04) 调用我的模型。该过程似乎有效，因为我可以在 VM 上成功调用我的应用程序。

但是，我从 VM 收到的预测与我在本地主机上收到的预测不同（这是预期的输出），但我使用相同的代码。我使用一个模型进行序列分类，当我试图在我的本地主机上获取两个标签的概率时，我得到：array([0.67829543, 0.32170454], dtype=float32) 而 VM returns array([1, 1], dtype=float32)。这个片段是我用来预测模型的，仅供参考：

predict_input = tokenizer.encode(sentence,
                                  truncation=True,
                                  padding=True,
                                  return_tensors="tf"
                                  )
tf_output = model.predict(predict_input)[0]
    
tf_prediction = tf.nn.softmax(tf_output, axis=0).numpy()

在我的本地主机上，我使用带有 GPU 支持的 tf 训练模型，VM 当然只有两个 vCPU。在 VM 上加载 tensorflow 时，我收到以下警告：

2020-12-27 07:57:55.533847: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory
2020-12-27 07:57:55.533896: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2020-12-27 07:57:56.792914: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2020-12-27 07:57:56.792966: W tensorflow/stream_executor/cuda/cuda_driver.cc:312] failed call to cuInit: UNKNOWN ERROR (303)
2020-12-27 07:57:56.793002: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (bertvm-1): /proc/driver/nvidia/version does not exist
2020-12-27 07:57:56.793316: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations:  AVX2 A
VX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2020-12-27 07:57:56.801469: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 2000129999 Hz
2020-12-27 07:57:56.801693: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x64b8fe0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-12-27 07:57:56.801805: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version

我不确定这是否是错误的根源，或者是否因为我使用 tf for GPU 训练了模型并在为 CPU 运行 tf 的实例上进行预测，但事实并非如此对我来说似乎太有意义了。这些警告似乎只与 CUDA 'issues' 有关，我认为它与 GPU 支持有关。

关于导致不同预测的原因的任何想法或提示？提前感谢您的帮助！

编辑：似乎模型 returns 在 VM 和本地主机上都有相同的登录。当我然后应用 tf.nn.softmax(tf_output, axis=0).numpy() 时，我得到不同的结果。 tf_output 在两个实例上都是 [1.9530067 1.2070574] 而上面的函数 returns [0.67829543 0.32170454] 在本地主机上和 [[1. 1.]] 在 VM 上（两者都格式化为字符串）如上所述.

Answer 1

所以我发现了问题。

tf_output = model.predict(predict_input) 在我的 VM 上表现不一样（顺便说一句，运行 Python 3.5，而我的本地主机运行 python 3.8）。不知何故，我不得不在虚拟机上索引两次，而在我的本地主机上，一个索引就足够了。因此，本地主机上的 tf_output = model.predict(predict_input)[0] 变成虚拟机上的 tf_output = model.predict(predict_input)[0][0]。

同样，在 tf.nn.softmax(tf_output, axis=0).numpy() 上调用 .numpy() 在本地主机上有效，而在 VM 上则被忽略。将 tf_prediction = tf.nn.softmax(tf_output, axis=0).numpy() 替换为

tf_prediction = tf.nn.softmax(tf_output, axis=0).numpy()
tf_prediction = tf_prediction.numpy()

结合以上解决了我的问题。

为清楚起见，这是在 VM 上运行的最后一个片段：

tf_output = model.predict(predict_input)[0][0]
tf_prediction = tf.nn.softmax(tf_output, axis=0).numpy()
tf_prediction = tf_prediction.numpy()

VM 在 Ubuntu 16.04 上运行，TF 2.3.1，python 3.5 没有 GPU，而我的本地主机在 Windows 10 上运行，TF 2.4.0，python 3.8 和 GPU（如果重要的话）。

对我来说似乎有点不合逻辑，但我想这至少是一种解决方法。希望我的独白也能对其他人的问题有所帮助^^干杯。

运行在 GCP 上的 Ubuntu 16.04 实例上使用 tensorflow 的烧瓶应用程序，模型运行但预测与本地主机上的不同

Running a flask app that uses tensorflow on an Ubuntu 16.04 instance on GCP, model runs but predictions are different than on local host

google-cloud-platform

tensorflow

huggingface-transformers

运行 在 GCP 上的 Ubuntu 16.04 实例上使用 tensorflow 的烧瓶应用程序，模型运行但预测与本地主机上的不同

Running a flask app that uses tensorflow on an Ubuntu 16.04 instance on GCP, model runs but predictions are different than on local host

google-cloud-platform

tensorflow

huggingface-transformers

运行在 GCP 上的 Ubuntu 16.04 实例上使用 tensorflow 的烧瓶应用程序，模型运行但预测与本地主机上的不同