如何强制 python 版本在从 GCP dataproc 集群旋转的数据实验室实例中同步?
How to force python versions to sync in a datalab instance spun from a GCP dataproc cluster?
我使用映像 1.2 在 GCP 中创建了一个 Dataproc 集群。我想 运行 从 Datalab 笔记本中激发灵感。如果我将 Datalab notebook 运行ning Python 2.7 作为其内核,这会很好地工作,但如果我想使用 Python 3 I 运行 进入次要版本不匹配。我用下面的 Datalab 脚本演示了不匹配:
### Configuration
import sys, os
sys.path.insert(0, '/opt/panera/lib')
os.environ['PYSPARK_PYTHON'] = '/opt/conda/bin/python'
os.environ['PYSPARK_DRIVER_PYTHON'] = '/opt/conda/bin/python'
import google.datalab.storage as storage
from io import BytesIO
spark = SparkSession.builder \
.enableHiveSupport() \
.config("hive.exec.dynamic.partition","true") \
.config("hive.exec.dynamic.partition.mode","nonstrict") \
.config("mapreduce.fileoutputcommitter.marksuccessfuljobs","false") \
.getOrCreate() \
sc = spark.sparkContext
### import libraries
from pyspark.mllib.tree import DecisionTree, DecisionTreeModel
from pyspark.mllib.util import MLUtils
from pyspark.mllib.regression import LabeledPoint
### trivial example
data = [
LabeledPoint(0.0, [0.0]),
LabeledPoint(1.0, [1.0]),
LabeledPoint(1.0, [2.0]),
LabeledPoint(1.0, [3.0])
]
toyModel = DecisionTree.trainClassifier(sc.parallelize(data), 2, {})
print(toyModel)
错误:
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, pan-bdaas-prod-jrl6-w-3.c.big-data-prod.internal, executor 6): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 124, in main
("%d.%d" % sys.version_info[:2], version))
Exception: Python in worker has different version 3.6 than that in driver 3.5, PySpark cannot run with different minor versions.Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.
其他初始化脚本:
gs://dataproc-initialization-actions/cloud-sql-proxy/cloud-sql-proxy.sh
gs://dataproc-initialization-actions/datalab/datalab.sh
...和加载我们一些必要的库和实用程序的脚本
Datalab 中的 Python 3 内核使用 Python 3.5 而不是 Python 3.6
您可以尝试在 Datalab 中设置 3.6 环境,然后为其安装新的内核规范,但仅配置 Dataproc 集群以使用 Python 3.5
可能更容易
设置集群以使用 3.5 的说明是 here
我使用映像 1.2 在 GCP 中创建了一个 Dataproc 集群。我想 运行 从 Datalab 笔记本中激发灵感。如果我将 Datalab notebook 运行ning Python 2.7 作为其内核,这会很好地工作,但如果我想使用 Python 3 I 运行 进入次要版本不匹配。我用下面的 Datalab 脚本演示了不匹配:
### Configuration
import sys, os
sys.path.insert(0, '/opt/panera/lib')
os.environ['PYSPARK_PYTHON'] = '/opt/conda/bin/python'
os.environ['PYSPARK_DRIVER_PYTHON'] = '/opt/conda/bin/python'
import google.datalab.storage as storage
from io import BytesIO
spark = SparkSession.builder \
.enableHiveSupport() \
.config("hive.exec.dynamic.partition","true") \
.config("hive.exec.dynamic.partition.mode","nonstrict") \
.config("mapreduce.fileoutputcommitter.marksuccessfuljobs","false") \
.getOrCreate() \
sc = spark.sparkContext
### import libraries
from pyspark.mllib.tree import DecisionTree, DecisionTreeModel
from pyspark.mllib.util import MLUtils
from pyspark.mllib.regression import LabeledPoint
### trivial example
data = [
LabeledPoint(0.0, [0.0]),
LabeledPoint(1.0, [1.0]),
LabeledPoint(1.0, [2.0]),
LabeledPoint(1.0, [3.0])
]
toyModel = DecisionTree.trainClassifier(sc.parallelize(data), 2, {})
print(toyModel)
错误:
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, pan-bdaas-prod-jrl6-w-3.c.big-data-prod.internal, executor 6): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 124, in main
("%d.%d" % sys.version_info[:2], version))
Exception: Python in worker has different version 3.6 than that in driver 3.5, PySpark cannot run with different minor versions.Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.
其他初始化脚本: gs://dataproc-initialization-actions/cloud-sql-proxy/cloud-sql-proxy.sh gs://dataproc-initialization-actions/datalab/datalab.sh ...和加载我们一些必要的库和实用程序的脚本
Datalab 中的 Python 3 内核使用 Python 3.5 而不是 Python 3.6
您可以尝试在 Datalab 中设置 3.6 环境,然后为其安装新的内核规范,但仅配置 Dataproc 集群以使用 Python 3.5
可能更容易设置集群以使用 3.5 的说明是 here