PySpark 3.2.1 - 基本操作在非常小的 RDD 上崩溃
PySpark 3.2.1 - basic actions crashing on very small RDDs
我在 Windows 10 上的 conda-forge 环境中使用 Python 3.9.10 在 JupyterLab 本地测试简单的 PySpark 函数。
设置:
from pyspark import SparkConf, SparkContext
conf = SparkConf().setMaster("local[4]").setAppName("PySpark_Test")
sc = SparkContext(conf=conf) # Default SparkContext
# Verify SparkContext
print(sc)
# Print the Spark version of SparkContext
print("The version of Spark Context in the PySpark shell is:", sc.version)
# Print the Python version of SparkContext
print("The Python version of Spark Context in the PySpark shell is:", sc.pythonVer)
# Print Master of SparkContext: URL of the cluster or “local” string to run in local mode
print("The master of Spark Context in the PySpark shell is:", sc.master)
# Print name of SparkContext
print("The appName of Spark Context is:", sc.appName)
输出:
<SparkContext master=local[4] appName=PySpark_Test>
The version of Spark Context in the PySpark shell is: 3.2.1
The Python version of Spark Context in the PySpark shell is: 3.9
The master of Spark Context in the PySpark shell is: local[4]
The appName of Spark Context is: PySpark_Test
# Create a Python list of numbers from 1 to 10
numb = [*range(1, 11)]
# Load the list into PySpark
numbRDD = sc.parallelize(numb)
numbRDD.collect()
输出:
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
type(numbRDD)
输出:
pyspark.rdd.RDD
cubedRDD = numbRDD.map(lambda x: x**3)
type(cubedRDD)
输出:
pyspark.rdd.PipelinedRDD
以上一切正常。
当我 运行 对这个非常小的 cubedRDD 几乎执行任何操作时,我都会崩溃。
这些操作会崩溃:
- cubedRDD.collect()
- cubedRDD.first()
- cubedRDD.take(2)
- cubedRDD.count()
错误摘录:
Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.collectAndServe. :
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 3 in stage 8.0 failed 1 times, most recent failure: Lost task 3.0
in stage 8.0 (TID 30) (10.0.0.236 executor driver):
org.apache.spark.SparkException: Python worker failed to connect back.
...
Caused by: java.net.SocketTimeoutException: Accept timed out
...
Caused by: org.apache.spark.SparkException: Python worker failed to connect back.
...
这里出了什么问题?
我通过设置 PYSPARK_PYTHON SYSTEM 环境变量(user-level 变量也可以)指向 python.exe 解决了这个问题在我的 conda-forge 环境 (Miniconda) 中。
PYSPARK_PYTHON = C:\Users\Me\Miniconda3\envs\spark39\python.exe
我终于在 PySpark 文档中找到了 this page,它告诉了我有关该环境变量的信息。我发现没有其他 PySpark 教程或安装说明提到该变量。
我还为 python.exe 和 python3.exe 关闭了“应用程序安装程序”的应用程序执行别名。我只通过 Miniconda 安装了 Python。
安装 jdk1.8 运行 一切正常。0_321 安装。
旁注:要使 Spark Dataframe write 操作正常工作(与此线程无关),请设置 HADOOP_HOME SYSTEM变量到 C:\winutils 和 SYSTEM PATH 以包含 %HADOOP_HOME%\bin。该 bin 目录应包含 Hadoop native libraries for your version of HADOOP. For Pyspark 3.2.1 from conda-forge, I used hadoop.dll and winutils.exe from here.
我在 Windows 10 上的 conda-forge 环境中使用 Python 3.9.10 在 JupyterLab 本地测试简单的 PySpark 函数。
设置:
from pyspark import SparkConf, SparkContext
conf = SparkConf().setMaster("local[4]").setAppName("PySpark_Test")
sc = SparkContext(conf=conf) # Default SparkContext
# Verify SparkContext
print(sc)
# Print the Spark version of SparkContext
print("The version of Spark Context in the PySpark shell is:", sc.version)
# Print the Python version of SparkContext
print("The Python version of Spark Context in the PySpark shell is:", sc.pythonVer)
# Print Master of SparkContext: URL of the cluster or “local” string to run in local mode
print("The master of Spark Context in the PySpark shell is:", sc.master)
# Print name of SparkContext
print("The appName of Spark Context is:", sc.appName)
输出:
<SparkContext master=local[4] appName=PySpark_Test>
The version of Spark Context in the PySpark shell is: 3.2.1
The Python version of Spark Context in the PySpark shell is: 3.9
The master of Spark Context in the PySpark shell is: local[4]
The appName of Spark Context is: PySpark_Test
# Create a Python list of numbers from 1 to 10
numb = [*range(1, 11)]
# Load the list into PySpark
numbRDD = sc.parallelize(numb)
numbRDD.collect()
输出:
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
type(numbRDD)
输出:
pyspark.rdd.RDD
cubedRDD = numbRDD.map(lambda x: x**3)
type(cubedRDD)
输出:
pyspark.rdd.PipelinedRDD
以上一切正常。
当我 运行 对这个非常小的 cubedRDD 几乎执行任何操作时,我都会崩溃。
这些操作会崩溃:
- cubedRDD.collect()
- cubedRDD.first()
- cubedRDD.take(2)
- cubedRDD.count()
错误摘录:
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 8.0 failed 1 times, most recent failure: Lost task 3.0 in stage 8.0 (TID 30) (10.0.0.236 executor driver): org.apache.spark.SparkException: Python worker failed to connect back. ...
Caused by: java.net.SocketTimeoutException: Accept timed out ...
Caused by: org.apache.spark.SparkException: Python worker failed to connect back. ...
这里出了什么问题?
我通过设置 PYSPARK_PYTHON SYSTEM 环境变量(user-level 变量也可以)指向 python.exe 解决了这个问题在我的 conda-forge 环境 (Miniconda) 中。
PYSPARK_PYTHON = C:\Users\Me\Miniconda3\envs\spark39\python.exe
我终于在 PySpark 文档中找到了 this page,它告诉了我有关该环境变量的信息。我发现没有其他 PySpark 教程或安装说明提到该变量。
我还为 python.exe 和 python3.exe 关闭了“应用程序安装程序”的应用程序执行别名。我只通过 Miniconda 安装了 Python。
安装 jdk1.8 运行 一切正常。0_321 安装。
旁注:要使 Spark Dataframe write 操作正常工作(与此线程无关),请设置 HADOOP_HOME SYSTEM变量到 C:\winutils 和 SYSTEM PATH 以包含 %HADOOP_HOME%\bin。该 bin 目录应包含 Hadoop native libraries for your version of HADOOP. For Pyspark 3.2.1 from conda-forge, I used hadoop.dll and winutils.exe from here.