PySpark 3.2.1 - 基本操作在非常小的 RDD 上崩溃

PySpark 3.2.1 - basic actions crashing on very small RDDs

我在 Windows 10 上的 conda-forge 环境中使用 Python 3.9.10 在 JupyterLab 本地测试简单的 PySpark 函数。

设置:

from pyspark import SparkConf, SparkContext

conf = SparkConf().setMaster("local[4]").setAppName("PySpark_Test") 
sc = SparkContext(conf=conf) # Default SparkContext

# Verify SparkContext
print(sc)

# Print the Spark version of SparkContext
print("The version of Spark Context in the PySpark shell is:", sc.version)

# Print the Python version of SparkContext
print("The Python version of Spark Context in the PySpark shell is:", sc.pythonVer)

# Print Master of SparkContext: URL of the cluster or “local” string to run in local mode 
print("The master of Spark Context in the PySpark shell is:", sc.master)

# Print name of SparkContext
print("The appName of Spark Context is:", sc.appName)

输出:

<SparkContext master=local[4] appName=PySpark_Test>
The version of Spark Context in the PySpark shell is: 3.2.1
The Python version of Spark Context in the PySpark shell is: 3.9
The master of Spark Context in the PySpark shell is: local[4]
The appName of Spark Context is: PySpark_Test

# Create a Python list of numbers from 1 to 10
numb = [*range(1, 11)]

# Load the list into PySpark  
numbRDD = sc.parallelize(numb)

numbRDD.collect()

输出:

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

type(numbRDD)

输出:

pyspark.rdd.RDD

cubedRDD = numbRDD.map(lambda x: x**3)
type(cubedRDD)

输出:

pyspark.rdd.PipelinedRDD

以上一切正常。

当我 运行 对这个非常小的 cubedRDD 几乎执行任何操作时,我都会崩溃。

这些操作会崩溃:

错误摘录:

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 8.0 failed 1 times, most recent failure: Lost task 3.0 in stage 8.0 (TID 30) (10.0.0.236 executor driver): org.apache.spark.SparkException: Python worker failed to connect back. ...

Caused by: java.net.SocketTimeoutException: Accept timed out ...

Caused by: org.apache.spark.SparkException: Python worker failed to connect back. ...

这里出了什么问题?

我通过设置 PYSPARK_PYTHON SYSTEM 环境变量(user-level 变量也可以)指向 python.exe 解决了这个问题在我的 conda-forge 环境 (Miniconda) 中。

PYSPARK_PYTHON = C:\Users\Me\Miniconda3\envs\spark39\python.exe

我终于在 PySpark 文档中找到了 this page,它告诉了我有关该环境变量的信息。我发现没有其他 PySpark 教程或安装说明提到该变量。

我还为 python.exe 和 python3.exe 关闭了“应用程序安装程序”的应用程序执行别名。我只通过 Miniconda 安装了 Python。

安装 jdk1.8 运行 一切正常。0_321 安装。

旁注:要使 Spark Dataframe write 操作正常工作(与此线程无关),请设置 HADOOP_HOME SYSTEM变量到 C:\winutils 和 SYSTEM PATH 以包含 %HADOOP_HOME%\bin。该 bin 目录应包含 Hadoop native libraries for your version of HADOOP. For Pyspark 3.2.1 from conda-forge, I used hadoop.dll and winutils.exe from here.