PySpark 3.2.1 - 基本操作在非常小的 RDD 上崩溃

Question

我在 Windows 10 上的 conda-forge 环境中使用 Python 3.9.10 在 JupyterLab 本地测试简单的 PySpark 函数。

设置：

from pyspark import SparkConf, SparkContext

conf = SparkConf().setMaster("local[4]").setAppName("PySpark_Test") 
sc = SparkContext(conf=conf) # Default SparkContext

# Verify SparkContext
print(sc)

# Print the Spark version of SparkContext
print("The version of Spark Context in the PySpark shell is:", sc.version)

# Print the Python version of SparkContext
print("The Python version of Spark Context in the PySpark shell is:", sc.pythonVer)

# Print Master of SparkContext: URL of the cluster or “local” string to run in local mode 
print("The master of Spark Context in the PySpark shell is:", sc.master)

# Print name of SparkContext
print("The appName of Spark Context is:", sc.appName)

输出：

<SparkContext master=local[4] appName=PySpark_Test>
The version of Spark Context in the PySpark shell is: 3.2.1
The Python version of Spark Context in the PySpark shell is: 3.9
The master of Spark Context in the PySpark shell is: local[4]
The appName of Spark Context is: PySpark_Test

# Create a Python list of numbers from 1 to 10
numb = [*range(1, 11)]

# Load the list into PySpark  
numbRDD = sc.parallelize(numb)

numbRDD.collect()

输出：

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

type(numbRDD)

输出：

pyspark.rdd.RDD

cubedRDD = numbRDD.map(lambda x: x**3)
type(cubedRDD)

输出：

pyspark.rdd.PipelinedRDD

以上一切正常。

当我运行对这个非常小的 cubedRDD 几乎执行任何操作时，我都会崩溃。

这些操作会崩溃：

cubedRDD.collect()
cubedRDD.first()
cubedRDD.take(2)
cubedRDD.count()

错误摘录：

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 8.0 failed 1 times, most recent failure: Lost task 3.0 in stage 8.0 (TID 30) (10.0.0.236 executor driver): org.apache.spark.SparkException: Python worker failed to connect back. ...

Caused by: java.net.SocketTimeoutException: Accept timed out ...

Caused by: org.apache.spark.SparkException: Python worker failed to connect back. ...

这里出了什么问题？

Answer 1

我通过设置 PYSPARK_PYTHON SYSTEM 环境变量（user-level 变量也可以）指向 python.exe 解决了这个问题在我的 conda-forge 环境 (Miniconda) 中。

PYSPARK_PYTHON = C:\Users\Me\Miniconda3\envs\spark39\python.exe

我终于在 PySpark 文档中找到了 this page，它告诉了我有关该环境变量的信息。我发现没有其他 PySpark 教程或安装说明提到该变量。

我还为 python.exe 和 python3.exe 关闭了“应用程序安装程序”的应用程序执行别名。我只通过 Miniconda 安装了 Python。

安装 jdk1.8 运行一切正常。0_321 安装。

旁注：要使 Spark Dataframe write 操作正常工作（与此线程无关），请设置 HADOOP_HOME SYSTEM变量到 C:\winutils 和 SYSTEM PATH 以包含 %HADOOP_HOME%\bin。该 bin 目录应包含 Hadoop native libraries for your version of HADOOP. For Pyspark 3.2.1 from conda-forge, I used hadoop.dll and winutils.exe from here.

PySpark 3.2.1 - 基本操作在非常小的 RDD 上崩溃

PySpark 3.2.1 - basic actions crashing on very small RDDs

python

apache-spark

pyspark