如何在 macOS Mojave 上使用 Pandas UDF? (由于 [__NSPlaceholderDictionary 初始化] 可能正在进行中而失败...)
How to use Pandas UDFs on macOS Mojave? (that fails due to [__NSPlaceholderDictionary initialize] may have been in progress...)
我正在尝试在 macOS 10.14.3 (macOS Mojave) 上的 Apache Spark 2.4.0 中使用 Pandas UDFs (a.k.a. Vectorized UDFs)。
我使用 pip
(以及后来的 pip3
)安装了 pandas
和 pyarrow
。
每当我执行 Spark SQL 官方文档中的示例代码时,我都会得到以下异常。
import pandas as pd
from pyspark.sql.functions import col, pandas_udf
from pyspark.sql.types import LongType
def multiply_func(a, b):
return a * b
multiply = pandas_udf(multiply_func, returnType=LongType())
x = pd.Series([1, 2, 3])
print(multiply_func(x, x))
df = spark.createDataFrame(pd.DataFrame(x, columns=["x"]))
# Execute function as a Spark vectorized UDF
df.select(multiply(col("x"), col("x"))).show()
异常情况如下:
objc[97883]: +[__NSPlaceholderDictionary initialize] may have been in progress in another thread when fork() was called.
objc[97883]: +[__NSPlaceholderDictionary initialize] may have been in progress in another thread when fork() was called. We cannot safely call it or ignore it in the fork() child process. Crashing instead. Set a breakpoint on objc_initializeAfterForkError to debug.
19/03/27 15:01:20 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
org.apache.spark.SparkException: Python worker exited unexpectedly (crashed)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun.applyOrElse(PythonRunner.scala:486)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun.applyOrElse(PythonRunner.scala:475)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:34)
at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon.read(ArrowPythonRunner.scala:178)
at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon.read(ArrowPythonRunner.scala:122)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at org.apache.spark.sql.execution.python.ArrowEvalPythonExec$$anon.<init>(ArrowEvalPythonExec.scala:98)
at org.apache.spark.sql.execution.python.ArrowEvalPythonExec.evaluate(ArrowEvalPythonExec.scala:96)
at org.apache.spark.sql.execution.python.EvalPythonExec.$anonfun$doExecute(EvalPythonExec.scala:128)
...
Caused by: java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon.read(ArrowPythonRunner.scala:159)
... 28 more
我在 Doesn't work on macOS High Sierra #69 中找到了一个解决方案,并认为我会 post 在 Whosebug 上找到它。
您应该确保 Xcode 的命令行工具已经安装。如果没有,执行以下命令:
xcode-select --install
结果非常重要的是导出 OBJC_DISABLE_INITIALIZE_FORK_SAFETY
env var:
export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES
上面两个代码运行良好:
>>> # Execute function as a Spark vectorized UDF
... df.select(multiply(col("x"), col("x"))).show()
[Stage 0:> (0 + 1) / 1]/usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:159: UserWarning: pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream
warnings.warn("pyarrow.open_stream is deprecated, please use "
/usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:159: UserWarning: pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream
warnings.warn("pyarrow.open_stream is deprecated, please use "
/usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:159: UserWarning: pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream
warnings.warn("pyarrow.open_stream is deprecated, please use "
/usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:159: UserWarning: pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream
warnings.warn("pyarrow.open_stream is deprecated, please use "
+-------------------+
|multiply_func(x, x)|
+-------------------+
| 1|
| 4|
| 9|
+-------------------+
我正在尝试在 macOS 10.14.3 (macOS Mojave) 上的 Apache Spark 2.4.0 中使用 Pandas UDFs (a.k.a. Vectorized UDFs)。
我使用 pip
(以及后来的 pip3
)安装了 pandas
和 pyarrow
。
每当我执行 Spark SQL 官方文档中的示例代码时,我都会得到以下异常。
import pandas as pd
from pyspark.sql.functions import col, pandas_udf
from pyspark.sql.types import LongType
def multiply_func(a, b):
return a * b
multiply = pandas_udf(multiply_func, returnType=LongType())
x = pd.Series([1, 2, 3])
print(multiply_func(x, x))
df = spark.createDataFrame(pd.DataFrame(x, columns=["x"]))
# Execute function as a Spark vectorized UDF
df.select(multiply(col("x"), col("x"))).show()
异常情况如下:
objc[97883]: +[__NSPlaceholderDictionary initialize] may have been in progress in another thread when fork() was called.
objc[97883]: +[__NSPlaceholderDictionary initialize] may have been in progress in another thread when fork() was called. We cannot safely call it or ignore it in the fork() child process. Crashing instead. Set a breakpoint on objc_initializeAfterForkError to debug.
19/03/27 15:01:20 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
org.apache.spark.SparkException: Python worker exited unexpectedly (crashed)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun.applyOrElse(PythonRunner.scala:486)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun.applyOrElse(PythonRunner.scala:475)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:34)
at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon.read(ArrowPythonRunner.scala:178)
at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon.read(ArrowPythonRunner.scala:122)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at org.apache.spark.sql.execution.python.ArrowEvalPythonExec$$anon.<init>(ArrowEvalPythonExec.scala:98)
at org.apache.spark.sql.execution.python.ArrowEvalPythonExec.evaluate(ArrowEvalPythonExec.scala:96)
at org.apache.spark.sql.execution.python.EvalPythonExec.$anonfun$doExecute(EvalPythonExec.scala:128)
...
Caused by: java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon.read(ArrowPythonRunner.scala:159)
... 28 more
我在 Doesn't work on macOS High Sierra #69 中找到了一个解决方案,并认为我会 post 在 Whosebug 上找到它。
您应该确保 Xcode 的命令行工具已经安装。如果没有,执行以下命令:
xcode-select --install
结果非常重要的是导出 OBJC_DISABLE_INITIALIZE_FORK_SAFETY
env var:
export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES
上面两个代码运行良好:
>>> # Execute function as a Spark vectorized UDF
... df.select(multiply(col("x"), col("x"))).show()
[Stage 0:> (0 + 1) / 1]/usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:159: UserWarning: pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream
warnings.warn("pyarrow.open_stream is deprecated, please use "
/usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:159: UserWarning: pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream
warnings.warn("pyarrow.open_stream is deprecated, please use "
/usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:159: UserWarning: pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream
warnings.warn("pyarrow.open_stream is deprecated, please use "
/usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:159: UserWarning: pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream
warnings.warn("pyarrow.open_stream is deprecated, please use "
+-------------------+
|multiply_func(x, x)|
+-------------------+
| 1|
| 4|
| 9|
+-------------------+