Zeppelin/Spark: org.apache.spark.SparkException: 无法 运行 编程“/usr/bin/”: error=13, 没有权限
Zeppelin/Spark: org.apache.spark.SparkException: Cannot run program "/usr/bin/": error=13, no permission
我尝试在 Debian 9 上使用 Zeppelin 0.7.2 和 Spark 2.1.1 进行基本回归 运行。两个 zeppelin 在 /usr/local/ 中都是 "installed",这意味着 /usr/local/zeppelin/ 和 /usr/local/spark。 Zeppelin 也知道正确的 SPARK_HOME。首先我加载数据:
%spark.pyspark
from sqlalchemy import create_engine #sql query
import pandas as pd #sql query
from pyspark import SparkContext #Spark DataFrame
from pyspark.sql import SQLContext #Spark DataFrame
# database connection and sql query
pdf = pd.read_sql("select col1, col2, col3 from table", create_engine('mysql+mysqldb://user:pass@host:3306/db').connect())
print(pdf.size) # size of pandas dataFrame
# convert pandas dataFrame into spark dataFrame
sdf = SQLContext(SparkContext.getOrCreate()).createDataFrame(pdf)
sdf.printSchema()# what does the spark dataFrame look like?
很好,它有效,我得到了 46977 行和三个列的输出:
46977
root
|-- col1: double (nullable = true)
|-- col2: double (nullable = true)
|-- col3: date (nullable = true)
好的,现在我要进行回归:
%spark.pyspark
# do a linear regression with sparks ml libs
# https://community.intersystems.com/post/machine-learning-spark-and-cach%C3%A9
from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import VectorAssembler
# choose several inputCols and transform the "Features" column(s) into the correct vector format
vectorAssembler = VectorAssembler(inputCols=["col1"], outputCol="features")
data=vectorAssembler.transform(sdf)
print(data)
# Split the data into 70% training and 30% test sets.
trainingData,testData = data.randomSplit([0.7, 0.3], 0.0)
print(trainingData)
# Configure the model.
lr = LinearRegression().setFeaturesCol("features").setLabelCol("col2").setMaxIter(10)
## Train the model using the training data.
lrm = lr.fit(trainingData)
## Run the test data through the model and display its predictions for PetalLength.
#predictions = lrm.transform(testData)
#predictions.show()
但是在执行 lr.fit(trainingData)
时,我在控制台(和 zeppelin 的日志文件)中收到错误。错误似乎是在启动 spark 时:Cannot 运行 program "/usr/bin/": error=13, Keine Berechtigung。我想知道应该从 /usr/bin/ 开始什么,因为我只使用路径 /usr/local/.
Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-4001144784380663394.py", line 367, in <module>
raise Exception(traceback.format_exc())
Exception: Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-4001144784380663394.py", line 360, in <module>
exec(code, _zcUserQueryNameSpace)
File "<stdin>", line 9, in <module>
File "/usr/local/spark/python/pyspark/ml/base.py", line 64, in fit
return self._fit(dataset)
File "/usr/local/spark/python/pyspark/ml/wrapper.py", line 236, in _fit
java_model = self._fit_java(dataset)
File "/usr/local/spark/python/pyspark/ml/wrapper.py", line 233, in _fit_java
return self._java_obj.fit(dataset._jdf)
File "/usr/local/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/usr/local/spark/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/usr/local/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
format(target_id, ".", name), value)
Py4JJavaError: An error occurred while calling o70.fit.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): **java.io.IOException: Cannot run program "/usr/bin/": error=13, Keine Berechtigung**
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
at org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:163)
at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:89)
at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:65)
at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:116)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:128)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
这是 Zeppelins 中的配置错误 conf/zeppelin-env.sh
。在那里,我取消了导致错误的以下行的注释,现在我对该行进行了注释并且它起作用了:
#export PYSPARK_PYTHON=/usr/bin/ # path to the python command. must be the same path on the driver(Zeppelin) and all workers.
所以问题是 PYSPARK_PYTHON 的路径设置不正确,现在它使用默认的 python 二进制文件。我通过在 Zeppelin 基本目录中执行 grep -R "/usr/bin/"
查找字符串 /usr/bin/
找到了解决方案,并检查了文件。
我尝试在 Debian 9 上使用 Zeppelin 0.7.2 和 Spark 2.1.1 进行基本回归 运行。两个 zeppelin 在 /usr/local/ 中都是 "installed",这意味着 /usr/local/zeppelin/ 和 /usr/local/spark。 Zeppelin 也知道正确的 SPARK_HOME。首先我加载数据:
%spark.pyspark
from sqlalchemy import create_engine #sql query
import pandas as pd #sql query
from pyspark import SparkContext #Spark DataFrame
from pyspark.sql import SQLContext #Spark DataFrame
# database connection and sql query
pdf = pd.read_sql("select col1, col2, col3 from table", create_engine('mysql+mysqldb://user:pass@host:3306/db').connect())
print(pdf.size) # size of pandas dataFrame
# convert pandas dataFrame into spark dataFrame
sdf = SQLContext(SparkContext.getOrCreate()).createDataFrame(pdf)
sdf.printSchema()# what does the spark dataFrame look like?
很好,它有效,我得到了 46977 行和三个列的输出:
46977
root
|-- col1: double (nullable = true)
|-- col2: double (nullable = true)
|-- col3: date (nullable = true)
好的,现在我要进行回归:
%spark.pyspark
# do a linear regression with sparks ml libs
# https://community.intersystems.com/post/machine-learning-spark-and-cach%C3%A9
from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import VectorAssembler
# choose several inputCols and transform the "Features" column(s) into the correct vector format
vectorAssembler = VectorAssembler(inputCols=["col1"], outputCol="features")
data=vectorAssembler.transform(sdf)
print(data)
# Split the data into 70% training and 30% test sets.
trainingData,testData = data.randomSplit([0.7, 0.3], 0.0)
print(trainingData)
# Configure the model.
lr = LinearRegression().setFeaturesCol("features").setLabelCol("col2").setMaxIter(10)
## Train the model using the training data.
lrm = lr.fit(trainingData)
## Run the test data through the model and display its predictions for PetalLength.
#predictions = lrm.transform(testData)
#predictions.show()
但是在执行 lr.fit(trainingData)
时,我在控制台(和 zeppelin 的日志文件)中收到错误。错误似乎是在启动 spark 时:Cannot 运行 program "/usr/bin/": error=13, Keine Berechtigung。我想知道应该从 /usr/bin/ 开始什么,因为我只使用路径 /usr/local/.
Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-4001144784380663394.py", line 367, in <module>
raise Exception(traceback.format_exc())
Exception: Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-4001144784380663394.py", line 360, in <module>
exec(code, _zcUserQueryNameSpace)
File "<stdin>", line 9, in <module>
File "/usr/local/spark/python/pyspark/ml/base.py", line 64, in fit
return self._fit(dataset)
File "/usr/local/spark/python/pyspark/ml/wrapper.py", line 236, in _fit
java_model = self._fit_java(dataset)
File "/usr/local/spark/python/pyspark/ml/wrapper.py", line 233, in _fit_java
return self._java_obj.fit(dataset._jdf)
File "/usr/local/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/usr/local/spark/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/usr/local/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
format(target_id, ".", name), value)
Py4JJavaError: An error occurred while calling o70.fit.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): **java.io.IOException: Cannot run program "/usr/bin/": error=13, Keine Berechtigung**
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
at org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:163)
at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:89)
at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:65)
at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:116)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:128)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
这是 Zeppelins 中的配置错误 conf/zeppelin-env.sh
。在那里,我取消了导致错误的以下行的注释,现在我对该行进行了注释并且它起作用了:
#export PYSPARK_PYTHON=/usr/bin/ # path to the python command. must be the same path on the driver(Zeppelin) and all workers.
所以问题是 PYSPARK_PYTHON 的路径设置不正确,现在它使用默认的 python 二进制文件。我通过在 Zeppelin 基本目录中执行 grep -R "/usr/bin/"
查找字符串 /usr/bin/
找到了解决方案,并检查了文件。