win7 pyspark sql 实用程序 IllegalArgumentException
win7 pyspark sql utils IllegalArgumentException
我正尝试在 pycharm 上 运行 pyspark。我已经连接了所有东西并设置了环境变量。我可以读取 sc.textFile,但是当我尝试从 pyspark.sql 读取 csv 文件时,出了点问题。
代码如下:
import os
import sys
from pyspark import SparkContext
from pyspark import SparkConf
from pyspark.sql import SQLContext
from pyspark.sql import SparkSession
# Path for spark source folder
os.environ['SPARK_HOME']="E:/spark-2.0.0-bin-hadoop2.7/spark-2.0.0-bin-hadoop2.7"
# Append pyspark to Python Path
sys.path.append("E:/spark-2.0.0-bin-hadoop2.7/spark-2.0.0-bin-hadoop2.7/python")
sys.path.append("E:/spark-2.0.0-bin-hadoop2.7/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1.zip")
conf = SparkConf().setAppName('Simple App')
sc = SparkContext("local", "Simple App")
spark = SparkSession.builder.config(conf=SparkConf()).getOrCreate()
accounts_rdd = spark.read.csv('test.csv')
print accounts_rdd.show()
这是错误:
Traceback (most recent call last):
File "C:/Users/bjlinmanna/PycharmProjects/untitled1/spark.py", line 25, in <module>
accounts_rdd = spark.read.csv('pmec_close_position_order.csv')
File "E:\spark-2.0.0-bin-hadoop2.7\spark-2.0.0-bin-hadoop2.7\python\pyspark\sql\readwriter.py", line 363, in csv
return self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path)))
File "E:\spark-2.0.0-bin-hadoop2.7\spark-2.0.0-bin-hadoop2.7\python\lib\py4j-0.10.1-src.zip\py4j\java_gateway.py", line 933, in __call__
File "E:\spark-2.0.0-bin-hadoop2.7\spark-2.0.0-bin-hadoop2.7\python\pyspark\sql\utils.py", line 79, in deco
raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.IllegalArgumentException: u'java.net.URISyntaxException: Relative path in absolute URI: file:C:/the/path/to/myfile/spark-warehouse'
感谢@Hyunsoo Park,我解决了我的问题如下:
spark = SparkSession.builder\
.master('local[*]')\
.appName('My App')\
.config('spark.sql.warehouse.dir', 'file:///C:/the/path/to/myfile')\
.getOrCreate()
accounts_rdd = spark.read\
.format('csv')\
.option('header', 'true')\
.load('file.csv')
设置配置时,请注意文件路径中的“//”。不知道为什么我设置的时候'file:C:/the/path/to/myfile',没用
这个link可能有用。
http://quabr.com/38669206/spark-2-0-relative-path-in-absolute-uri-spark-warehouse
简而言之,有配置选项spark.sql.warehouse.dir
来设置仓库文件夹。如果您手动设置仓库文件夹,错误消息将消失。
我今天遇到了同样的问题。我在 ubuntu 16.04 中没有问题,但是,当我 运行 在 windows 10 中使用相同的代码时,spark 会像你一样显示错误消息。可能是 spark 在 windows.
中找不到或无法正确创建仓库文件夹
我正尝试在 pycharm 上 运行 pyspark。我已经连接了所有东西并设置了环境变量。我可以读取 sc.textFile,但是当我尝试从 pyspark.sql 读取 csv 文件时,出了点问题。
代码如下:
import os
import sys
from pyspark import SparkContext
from pyspark import SparkConf
from pyspark.sql import SQLContext
from pyspark.sql import SparkSession
# Path for spark source folder
os.environ['SPARK_HOME']="E:/spark-2.0.0-bin-hadoop2.7/spark-2.0.0-bin-hadoop2.7"
# Append pyspark to Python Path
sys.path.append("E:/spark-2.0.0-bin-hadoop2.7/spark-2.0.0-bin-hadoop2.7/python")
sys.path.append("E:/spark-2.0.0-bin-hadoop2.7/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1.zip")
conf = SparkConf().setAppName('Simple App')
sc = SparkContext("local", "Simple App")
spark = SparkSession.builder.config(conf=SparkConf()).getOrCreate()
accounts_rdd = spark.read.csv('test.csv')
print accounts_rdd.show()
这是错误:
Traceback (most recent call last):
File "C:/Users/bjlinmanna/PycharmProjects/untitled1/spark.py", line 25, in <module>
accounts_rdd = spark.read.csv('pmec_close_position_order.csv')
File "E:\spark-2.0.0-bin-hadoop2.7\spark-2.0.0-bin-hadoop2.7\python\pyspark\sql\readwriter.py", line 363, in csv
return self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path)))
File "E:\spark-2.0.0-bin-hadoop2.7\spark-2.0.0-bin-hadoop2.7\python\lib\py4j-0.10.1-src.zip\py4j\java_gateway.py", line 933, in __call__
File "E:\spark-2.0.0-bin-hadoop2.7\spark-2.0.0-bin-hadoop2.7\python\pyspark\sql\utils.py", line 79, in deco
raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.IllegalArgumentException: u'java.net.URISyntaxException: Relative path in absolute URI: file:C:/the/path/to/myfile/spark-warehouse'
感谢@Hyunsoo Park,我解决了我的问题如下:
spark = SparkSession.builder\
.master('local[*]')\
.appName('My App')\
.config('spark.sql.warehouse.dir', 'file:///C:/the/path/to/myfile')\
.getOrCreate()
accounts_rdd = spark.read\
.format('csv')\
.option('header', 'true')\
.load('file.csv')
设置配置时,请注意文件路径中的“//”。不知道为什么我设置的时候'file:C:/the/path/to/myfile',没用
这个link可能有用。
http://quabr.com/38669206/spark-2-0-relative-path-in-absolute-uri-spark-warehouse
简而言之,有配置选项spark.sql.warehouse.dir
来设置仓库文件夹。如果您手动设置仓库文件夹,错误消息将消失。
我今天遇到了同样的问题。我在 ubuntu 16.04 中没有问题,但是,当我 运行 在 windows 10 中使用相同的代码时,spark 会像你一样显示错误消息。可能是 spark 在 windows.
中找不到或无法正确创建仓库文件夹