将 PySpark 与 Jupyter Notebook 集成
Integrate PySpark with Jupyter Notebook
我正在按照此 site 安装 Jupyter Notebook、PySpark 并将两者集成。
当我需要创建 "Jupyter profile" 时,我读到 "Jupyter profiles" 不再存在。所以我继续执行以下几行。
$ mkdir -p ~/.ipython/kernels/pyspark
$ touch ~/.ipython/kernels/pyspark/kernel.json
我打开了kernel.json
写了下面的内容:
{
"display_name": "pySpark",
"language": "python",
"argv": [
"/usr/bin/python",
"-m",
"IPython.kernel",
"-f",
"{connection_file}"
],
"env": {
"SPARK_HOME": "/usr/local/Cellar/spark-2.0.0-bin-hadoop2.7",
"PYTHONPATH": "/usr/local/Cellar/spark-2.0.0-bin-hadoop2.7/python:/usr/local/Cellar/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip",
"PYTHONSTARTUP": "/usr/local/Cellar/spark-2.0.0-bin-hadoop2.7/python/pyspark/shell.py",
"PYSPARK_SUBMIT_ARGS": "pyspark-shell"
}
}
Spark路径正确。
但是,当我 运行 jupyter console --kernel pyspark
我得到这个输出:
MacBook:~ Agus$ jupyter console --kernel pyspark
/usr/bin/python: No module named IPython
Traceback (most recent call last):
File "/usr/local/bin/jupyter-console", line 11, in <module>
sys.exit(main())
File "/usr/local/lib/python2.7/site-packages/jupyter_core/application.py", line 267, in launch_instance
return super(JupyterApp, cls).launch_instance(argv=argv, **kwargs)
File "/usr/local/lib/python2.7/site-packages/traitlets/config/application.py", line 595, in launch_instance
app.initialize(argv)
File "<decorator-gen-113>", line 2, in initialize
File "/usr/local/lib/python2.7/site-packages/traitlets/config/application.py", line 74, in catch_config_error
return method(app, *args, **kwargs)
File "/usr/local/lib/python2.7/site-packages/jupyter_console/app.py", line 137, in initialize
self.init_shell()
File "/usr/local/lib/python2.7/site-packages/jupyter_console/app.py", line 110, in init_shell
client=self.kernel_client,
File "/usr/local/lib/python2.7/site-packages/traitlets/config/configurable.py", line 412, in instance
inst = cls(*args, **kwargs)
File "/usr/local/lib/python2.7/site-packages/jupyter_console/ptshell.py", line 251, in __init__
self.init_kernel_info()
File "/usr/local/lib/python2.7/site-packages/jupyter_console/ptshell.py", line 305, in init_kernel_info
raise RuntimeError("Kernel didn't respond to kernel_info_request")
RuntimeError: Kernel didn't respond to kernel_info_request
最简单的方法是使用 findspark。首先创建一个环境变量:
export SPARK_HOME="{full path to Spark}"
然后安装findspark:
pip install findspark
然后启动 jupyter notebook,下面应该可以工作了:
import findspark
findspark.init()
import pyspark
pyspark 与 jupyter notebook 集成的多种方式。
1.安装Apache Toree.
pip install jupyter
pip install toree
jupyter toree install --spark_home=path/to/your/spark_directory --interpreters=PySpark
您可以通过
检查安装
jupyter kernelspec list
您将获得 toree pyspark 内核的条目
apache_toree_pyspark /home/pauli/.local/share/jupyter/kernels/apache_toree_pyspark
之后,如果需要,您可以安装其他解释器,如 SparkR、Scala、SQL
jupyter toree install --interpreters=Scala,SparkR,SQL
2.将这些行添加到 bashrc
export SPARK_HOME=/path to /spark-2.2.0
export PATH="$PATH:$SPARK_HOME/bin"
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
在终端中输入 pyspark
,它将打开一个 jupyter notebook,其中已初始化 sparkcontext。
安装 pyspark
仅作为 python package
pip install pyspark
现在您可以像导入另一个 python 包一样导入 pyspark。
我正在按照此 site 安装 Jupyter Notebook、PySpark 并将两者集成。
当我需要创建 "Jupyter profile" 时,我读到 "Jupyter profiles" 不再存在。所以我继续执行以下几行。
$ mkdir -p ~/.ipython/kernels/pyspark
$ touch ~/.ipython/kernels/pyspark/kernel.json
我打开了kernel.json
写了下面的内容:
{
"display_name": "pySpark",
"language": "python",
"argv": [
"/usr/bin/python",
"-m",
"IPython.kernel",
"-f",
"{connection_file}"
],
"env": {
"SPARK_HOME": "/usr/local/Cellar/spark-2.0.0-bin-hadoop2.7",
"PYTHONPATH": "/usr/local/Cellar/spark-2.0.0-bin-hadoop2.7/python:/usr/local/Cellar/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip",
"PYTHONSTARTUP": "/usr/local/Cellar/spark-2.0.0-bin-hadoop2.7/python/pyspark/shell.py",
"PYSPARK_SUBMIT_ARGS": "pyspark-shell"
}
}
Spark路径正确。
但是,当我 运行 jupyter console --kernel pyspark
我得到这个输出:
MacBook:~ Agus$ jupyter console --kernel pyspark
/usr/bin/python: No module named IPython
Traceback (most recent call last):
File "/usr/local/bin/jupyter-console", line 11, in <module>
sys.exit(main())
File "/usr/local/lib/python2.7/site-packages/jupyter_core/application.py", line 267, in launch_instance
return super(JupyterApp, cls).launch_instance(argv=argv, **kwargs)
File "/usr/local/lib/python2.7/site-packages/traitlets/config/application.py", line 595, in launch_instance
app.initialize(argv)
File "<decorator-gen-113>", line 2, in initialize
File "/usr/local/lib/python2.7/site-packages/traitlets/config/application.py", line 74, in catch_config_error
return method(app, *args, **kwargs)
File "/usr/local/lib/python2.7/site-packages/jupyter_console/app.py", line 137, in initialize
self.init_shell()
File "/usr/local/lib/python2.7/site-packages/jupyter_console/app.py", line 110, in init_shell
client=self.kernel_client,
File "/usr/local/lib/python2.7/site-packages/traitlets/config/configurable.py", line 412, in instance
inst = cls(*args, **kwargs)
File "/usr/local/lib/python2.7/site-packages/jupyter_console/ptshell.py", line 251, in __init__
self.init_kernel_info()
File "/usr/local/lib/python2.7/site-packages/jupyter_console/ptshell.py", line 305, in init_kernel_info
raise RuntimeError("Kernel didn't respond to kernel_info_request")
RuntimeError: Kernel didn't respond to kernel_info_request
最简单的方法是使用 findspark。首先创建一个环境变量:
export SPARK_HOME="{full path to Spark}"
然后安装findspark:
pip install findspark
然后启动 jupyter notebook,下面应该可以工作了:
import findspark
findspark.init()
import pyspark
pyspark 与 jupyter notebook 集成的多种方式。
1.安装Apache Toree.
pip install jupyter
pip install toree
jupyter toree install --spark_home=path/to/your/spark_directory --interpreters=PySpark
您可以通过
检查安装 jupyter kernelspec list
您将获得 toree pyspark 内核的条目
apache_toree_pyspark /home/pauli/.local/share/jupyter/kernels/apache_toree_pyspark
之后,如果需要,您可以安装其他解释器,如 SparkR、Scala、SQL
jupyter toree install --interpreters=Scala,SparkR,SQL
2.将这些行添加到 bashrc
export SPARK_HOME=/path to /spark-2.2.0
export PATH="$PATH:$SPARK_HOME/bin"
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
在终端中输入 pyspark
,它将打开一个 jupyter notebook,其中已初始化 sparkcontext。
安装
pyspark
仅作为 python package
pip install pyspark
现在您可以像导入另一个 python 包一样导入 pyspark。