PYTHONPATH 错误 [Windows 123] 使用 Pyspark 时使用像 NLTK 或 PATTERN 重复标签磁盘这样的导入延迟 'C://C://..spark-core_2.11-2.3.2.jar'
PYTHONPATH Error [Windows 123] with Pyspark when use import lazy like NLTK or PATTERN duplicate label disk 'C://C://..spark-core_2.11-2.3.2.jar'
问题是Windows路径和库是lazily imported的,像nltk,意思是nltk和pattern在使用的时候导入自己的库,此时模块importlib_metada.py和pathlib.py 尝试读取 PYTHONPATH 路径中的值不正确 (D:/D:/),然后代码爆炸。
首先,我们有一个像这样的简单函数
import nltk
def print_stopwords():
print(nltk.corpus.stopwords)
在本地模式下,你可以运行这个你得到所有的停用词,OK。
如果您想在 Spark 的地图中使用此函数来实现 Pyspark 工作流,则上面的代码不起作用。为什么?我真的不知道...
我认为它不起作用的原因是 Spark JAVA 库在执行如下映射函数时使用和修改 PYTHONPATH:
import nltk
from pyspark.sql import SQLContext, SparkSession
spark = (SparkSession
.builder
.master("local[*]")
.appName("Nueva")
.getOrCreate())
sc = spark.sparkContext
sqlContext = SQLContext(sc)
def print_stopwords(x):
print("\n",x)
print(nltk.corpus.stopwords.words('english'))
return x
prueba = sc.parallelize([0,1,2,3])
r = prueba.map(print_stopwords)
r.take(1)
我收到错误
File "C:\ProgramData\Anaconda3\lib\site-packages\nltk\__init__.py", line 143, in <module>
from nltk.chunk import *
File "C:\ProgramData\Anaconda3\lib\site-packages\nltk\chunk\__init__.py", line 157, in <module>
from nltk.chunk.api import ChunkParserI
File "C:\ProgramData\Anaconda3\lib\site-packages\nltk\chunk\api.py", line 13, in <module>
from nltk.parse import ParserI
File "C:\ProgramData\Anaconda3\lib\site-packages\nltk\parse\__init__.py", line 100, in <module>
from nltk.parse.transitionparser import TransitionParser
File "C:\ProgramData\Anaconda3\lib\site-packages\nltk\parse\transitionparser.py", line 22, in <module>
from sklearn.datasets import load_svmlight_file
File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\datasets\__init__.py", line 22, in <module>
from .twenty_newsgroups import fetch_20newsgroups
File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\datasets\twenty_newsgroups.py", line 44, in <module>
from ..feature_extraction.text import CountVectorizer
File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\feature_extraction\__init__.py", line 10, in <module>
from . import text
File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py", line 28, in <module>
from ..preprocessing import normalize
File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\preprocessing\__init__.py", line 6, in <module>
from ._function_transformer import FunctionTransformer
File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\preprocessing\_function_transformer.py", line 5, in <module>
from ..utils.testing import assert_allclose_dense_sparse
File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\testing.py", line 718, in <module>
import pytest
File "C:\ProgramData\Anaconda3\lib\site-packages\pytest.py", line 6, in <module>
from _pytest.assertion import register_assert_rewrite
File "C:\ProgramData\Anaconda3\lib\site-packages\_pytest\assertion\__init__.py", line 7, in <module>
from _pytest.assertion import rewrite
File "C:\ProgramData\Anaconda3\lib\site-packages\_pytest\assertion\rewrite.py", line 26, in <module>
from _pytest.assertion import util
File "C:\ProgramData\Anaconda3\lib\site-packages\_pytest\assertion\util.py", line 8, in <module>
import _pytest._code
File "C:\ProgramData\Anaconda3\lib\site-packages\_pytest\_code\__init__.py", line 2, in <module>
from .code import Code # noqa
File "C:\ProgramData\Anaconda3\lib\site-packages\_pytest\_code\code.py", line 23, in <module>
import pluggy
File "C:\ProgramData\Anaconda3\lib\site-packages\pluggy\__init__.py", line 16, in <module>
from .manager import PluginManager, PluginValidationError
File "C:\ProgramData\Anaconda3\lib\site-packages\pluggy\manager.py", line 11, in <module>
import importlib_metadata
File "C:\ProgramData\Anaconda3\lib\site-packages\importlib_metadata\__init__.py", line 549, in <module>
__version__ = version(__name__)
File "C:\ProgramData\Anaconda3\lib\site-packages\importlib_metadata\__init__.py", line 511, in version
return distribution(distribution_name).version
File "C:\ProgramData\Anaconda3\lib\site-packages\importlib_metadata\__init__.py", line 482, in distribution
return Distribution.from_name(distribution_name)
File "C:\ProgramData\Anaconda3\lib\site-packages\importlib_metadata\__init__.py", line 183, in from_name
dist = next(dists, None)
File "C:\ProgramData\Anaconda3\lib\site-packages\importlib_metadata\__init__.py", line 425, in <genexpr>
for path in map(cls._switch_path, paths)
File "C:\ProgramData\Anaconda3\lib\site-packages\importlib_metadata\__init__.py", line 449, in _search_path
if not root.is_dir():
File "C:\ProgramData\Anaconda3\lib\pathlib.py", line 1351, in is_dir
return S_ISDIR(self.stat().st_mode)
File "C:\ProgramData\Anaconda3\lib\pathlib.py", line 1161, in stat
return self._accessor.stat(self)
OSError: [WinError 123] The file name, directory name or volume label syntax is not correct: 'C:\C:\Enviroments\spark-2.3.2-bin-hadoop2.7\jars\spark-core_2.11-2.3.2.jar'
我打印了来自pathlib.py和importlib_metadata.py的环境变量,并得到了PYTHONPATH的值,像这样:
'PYTHONPATH': 'C:\Enviroments\spark-2.3.2-bin-hadoop2.7\python\lib\pyspark.zip;C:\Enviroments\spark-2.3.2-bin-hadoop2.7\python\lib\py4j-0.10.7-src.zip;/C:/Enviroments/spark-2.3.2-bin-hadoop2.7/jars/spark-core_2.11-2.3.2.jar'
我尝试在函数内部、外部和所有方式编辑路径...但在某些时候 Spark 序列化函数并编辑 PYTHONPATH...python 文件中没有,Java 文件,我无法调试此代码,因为 spark 运行s 在容器内,由于我的 IDE 的很多复杂原因,我无法输入 ip 和端口(Intellij Idea ).
不工作的原因是这个斜杠 -> /C:/Enviroments/spark-2.3.2-bin-hadoop2.7/jars/spark-core_2.11-2.3.2.jar'。 python 将 windows 中的斜杠解释为绝对路径,并在路径的开头添加磁盘标签,/C: => C:/C:/。然后在执行中它会产生错误,这条路线显然不存在。
请帮助我!
提前致谢:)
我在使用 pytest 时遇到了同样的问题。对于 windows.
中格式错误的路径,我没有合适的解决方案
您可以对其应用快速修复:
for path in list(sys.path):
if not os.path.exists(path):
sys.path.remove(path)
你至少会摆脱错误。
如果在另一个盘符中使用新的conda环境,它可以工作,可能是因为importlib包在基础conda环境中安装和使用,但在新的conda环境中没有执行。不知道为什么。
Python conda 环境无法与基础 conda env 正常工作,它需要在另一个 dirve 中使用另一个 conda 环境(例如 D: 如果在 C: 驱动器上安装了 Conda)
为此,您可以更改 PYSPARK_PYTHON 环境变量:
os.environ["PYSPARK_PYTHON"]="D:\conda_envs\new_environment\python.exe"
os.environ["SPARK_DRIVER_PYTHON"]="D:\conda_envs\new_environment\python.exe"
确保 SPARK_HOME 目录与 Python 或 anaconda 环境位于同一驱动器盘符中。
¯\(ツ)/¯
随便
终于解决了T.T
问题是Windows路径和库是lazily imported的,像nltk,意思是nltk和pattern在使用的时候导入自己的库,此时模块importlib_metada.py和pathlib.py 尝试读取 PYTHONPATH 路径中的值不正确 (D:/D:/),然后代码爆炸。
首先,我们有一个像这样的简单函数
import nltk
def print_stopwords():
print(nltk.corpus.stopwords)
在本地模式下,你可以运行这个你得到所有的停用词,OK。
如果您想在 Spark 的地图中使用此函数来实现 Pyspark 工作流,则上面的代码不起作用。为什么?我真的不知道...
我认为它不起作用的原因是 Spark JAVA 库在执行如下映射函数时使用和修改 PYTHONPATH:
import nltk
from pyspark.sql import SQLContext, SparkSession
spark = (SparkSession
.builder
.master("local[*]")
.appName("Nueva")
.getOrCreate())
sc = spark.sparkContext
sqlContext = SQLContext(sc)
def print_stopwords(x):
print("\n",x)
print(nltk.corpus.stopwords.words('english'))
return x
prueba = sc.parallelize([0,1,2,3])
r = prueba.map(print_stopwords)
r.take(1)
我收到错误
File "C:\ProgramData\Anaconda3\lib\site-packages\nltk\__init__.py", line 143, in <module>
from nltk.chunk import *
File "C:\ProgramData\Anaconda3\lib\site-packages\nltk\chunk\__init__.py", line 157, in <module>
from nltk.chunk.api import ChunkParserI
File "C:\ProgramData\Anaconda3\lib\site-packages\nltk\chunk\api.py", line 13, in <module>
from nltk.parse import ParserI
File "C:\ProgramData\Anaconda3\lib\site-packages\nltk\parse\__init__.py", line 100, in <module>
from nltk.parse.transitionparser import TransitionParser
File "C:\ProgramData\Anaconda3\lib\site-packages\nltk\parse\transitionparser.py", line 22, in <module>
from sklearn.datasets import load_svmlight_file
File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\datasets\__init__.py", line 22, in <module>
from .twenty_newsgroups import fetch_20newsgroups
File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\datasets\twenty_newsgroups.py", line 44, in <module>
from ..feature_extraction.text import CountVectorizer
File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\feature_extraction\__init__.py", line 10, in <module>
from . import text
File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py", line 28, in <module>
from ..preprocessing import normalize
File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\preprocessing\__init__.py", line 6, in <module>
from ._function_transformer import FunctionTransformer
File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\preprocessing\_function_transformer.py", line 5, in <module>
from ..utils.testing import assert_allclose_dense_sparse
File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\testing.py", line 718, in <module>
import pytest
File "C:\ProgramData\Anaconda3\lib\site-packages\pytest.py", line 6, in <module>
from _pytest.assertion import register_assert_rewrite
File "C:\ProgramData\Anaconda3\lib\site-packages\_pytest\assertion\__init__.py", line 7, in <module>
from _pytest.assertion import rewrite
File "C:\ProgramData\Anaconda3\lib\site-packages\_pytest\assertion\rewrite.py", line 26, in <module>
from _pytest.assertion import util
File "C:\ProgramData\Anaconda3\lib\site-packages\_pytest\assertion\util.py", line 8, in <module>
import _pytest._code
File "C:\ProgramData\Anaconda3\lib\site-packages\_pytest\_code\__init__.py", line 2, in <module>
from .code import Code # noqa
File "C:\ProgramData\Anaconda3\lib\site-packages\_pytest\_code\code.py", line 23, in <module>
import pluggy
File "C:\ProgramData\Anaconda3\lib\site-packages\pluggy\__init__.py", line 16, in <module>
from .manager import PluginManager, PluginValidationError
File "C:\ProgramData\Anaconda3\lib\site-packages\pluggy\manager.py", line 11, in <module>
import importlib_metadata
File "C:\ProgramData\Anaconda3\lib\site-packages\importlib_metadata\__init__.py", line 549, in <module>
__version__ = version(__name__)
File "C:\ProgramData\Anaconda3\lib\site-packages\importlib_metadata\__init__.py", line 511, in version
return distribution(distribution_name).version
File "C:\ProgramData\Anaconda3\lib\site-packages\importlib_metadata\__init__.py", line 482, in distribution
return Distribution.from_name(distribution_name)
File "C:\ProgramData\Anaconda3\lib\site-packages\importlib_metadata\__init__.py", line 183, in from_name
dist = next(dists, None)
File "C:\ProgramData\Anaconda3\lib\site-packages\importlib_metadata\__init__.py", line 425, in <genexpr>
for path in map(cls._switch_path, paths)
File "C:\ProgramData\Anaconda3\lib\site-packages\importlib_metadata\__init__.py", line 449, in _search_path
if not root.is_dir():
File "C:\ProgramData\Anaconda3\lib\pathlib.py", line 1351, in is_dir
return S_ISDIR(self.stat().st_mode)
File "C:\ProgramData\Anaconda3\lib\pathlib.py", line 1161, in stat
return self._accessor.stat(self)
OSError: [WinError 123] The file name, directory name or volume label syntax is not correct: 'C:\C:\Enviroments\spark-2.3.2-bin-hadoop2.7\jars\spark-core_2.11-2.3.2.jar'
我打印了来自pathlib.py和importlib_metadata.py的环境变量,并得到了PYTHONPATH的值,像这样:
'PYTHONPATH': 'C:\Enviroments\spark-2.3.2-bin-hadoop2.7\python\lib\pyspark.zip;C:\Enviroments\spark-2.3.2-bin-hadoop2.7\python\lib\py4j-0.10.7-src.zip;/C:/Enviroments/spark-2.3.2-bin-hadoop2.7/jars/spark-core_2.11-2.3.2.jar'
我尝试在函数内部、外部和所有方式编辑路径...但在某些时候 Spark 序列化函数并编辑 PYTHONPATH...python 文件中没有,Java 文件,我无法调试此代码,因为 spark 运行s 在容器内,由于我的 IDE 的很多复杂原因,我无法输入 ip 和端口(Intellij Idea ).
不工作的原因是这个斜杠 -> /C:/Enviroments/spark-2.3.2-bin-hadoop2.7/jars/spark-core_2.11-2.3.2.jar'。 python 将 windows 中的斜杠解释为绝对路径,并在路径的开头添加磁盘标签,/C: => C:/C:/。然后在执行中它会产生错误,这条路线显然不存在。
请帮助我! 提前致谢:)
我在使用 pytest 时遇到了同样的问题。对于 windows.
中格式错误的路径,我没有合适的解决方案您可以对其应用快速修复:
for path in list(sys.path):
if not os.path.exists(path):
sys.path.remove(path)
你至少会摆脱错误。
如果在另一个盘符中使用新的conda环境,它可以工作,可能是因为importlib包在基础conda环境中安装和使用,但在新的conda环境中没有执行。不知道为什么。
Python conda 环境无法与基础 conda env 正常工作,它需要在另一个 dirve 中使用另一个 conda 环境(例如 D: 如果在 C: 驱动器上安装了 Conda)
为此,您可以更改 PYSPARK_PYTHON 环境变量:
os.environ["PYSPARK_PYTHON"]="D:\conda_envs\new_environment\python.exe"
os.environ["SPARK_DRIVER_PYTHON"]="D:\conda_envs\new_environment\python.exe"
确保 SPARK_HOME 目录与 Python 或 anaconda 环境位于同一驱动器盘符中。
¯\(ツ)/¯ 随便
终于解决了T.T