PYTHONPATH 错误 [Windows 123] 使用 Pyspark 时使用像 NLTK 或 PATTERN 重复标签磁盘这样的导入延迟 'C://C://..spark-core_2.11-2.3.2.jar'

PYTHONPATH Error [Windows 123] with Pyspark when use import lazy like NLTK or PATTERN duplicate label disk 'C://C://..spark-core_2.11-2.3.2.jar'

问题是Windows路径和库是lazily imported的,像nltk,意思是nltk和pattern在使用的时候导入自己的库,此时模块importlib_metada.py和pathlib.py 尝试读取 PYTHONPATH 路径中的值不正确 (D:/D:/),然后代码爆炸。

首先,我们有一个像这样的简单函数

import nltk
def print_stopwords():
  print(nltk.corpus.stopwords)

在本地模式下,你可以运行这个你得到所有的停用词,OK

如果您想在 Spark 的地图中使用此函数来实现 Pyspark 工作流,则上面的代码不起作用。为什么?我真的不知道...

我认为它不起作用的原因是 Spark JAVA 库在执行如下映射函数时使用和修改 PYTHONPATH:

import nltk
from pyspark.sql import SQLContext, SparkSession

spark = (SparkSession
         .builder
         .master("local[*]")
         .appName("Nueva")
         .getOrCreate())

sc = spark.sparkContext
sqlContext = SQLContext(sc)

def print_stopwords(x):
    print("\n",x)
    print(nltk.corpus.stopwords.words('english'))
    return x

prueba = sc.parallelize([0,1,2,3])
r = prueba.map(print_stopwords)
r.take(1)

我收到错误

  File "C:\ProgramData\Anaconda3\lib\site-packages\nltk\__init__.py", line 143, in <module>
    from nltk.chunk import *
  File "C:\ProgramData\Anaconda3\lib\site-packages\nltk\chunk\__init__.py", line 157, in <module>
    from nltk.chunk.api import ChunkParserI
  File "C:\ProgramData\Anaconda3\lib\site-packages\nltk\chunk\api.py", line 13, in <module>
    from nltk.parse import ParserI
  File "C:\ProgramData\Anaconda3\lib\site-packages\nltk\parse\__init__.py", line 100, in <module>
    from nltk.parse.transitionparser import TransitionParser
  File "C:\ProgramData\Anaconda3\lib\site-packages\nltk\parse\transitionparser.py", line 22, in <module>
    from sklearn.datasets import load_svmlight_file
  File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\datasets\__init__.py", line 22, in <module>
    from .twenty_newsgroups import fetch_20newsgroups
  File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\datasets\twenty_newsgroups.py", line 44, in <module>
    from ..feature_extraction.text import CountVectorizer
  File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\feature_extraction\__init__.py", line 10, in <module>
    from . import text
  File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py", line 28, in <module>
    from ..preprocessing import normalize
  File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\preprocessing\__init__.py", line 6, in <module>
    from ._function_transformer import FunctionTransformer
  File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\preprocessing\_function_transformer.py", line 5, in <module>
    from ..utils.testing import assert_allclose_dense_sparse
  File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\testing.py", line 718, in <module>
    import pytest
  File "C:\ProgramData\Anaconda3\lib\site-packages\pytest.py", line 6, in <module>
    from _pytest.assertion import register_assert_rewrite
  File "C:\ProgramData\Anaconda3\lib\site-packages\_pytest\assertion\__init__.py", line 7, in <module>
    from _pytest.assertion import rewrite
  File "C:\ProgramData\Anaconda3\lib\site-packages\_pytest\assertion\rewrite.py", line 26, in <module>
    from _pytest.assertion import util
  File "C:\ProgramData\Anaconda3\lib\site-packages\_pytest\assertion\util.py", line 8, in <module>
    import _pytest._code
  File "C:\ProgramData\Anaconda3\lib\site-packages\_pytest\_code\__init__.py", line 2, in <module>
    from .code import Code  # noqa
  File "C:\ProgramData\Anaconda3\lib\site-packages\_pytest\_code\code.py", line 23, in <module>
    import pluggy
  File "C:\ProgramData\Anaconda3\lib\site-packages\pluggy\__init__.py", line 16, in <module>
    from .manager import PluginManager, PluginValidationError
  File "C:\ProgramData\Anaconda3\lib\site-packages\pluggy\manager.py", line 11, in <module>
    import importlib_metadata
  File "C:\ProgramData\Anaconda3\lib\site-packages\importlib_metadata\__init__.py", line 549, in <module>
    __version__ = version(__name__)
  File "C:\ProgramData\Anaconda3\lib\site-packages\importlib_metadata\__init__.py", line 511, in version
    return distribution(distribution_name).version
  File "C:\ProgramData\Anaconda3\lib\site-packages\importlib_metadata\__init__.py", line 482, in distribution
    return Distribution.from_name(distribution_name)
  File "C:\ProgramData\Anaconda3\lib\site-packages\importlib_metadata\__init__.py", line 183, in from_name
    dist = next(dists, None)
  File "C:\ProgramData\Anaconda3\lib\site-packages\importlib_metadata\__init__.py", line 425, in <genexpr>
    for path in map(cls._switch_path, paths)
  File "C:\ProgramData\Anaconda3\lib\site-packages\importlib_metadata\__init__.py", line 449, in _search_path
    if not root.is_dir():
  File "C:\ProgramData\Anaconda3\lib\pathlib.py", line 1351, in is_dir
    return S_ISDIR(self.stat().st_mode)
  File "C:\ProgramData\Anaconda3\lib\pathlib.py", line 1161, in stat
    return self._accessor.stat(self)
OSError: [WinError 123] The file name, directory name or volume label syntax is not correct: 'C:\C:\Enviroments\spark-2.3.2-bin-hadoop2.7\jars\spark-core_2.11-2.3.2.jar'

我打印了来自pathlib.py和importlib_metadata.py的环境变量,并得到了PYTHONPATH的值,像这样:

'PYTHONPATH': 'C:\Enviroments\spark-2.3.2-bin-hadoop2.7\python\lib\pyspark.zip;C:\Enviroments\spark-2.3.2-bin-hadoop2.7\python\lib\py4j-0.10.7-src.zip;/C:/Enviroments/spark-2.3.2-bin-hadoop2.7/jars/spark-core_2.11-2.3.2.jar'

我尝试在函数内部、外部和所有方式编辑路径...但在某些时候 Spark 序列化函数并编辑 PYTHONPATH...python 文件中没有,Java 文件,我无法调试此代码,因为 spark 运行s 在容器内,由于我的 IDE 的很多复杂原因,我无法输入 ip 和端口(Intellij Idea ).

不工作的原因是这个斜杠 -> /C:/Enviroments/spark-2.3.2-bin-hadoop2.7/jars/spark-core_2.11-2.3.2.jar'。 python 将 windows 中的斜杠解释为绝对路径,并在路径的开头添加磁盘标签,/C: => C:/C:/。然后在执行中它会产生错误,这条路线显然不存在。

请帮助我! 提前致谢:)

我在使用 pytest 时遇到了同样的问题。对于 windows.

中格式错误的路径,我没有合适的解决方案

您可以对其应用快速修复:

for path in list(sys.path):
    if not os.path.exists(path):
        sys.path.remove(path)

你至少会摆脱错误。

如果在另一个盘符中使用新的conda环境,它可以工作,可能是因为importlib包在基础conda环境中安装和使用,但在新的conda环境中没有执行。不知道为什么。

Python conda 环境无法与基础 conda env 正常工作,它需要在另一个 dirve 中使用另一个 conda 环境(例如 D: 如果在 C: 驱动器上安装了 Conda)

为此,您可以更改 PYSPARK_PYTHON 环境变量:

os.environ["PYSPARK_PYTHON"]="D:\conda_envs\new_environment\python.exe"
os.environ["SPARK_DRIVER_PYTHON"]="D:\conda_envs\new_environment\python.exe"

确保 SPARK_HOME 目录与 Python 或 anaconda 环境位于同一驱动器盘符中。

¯\(ツ)/¯ 随便

终于解决了T.T