无法在 Azure Databricks 提供的 spark 集群中导入已安装的 python 模块

Question

我刚刚开始通过 Azure Databricks 中提供的 spark 集群运行 python 笔记本。根据要求，我们通过 shell 命令以及 databricks 工作区中的 'Create library' UI 安装了几个外部包，如 spacy 和 kafka。

python -m spacy download en_core_web_sm

但是，每次我们运行 'import ' 时，集群都会抛出 'Module not found' 错误。

OSError: Can't find Model 'en_core_web_sm'

最重要的是，我们似乎无法确切知道这些模块的安装位置。尽管在 'sys.path' 中添加了模块路径，但问题仍然存在。

请尽快告诉我们如何解决此问题

Answer 1

您可以按照以下步骤在 Azure Databricks 上安装和加载 spaCy 包。

第 1 步： 使用 pip 安装 spaCy 并下载 spaCy 模型。

%sh
/databricks/python3/bin/pip install spacy 
/databricks/python3/bin/python3 -m spacy download en_core_web_sm

笔记本输出：

Step2: 运行使用spaCy的例子。

import spacy

# Load English tokenizer, tagger, parser, NER and word vectors
nlp = spacy.load("en_core_web_sm")

# Process whole documents
text = ("When Sebastian Thrun started working on self-driving cars at "
        "Google in 2007, few people outside of the company took him "
        "seriously. “I can tell you very senior CEOs of major American "
        "car companies would shake my hand and turn away because I wasn’t "
        "worth talking to,” said Thrun, in an interview with Recode earlier "
        "this week.")
doc = nlp(text)

# Analyze syntax
print("Noun phrases:", [chunk.text for chunk in doc.noun_chunks])
print("Verbs:", [token.lemma_ for token in doc if token.pos_ == "VERB"])

# Find named entities, phrases and concepts
for entity in doc.ents:
    print(entity.text, entity.label_)

笔记本输出：

希望这对您有所帮助。如果您有任何疑问，请告诉我们。

请点击 "Mark as Answer" 并在对您有帮助的 post 上投票，这可能对其他社区成员有益。

Answer 2

将 spacy "en_core_web_sm" 模型安装为

    %sh python -m spacy download en_core_web_sm

将模型导入为

    import en_core_web_sm
    nlp = en_core_web_sm.load()
    doc = nlp("My name is Raghu Ram. I live in Kolkata.")
    for ent in doc.ents:
      print(ent.text, ent.label_)

Answer 3

创建集群时使用 Databricks ML 运行时分布https://docs.databricks.com/runtime/mlruntime.html

然后您可以从 Install Library UI 安装 spacy（只需转到 cluster/libraries 并照常安装），或通过 %sh、%pip 或 %conda

然后加载英文语料库：

%python

导入spacy spacy.cli.download("en_core_web_lg")

无法在 Azure Databricks 提供的 spark 集群中导入已安装的 python 模块

Can't import installed python modules in spark cluster offered by Azure Databricks

apache-spark

azure-databricks