Build the corpus by Wikipedia: ModuleNotFoundError: No module named 'gensim'

Build the corpus by Wikipedia: ModuleNotFoundError: No module named 'gensim'

我通过 Building a Wikipedia Text Corpus for Natural Language Processing 复制了一个简单的 Python 脚本,通过使用 gensim 从文章中剥离所有维基百科标记来构建语料库。这是成本:

"""
Creates a corpus from Wikipedia dump file.
Inspired by:
https://github.com/panyang/Wikipedia_Word2vec/blob/master/v1/process_wiki.py
"""

import sys
from gensim.corpora import WikiCorpus

    def make_corpus(in_f, out_f):

    """Convert Wikipedia xml dump file to text corpus"""

    output = open(out_f, 'w')
    wiki = WikiCorpus(in_f)

    i = 0
    for text in wiki.get_texts():
        output.write(bytes(' '.join(text), 'utf-8').decode('utf-8') + '\n')
        i = i + 1
        if (i % 10000 == 0):
            print('Processed ' + str(i) + ' articles')
    output.close()
    print('Processing complete!')


if __name__ == '__main__':

    if len(sys.argv) != 3:
        print('Usage: python make_wiki_corpus.py <wikipedia_dump_file> <processed_text_file>')
        sys.exit(1)
    in_f = sys.argv[1]
    out_f = sys.argv[2]
    make_corpus(in_f, out_f)

无论如何,我得到了错误:

ModuleNotFoundError: No module named 'gensim'

尽管我已经安装了 gensim 软件包:

python3 -m pip install gensim

编辑。如果我尝试

pip install -U gensim

我得到错误

 ImportError: cannot import name 'SourceDistribution' from 
 'pip._internal.distributions.source' (C:\Users\Standard\Anaconda3\lib\site- 
 packages\pip\_internal\distributions\source\__init__.py)

您的系统中没有安装 gensim 模块。

pip install -U gensim

或从https://pypi.python.org/pypi/gensim下载。

gensim 取决于 scipynumpy。您必须在安装 gensim 之前安装它们。

pip 20.0.0 中存在错误。使用以下方式升级到 20.0.1:

python get-pip.py

或降级到 19.3.1。

python get-pip.py pip==19.3.1