有什么方法可以将 Python 的 nltk.download('punkt') 导入到 Google Cloud Functions 中？

Question

有什么方法可以将 Python 的 nltk.download('punkt') 导入到 Google Cloud Functions 中？我发现在 main.py 中手动将语句添加到我的代码块中会显着减慢我的函数处理速度，因为每次运行时都必须下载 punkt。有什么方法可以通过以其他方式调用 punkt 来消除这种情况吗？

编辑#1：- 我编辑了我的代码和程序结构以匹配 Barak 的建议，但我不断收到相同的错误：

Error: function terminated. Recommended action: inspect logs for termination reason. Details:

**********************************************************************
  Resource [93mpunkt[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt/PY3/english.pickle[0m

  Searched in:
    - '/tmp/nltk_data'
    - '/env/nltk_data'
    - '/env/share/nltk_data'
    - '/env/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - ''
**********************************************************************

Answer 1

查看 uploading files with your Cloud function 的说明。具体来说，由于您可以上传文件，因此您可以修改 nltk 以仅使用这些文件：

在official NLTK documentation之后，您可以"Set your NLTK_DATA environment variable to point to your top level nltk_data folder."

将这些组合在一起，您会得到：

使用 python -m nltk.downloader punkt
上传 NLTK 目录（在上述文档中找到它在您计算机上的路径）作为 nltk_data 目录，在函数环境的根目录下创建

配置代码以查找该文件夹：

import os
root = os.path.dirname(path.abspath(__file__))
nltk_dir = os.path.join(root, 'nltk_data')  # Your folder name here
os.environ['NLTK_DATA'] = nltk_dir

编辑：似乎使用环境变量导出路径没有达到预期的效果，所以让我们在代码中明确路径

在您的电脑上下载数据

import os
download_dir = os.path.abspath('my_nltk_dir')
os.makedirs(download_dir)
import nltk
nltk.download('punkt', download_dir=download_dir)

将目录 my_nltk_dir 添加到 python 脚本的同一文件夹中。这将是
```
PROJECT_ROOT/
|-- my_code.py
|-- my_nltk_dir/
    |-- ...
```

在您的代码中使用

引用数据

import ntlk.data
root = os.path.dirname(path.abspath(__file__))
download_dir = os.path.join(root, 'my_nltk_dir')
nltk.data.load(
    os.path.join(download_dir, 'tokenizers/punkt/english.pickle')
)

Answer 2

将 nltk 添加到您的 requirements.txt;

在你的本地机器上安装 nltk，如果你还没有：

pip install nltk

然后下载 nltk_data 个文件。对于分词器，我需要 Punkt 分词器模块：

python -m nltk.downloader punkt

将它们（它们在 Windows 的 Roaming/ 中）复制到您的根文件夹（即与您的函数一起）：

cp -r C:\Users\<USER>\AppData\Roaming\nltk_data\* YOUR\ROOT\FOLDER\nltk_data\

在主 python 函数的开头，或者就在使用 nltk 之前，添加以下代码——基本上，它会抓取 path where nltk_data is, and tells nltk 以查看此文件夹：

  root = os.path.dirname(os.path.abspath(__file__))
  download_dir = os.path.join(root, 'nltk_data')
  os.chdir(download_dir)
  nltk.data.path.append(download_dir)

最后，在 committing/pushing 之后（如果您使用的是 Cloud Source Repos），（重新）部署您的函数！

有什么方法可以将 Python 的 nltk.download('punkt') 导入到 Google Cloud Functions 中？

Any way to import Python's nltk.download('punkt') into Google Cloud Functions?

python

nltk

google-cloud-platform

google-cloud-functions