python 中的文本清理/处理管道

Question

我是 python 环境 (jupyter notebook) 的新手，我正在尝试处理相对庞大的文本数据。我想按照相同的顺序应用以下步骤来处理它：

去除空格，小写，词干, 删除标点符号但保留词内破折号或连字符，删除停用词，删除符号，去除空格，

我希望我能得到一个可以执行任务的函数，而不是单独执行它们，是否有任何单个库 and/or 函数可以提供帮助？如果没有，定义一个函数以仅用一个函数执行它们的最简单方法是什么运行?

Answer 1

如评论中所述，可以使用 Python 中多个库的组合来完成。一个可以执行所有操作的函数可能如下所示：

import nltk
import re
import string
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer # or LancasterStemmer, RegexpStemmer, SnowballStemmer

default_stemmer = PorterStemmer()
default_stopwords = stopwords.words('english') # or any other list of your choice
def clean_text(text, ):

    def tokenize_text(text):
        return [w for s in sent_tokenize(text) for w in word_tokenize(s)]

    def remove_special_characters(text, characters=string.punctuation.replace('-', '')):
        tokens = tokenize_text(text)
        pattern = re.compile('[{}]'.format(re.escape(characters)))
        return ' '.join(filter(None, [pattern.sub('', t) for t in tokens]))

    def stem_text(text, stemmer=default_stemmer):
        tokens = tokenize_text(text)
        return ' '.join([stemmer.stem(t) for t in tokens])

    def remove_stopwords(text, stop_words=default_stopwords):
        tokens = [w for w in tokenize_text(text) if w not in stop_words]
        return ' '.join(tokens)

    text = text.strip(' ') # strip whitespaces
    text = text.lower() # lowercase
    text = stem_text(text) # stemming
    text = remove_special_characters(text) # remove punctuation and symbols
    text = remove_stopwords(text) # remove stopwords
    #text.strip(' ') # strip whitespaces again?

    return text

正在使用（Python2.7 进行测试，但也应该在 Python3 中工作）：

text = '  Test text !@$%$(%)^   just words and word-word'
clean_text(text)

结果：

u'test text word word-word'

Answer 2

或者，您也可以使用我的管道创建器 class 来获取我最近完成的文本数据。在 github 中找到 here。 demo_pipe.py 几乎涵盖了您想要做的一切。

python 中的文本清理/处理管道

Pipeline for text cleaning / processing in python

text-processing

nlp

nltk

python-3.x

jupyter-notebook