x.findall 函数 returns 一个值但不会写入 pandas 数据框

Question

我创建了一个搜索 NLTK.text.Text 对象的函数，当我运行该函数时 return 是一个值。

更新：问题似乎是在下面的函数中，'donation' 变量实际上并没有被传递一个值。然而，text.findall 函数会 return 一个值，但由于某种原因不会更新变量。

def find_donation_orgs(x):
    text = nltk.Text(nltk.word_tokenize(x))
    donation =  text.findall(r"<\.> <.*>{,15}? <donat.*|contrib.*|Donat.*|Contrib.*> <.*>*? <to> (<.*>+?) <\.|\,|\;> ")
    return donation

以下输入的输出类似这样，但是我认为输出来自 text.findall 而不是实际的 "return donation"。

a = "This is a sentence. I also donate to Mr. T's Tea Party. I contribute to the Boys and Girls club. "

find_donation_orgs(a)

输出=

Mr. T 's Tea Party
the Boys and Girls club

但是，当我尝试应用该函数以便将输出写入 pandas 数据框中的新列时，它是 returning None。见下文：

df['donation_orgs'] = df.apply(lambda row: find_donation_orgs(row['Obit']), axis = 1)

其中df['Obit']是一串文本，类似于我上面的a变量。

更新：所以 text.findall 的输出似乎没有更新它分配给的变量的值...所以我需要弄清楚如何将该输出实际分配给变量为了 return 它到数据框。见下文：

text = df.text.iloc[1]

textfindall = text.findall(r"<\.> <.*>{,15}? <donat.*|contrib.*|Donat.*|Contrib.*> <.*>*? <to> (<.*>+?) <\.|\,|\;> ")

print('text is ' + str(type(text)))
print('textfindall is ' + str(type(textfindall)))
print(textfindall)

输出：

visit brother Alfred Fuller; the research of Dr. Giuseppe Giaccone at
Georgetown University
text is <class 'nltk.text.Text'>
textfindall is <class 'NoneType'>
none

Answer 1

尝试通过检查您的函数实际接收到的内容和 return 来调试您的代码。您可以使用调试器（在大多数 IDE 中都有）或使用函数的 return 值来确定问题出在函数还是 pandas 函数

def find_donation_orgs(x):
    return x

确保您的输入符合您的预期。

def find_donation_orgs(x):
    return nltk.Text(nltk.word_tokenize(x))

看看它的标记化是什么。

def find_donation_orgs(x):
    text = nltk.Text(nltk.word_tokenize(x))
    all_occurrences = text.findall(r"<\.> <.*>{,15}? <donat.*|contrib.*|Donat.*|Contrib.*> <.*>*? <to> (<.*>+?) <\.|\,|\;> ")
    if all_occurrences is None:
        return "no occurrences"
    else:
        return all_occurrences

检查您的正则表达式是否有问题。在这种情况下，请返回标记器输出以尝试修复您的正则表达式。

更新

查看 source code of the nltk.Text 对象，似乎 findall 方法实际上并没有 return 任何东西，而是打印结果：

def findall(self, regexp):
    if "_token_searcher" not in self.__dict__:
        self._token_searcher = TokenSearcher(self)

    hits = self._token_searcher.findall(regexp)
    hits = [' '.join(h) for h in hits]
    print(tokenwrap(hits, "; "))

这是因为 Text 对象仅供通过交互式控制台使用：

A wrapper around a sequence of simple (string) tokens, which is intended to support initial exploration of texts (via the interactive console). [...] If you wish to write a program which makes use of these analyses, then you should bypass the Text class, and use the appropriate analysis function or class directly instead.

您的函数应如下所示：

from nltk.util import tokenwrap
def find_donation_orgs(x):
    searcher = nltk.TokenSearcher(nltk.word_tokenize(x))
    hits = searcher.findall(r"<\.> <.*>{,15}? <donat.*|contrib.*|Donat.*|Contrib.*> <.*>*? <to> (<.*>+?) <\.|\,|\;> ")

    hits = [' '.join(h) for h in hits]
    donation = tokenwrap(hits, "; ")
    return donation

这复制了原始行为，除了实际的 return 值。当然，一旦收到 hits 列表，您可能希望以不同的方式格式化输出。

x.findall 函数 returns 一个值但不会写入 pandas 数据框

x.findall function returns a value but won't write to pandas data frame

python

nlp

nltk

dataframe

pandas

更新