如何从 Python 中的 bi / tri-gram 的输出中删除列表特殊字符（“（）”，“'”，“，”）

Question

我编写了一个代码，使用 NLTK 从文本输入中计算二元组/三元组频率。我在这里面临的问题是，由于输出是以 Python 列表的形式获得的，所以我的输出包含列表特定字符，即（“()”、“'”、“、”）。我计划将其导出到一个 csv 文件中，因此我想在代码级别本身删除这些特殊字符。我该如何进行编辑。

输入代码：

import nltk
from nltk import word_tokenize, pos_tag
from nltk.collocations import *
from itertools import *
from nltk.util import ngrams
from nltk.corpus import stopwords

corpus = '''The pure amnesia of her face,
newborn. I looked so far into her that, for a while, looked so far into her that, for a while  looked so far into her that, for a while looked so far into her that, for a while the visual 
held no memory. Little by little, I returned to myself, waking to nurse the visual held no  memory. Little by little, I returned to myself, waking to nurse
'''
s_corpus = corpus.lower()

stop_words = set(stopwords.words('english'))

tokens = nltk.word_tokenize(s_corpus)
tokens = [word for word in tokens if word not in stop_words]

c_tokens = [''.join(e for e in string if e.isalnum()) for string in tokens]
c_tokens = [x for x in c_tokens if x]

bgs_2 = nltk.bigrams(c_tokens)
bgs_3 = nltk.trigrams(c_tokens)

fdist = nltk.FreqDist(bgs_3)

tmp = list()
for k,v in fdist.items():
    tmp.append((v,k))
tmp = sorted (tmp, reverse=True)

for kk,vv in tmp[:]:
    print (vv,kk)

当前输出：

('looked', 'far', 'looked') 3
('far', 'looked', 'far') 3
('visual', 'held', 'memory') 2
('returned', 'waking', 'nurse') 2

预期输出：

looked far looked, 3
far looked far, 3
visual held memory, 2
returned waking nurse, 2

提前感谢您的帮助。

Answer 1

所以 "fix" 你的输出：使用它来打印您的数据：

for kk,vv in tmp:
    print(" ".join(list(kk)),",%d" % vv)

但是如果您要将其解析为 csv，您应该以不同的格式收集输出。

当前您正在创建包含一个元组和一个数字的元组列表。尝试将您的数据收集为包含每个值的列表列表。这样你就可以直接把它写成一个csv文件。

看这里：Create a .csv file with values from a Python list

Answer 2

更好的问题是 ngrams 输出中的 ("()", "'",",") 是什么？

>>> from nltk import ngrams
>>> from nltk import word_tokenize

# Split a sentence into a list of "words"
>>> word_tokenize("This is a foo bar sentence")
['This', 'is', 'a', 'foo', 'bar', 'sentence']
>>> type(word_tokenize("This is a foo bar sentence"))
<class 'list'>

# Extract bigrams.
>>> list(ngrams(word_tokenize("This is a foo bar sentence"), 2))
[('This', 'is'), ('is', 'a'), ('a', 'foo'), ('foo', 'bar'), ('bar', 'sentence')]

# Okay, so the output is a list, no surprise.
>>> type(list(ngrams(word_tokenize("This is a foo bar sentence"), 2)))
<class 'list'>

但是('This', 'is')是什么类型呢？

>>> list(ngrams(word_tokenize("This is a foo bar sentence"), 2))[0]
('This', 'is')
>>> first_thing_in_output = list(ngrams(word_tokenize("This is a foo bar sentence"), 2))[0]
>>> type(first_thing_in_output)
<class 'tuple'>

啊，是个元组，看https://realpython.com/python-lists-tuples/

打印元组时会发生什么？

>>> print(first_thing_in_output)
('This', 'is')

如果将它们转换成 str() 会怎样？

>>> print(str(first_thing_in_output))
('This', 'is')

但是我想要输出This is而不是('This', 'is')，所以我将使用str.join()函数，见https://www.geeksforgeeks.org/join-function-python/:

>>> print(' '.join((first_thing_in_output)))
This is

现在这是真正经历的好时机 the tutorial of basic Python types to understand what is happening. Additionally, it'll be good to understand how "container" types work too, e.g. https://github.com/usaarhat/pywarmups/blob/master/session2.md

通读原文post，代码有不少问题。

我猜代码的目标是：

标记文本并删除停用词
提取 ngram（无停用词）
打印出它们的字符串形式和计数

棘手的部分是 stopwords.words('english') 不包含标点符号，因此您最终会得到包含标点符号的奇怪 ngram：

from nltk import word_tokenize
from nltk.util import ngrams
from nltk.corpus import stopwords

text = '''The pure amnesia of her face,
newborn. I looked so far into her that, for a while, looked so far into her that, for a while  looked so far into her that, for a while looked so far into her that, for a while the visual 
held no memory. Little by little, I returned to myself, waking to nurse the visual held no  memory. Little by little, I returned to myself, waking to nurse
'''

stoplist = set(stopwords.words('english'))

tokens = [token for token in nltk.word_tokenize(text) if token not in stoplist]

list(ngrams(tokens, 2))

[输出]:

[('The', 'pure'),
 ('pure', 'amnesia'),
 ('amnesia', 'face'),
 ('face', ','),
 (',', 'newborn'),
 ('newborn', '.'),
 ('.', 'I'),
 ('I', 'looked'),
 ('looked', 'far'),
 ('far', ','),
 (',', ','), ...]

也许您想用标点符号扩展非索引字表，例如

from string import punctuation
from nltk import word_tokenize
from nltk.util import ngrams
from nltk.corpus import stopwords

text = '''The pure amnesia of her face,
newborn. I looked so far into her that, for a while, looked so far into her that, for a while  looked so far into her that, for a while looked so far into her that, for a while the visual 
held no memory. Little by little, I returned to myself, waking to nurse the visual held no  memory. Little by little, I returned to myself, waking to nurse
'''

stoplist = set(stopwords.words('english') + list(punctuation))

tokens = [token for token in nltk.word_tokenize(text) if token not in stoplist]

list(ngrams(tokens, 2))

[输出]:

[('The', 'pure'),
 ('pure', 'amnesia'),
 ('amnesia', 'face'),
 ('face', 'newborn'),
 ('newborn', 'I'),
 ('I', 'looked'),
 ('looked', 'far'),
 ('far', 'looked'),
 ('looked', 'far'), ...]

然后你意识到像 I 这样的标记应该是一个停用词，但它仍然存在于你的 ngram 列表中。这是因为来自 stopwords.words('english') 的列表是小写的，例如

>>> stopwords.words('english')

[输出]:

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're", ...]

因此，当您检查令牌是否在非索引字表中时，您还应该将令牌小写。（避免将 word_tokenize 之前的句子小写，因为 word_tokenize 可能会从大写中获取线索）。因此：

from string import punctuation
from nltk import word_tokenize
from nltk.util import ngrams
from nltk.corpus import stopwords

text = '''The pure amnesia of her face,
newborn. I looked so far into her that, for a while, looked so far into her that, for a while  looked so far into her that, for a while looked so far into her that, for a while the visual 
held no memory. Little by little, I returned to myself, waking to nurse the visual held no  memory. Little by little, I returned to myself, waking to nurse
'''

stoplist = set(stopwords.words('english') + list(punctuation))

tokens = [token for token in nltk.word_tokenize(text) if token.lower() not in stoplist]

list(ngrams(tokens, 2))

[输出]:

[('pure', 'amnesia'),
 ('amnesia', 'face'),
 ('face', 'newborn'),
 ('newborn', 'looked'),
 ('looked', 'far'),
 ('far', 'looked'),
 ('looked', 'far'),
 ('far', 'looked'),
 ('looked', 'far'),
 ('far', 'looked'), ...]

现在 ngrams 看起来正在实现目标：

标记文本并删除停用词
提取 ngram（无停用词）

然后在最后一部分，您想要按排序顺序将 ngram 打印到文件中，您实际上可以使用 Freqdist.most_common()，它将按降序排列，例如

from string import punctuation
from nltk import word_tokenize
from nltk.util import ngrams
from nltk.corpus import stopwords
from nltk import FreqDist

text = '''The pure amnesia of her face,
newborn. I looked so far into her that, for a while, looked so far into her that, for a while  looked so far into her that, for a while looked so far into her that, for a while the visual 
held no memory. Little by little, I returned to myself, waking to nurse the visual held no  memory. Little by little, I returned to myself, waking to nurse
'''

stoplist = set(stopwords.words('english') + list(punctuation))

tokens = [token for token in nltk.word_tokenize(text) if token.lower() not in stoplist]

FreqDist(ngrams(tokens, 2)).most_common()

[输出]:

[(('looked', 'far'), 4),
 (('far', 'looked'), 3),
 (('visual', 'held'), 2),
 (('held', 'memory'), 2),
 (('memory', 'Little'), 2),
 (('Little', 'little'), 2),
 (('little', 'returned'), 2),
 (('returned', 'waking'), 2),
 (('waking', 'nurse'), 2),
 (('pure', 'amnesia'), 1),
 (('amnesia', 'face'), 1),
 (('face', 'newborn'), 1),
 (('newborn', 'looked'), 1),
 (('far', 'visual'), 1),
 (('nurse', 'visual'), 1)]

（另见：）

Final 最后，将其打印到文件中，你真的应该使用上下文管理器，http://eigenhombre.com/introduction-to-context-managers-in-python.html

with open('bigrams-list.tsv', 'w') as fout:
    for bg, count in FreqDist(ngrams(tokens, 2)).most_common():
        print('\t'.join([' '.join(bg), str(count)]), end='\n', file=fout)

[二元组-list.tsv]:

looked far  4
far looked  3
visual held 2
held memory 2
memory Little   2
Little little   2
little returned 2
returned waking 2
waking nurse    2
pure amnesia    1
amnesia face    1
face newborn    1
newborn looked  1
far visual  1
nurse visual    1

深思

现在你看到这个奇怪的二连词Little little，有没有道理？

这是从

中删除 by 的副产品

Little by little

所以现在，根据您提取的 ngram 的最终任务是什么，您可能真的不想从列表中删除停用词。

如何从 Python 中的 bi / tri-gram 的输出中删除列表特殊字符（“（）”，“'”，“，”）

How to remove List special characters ("()", "'",",") from the output of a bi / tri-gram in Python

python

special-characters

nltk

深思