使用 pattern.en 格式化整个文本?
Format an entire text with pattern.en?
出于机器学习的目的,我需要分析一些文本。我认识的一位数据科学家建议我为我的项目使用 pattern.en。
我会给我的程序一个关键字(示例 : pizza),它必须从我给他的几个文本中排序一些"trends"。 (示例:我给他一些关于比萨饼上的花生酱的文本,所以程序会识别花生酱是一种增长趋势。)
所以首先,我必须 "clean" 文本。我知道 pattern.en 可以将单词识别为名词、动词、副词等,我想删除所有限定词、冠词和其他 "meaningless" 词以进行分析,但我不知道该怎么做。我尝试 parse()
所以我可以得到 :
s = "Hello, how is it going ? I am tired actually, did not sleep enough... That is bad for work, definitely"
parsedS = parse(s)
print(parsedS)
输出:
Hello/UH/hello ,/,/, how/WRB/how is/VBZ/be it/PRP/it going/VBG/go ?/./?
I/PRP/i am/VBP/be tired/VBN/tire actually/RB/actually ,/,/, did/VBD/do not/RB/not sleep/VB/sleep enough/RB/enough .../:/...
That/DT/that is/VBZ/be bad/JJ/bad for/IN/for work/NN/work ,/,/, definitely/RB/definitely
所以我想删除带有标签 "UH"、","、"PRP" 等的单词,但我不知道该怎么做,并且不会弄乱句子 (出于分析目的,我将忽略 Example)
中没有单词 "pizza" 的句子
不知道我解释的清楚了没有,有什么不明白的可以问我
编辑 - 更新: 在 canyon289 的回答之后,我想逐句进行,而不是针对整个文本.我试过了:
for sentence in Text(s):
sentence = sentence.split(" ")
print("SENTENCE :")
for word in sentence:
if not any(tag in word for tag in dont_want):
print(word)
但是我有以下错误:
AttributeError: 'Sentence' object has no attribute 'split'
我该如何解决这个问题?
这应该适合你
s = "Hello, how is it going ? I am tired actually, did not sleep enough... That is bad for work, definitely"
s = parse(s)
#Create a list of all the tags you don't want
dont_want = ["UH", "PRP"]
sentence = parse(s).split(" ")
#Go through all the words and look for any occurence of the tag you don't want
#This is done through a nested list comprehension
[word for word in sentence if not any(tag in word for tag in dont_want)]
[u',/,/O/O', u'how/WRB/O/O', u'is/VBZ/B-VP/O', u'going/VBG/B-VP/O',
u'am/VBP/B-VP/O', u'tired/VBN/I-VP/O', u'actually/RB/B-ADVP/O',
u',/,/O/O', u'did/VBD/B-VP/O', u'not/RB/I-VP/O', u'sleep/VB/I-VP/O',
u'enough/RB/B-ADVP/O', u'.../:/O/O\nThat/DT/O/O', u'is/VBZ/B-VP/O',
u'bad/JJ/B-ADJP/O', u'for/IN/B-PP/B-PNP', u'work/NN/B-NP/I-PNP',
u',/,/O/O', u'definitely/RB/B-ADVP/O']
出于机器学习的目的,我需要分析一些文本。我认识的一位数据科学家建议我为我的项目使用 pattern.en。
我会给我的程序一个关键字(示例 : pizza),它必须从我给他的几个文本中排序一些"trends"。 (示例:我给他一些关于比萨饼上的花生酱的文本,所以程序会识别花生酱是一种增长趋势。)
所以首先,我必须 "clean" 文本。我知道 pattern.en 可以将单词识别为名词、动词、副词等,我想删除所有限定词、冠词和其他 "meaningless" 词以进行分析,但我不知道该怎么做。我尝试 parse()
所以我可以得到 :
s = "Hello, how is it going ? I am tired actually, did not sleep enough... That is bad for work, definitely"
parsedS = parse(s)
print(parsedS)
输出:
Hello/UH/hello ,/,/, how/WRB/how is/VBZ/be it/PRP/it going/VBG/go ?/./?
I/PRP/i am/VBP/be tired/VBN/tire actually/RB/actually ,/,/, did/VBD/do not/RB/not sleep/VB/sleep enough/RB/enough .../:/...
That/DT/that is/VBZ/be bad/JJ/bad for/IN/for work/NN/work ,/,/, definitely/RB/definitely
所以我想删除带有标签 "UH"、","、"PRP" 等的单词,但我不知道该怎么做,并且不会弄乱句子 (出于分析目的,我将忽略 Example)
中没有单词 "pizza" 的句子不知道我解释的清楚了没有,有什么不明白的可以问我
编辑 - 更新: 在 canyon289 的回答之后,我想逐句进行,而不是针对整个文本.我试过了:
for sentence in Text(s):
sentence = sentence.split(" ")
print("SENTENCE :")
for word in sentence:
if not any(tag in word for tag in dont_want):
print(word)
但是我有以下错误:
AttributeError: 'Sentence' object has no attribute 'split'
我该如何解决这个问题?
这应该适合你
s = "Hello, how is it going ? I am tired actually, did not sleep enough... That is bad for work, definitely"
s = parse(s)
#Create a list of all the tags you don't want
dont_want = ["UH", "PRP"]
sentence = parse(s).split(" ")
#Go through all the words and look for any occurence of the tag you don't want
#This is done through a nested list comprehension
[word for word in sentence if not any(tag in word for tag in dont_want)]
[u',/,/O/O', u'how/WRB/O/O', u'is/VBZ/B-VP/O', u'going/VBG/B-VP/O', u'am/VBP/B-VP/O', u'tired/VBN/I-VP/O', u'actually/RB/B-ADVP/O', u',/,/O/O', u'did/VBD/B-VP/O', u'not/RB/I-VP/O', u'sleep/VB/I-VP/O', u'enough/RB/B-ADVP/O', u'.../:/O/O\nThat/DT/O/O', u'is/VBZ/B-VP/O', u'bad/JJ/B-ADJP/O', u'for/IN/B-PP/B-PNP', u'work/NN/B-NP/I-PNP', u',/,/O/O', u'definitely/RB/B-ADVP/O']