Python, NLP: 如何从以形容词为中项的文本文件中找出所有的三元组
Python, NLP: How to find all trigrams from text files with adjectives as the middle term
我认为问题是 self-explanatory 但问题的详细含义在这里。
我想使用 nltk
库从文本文件中提取所有三元组,其中形容词作为中间词。
示例文本 - 一个红球和好孩子在一起。
输出示例 -
('A','red','ball'), ('the','good','boy')
等等
这段代码应该可以做到:
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
text = word_tokenize("He is a very handsome man. Her childern are funny. She has a lovely voice")
text_tags = nltk.pos_tag(text)
results = list()
for i, (txt, tag) in enumerate(text_tags):
if tag in ["JJ", "JJR", "JJS"]:
if (i > 0) and (i < len(text_tags)-1):
results.append((text_tags[i-1][0], txt, text_tags[i+1][0]))
# output: [('very', 'handsome', 'man'), ('are', 'funny', '.'), ('a', 'lovely', 'voice')]
我认为问题是 self-explanatory 但问题的详细含义在这里。
我想使用 nltk
库从文本文件中提取所有三元组,其中形容词作为中间词。
示例文本 - 一个红球和好孩子在一起。
输出示例 -
('A','red','ball'), ('the','good','boy')
等等
这段代码应该可以做到:
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
text = word_tokenize("He is a very handsome man. Her childern are funny. She has a lovely voice")
text_tags = nltk.pos_tag(text)
results = list()
for i, (txt, tag) in enumerate(text_tags):
if tag in ["JJ", "JJR", "JJS"]:
if (i > 0) and (i < len(text_tags)-1):
results.append((text_tags[i-1][0], txt, text_tags[i+1][0]))
# output: [('very', 'handsome', 'man'), ('are', 'funny', '.'), ('a', 'lovely', 'voice')]