在 Python 中动态获取两个或多个索引之间的元素,无需硬编码索引变量的数量
Get elements between two or more indexes dynamically in Python without hardcoding number of index variables
我正在尝试从输入文本中提取 POS 标签并提取 2 个或更多 'IN' 标签之间的所有单词。所以,这个想法是,如果有 1 个 'IN' 标签,提取发生在从标签的索引到句子的结尾。如果有超过 2 个 'IN' 标签,提取应该从 1 个标签的索引到另一个 'IN' 标签,将短语分成组。我已经编写了执行相同操作的代码。
代码是:
def extractor(text):
text = nltk.word_tokenize(text)
pos_tagged = nltk.pos_tag(text)
# print(pos_tagged)
# Get tuple index of preposition
indices = [i for i, tupl in enumerate(pos_tagged) if tupl[1] == 'IN']
# print(indices)
if len(indices) == 1:
idx = indices[0]
phrase = pos_tagged[idx:]
words = [i[0] for i in phrase]
comb_words = ' '.join(i for i in words)
return comb_words
else:
idx1 = indices[0]
idx2 = indices[1]
phrase1 = pos_tagged[idx1:idx2]
words1 = [i[0] for i in phrase1]
comb_words1 = ' '.join(i for i in words1)
phrase2 = pos_tagged[idx2:]
words2 = [i[0] for i in phrase2]
comb_words2 = ' '.join(i for i in words2)
return comb_words1, comb_words2
extractor("hunger increases in the morning during workout")
并且输出符合预期。
唯一担心的是,如果我的文本中有 2 个 'IN' 标记,我必须专门对场景进行硬编码。
idx1 = indices[0] idx2 = indices[1]
所以,这样,如果有10个'IN'标签,我需要用这种方式创建10个索引变量。有没有更好的方法来解决这个问题,以便可以根据输入中存在的标签数量动态创建索引变量
我会使用发电机。
def extractor(text, tag='IN', max_level=None):
text = nltk.word_tokenize(text)
pos_tagged = nltk.pos_tag(text)
indices = [i for i, tupl in enumerate(pos_tagged) if tupl[1] == tag]
# remove the first index if it is 0 -- we don't want empty phrase
if not indices[0]:
indices.pop(0)
# maybe we don't care about tags past 2nd, or 5th, or 10th
# indexing to None will just yield whole array
indices = indices[:max_level] + [len(pos_tagged)]
# the end of previous phrase
prev_index = indices[0]
for index in indices[1:]:
words = pos_tagged[prev_index:index]
prev_index = index
yield ' '.join(word for (word, tag) in words)
list(extractor("hunger increases in the morning during workout"))
# ['in the morning', 'during workout']
max_level 用于限制您关心的标签的最大数量——例如,您希望第 5 个标签之后的所有内容与标签无关,因此您调用 extractor(text, max_level=5)
。
编辑:如果您最终需要第一个标记出现之前的部分,请将 prev_index
初始化为 0
而不是 indices[0]
。
我正在尝试从输入文本中提取 POS 标签并提取 2 个或更多 'IN' 标签之间的所有单词。所以,这个想法是,如果有 1 个 'IN' 标签,提取发生在从标签的索引到句子的结尾。如果有超过 2 个 'IN' 标签,提取应该从 1 个标签的索引到另一个 'IN' 标签,将短语分成组。我已经编写了执行相同操作的代码。 代码是:
def extractor(text):
text = nltk.word_tokenize(text)
pos_tagged = nltk.pos_tag(text)
# print(pos_tagged)
# Get tuple index of preposition
indices = [i for i, tupl in enumerate(pos_tagged) if tupl[1] == 'IN']
# print(indices)
if len(indices) == 1:
idx = indices[0]
phrase = pos_tagged[idx:]
words = [i[0] for i in phrase]
comb_words = ' '.join(i for i in words)
return comb_words
else:
idx1 = indices[0]
idx2 = indices[1]
phrase1 = pos_tagged[idx1:idx2]
words1 = [i[0] for i in phrase1]
comb_words1 = ' '.join(i for i in words1)
phrase2 = pos_tagged[idx2:]
words2 = [i[0] for i in phrase2]
comb_words2 = ' '.join(i for i in words2)
return comb_words1, comb_words2
extractor("hunger increases in the morning during workout")
并且输出符合预期。
唯一担心的是,如果我的文本中有 2 个 'IN' 标记,我必须专门对场景进行硬编码。
idx1 = indices[0] idx2 = indices[1]
所以,这样,如果有10个'IN'标签,我需要用这种方式创建10个索引变量。有没有更好的方法来解决这个问题,以便可以根据输入中存在的标签数量动态创建索引变量
我会使用发电机。
def extractor(text, tag='IN', max_level=None):
text = nltk.word_tokenize(text)
pos_tagged = nltk.pos_tag(text)
indices = [i for i, tupl in enumerate(pos_tagged) if tupl[1] == tag]
# remove the first index if it is 0 -- we don't want empty phrase
if not indices[0]:
indices.pop(0)
# maybe we don't care about tags past 2nd, or 5th, or 10th
# indexing to None will just yield whole array
indices = indices[:max_level] + [len(pos_tagged)]
# the end of previous phrase
prev_index = indices[0]
for index in indices[1:]:
words = pos_tagged[prev_index:index]
prev_index = index
yield ' '.join(word for (word, tag) in words)
list(extractor("hunger increases in the morning during workout"))
# ['in the morning', 'during workout']
max_level 用于限制您关心的标签的最大数量——例如,您希望第 5 个标签之后的所有内容与标签无关,因此您调用 extractor(text, max_level=5)
。
编辑:如果您最终需要第一个标记出现之前的部分,请将 prev_index
初始化为 0
而不是 indices[0]
。