在 Python 中动态获取两个或多个索引之间的元素,无需硬编码索引变量的数量

Get elements between two or more indexes dynamically in Python without hardcoding number of index variables

我正在尝试从输入文本中提取 POS 标签并提取 2 个或更多 'IN' 标签之间的所有单词。所以,这个想法是,如果有 1 个 'IN' 标签,提取发生在从标签的索引到句子的结尾。如果有超过 2 个 'IN' 标签,提取应该从 1 个标签的索引到另一个 'IN' 标签,将短语分成组。我已经编写了执行相同操作的代码。 代码是:

def extractor(text):
    text = nltk.word_tokenize(text)
    pos_tagged = nltk.pos_tag(text)
#    print(pos_tagged)
#    Get tuple index of preposition
    indices = [i for i, tupl in enumerate(pos_tagged) if tupl[1] == 'IN']
#    print(indices)
    if len(indices) == 1:
        idx = indices[0]
        phrase = pos_tagged[idx:]
        words = [i[0] for i in phrase]
        comb_words = ' '.join(i for i in words)
        return comb_words 
        
    else:
        idx1 = indices[0]
        idx2 = indices[1]
        phrase1 = pos_tagged[idx1:idx2]
        words1 = [i[0] for i in phrase1]
        comb_words1 = ' '.join(i for i in words1)

        phrase2 = pos_tagged[idx2:]
        words2 = [i[0] for i in phrase2]
        comb_words2 = ' '.join(i for i in words2)
                        
        return comb_words1, comb_words2
        

extractor("hunger increases in the morning during workout")

并且输出符合预期。 唯一担心的是,如果我的文本中有 2 个 'IN' 标记,我必须专门对场景进行硬编码。 idx1 = indices[0] idx2 = indices[1]

所以,这样,如果有10个'IN'标签,我需要用这种方式创建10个索引变量。有没有更好的方法来解决这个问题,以便可以根据输入中存在的标签数量动态创建索引变量

我会使用发电机。

def extractor(text, tag='IN', max_level=None):
    text = nltk.word_tokenize(text)
    pos_tagged = nltk.pos_tag(text)
    
    indices = [i for i, tupl in enumerate(pos_tagged) if tupl[1] == tag]
    
    # remove the first index if it is 0 -- we don't want empty phrase
    if not indices[0]:
        indices.pop(0)
    
    # maybe we don't care about tags past 2nd, or 5th, or 10th
    # indexing to None will just yield whole array
    indices = indices[:max_level] + [len(pos_tagged)]
    
    # the end of previous phrase
    prev_index = indices[0]
    
    for index in indices[1:]:
        words = pos_tagged[prev_index:index]
        prev_index = index
        
        yield ' '.join(word for (word, tag) in words)

list(extractor("hunger increases in the morning during workout"))
# ['in the morning', 'during workout']

max_level 用于限制您关心的标签的最大数量——例如,您希望第 5 个标签之后的所有内容与标签无关,因此您调用 extractor(text, max_level=5)

编辑:如果您最终需要第一个标记出现之前的部分,请将 prev_index 初始化为 0 而不是 indices[0]