Python 连接文本中的组合关键字

Python connect composed keywords in texts

所以,我有一个小写的关键字列表。假设

keywords = ['machine learning', 'data science', 'artificial intelligence']

和小写文本列表。比方说

texts = [
  'the new machine learning model built by google is revolutionary for the current state of artificial intelligence. it may change the way we are thinking', 
  'data science and artificial intelligence are two different fields, although they are interconnected. scientists from harvard are explaining it in a detailed presentation that could be found on our page.'
]

我需要将文本转换成:

[[['the', 'new',
   'machine_learning',
   'model',
   'built',
   'by',
   'google',
   'is',
   'revolutionary',
   'for',
   'the',
   'current',
   'state',
   'of',
   'artificial_intelligence'],
  ['it', 'may', 'change', 'the', 'way', 'we', 'are', 'thinking']],
 [['data_science',
   'and',
   'artificial_intelligence',
   'are',
   'two',
   'different',
   'fields',
   'although',
   'they',
   'are',
   'interconnected'],
  ['scientists',
   'from',
   'harvard',
   'are',
   'explaining',
   'it',
   'in',
   'a',
   'detailed',
   'presentation',
   'that',
   'could',
   'be',
   'found',
   'on',
   'our',
   'page']]]

我现在做的是检查关键字是否在文本中,并用带_的关键字替换它们。但这是复杂的 m*n,当你有 700 条长文本和 2M 关键字时,它真的很慢。

我正在尝试使用 Phraser,但我无法仅使用我的关键字来构建一个。

有人可以建议我更优化的方法吗?

这可能不是最好的 pythonic 方法,但它可以通过 3 个步骤完成。

keywords = ['machine learning', 'data science', 'artificial intelligence']

texts = ['the new machine learning model built by google is revolutionary for the current state of artificial intelligence. it may change the way we are thinking', 'data science and artificial intelligence are two different fields, although they are interconnected. scientists from harvard are explaining it in a detailed presentation that could be found on our page.']

#Add underscore
for idx, text in enumerate(texts):
  for keyword in keywords:
    reload_text = texts[idx]
    if keyword in text:
      texts[idx] = reload_text.replace(keyword, keyword.replace(" ", "_"))

#Split text for each "." encountered
for idx, text in enumerate(texts):
  texts[idx] = list(filter(None, text.split(".")))
print(texts)

#Split text to get each word
for idx,text in enumerate(texts):
  for idx_s,sentence in enumerate(text):
    texts[idx][idx_s] = list(map(lambda x: re.sub("[,\.!?]", "", x), sentence.split())) #map to delete every undesired characters

print(texts)

输出

[
    [
        ['the', 'new', 'machine_learning', 'model', 'built', 'by', 'google', 'is', 'revolutionary', 'for', 'the', 'current', 'state', 'of', 'artificial_intelligence'], 
        ['it', 'may', 'change', 'the', 'way', 'we', 'are', 'thinking']
    ], 
    [
        ['data_science', 'and', 'artificial_intelligence', 'are', 'two', 'different', 'fields', 'although', 'they', 'are', 'interconnected'], 
        ['scientists', 'from', 'harvard', 'are', 'explaining', 'it', 'in', 'a', 'detailed', 'presentation', 'that', 'could', 'be', 'found', 'on', 'our', 'page']
    ]
]

gensimPhrases/Phraser 类 旨在使用其内部的、统计得出的记录,记录哪些词对应提升为短语——而不是用户提供的配对。 (你可能会通过合成 scores/thresholds 来戳戳和刺激 Phraser 来做你想做的事,但这会有点笨拙和笨拙。)

您可以模仿他们的一般方法:(1) 对标记列表而不是原始字符串进行操作; (2) 学习并记住应该组合的标记对; & (3) 一次执行组合。这应该比基于对字符串进行重复搜索和替换的任何方法更有效——听起来您已经尝试过并发现需要。

例如,让我们首先创建一个字典,其中的键是应该组合的单词对的元组,值是包含它们指定的组合标记的元组,以及一个空的第二项-元组。 (其原因稍后会清楚。)

keywords = ['machine learning', 'data science', 'artificial intelligence']
texts = [
    'the new machine learning model built by google is revolutionary for the current state of artificial intelligence. it may change the way we are thinking', 
    'data science and artificial intelligence are two different fields, although they are interconnected. scientists from harvard are explaining it in a detailed presentation that could be found on our page.'
]

combinations_dict = {tuple(kwsplit):('_'.join(kwsplit), ()) 
                     for kwsplit in [kwstr.split() for kwstr in keywords]}
combinations_dict

经过这一步,combinations_dict是:

{('machine', 'learning'): ('machine_learning', ()),
 ('data', 'science'): ('data_science', ()),
 ('artificial', 'intelligence'): ('artificial_intelligence', ())}

现在,我们可以使用 Python 生成器函数来创建任何其他标记序列的可迭代转换,一个接一个地获取原始标记——但在发出任何标记之前,添加下一个标记到缓冲的候选令牌对。如果该对是应该组合的一对,则单个组合标记被 yielded——但如果不是,则只发出第一个标记,留下第二个标记与新候选对中的下一个标记组合。

例如:

def combining_generator(tokens, comb_dict):
    buff = ()  # start with empty buffer
    for in_tok in tokens:
        buff += (in_tok,)  # add latest to buffer
        if len(buff) < 2:  # grow buffer to 2 tokens if possible
            continue
        # lookup what to do for current pair... 
        # ...defaulting to emit-[0]-item, keep-[1]-item in new buff
        out_tok, buff = comb_dict.get(buff, (buff[0], (buff[1],)))
        yield out_tok 
    if buff:
        yield buff[0]  # last solo token if any

这里我们看到了早期 () 空元组的原因:这是成功替换后 buff 的首选状态。以这种方式驱动结果和下一个状态可以帮助我们使用 dict.get(key, default) 的形式,它提供了一个特定的值,如果找不到密钥就可以使用。

现在可以通过以下方式应用指定的组合:

tokenized_texts = [text.split() for text in texts]
retokenized_texts = [list(combining_generator(tokens, combinations_dict)) for tokens in tokenized_texts]
retokenized_texts

...报告 tokenized_texts 为:

[
  ['the', 'new', 'machine_learning', 'model', 'built', 'by', 'google', 'is', 'revolutionary', 'for', 'the', 'current', 'state', 'of', 'artificial', 'intelligence.', 'it', 'may', 'change', 'the', 'way', 'we', 'are', 'thinking'], 
  ['data_science', 'and', 'artificial_intelligence', 'are', 'two', 'different', 'fields,', 'although', 'they', 'are', 'interconnected.', 'scientists', 'from', 'harvard', 'are', 'explaining', 'it', 'in', 'a', 'detailed', 'presentation', 'that', 'could', 'be', 'found', 'on', 'our', 'page.']
]

请注意,标记 ('artificial', 'intelligence.') 未在此处合并 ,因为使用的非常简单的 .split() 标记化留下了标点符号,防止与规则完全匹配。

真正的项目将希望使用更复杂的标记化,它可能会去除标点符号,或将标点符号保留为标记,或进行其他预处理 - 因此可以正确地将 'artificial' 作为标记传递没有附件 '.'。例如,一个简单的标记化只保留连续的单词字符丢弃标点符号将是:

import re
tokenized_texts = [re.findall('\w+', text) for text in texts]
tokenized_texts

另一个将任何杂散的 non-word/non-space 字符(标点符号)保留为独立标记的是:

tokenized_texts = [re.findall(r'\w+|(?:[^\w\s])', text) for text in texts]
tokenized_texts

这些替代简单 .split() 的任何一个都可以确保您的第一个文本呈现必要的 ('artificial', 'intelligence') 对进行组合。