基于替换和不替换规则的子字符串替换

Substring replacements based on replace and no-replace rules

我有一个字符串和 rules/mappings 用于替换和不替换。

例如

"This is an example sentence that needs to be processed into a new sentence."
"This is a second example sentence that shows how 'sentence' in 'sentencepiece' should not be replaced."

替换规则:

replace_dictionary = {'sentence': 'processed_sentence'}
no_replace_set = {'example sentence'}

结果:

"This is an example sentence that needs to be processed into a new processed_sentence."
"This is a second example sentence that shows how 'processed_sentence' in 'sentencepiece' should not be replaced."

附加条件:

  1. 仅在大小写匹配时才替换,即大小写很重要。
  2. 只进行全字替换,标点符号忽略,替换后保留

我在想在 Python 3.x 中解决这个问题的最干净的方法是什么?

基于恶魔魔像的answer

更新

对不起,我错过了一个事实,即只应替换整个单词。我更新了我的代码,甚至将其概括为在函数中使用。

def replace_whole(sentence, replace_token, replace_with, dont_replace):
    rx = f"[\"\'\.,:; ]({replace_token})[\"\'\.,:; ]"
    iter = re.finditer(rx, sentence)
    out_sentence = ""
    found = []
    indices = []
    for m in iter:
        indices.append(m.start(0))
        found.append(m.group())

    context_size=len(dont_replace)
    for i in range(len(indices)):
        context = sentence[indices[i]-context_size:indices[i]+context_size]
        if dont_replace in context:
            continue
        else:
            # First replace the word only in the substring found
            to_replace = found[i].replace(replace_token, replace_with)
            # Then replace the word in the context found, so any special token like "" or . gets taken over and the context does not change
            replace_val = context.replace(found[i], to_replace)
            # finally replace the context found with the replacing context
            out_sentence = sentence.replace(context, replace_val)
            
    return out_sentence

通过使用 finditer(),使用正则表达式查找字符串的所有出现和值(因为我们需要检查它是一个完整的单词还是嵌入在任何类型的单词中)。您可能需要将 rx 调整为您对“整个单词”的定义。然后获取有关 no_replace 规则大小的这些值的上下文。然后检查上下文是否包含您的 no_replace 字符串。 如果没有,你可以替换它,只对单词使用 replace(),然后替换上下文中出现的单词,然后替换整个文本中的上下文。这样替换过程几乎是独一无二的,不会发生奇怪的行为。

使用你的例子,这导致:

replace_whole(sen2, "sentence", "processed_sentence", "example sentence")
>>>"This is a second example sentence that shows how 'processed_sentence' in 'sentencepiece' should not be replaced."

replace_whole(sen1, "sentence", "processed_sentence", "example sentence")
>>>'This is an example sentence that needs to be processed into a new processed_sentence.'

经过一些研究,我认为这是解决我的问题的最好、最干净的方法。该解决方案通过在找到匹配项时调用 match_fun 来工作,并且 match_fun 仅执行替换,当且仅当没有与当前匹配项重叠的“no-replace-phrase” .如果您需要更多说明或者您认为可以改进某些方面,请告诉我。

replace_dict = ... # The code below assumes you already have this
no_replace_dict = ...# The code below assumes you already have this
text = ... # The text on input.

def match_fun(match: re.Match):
    str_match: str = match.group()

    if str_match not in cls.no_replace_dict:
        return cls.replace_dict[str_match]
    
    for no_replace in cls.no_replace_dict[str_match]:
            
        no_replace_matches_iter = re.finditer(r'\b' + no_replace + r'\b', text)
        for no_replace_match in no_replace_matches_iter:

            if no_replace_match.start() >= match.start() and no_replace_match.start() < match.end():
                return str_match
            
            if no_replace_match.end() > match.start() and no_replace_match.end() <= match.end():
                return str_match
        
    return cls.replace_dict[str_match]

for replace in cls.replace_dict:
    pattern = re.compile(r'\b' + replace + r'\b')
    text = pattern.sub(match_fun, text)