将单引号替换为双引号并排除某些元素

Question

我想用双引号替换字符串中的所有单引号，"n't"、“'ll”、“'m”等出现的情况除外

input="the Whosebug don\'t said, \'hey what\'"
output="the Whosebug don\'t said, \"hey what\""

代码 1:(@https://whosebug.com/users/918959/antti-haapala)

def convert_regex(text): 
     return re.sub(r"(?<!\w)'(?!\w)|(?<!\w)'(?=\w)|(?<=\w)'(?!\w)", '"', text)

有 3 种情况：' 前面和后面都没有字母数字字符； or 前面没有，但后面跟着字母数字字符； or 前面没有后面是字母数字字符。

问题：这不适用于以撇号结尾的单词，即大多数所有格复数，它也不适用于非正式以撇号开头的缩写。

代码 2:(@https://whosebug.com/users/953482/kevin)

def convert_text_func(s):
    c = "_" #placeholder character. Must NOT appear in the string.
    assert c not in s
    protected = {word: word.replace("'", c) for word in ["don't", "it'll", "I'm"]}
    for k,v in protected.iteritems():
        s = s.replace(k,v)
    s = s.replace("'", '"')
    for k,v in protected.iteritems():
        s = s.replace(v,k)
    return s

太多的词无法指定，如何指定人等。请帮忙。

编辑 1： 我正在使用@anubhava 的精彩回答。我正面临这个问题。有时，该方法会失败的语言翻译。代码=

text=re.sub(r"(?<!s)'(?!(?:t|ll|e?m|s|d|ve|re|clock)\b)", '"', text)

问题：

在文本中，'Kumbh melas' melas 是印地语到英语的翻译，不是复数所有格名词。

Input="Similar to the 'Kumbh melas', celebrated by the banks of the holy rivers of India,"
Output=Similar to the "Kumbh melas', celebrated by the banks of the holy rivers of India,
Expected Output=Similar to the "Kumbh melas", celebrated by the banks of the holy rivers of India,

我正在寻找可能会添加一个以某种方式修复它的条件。人为干预是最后的选择。

编辑 2： 天真而漫长的修复方法：

def replace_translations(text):
    d = enchant.Dict("en_US")
    words=tokenize_words(text)
    punctuations=[x for x in string.punctuation]
    for i,word in enumerate(words):
        print i,word
        if(i!=len(words) and word not in punctuations and d.check(word)==False and words[i+1]=="'"):
            text=text.replace(words[i]+words[i+1],words[i]+"\"")
    return text

有没有我遗漏的极端情况或者有更好的方法？

Answer 1

您可以使用：

input="I'm one of the persons' Whosebug don't th'em said, 'hey what' I'll handle it."
print re.sub(r"(?<!s)'(?!(?:t|ll|e?m)\b)", '"', input)

输出：

I'm one of the persons' Whosebug don't th'em said, "hey what" I'll handle it.

RegEx Demo

Answer 2

这是另一种可能的方法：

import re

text = "I'm one of the persons' Whosebug don't th'em said, 'hey what' I'll handle it."

print re.sub("((?<!s)'(?!\w+)|(\s+'))", '"', text)

我试图避免对特殊情况的需要，它给出了：

I'm one of the persons' Whosebug don't th'em said,"hey what" I'll handle it.

Answer 3

这是一种非正则表达式的方式

text="the Whosebug don't said, 'hey what'"

out = []
for i, j in enumerate(text):
    if j == '\'':
        if text[i-1:i+2] == "n't" or text[i:i+3] == "'ll" or text[i:i+3] == "'m":
            out.append(j)
        else:
            out.append('"')
    else:
        out.append(j)

print ''.join(out)

给出输出

the Whosebug don't said, "hey what"

当然，您可以改进排除列表，不必手动检查每个排除...

Answer 4

试试这个： 你可以使用这个正则表达式 ((?<=\s)'([^']+)'(?=\s)) 并替换为 ""

import re
p = re.compile(ur'((?<=\s)\'([^\']+)\'(?=\s))')
test_str = u"I'm one of the persons' Whosebug don't th'em said, 'hey what' I'll handle it."
subst = u"\"\""

result = re.sub(p, subst, test_str)

输出

I'm one of the persons' Whosebug don't th'em said, "hey what" I'll handle it.

Demo

Answer 5

第一次尝试

您也可以使用这个正则表达式：

(?:(?<!\w)'((?:.|\n)+?'?)'(?!\w))

DEMO IN REGEX101

这个正则表达式匹配整个sentence/word，从头到尾都用引号引起来，而且还捕获第1组中引号的内容，所以你可以用""替换匹配的部分。

(?<!\w) - 非单词字符的负后视，以排除诸如 "you'll" 等单词，但允许正则表达式匹配 \n 等字符后的引号， :、;、. 或 - 等。假设报价前总是有一个空格是有风险的。
' - 单引号，
(?:.|\n)+?'?) - 非捕获组：一个或多个任意字符或新行（以匹配多行句子）与惰性量化（避免从第一个到最后一个单引号匹配），然后是可选的单引号 sing，如果连续有两个
'(?!\w) - 单引号，后跟非单词字符，排除 "i'm"、"you're" 等文本，其中引号在单词之间，

s'案例

然而它仍然有问题匹配带有撇号的句子出现在以 s 结尾的单词之后，例如：'the classes' hours'。我认为当 s 后跟 ' 应该被视为引号结束时，或者作为带有撇号的 or s 时，我认为不可能用正则表达式来区分。但是我想出了一种解决这个问题的有限方法，使用正则表达式：

(?:(?<!\w)'((?:.|\n)+?'?)(?:(?<!s)'(?!\w)|(?<=s)'(?!([^']|\w'\w)+'(?!\w))))

DEMO IN REGEX101

PYTHON IMPLEMENTATION

对于 s' 的情况有额外的选择：(?<!s)'(?!\w)|(?<=s)'(?!([^']|\w'\w)+'(?!\w) 其中：

(?<!s)'(?!\w) - 如果在 ' 之前没有 s，匹配上面的正则表达式（第一次尝试），
(?<=s)'(?!([^']|\w'\w)+'(?!\w) - 如果在 ' 之前有 s，只有当没有其他 ' 后跟非-单词后面文本中的字符，结束之前或另一个 ' 之前（但只有 ' 前面有 s 以外的字母或下一个引号的开头）。 \w'\w 是在这样的匹配中包含字母之间的 '，如 i'm 等

这个正则表达式只有在连续 s' 个情况下才会匹配错误。尽管如此，它还远非完美的解决方案。

\w的缺陷

此外，使用 \w 总是有可能 ' 出现在符号或非 [a-zA-Z_0-9] 之后，但仍然是字母字符，如某些本地语言字符，然后它会被视为 quatation 的开始。可以通过将 (?<!\w) 和 (?!\w) 替换为 (?<!\p{L}) 和 (?!\p{L}) 或类似 (?<=^|[,.?!)\s]) 等的东西来避免，对句子中可能出现的字符进行正向环视量化前。然而，列表可能会很长。

将单引号替换为双引号并排除某些元素

Replace single quotes with double with exclusion of some elements

python

regex

nlp

replace

第一次尝试

s'案例

\w的缺陷