预处理以摆脱句子中的连字符而不是破折号
Preprocessing to get rid of not hyphen but dash in sentences
我想做什么
我想在 NLP 预处理的句子中去掉连字符而不是破折号。
输入
samples = [
'A former employee of the accused company, ———, offered a statement off the record.', #three dashes
'He is afraid of two things — spiders and senior prom.' #dash
'Fifty-six bottles of pop on the wall, fifty-six bottles of pop.' #hyphen
]
预期输出
#output
['A former employee of the accused company','offered a statement off the record.']
['He is afraid of two things', 'spiders and senior prom.']
['Fifty-six bottles of pop on the wall', 'fifty-six bottles of pop.']
以上句子来自以下两篇关于连字符和破折号的文章。
问题
- 第一个去除符号'-'的过程失败了,很难理解第二句和第三句没有单引号('')的原因。
#output
['A former employee of the accused company, — — —, offered a statement off the record.',
'He is afraid of two things—spiders and senior prom.
Fifty-six bottles of pop on the wall, fifty-six bottles of pop.']
- 我不知道如何编写代码来区分连字符和破折号。
当前代码
samples = [
'A former employee of the accused company, — — —, offered a statement off the record.', #dash
'He is afraid of two things—spiders and senior prom.' #dash
'Fifty-six bottles of pop on the wall, fifty-six bottles of pop.' #hyphen
]
ignore_symbol = ['-']
for i in range(len(samples)):
text = samples[i]
ret = []
for word in text.split(' '):
ignore = len(word) <= 0
for iw in ignore_symbol:
if word == iw:
ignore = True
break
if not ignore:
ret.append(word)
text = ' '.join(ret)
samples[i] = text
print(samples)
#output
['A former employee of the accused company, — — —, offered a statement off the record.',
'He is afraid of two things—spiders and senior prom.
Fifty-six bottles of pop on the wall, fifty-six bottles of pop.']
for i in range (len(samples)):
list_temp = []
text = samples[i]
list_temp.extend([x.strip() for x in text.split(',') if not x.strip() == ''])
samples[i] = list_temp
print(samples)
#output
[['A former employee of the accused company',
'— — —',
'offered a statement off the record.'],
['He is afraid of two things—spiders and senior prom.Fifty-six bottles of pop on the wall',
'fifty-six bottles of pop.']]
开发环境
Python 3.7.0
尝试使用正则表达式(正则表达式)拆分 re.split
。 Python 的 String.split() 功能太有限了。然后,您需要传递“连字符”字符的 Unicode 版本。
类似于:
re.split('[[=10=]2D]', text)
如果您正在寻找非正则表达式的解决方案,破折号的 Unicode 点是 8212
,因此您可以将它们替换为 ','
,然后按 ','
拆分,然后添加非-空白句子:
>>> samples = [
'A former employee of the accused company, ———, offered a statement off the record.', #three dashes
'He is afraid of two things — spiders and senior prom.', #dash
'Fifty-six bottles of pop on the wall, fifty-six bottles of pop.' #hyphen
]
>>> output = [[
sentence.strip() for sentence in elem.replace(chr(8212), ',').split(',')
if sentence.strip()
] for elem in samples]
>>> output
[['A former employee of the accused company',
'offered a statement off the record.'],
['He is afraid of two things', 'spiders and senior prom.'],
['Fifty-six bottles of pop on the wall', 'fifty-six bottles of pop.']]
首先,将第 2 句和第 3 句合并,因为两个字符串之间没有逗号分隔。在 Python 中,写 tmp = 'a''b'
等同于 tmp = 'ab'
,这就是为什么 samples
中只有 2 个字符串(第 2 个和第 3 个已合并)。
关于你的问题:
下面的函数 remove_dash_preserve_hyphen
删除 str_sentence
参数中的所有破折号,并 returns 清理 str_sentence
。
然后将函数应用于 samples
列表中的所有字符串元素,从而生成干净的 samples_without_dash
.
samples = [
'A former employee of the accused company, ———, offered a statement off the record.', #three dashes
'He is afraid of two things — spiders and senior prom.',#**(COMMA HERE)** #dash
'Fifty-six bottles of pop on the wall, fifty-six bottles of pop.' #hyphen
]
def remove_dash_preserve_hyphen(str_sentence, dash_signatures=['—']):
for dash_sig in dash_signatures:
str_sentence = str_sentence.replace(dash_sig, '')
return str_sentence
samples_without_dash = [remove_dash_preserve_hyphen(sentence) for sentence in samples]
有问题的确切破折号是带有 unicode 'U+2014' 的 'em-dash'。
示例中可能有更多您不想要的破折号。您需要明智地跟踪它,并在调用 remove_dash_preserve_hyphen
函数时在 dash_signatures
参数中传递所有破折号类型(您不需要的破折号)的列表。
我想做什么
我想在 NLP 预处理的句子中去掉连字符而不是破折号。
输入
samples = [
'A former employee of the accused company, ———, offered a statement off the record.', #three dashes
'He is afraid of two things — spiders and senior prom.' #dash
'Fifty-six bottles of pop on the wall, fifty-six bottles of pop.' #hyphen
]
预期输出
#output
['A former employee of the accused company','offered a statement off the record.']
['He is afraid of two things', 'spiders and senior prom.']
['Fifty-six bottles of pop on the wall', 'fifty-six bottles of pop.']
以上句子来自以下两篇关于连字符和破折号的文章。
问题
- 第一个去除符号'-'的过程失败了,很难理解第二句和第三句没有单引号('')的原因。
#output
['A former employee of the accused company, — — —, offered a statement off the record.',
'He is afraid of two things—spiders and senior prom.
Fifty-six bottles of pop on the wall, fifty-six bottles of pop.']
- 我不知道如何编写代码来区分连字符和破折号。
当前代码
samples = [
'A former employee of the accused company, — — —, offered a statement off the record.', #dash
'He is afraid of two things—spiders and senior prom.' #dash
'Fifty-six bottles of pop on the wall, fifty-six bottles of pop.' #hyphen
]
ignore_symbol = ['-']
for i in range(len(samples)):
text = samples[i]
ret = []
for word in text.split(' '):
ignore = len(word) <= 0
for iw in ignore_symbol:
if word == iw:
ignore = True
break
if not ignore:
ret.append(word)
text = ' '.join(ret)
samples[i] = text
print(samples)
#output
['A former employee of the accused company, — — —, offered a statement off the record.',
'He is afraid of two things—spiders and senior prom.
Fifty-six bottles of pop on the wall, fifty-six bottles of pop.']
for i in range (len(samples)):
list_temp = []
text = samples[i]
list_temp.extend([x.strip() for x in text.split(',') if not x.strip() == ''])
samples[i] = list_temp
print(samples)
#output
[['A former employee of the accused company',
'— — —',
'offered a statement off the record.'],
['He is afraid of two things—spiders and senior prom.Fifty-six bottles of pop on the wall',
'fifty-six bottles of pop.']]
开发环境
Python 3.7.0
尝试使用正则表达式(正则表达式)拆分 re.split
。 Python 的 String.split() 功能太有限了。然后,您需要传递“连字符”字符的 Unicode 版本。
类似于:
re.split('[[=10=]2D]', text)
如果您正在寻找非正则表达式的解决方案,破折号的 Unicode 点是 8212
,因此您可以将它们替换为 ','
,然后按 ','
拆分,然后添加非-空白句子:
>>> samples = [
'A former employee of the accused company, ———, offered a statement off the record.', #three dashes
'He is afraid of two things — spiders and senior prom.', #dash
'Fifty-six bottles of pop on the wall, fifty-six bottles of pop.' #hyphen
]
>>> output = [[
sentence.strip() for sentence in elem.replace(chr(8212), ',').split(',')
if sentence.strip()
] for elem in samples]
>>> output
[['A former employee of the accused company',
'offered a statement off the record.'],
['He is afraid of two things', 'spiders and senior prom.'],
['Fifty-six bottles of pop on the wall', 'fifty-six bottles of pop.']]
首先,将第 2 句和第 3 句合并,因为两个字符串之间没有逗号分隔。在 Python 中,写 tmp = 'a''b'
等同于 tmp = 'ab'
,这就是为什么 samples
中只有 2 个字符串(第 2 个和第 3 个已合并)。
关于你的问题:
下面的函数 remove_dash_preserve_hyphen
删除 str_sentence
参数中的所有破折号,并 returns 清理 str_sentence
。
然后将函数应用于 samples
列表中的所有字符串元素,从而生成干净的 samples_without_dash
.
samples = [
'A former employee of the accused company, ———, offered a statement off the record.', #three dashes
'He is afraid of two things — spiders and senior prom.',#**(COMMA HERE)** #dash
'Fifty-six bottles of pop on the wall, fifty-six bottles of pop.' #hyphen
]
def remove_dash_preserve_hyphen(str_sentence, dash_signatures=['—']):
for dash_sig in dash_signatures:
str_sentence = str_sentence.replace(dash_sig, '')
return str_sentence
samples_without_dash = [remove_dash_preserve_hyphen(sentence) for sentence in samples]
有问题的确切破折号是带有 unicode 'U+2014' 的 'em-dash'。
示例中可能有更多您不想要的破折号。您需要明智地跟踪它,并在调用 remove_dash_preserve_hyphen
函数时在 dash_signatures
参数中传递所有破折号类型(您不需要的破折号)的列表。