在使用正则表达式连接的单词之间添加一个 space 和逗号
Add a single space and comma between words that are connected using regex
我有一个嵌套的 list_3,它看起来像:
[['Company OverviewCompany: HowSector: SoftwareYear Founded: 2010One Sentence Pitch: Easily give and request low-quality feedback with your team to achieve more togetherUniversity Affiliation(s): Duke$ Raised: 0,000Investors: Friends & familyTraction to Date: 10% of monthly active users (MAU) are also active weekly'], [['Company OverviewCompany: GrubSector: SoftwareYear Founded: 2018One Sentence Pitch: Find food you likeUniversity Affiliation(s): Stanford$ Raised: 0,000Investors: Friends & familyTraction to Date: 40% of monthly active users (MAU) are also active weekly']]]
我想使用正则表达式在每个连接的单词之间添加一个逗号后跟一个 space 即(HowSector:, SoftwareYear, 2010One),到目前为止我已经尝试写一个 re.sub 代码,通过选择所有不带白色的字符 space 并替换它,但有 运行 进入一些问题:
for i, list in enumerate(list_3):
list_3[i] = [re.sub('r\s\s+', ', ', word) for word in list]
list_33.append(list_3[i])
print(list_33)
错误:
return _compile(pattern, flags).sub(repl, string, count)
TypeError: expected string or bytes-like object
我希望输出为:
[['Company Overview, Company: How, Sector: Software, Year Founded: 2010, One Sentence Pitch: Easily give and request low-quality feedback with your team to achieve more together University, Affiliation(s): Duke, $ Raised: 0,000, Investors: Friends & family, Traction to Date: 10% of monthly active users (MAU) are also active weekly'],[...]]
关于如何使用正则表达式执行此操作的任何想法?
主要问题是您的嵌套列表没有固定级别。有时它有 2 个级别,有时它有 3 个级别。这就是您收到上述错误的原因。如果列表有 3 个级别,re.sub
接收一个列表作为第三个参数而不是字符串。
第二个问题是您使用的正则表达式不是正确的正则表达式。我们在这里可以使用的最简单的正则表达式应该(至少)能够找到后跟大写字母的非空白字符。
在下面的示例代码中,我使用了 re.compile
(因为相同的正则表达式将被反复使用,我们不妨预编译它并获得一些性能提升)并且我'我只是打印输出。您需要想办法以您想要的格式获取输出。
regex = re.compile(r'(\S)([A-Z])')
replacement = r', '
for inner_list in nested_list:
for string_or_list in inner_list:
if isinstance(string_or_list, str):
print(regex.sub(replacement, string_or_list))
else:
for string in string_or_list:
print(regex.sub(replacement, string))
产出
Company Overview, Company: How, Sector: Software, Year Founded: 2010, One Sentence Pitch: Easily give and request low-quality feedback with your team to achieve more together, University Affiliation(s): Duke$ Raised: 0,000, Investors: Friends & family, Traction to Date: 10% of monthly active users (, MA, U) are also active weekly
Company Overview, Company: Grub, Sector: Software, Year Founded: 2018, One Sentence Pitch: Find food you like, University Affiliation(s): Stanford$ Raised: 0,000, Investors: Friends & family, Traction to Date: 40% of monthly active users (, MA, U) are also active weekly
如果您的列表列表任意深,您可以递归遍历它并处理(使用 THIS 正则表达式)字符串并产生相同的结构:
import re
from collections.abc import Iterable
def process(l):
for el in l:
if isinstance(el, Iterable) and not isinstance(el, (str, bytes)):
yield type(el)(process(el))
else:
yield ', '.join(re.split(r'(?<=[a-z])(?=[A-Z])', el))
给定你的例子 LoL
结果如下:
>>> list(process(LoL))
[['Company Overview, Company: How, Sector: Software, Year Founded: 2010One Sentence Pitch: Easily give and request low-quality feedback with your team to achieve more together, University Affiliation(s): Duke$ Raised: 0,000Investors: Friends & family, Traction to Date: 10% of monthly active users (MAU) are also active weekly'], [['Company Overview, Company: Grub, Sector: Software, Year Founded: 2018One Sentence Pitch: Find food you like, University Affiliation(s): Stanford$ Raised: 0,000Investors: Friends & family, Traction to Date: 40% of monthly active users (MAU) are also active weekly']]]
我相信你可以使用下面的Python代码。
rgx = r'(?<=[a-z\d])([A-Z$][A-Za-z]*(?: +\S+?)*)*:'
rep = r', :'
re.sub(rgx, rep, s)
其中 s
是字符串。
Start your engine! | Python code
Python的正则引擎在匹配时会进行如下操作
(?<= : begin positive lookbehind
[a-z\d] : match a letter or digit
) : end positive lookbehind
( : begin capture group 1
[A-Z$] : match a capital letter or '$'
[A-Za-z]* : match 0+ letters
(?: +\S+?) : match 1+ spaces greedily, 1+ non-spaces
non-greedily in a non-capture group
* : execute non-capture group 0+ times
) : end capture group
: : match ':'
请注意,可能需要调整捕获组中每个标记的正后视和允许字符以满足要求。
用于构造替换字符串 (, :
) 的正则表达式创建字符串 ', '
,后跟捕获组 1 的内容,后跟冒号。
我有一个嵌套的 list_3,它看起来像:
[['Company OverviewCompany: HowSector: SoftwareYear Founded: 2010One Sentence Pitch: Easily give and request low-quality feedback with your team to achieve more togetherUniversity Affiliation(s): Duke$ Raised: 0,000Investors: Friends & familyTraction to Date: 10% of monthly active users (MAU) are also active weekly'], [['Company OverviewCompany: GrubSector: SoftwareYear Founded: 2018One Sentence Pitch: Find food you likeUniversity Affiliation(s): Stanford$ Raised: 0,000Investors: Friends & familyTraction to Date: 40% of monthly active users (MAU) are also active weekly']]]
我想使用正则表达式在每个连接的单词之间添加一个逗号后跟一个 space 即(HowSector:, SoftwareYear, 2010One),到目前为止我已经尝试写一个 re.sub 代码,通过选择所有不带白色的字符 space 并替换它,但有 运行 进入一些问题:
for i, list in enumerate(list_3):
list_3[i] = [re.sub('r\s\s+', ', ', word) for word in list]
list_33.append(list_3[i])
print(list_33)
错误:
return _compile(pattern, flags).sub(repl, string, count)
TypeError: expected string or bytes-like object
我希望输出为:
[['Company Overview, Company: How, Sector: Software, Year Founded: 2010, One Sentence Pitch: Easily give and request low-quality feedback with your team to achieve more together University, Affiliation(s): Duke, $ Raised: 0,000, Investors: Friends & family, Traction to Date: 10% of monthly active users (MAU) are also active weekly'],[...]]
关于如何使用正则表达式执行此操作的任何想法?
主要问题是您的嵌套列表没有固定级别。有时它有 2 个级别,有时它有 3 个级别。这就是您收到上述错误的原因。如果列表有 3 个级别,re.sub
接收一个列表作为第三个参数而不是字符串。
第二个问题是您使用的正则表达式不是正确的正则表达式。我们在这里可以使用的最简单的正则表达式应该(至少)能够找到后跟大写字母的非空白字符。
在下面的示例代码中,我使用了 re.compile
(因为相同的正则表达式将被反复使用,我们不妨预编译它并获得一些性能提升)并且我'我只是打印输出。您需要想办法以您想要的格式获取输出。
regex = re.compile(r'(\S)([A-Z])')
replacement = r', '
for inner_list in nested_list:
for string_or_list in inner_list:
if isinstance(string_or_list, str):
print(regex.sub(replacement, string_or_list))
else:
for string in string_or_list:
print(regex.sub(replacement, string))
产出
Company Overview, Company: How, Sector: Software, Year Founded: 2010, One Sentence Pitch: Easily give and request low-quality feedback with your team to achieve more together, University Affiliation(s): Duke$ Raised: 0,000, Investors: Friends & family, Traction to Date: 10% of monthly active users (, MA, U) are also active weekly
Company Overview, Company: Grub, Sector: Software, Year Founded: 2018, One Sentence Pitch: Find food you like, University Affiliation(s): Stanford$ Raised: 0,000, Investors: Friends & family, Traction to Date: 40% of monthly active users (, MA, U) are also active weekly
如果您的列表列表任意深,您可以递归遍历它并处理(使用 THIS 正则表达式)字符串并产生相同的结构:
import re
from collections.abc import Iterable
def process(l):
for el in l:
if isinstance(el, Iterable) and not isinstance(el, (str, bytes)):
yield type(el)(process(el))
else:
yield ', '.join(re.split(r'(?<=[a-z])(?=[A-Z])', el))
给定你的例子 LoL
结果如下:
>>> list(process(LoL))
[['Company Overview, Company: How, Sector: Software, Year Founded: 2010One Sentence Pitch: Easily give and request low-quality feedback with your team to achieve more together, University Affiliation(s): Duke$ Raised: 0,000Investors: Friends & family, Traction to Date: 10% of monthly active users (MAU) are also active weekly'], [['Company Overview, Company: Grub, Sector: Software, Year Founded: 2018One Sentence Pitch: Find food you like, University Affiliation(s): Stanford$ Raised: 0,000Investors: Friends & family, Traction to Date: 40% of monthly active users (MAU) are also active weekly']]]
我相信你可以使用下面的Python代码。
rgx = r'(?<=[a-z\d])([A-Z$][A-Za-z]*(?: +\S+?)*)*:'
rep = r', :'
re.sub(rgx, rep, s)
其中 s
是字符串。
Start your engine! | Python code
Python的正则引擎在匹配时会进行如下操作
(?<= : begin positive lookbehind
[a-z\d] : match a letter or digit
) : end positive lookbehind
( : begin capture group 1
[A-Z$] : match a capital letter or '$'
[A-Za-z]* : match 0+ letters
(?: +\S+?) : match 1+ spaces greedily, 1+ non-spaces
non-greedily in a non-capture group
* : execute non-capture group 0+ times
) : end capture group
: : match ':'
请注意,可能需要调整捕获组中每个标记的正后视和允许字符以满足要求。
用于构造替换字符串 (, :
) 的正则表达式创建字符串 ', '
,后跟捕获组 1 的内容,后跟冒号。