如何从字符串中删除数字但保留特定的数字组?
How to remove numbers from string but keep specific groups of numbers?
我想使用 python 正则表达式从保留编号 754 和 1231 中删除字符串中的数字,因为它们与税务部门代码 754 和部门代码 1231 相关。例如,我有以下文本数据:
test="""Dividends 9672
Dividends 9680
Interest Income
Ordinary Dividends
Royalties
Capital Gain Distributions
Income from Blackstone
Ordinary Income
Rental Income
Long Term Capital Gain
Short Term Capital Gain
1231 Gain
Section 754 Stock Basis Adjustment - 2015
M-1 Section 754 Stock Basis Adjustment - 2015
Section 754 Stock Basis Adjustment - 2018
M-1 Section 754 Stock basis adjustment - 2018
"""
我希望输出为:
Dividends
Dividends
Interest Income
Ordinary Dividends
Royalties
Capital Gain Distributions
Income from Blackstone
Ordinary Income
Rental Income
Long Term Capital Gain
Short Term Capital Gain
1231 Gain
Section 754 Stock Basis Adjustment
M- Section 754 Stock Basis Adjustment
Section 754 Stock Basis Adjustment
M- Section 754 Stock basis adjustment
我的解决方案是:
test=re.sub(r'[^(754)(1231)A-Za-z]','',test)
print(test)
但它不会将 754 或 1231 视为整个组,只会删除数字 6、8、9。
您可以使用
re.sub(r'(754|1231)|[^A-Za-z\s]', r'', text)
参见regex demo。
这里,(754|1231)
匹配并捕获到第1组一个754
或1231
数字序列,然后|[^A-Za-z\s]
匹配除ASCII字母或任何字符以外的任何字符Unicode 空格,匹配项替换为第 1 组值(即捕获的内容保留在字符串中)。
注意:如果数字要匹配为精确数字使用数字边界:
re.sub(r'(?<!\d)(754|1231)(?!\d)|[^A-Za-z\s]', r'', text)
你可以这样写。
rgx = r' *-? *(?<!\d)(?!(?:754|1231)(?!\d))\d+'
re.sub(rgx, '', test)
请注意,这会删除所有不需要的 space 和连字符以及数字,例如,'7541'
会匹配并替换为空字符串。
正则表达式可以分解如下(我已经用包含 space 的字符 class 替换了初始 space 以便它可见。)
[ ]*-? * # match >= 0 spaces, optionally followed by a hyphen,
# followed by >= 0 spaces
(?<!\d) # negative lookbehind asserts that preceding character is
# not a digit
(?! # begin negative lookahead
(?:754|1231) # match '754' or '1231'
(?!\d) # negative lookahead asserts that next character is
# not a digit
) # end negative lookahead
\d+ # match >= 1 digits
我想使用 python 正则表达式从保留编号 754 和 1231 中删除字符串中的数字,因为它们与税务部门代码 754 和部门代码 1231 相关。例如,我有以下文本数据:
test="""Dividends 9672
Dividends 9680
Interest Income
Ordinary Dividends
Royalties
Capital Gain Distributions
Income from Blackstone
Ordinary Income
Rental Income
Long Term Capital Gain
Short Term Capital Gain
1231 Gain
Section 754 Stock Basis Adjustment - 2015
M-1 Section 754 Stock Basis Adjustment - 2015
Section 754 Stock Basis Adjustment - 2018
M-1 Section 754 Stock basis adjustment - 2018
"""
我希望输出为:
Dividends
Dividends
Interest Income
Ordinary Dividends
Royalties
Capital Gain Distributions
Income from Blackstone
Ordinary Income
Rental Income
Long Term Capital Gain
Short Term Capital Gain
1231 Gain
Section 754 Stock Basis Adjustment
M- Section 754 Stock Basis Adjustment
Section 754 Stock Basis Adjustment
M- Section 754 Stock basis adjustment
我的解决方案是:
test=re.sub(r'[^(754)(1231)A-Za-z]','',test)
print(test)
但它不会将 754 或 1231 视为整个组,只会删除数字 6、8、9。
您可以使用
re.sub(r'(754|1231)|[^A-Za-z\s]', r'', text)
参见regex demo。
这里,(754|1231)
匹配并捕获到第1组一个754
或1231
数字序列,然后|[^A-Za-z\s]
匹配除ASCII字母或任何字符以外的任何字符Unicode 空格,匹配项替换为第 1 组值(即捕获的内容保留在字符串中)。
注意:如果数字要匹配为精确数字使用数字边界:
re.sub(r'(?<!\d)(754|1231)(?!\d)|[^A-Za-z\s]', r'', text)
你可以这样写。
rgx = r' *-? *(?<!\d)(?!(?:754|1231)(?!\d))\d+'
re.sub(rgx, '', test)
请注意,这会删除所有不需要的 space 和连字符以及数字,例如,'7541'
会匹配并替换为空字符串。
正则表达式可以分解如下(我已经用包含 space 的字符 class 替换了初始 space 以便它可见。)
[ ]*-? * # match >= 0 spaces, optionally followed by a hyphen,
# followed by >= 0 spaces
(?<!\d) # negative lookbehind asserts that preceding character is
# not a digit
(?! # begin negative lookahead
(?:754|1231) # match '754' or '1231'
(?!\d) # negative lookahead asserts that next character is
# not a digit
) # end negative lookahead
\d+ # match >= 1 digits