将文本句子中的任何整数转换为 python 中的字符串
convert any integer in a textual sentence to string in python
我有一个系列看起来有点像这样
'0589 BTC: 581 OUTFLOW BANK REF: CUST REF: 0004'
'CUR FR 44F8 Availability: 12,267.24 Debited'
...
我想将所有整数/字母数字值替换为相应的字格式。
For eg.
0589 --> ZERO FIVE EIGHT NINE
44F8 --> FOUR FOUR F EIGHT
12,267.24 --> ONE TWO , TWO SIX SEVEN . TWO FOUR
因此第一项将被转换为
'ZERO FIVE EIGHT NINE BTC: FIVE EIGHT ONE OUTFLOW BANK REF: CUST REF: ZERO ZERO ZERO FOUR'
等等。
解决这个问题的方法是什么,
我正在研究一些 python 包,例如 num2words 和 inflect,但它们都是 return 人类可读的格式
即 22 --> 22 不满足我的要求
conversion_dict = {1:'One' , 2 : 'Two' , 3 : 'Three' , 4 : 'Four' , 5: 'Five' , 6:'Six' , 7 : 'Seven' , 8:'Eight' , 9:'Nine' , 0: 'Zero'}
你可以试试这个方法
- 使用正则表达式获取字符串中的所有数字
[\d]+
- 将匹配转换为整数
- 使用模 (%) 运算符从数字中获取数字
- 终于用你的字典查到了单词中的数字
替代解决方案
- 使用
re.findall(r'[\d]', my_string)
- 这会给你所有可能的数字
- 接下来就用
my_string.replace(digit, f' {conversion_dict[digit]} ' )
您可以遍历字符串并用相应的字符串替换每个数字,就像我在这里所做的那样:
conversion_dict = {'1':'One' , '2' : 'Two' , '3' : 'Three' , '4' : 'Four' , '5': 'Five' , '6':'Six' , '7' : 'Seven' , '8':'Eight' , '9':'Nine' , '0': 'Zero'}
def parser(string: str):
def inner():
for index, char in enumerate(string):
yield " " + conversion_dict[char] + " " if char.isnumeric() else char
result = "".join(inner()).lstrip()
return "".join(s if not (result[i] == result[i - 1] == " ") else "" for i, s in enumerate(result))
string_for_example = """
0589 BTC: 581 OUTFLOW BANK REF: CUST REF: 0004
CUR FR 44F8 Availability: 12,267.24 Debited
"""
print(parser(string_for_example))
结果是:
Zero Five Eight Nine BTC: Five Eight One OUTFLOW BANK REF: CUST REF: Zero Zero Zero Four
CUR FR Four Four F Eight Availability: One Two , Two Six Seven . Two Four Debited
# your pandas series
s = pd.Series(['0589 BTC: 581 OUTFLOW BANK REF: CUST REF: 0004',
'CUR FR 44F8 Availability: 12,267.24 Debited'], name='Text')
# your conversion dict with strings not ints
conversion_dict = {'1':'One ' , '2' : 'Two ' , '3' : 'Three ' , '4' : 'Four ' ,
'5': 'Five ' , '6':'Six ' , '7' : 'Seven ' , '8':'Eight ' ,
'9':'Nine ' , '0': 'Zero '}
# use replace with regex set to true and then replace duplicate spaces between words
s.replace(conversion_dict, regex=True).replace(' +', ' ', regex=True).str.rstrip()
['Zero Five Eight Nine BTC: Five Eight One OUTFLOW BANK REF: CUST REF: Zero Zero Zero Four'
'CUR FR Four Four FEight Availability: One Two ,Two Six Seven .Two Four Debited']
为了给我 2 美分,这里有一个可能的解决方案:
import re
def num2digit(text):
mapper = {
'0': 'ZERO ',
'1': 'ONE ',
'2': 'TWO ',
'3': 'THREE ',
'4': 'FOUR ',
'5': 'FIVE ',
'6': 'SIX ',
'7': 'SEVEN ',
'8': 'EIGHT ',
'9': 'NINE ',
}
for k, v in mapper.iteritems():
text = text.replace(k, v)
return re.sub(' +', ' ', text).strip()
然后你可以这样称呼它:
>>> num2digit('0589 BTC: 581 OUTFLOW BANK REF: CUST REF: 0004')
'ZERO FIVE EIGHT NINE BTC: FIVE EIGHT ONE OUTFLOW BANK REF: CUST REF: ZERO ZERO ZERO FOUR'
为了解释它所做的是用它的映射名称替换每个数字,然后在名称后添加一个 space 以根据需要分隔每个单词,然后删除可能的双白色 spaces然后,最后,删除可能的尾随白色spaces.
我有一个系列看起来有点像这样
'0589 BTC: 581 OUTFLOW BANK REF: CUST REF: 0004'
'CUR FR 44F8 Availability: 12,267.24 Debited'
...
我想将所有整数/字母数字值替换为相应的字格式。
For eg.
0589 --> ZERO FIVE EIGHT NINE
44F8 --> FOUR FOUR F EIGHT
12,267.24 --> ONE TWO , TWO SIX SEVEN . TWO FOUR
因此第一项将被转换为
'ZERO FIVE EIGHT NINE BTC: FIVE EIGHT ONE OUTFLOW BANK REF: CUST REF: ZERO ZERO ZERO FOUR'
等等。
解决这个问题的方法是什么,
我正在研究一些 python 包,例如 num2words 和 inflect,但它们都是 return 人类可读的格式 即 22 --> 22 不满足我的要求
conversion_dict = {1:'One' , 2 : 'Two' , 3 : 'Three' , 4 : 'Four' , 5: 'Five' , 6:'Six' , 7 : 'Seven' , 8:'Eight' , 9:'Nine' , 0: 'Zero'}
你可以试试这个方法
- 使用正则表达式获取字符串中的所有数字
[\d]+
- 将匹配转换为整数
- 使用模 (%) 运算符从数字中获取数字
- 终于用你的字典查到了单词中的数字
替代解决方案
- 使用
re.findall(r'[\d]', my_string)
- 这会给你所有可能的数字
- 接下来就用
my_string.replace(digit, f' {conversion_dict[digit]} ' )
您可以遍历字符串并用相应的字符串替换每个数字,就像我在这里所做的那样:
conversion_dict = {'1':'One' , '2' : 'Two' , '3' : 'Three' , '4' : 'Four' , '5': 'Five' , '6':'Six' , '7' : 'Seven' , '8':'Eight' , '9':'Nine' , '0': 'Zero'}
def parser(string: str):
def inner():
for index, char in enumerate(string):
yield " " + conversion_dict[char] + " " if char.isnumeric() else char
result = "".join(inner()).lstrip()
return "".join(s if not (result[i] == result[i - 1] == " ") else "" for i, s in enumerate(result))
string_for_example = """
0589 BTC: 581 OUTFLOW BANK REF: CUST REF: 0004
CUR FR 44F8 Availability: 12,267.24 Debited
"""
print(parser(string_for_example))
结果是:
Zero Five Eight Nine BTC: Five Eight One OUTFLOW BANK REF: CUST REF: Zero Zero Zero Four
CUR FR Four Four F Eight Availability: One Two , Two Six Seven . Two Four Debited
# your pandas series
s = pd.Series(['0589 BTC: 581 OUTFLOW BANK REF: CUST REF: 0004',
'CUR FR 44F8 Availability: 12,267.24 Debited'], name='Text')
# your conversion dict with strings not ints
conversion_dict = {'1':'One ' , '2' : 'Two ' , '3' : 'Three ' , '4' : 'Four ' ,
'5': 'Five ' , '6':'Six ' , '7' : 'Seven ' , '8':'Eight ' ,
'9':'Nine ' , '0': 'Zero '}
# use replace with regex set to true and then replace duplicate spaces between words
s.replace(conversion_dict, regex=True).replace(' +', ' ', regex=True).str.rstrip()
['Zero Five Eight Nine BTC: Five Eight One OUTFLOW BANK REF: CUST REF: Zero Zero Zero Four'
'CUR FR Four Four FEight Availability: One Two ,Two Six Seven .Two Four Debited']
为了给我 2 美分,这里有一个可能的解决方案:
import re
def num2digit(text):
mapper = {
'0': 'ZERO ',
'1': 'ONE ',
'2': 'TWO ',
'3': 'THREE ',
'4': 'FOUR ',
'5': 'FIVE ',
'6': 'SIX ',
'7': 'SEVEN ',
'8': 'EIGHT ',
'9': 'NINE ',
}
for k, v in mapper.iteritems():
text = text.replace(k, v)
return re.sub(' +', ' ', text).strip()
然后你可以这样称呼它:
>>> num2digit('0589 BTC: 581 OUTFLOW BANK REF: CUST REF: 0004')
'ZERO FIVE EIGHT NINE BTC: FIVE EIGHT ONE OUTFLOW BANK REF: CUST REF: ZERO ZERO ZERO FOUR'
为了解释它所做的是用它的映射名称替换每个数字,然后在名称后添加一个 space 以根据需要分隔每个单词,然后删除可能的双白色 spaces然后,最后,删除可能的尾随白色spaces.