将美国金额提取到单独的列中
Extract US amounts into separate columns
我正在尝试使用正则表达式从 sub-string 中提取以美元计价的金额。负数在字符串末尾有一个“CR”,表示负值。金额包含在标题为“说明”的单列 csv 文件中。以下是一些示例订单项:
description
Account Total: 26,458.16 7,476,744.04 7,484,287.03 7,542.99CR 18,915.17
Account Total: 27,218.61 7,719,293.26 7,740,051.63 20,758.37CR 6,460.24
Account Total: .00 7,634,750.07 39,055.35 7,595,694.72 7,595,694.72
Account Total: 64,249.00 .00 64,249.00 64,249.00CR .00
理想的结果是一个数据框,每个金额都包含在单独的列中,标题如下:'Beg_bal'、'Total_cr'、'Total_db' 、“Net_ch”和“Ending_bal”
我尝试了以下代码,但结果为“nan”值:
pat=r'^(?P<Beg_bal>$?(?:\d+,)*\d+\.\d+)\s+(?P<Total_cr>$?(?:\d+,)*\d+\.\d+)\s+(?P<Total_db>$?(?:\d+,)*\d+\.\d+)\s+(?P<Net_ch>$?(?:\d+,)*\d+\.\d+)\s+(?P<Ending_bal>$?(?:\d+,)*\d+\.\d+)'
df[['Beg_bal','Total_cr','Total_db','Net_ch','Ending_bal']]=df['description'].str.extract(pat)
提前致谢,非常感谢您一如既往的帮助。
您可以使用 str.split
,删除前 2 列,因为它们包含帐户和总计:并根据需要重命名剩余的列
df_ = df['description'].str.split('\s+', expand=True).iloc[:, 2:]
df_.columns = ['Beg_bal', 'Total_cr', 'Total_db', 'Net_ch', 'Ending_bal']
print (df_)
Beg_bal Total_cr Total_db Net_ch Ending_bal
0 26,458.16 7,476,744.04 7,484,287.03 7,542.99CR 18,915.17
1 27,218.61 7,719,293.26 7,740,051.63 20,758.37CR 6,460.24
2 .00 7,634,750.07 39,055.35 7,595,694.72 7,595,694.72
3 64,249.00 .00 64,249.00 64,249.00CR .00
你可以这样做:
df = pd.read_csv('test.csv', sep='|')
df = df['description'].str.split(r' *').apply(pd.Series).drop(columns=[0])
df.columns = [['Beg_bal','Total_cr','Total_db','Net_ch','Ending_bal']]
print(df)
Beg_bal Total_cr Total_db Net_ch Ending_bal
0 26,458.16 7,476,744.04 7,484,287.03 7,542.99CR 18,915.17
1 27,218.61 7,719,293.26 7,740,051.63 20,758.37CR 6,460.24
2 .00 7,634,750.07 39,055.35 7,595,694.72 7,595,694.72
3 64,249.00 .00 64,249.00 64,249.00CR .00
我正在尝试使用正则表达式从 sub-string 中提取以美元计价的金额。负数在字符串末尾有一个“CR”,表示负值。金额包含在标题为“说明”的单列 csv 文件中。以下是一些示例订单项:
description
Account Total: 26,458.16 7,476,744.04 7,484,287.03 7,542.99CR 18,915.17
Account Total: 27,218.61 7,719,293.26 7,740,051.63 20,758.37CR 6,460.24
Account Total: .00 7,634,750.07 39,055.35 7,595,694.72 7,595,694.72
Account Total: 64,249.00 .00 64,249.00 64,249.00CR .00
理想的结果是一个数据框,每个金额都包含在单独的列中,标题如下:'Beg_bal'、'Total_cr'、'Total_db' 、“Net_ch”和“Ending_bal” 我尝试了以下代码,但结果为“nan”值:
pat=r'^(?P<Beg_bal>$?(?:\d+,)*\d+\.\d+)\s+(?P<Total_cr>$?(?:\d+,)*\d+\.\d+)\s+(?P<Total_db>$?(?:\d+,)*\d+\.\d+)\s+(?P<Net_ch>$?(?:\d+,)*\d+\.\d+)\s+(?P<Ending_bal>$?(?:\d+,)*\d+\.\d+)'
df[['Beg_bal','Total_cr','Total_db','Net_ch','Ending_bal']]=df['description'].str.extract(pat)
提前致谢,非常感谢您一如既往的帮助。
您可以使用 str.split
,删除前 2 列,因为它们包含帐户和总计:并根据需要重命名剩余的列
df_ = df['description'].str.split('\s+', expand=True).iloc[:, 2:]
df_.columns = ['Beg_bal', 'Total_cr', 'Total_db', 'Net_ch', 'Ending_bal']
print (df_)
Beg_bal Total_cr Total_db Net_ch Ending_bal
0 26,458.16 7,476,744.04 7,484,287.03 7,542.99CR 18,915.17
1 27,218.61 7,719,293.26 7,740,051.63 20,758.37CR 6,460.24
2 .00 7,634,750.07 39,055.35 7,595,694.72 7,595,694.72
3 64,249.00 .00 64,249.00 64,249.00CR .00
你可以这样做:
df = pd.read_csv('test.csv', sep='|')
df = df['description'].str.split(r' *').apply(pd.Series).drop(columns=[0])
df.columns = [['Beg_bal','Total_cr','Total_db','Net_ch','Ending_bal']]
print(df)
Beg_bal Total_cr Total_db Net_ch Ending_bal
0 26,458.16 7,476,744.04 7,484,287.03 7,542.99CR 18,915.17
1 27,218.61 7,719,293.26 7,740,051.63 20,758.37CR 6,460.24
2 .00 7,634,750.07 39,055.35 7,595,694.72 7,595,694.72
3 64,249.00 .00 64,249.00 64,249.00CR .00