python 数据框收入列清理
python dataframe income column cleanup
这可能是一个简单的解决方案,但我发现很难让这个函数适用于我的数据集。
我有一个包含各种数据的薪水列。下面的示例数据框:
ID Income desired Output
1 26000 26000
2 45K 45000
3 - NaN
4 0 NaN
5 N/A NaN
6 2000 2000
7 30000 - 45000 37500 (30000+45000/2)
8 21000 per Annum 21000
9 50000 per annum 50000
10 21000 to 30000 25500 (21000+30000/2)
11 NaN
12 21000 To 50000 35500 (21000+50000/2)
13 43000/year 43000
14 NaN
15 80000/Year 80000
16 12.40 p/h 12896 (12.40 x 20 x 52)
17 12.40 per hour 12896 (12.40 x 20 x 52)
18 45000.0 (this is a float value) 45000
@user34974 - 在提供可行的解决方案(如下)方面非常有帮助。但是,该解决方案为我提供了一个错误,因为数据框列也包含浮点值。任何人都可以帮助满足数据框列中可以处理的函数中的浮点值吗?最后更新列中的输出应该是浮点值。
Normrep = ['N/A','per Annum','per annum','/year','/Year','p/h','per hour',35000.0]
def clean_income(value):
for i in Normrep:
value = value.replace(i,"")
if len(value) == 0 or value.isspace() or value == '-': #- cannot be clubbed to array as used else where in data
return np.nan
elif value == '0':
return np.nan
# now there should not be any extra letters with K hence can be done below step
if value.endswith('K'):
value = value.replace('K','000')
# for to and -
vals = value.split(' to ')
if len(vals) != 2:
vals = value.split(' To ')
if len(vals) != 2:
vals = value.split(' - ')
if len(vals) == 2:
return (float(vals[0]) + float(vals[1]))/2
try:
a = float(value)
return a
except:
return np.nan # Either not proper data or need to still handle some fromat of inputs.
testData = ['26000','45K','-','0','N/A','2000','30000 - 45000','21000 per Annum','','21000 to 30000','21000 To 50000','43000/year', 35000.0]
df = pd.DataFrame(testData)
print(df)
df[0] = df[0].apply(lambda x: clean_income(x))
print(df)
我想重申一下,如果这只是数据的可能组合,那么我已经完成并提供了以下代码。
即使有任何小的变化,您也需要进行编辑以适应新的变化。让我解释一下我做了什么,对于你想用“”替换的所有字符串,我创建了一个数组 Normrep。所以,如果你要删除更多的字符串,你可以添加元素。此外,对于 'K'、'p/h'、'per hour',它们需要专门处理,并且需要完成转换。因此,如果您的数据中的字符串可能发生变化,那么您需要在此处进行处理。
import pandas as pd
import numpy as np
Normrep = ['N/A', 'per Annum', 'per annum', '/year', '/Year']
def clean_income(value):
if isinstance(value,float):
return value
else:
isHourConversionNeeded = False;
for i in Normrep:
value = value.replace(i, "")
if len(value) == 0 or value.isspace() or value == '-': # - cannot be clubbed to array as used else where in data
return np.nan
elif value == '0':
return np.nan
# now there should not be any extra letters with K hence can be done below step
if value.endswith('K'):
value = value.replace('K', '000')
elif value.endswith('p/h') or value.endswith('per hour'):
isHourConversionNeeded = True
value = value.replace('p/h',"")
value = value.replace('per hour',"")
# for to and -
vals = value.split(' to ')
if len(vals) != 2:
vals = value.split(' To ')
if len(vals) != 2:
vals = value.split(' - ')
if len(vals) == 2:
return (float(vals[0]) + float(vals[1])) / 2
try:
a = float(value)
if isHourConversionNeeded:
a = a * 20 * 52
return a
except:
return np.nan # Either not proper data or need to still handle some fromat of inputs.
testData = ['26000', '45K', '-', '0', 'N/A', '2000', '30000 - 45000', '21000 per Annum', '', '21000 to 30000',
'21000 To 50000', '43000/year', 35000.0,'12.40 p/h','12.40 per hour']
df = pd.DataFrame(testData)
print(df)
df[0] = df[0].apply(lambda x: clean_income(x))
print(df)
下面是我将在没有所有循环的情况下执行此操作的方法。
c = ['ID','Income']
d = [
[1, 26000],
[2, '45K'],
[3, '-'],
[4, 0],
[5, 'N/A'],
[6, 2000],
[7, '30000 - 45000'],
[8, '21000 per Annum'],
[9, '50000 per annum'],
[10, '21000 to 30000'],
[11, ''],
[12, '21000 To 50000'],
[13, '43000/year'],
[14, ''],
[15, '80000/Year'],
[16, '12.40 p/h'],
[17, '12.40 per hour'],
[18, 45000.00]]
import pandas as pd
import numpy as np
df = pd.DataFrame(d,columns=c)
df['Income1'] = df['Income'].astype(str).str.lower()
df['Income1'].replace({'n/a' : '0', '':'0', '-':'0', 0:'0'}, regex=False, inplace=True)
df['Income1'].replace({'k$': '000','to': '+', '-': '+', ' per annum': '', 'p/h' : 'per hour', '/year': ''}, regex=True, inplace=True)
df['Income1'].replace(' per hour', ' * 12 * 52', regex=True, inplace=True)
df.loc[df.astype(str).Income1.str.contains('\+'),'Income1'] = '(' + df['Income1'].astype(str) + ') / 2'
df['Income1'] = df['Income1'].apply(lambda x: eval(x) if (pd.notnull(x)) else x)
df['Income1'] = (df['Income1'].fillna(0)
.astype(int)
.astype(object)
.where(df['Income1'].notnull()))
print (df)
这个输出将是:
ID Income Income1
0 1 26000 26000
1 2 45K 45000
2 3 - NaN
3 4 0 NaN
4 5 N/A NaN
5 6 2000 2000
6 7 30000 - 45000 37500
7 8 21000 per Annum 21000
8 9 50000 per annum 50000
9 10 21000 to 30000 25500
10 11 NaN
11 12 21000 To 50000 35500
12 13 43000/year 43000
13 14 NaN
14 15 80000/Year 80000
15 16 12.40 p/h 7737
16 17 12.40 per hour 7737
17 18 45000 45000
这可能是一个简单的解决方案,但我发现很难让这个函数适用于我的数据集。
我有一个包含各种数据的薪水列。下面的示例数据框:
ID Income desired Output
1 26000 26000
2 45K 45000
3 - NaN
4 0 NaN
5 N/A NaN
6 2000 2000
7 30000 - 45000 37500 (30000+45000/2)
8 21000 per Annum 21000
9 50000 per annum 50000
10 21000 to 30000 25500 (21000+30000/2)
11 NaN
12 21000 To 50000 35500 (21000+50000/2)
13 43000/year 43000
14 NaN
15 80000/Year 80000
16 12.40 p/h 12896 (12.40 x 20 x 52)
17 12.40 per hour 12896 (12.40 x 20 x 52)
18 45000.0 (this is a float value) 45000
@user34974 - 在提供可行的解决方案(如下)方面非常有帮助。但是,该解决方案为我提供了一个错误,因为数据框列也包含浮点值。任何人都可以帮助满足数据框列中可以处理的函数中的浮点值吗?最后更新列中的输出应该是浮点值。
Normrep = ['N/A','per Annum','per annum','/year','/Year','p/h','per hour',35000.0]
def clean_income(value):
for i in Normrep:
value = value.replace(i,"")
if len(value) == 0 or value.isspace() or value == '-': #- cannot be clubbed to array as used else where in data
return np.nan
elif value == '0':
return np.nan
# now there should not be any extra letters with K hence can be done below step
if value.endswith('K'):
value = value.replace('K','000')
# for to and -
vals = value.split(' to ')
if len(vals) != 2:
vals = value.split(' To ')
if len(vals) != 2:
vals = value.split(' - ')
if len(vals) == 2:
return (float(vals[0]) + float(vals[1]))/2
try:
a = float(value)
return a
except:
return np.nan # Either not proper data or need to still handle some fromat of inputs.
testData = ['26000','45K','-','0','N/A','2000','30000 - 45000','21000 per Annum','','21000 to 30000','21000 To 50000','43000/year', 35000.0]
df = pd.DataFrame(testData)
print(df)
df[0] = df[0].apply(lambda x: clean_income(x))
print(df)
我想重申一下,如果这只是数据的可能组合,那么我已经完成并提供了以下代码。
即使有任何小的变化,您也需要进行编辑以适应新的变化。让我解释一下我做了什么,对于你想用“”替换的所有字符串,我创建了一个数组 Normrep。所以,如果你要删除更多的字符串,你可以添加元素。此外,对于 'K'、'p/h'、'per hour',它们需要专门处理,并且需要完成转换。因此,如果您的数据中的字符串可能发生变化,那么您需要在此处进行处理。
import pandas as pd
import numpy as np
Normrep = ['N/A', 'per Annum', 'per annum', '/year', '/Year']
def clean_income(value):
if isinstance(value,float):
return value
else:
isHourConversionNeeded = False;
for i in Normrep:
value = value.replace(i, "")
if len(value) == 0 or value.isspace() or value == '-': # - cannot be clubbed to array as used else where in data
return np.nan
elif value == '0':
return np.nan
# now there should not be any extra letters with K hence can be done below step
if value.endswith('K'):
value = value.replace('K', '000')
elif value.endswith('p/h') or value.endswith('per hour'):
isHourConversionNeeded = True
value = value.replace('p/h',"")
value = value.replace('per hour',"")
# for to and -
vals = value.split(' to ')
if len(vals) != 2:
vals = value.split(' To ')
if len(vals) != 2:
vals = value.split(' - ')
if len(vals) == 2:
return (float(vals[0]) + float(vals[1])) / 2
try:
a = float(value)
if isHourConversionNeeded:
a = a * 20 * 52
return a
except:
return np.nan # Either not proper data or need to still handle some fromat of inputs.
testData = ['26000', '45K', '-', '0', 'N/A', '2000', '30000 - 45000', '21000 per Annum', '', '21000 to 30000',
'21000 To 50000', '43000/year', 35000.0,'12.40 p/h','12.40 per hour']
df = pd.DataFrame(testData)
print(df)
df[0] = df[0].apply(lambda x: clean_income(x))
print(df)
下面是我将在没有所有循环的情况下执行此操作的方法。
c = ['ID','Income']
d = [
[1, 26000],
[2, '45K'],
[3, '-'],
[4, 0],
[5, 'N/A'],
[6, 2000],
[7, '30000 - 45000'],
[8, '21000 per Annum'],
[9, '50000 per annum'],
[10, '21000 to 30000'],
[11, ''],
[12, '21000 To 50000'],
[13, '43000/year'],
[14, ''],
[15, '80000/Year'],
[16, '12.40 p/h'],
[17, '12.40 per hour'],
[18, 45000.00]]
import pandas as pd
import numpy as np
df = pd.DataFrame(d,columns=c)
df['Income1'] = df['Income'].astype(str).str.lower()
df['Income1'].replace({'n/a' : '0', '':'0', '-':'0', 0:'0'}, regex=False, inplace=True)
df['Income1'].replace({'k$': '000','to': '+', '-': '+', ' per annum': '', 'p/h' : 'per hour', '/year': ''}, regex=True, inplace=True)
df['Income1'].replace(' per hour', ' * 12 * 52', regex=True, inplace=True)
df.loc[df.astype(str).Income1.str.contains('\+'),'Income1'] = '(' + df['Income1'].astype(str) + ') / 2'
df['Income1'] = df['Income1'].apply(lambda x: eval(x) if (pd.notnull(x)) else x)
df['Income1'] = (df['Income1'].fillna(0)
.astype(int)
.astype(object)
.where(df['Income1'].notnull()))
print (df)
这个输出将是:
ID Income Income1
0 1 26000 26000
1 2 45K 45000
2 3 - NaN
3 4 0 NaN
4 5 N/A NaN
5 6 2000 2000
6 7 30000 - 45000 37500
7 8 21000 per Annum 21000
8 9 50000 per annum 50000
9 10 21000 to 30000 25500
10 11 NaN
11 12 21000 To 50000 35500
12 13 43000/year 43000
13 14 NaN
14 15 80000/Year 80000
15 16 12.40 p/h 7737
16 17 12.40 per hour 7737
17 18 45000 45000