python 数据框收入列清理

Question

这可能是一个简单的解决方案，但我发现很难让这个函数适用于我的数据集。

我有一个包含各种数据的薪水列。下面的示例数据框：

ID   Income                              desired Output         
1    26000                               26000
2    45K                                 45000
3    -                                   NaN
4    0                                   NaN
5    N/A                                 NaN
6    2000                                2000   
7    30000 - 45000                       37500 (30000+45000/2)   
8    21000 per Annum                     21000                
9    50000 per annum                     50000
10   21000 to 30000                      25500 (21000+30000/2)
11                                       NaN
12   21000 To 50000                      35500 (21000+50000/2)
13   43000/year                          43000
14                                       NaN
15   80000/Year                          80000
16   12.40 p/h                           12896 (12.40 x 20 x 52)
17   12.40 per hour                      12896 (12.40 x 20 x 52)
18   45000.0 (this is a float value)     45000

@user34974 - 在提供可行的解决方案（如下）方面非常有帮助。但是，该解决方案为我提供了一个错误，因为数据框列也包含浮点值。任何人都可以帮助满足数据框列中可以处理的函数中的浮点值吗？最后更新列中的输出应该是浮点值。

Normrep = ['N/A','per Annum','per annum','/year','/Year','p/h','per hour',35000.0]

def clean_income(value):
    for i in Normrep:
        value = value.replace(i,"")



    if len(value) == 0 or value.isspace() or value == '-': #- cannot be clubbed to array as used else where in data
        return np.nan
    elif value == '0':
        return np.nan

    # now there should not be any extra letters with K hence can be done below step
    if value.endswith('K'):
        value = value.replace('K','000')
    
    # for to and -
    vals = value.split(' to ')
    if len(vals) != 2:
        vals = value.split(' To ')
        if len(vals) != 2:
            vals = value.split(' - ')

    if len(vals) == 2:
        return (float(vals[0]) + float(vals[1]))/2

    try:
        a = float(value)
        return a
    except:
        return np.nan    # Either not proper data or need to still handle some fromat of inputs.


testData = ['26000','45K','-','0','N/A','2000','30000 - 45000','21000 per Annum','','21000 to 30000','21000 To 50000','43000/year', 35000.0]


df = pd.DataFrame(testData)
print(df)

df[0] = df[0].apply(lambda x: clean_income(x))

print(df)

Answer 1

我想重申一下，如果这只是数据的可能组合，那么我已经完成并提供了以下代码。

即使有任何小的变化，您也需要进行编辑以适应新的变化。让我解释一下我做了什么，对于你想用“”替换的所有字符串，我创建了一个数组 Normrep。所以，如果你要删除更多的字符串，你可以添加元素。此外，对于 'K'、'p/h'、'per hour'，它们需要专门处理，并且需要完成转换。因此，如果您的数据中的字符串可能发生变化，那么您需要在此处进行处理。

import pandas as pd
import numpy as np

Normrep = ['N/A', 'per Annum', 'per annum', '/year', '/Year']


def clean_income(value):
    if isinstance(value,float):
        return value
    else:
        isHourConversionNeeded = False;
        
        for i in Normrep:
            value = value.replace(i, "")

        if len(value) == 0 or value.isspace() or value == '-':  # - cannot be clubbed to array as used else where in data
            return np.nan
        elif value == '0':
            return np.nan

        # now there should not be any extra letters with K hence can be done below step
        if value.endswith('K'):
            value = value.replace('K', '000')
        elif value.endswith('p/h') or value.endswith('per hour'):
            isHourConversionNeeded = True
            value = value.replace('p/h',"")
            value = value.replace('per hour',"")

        # for to and -
        vals = value.split(' to ')
        if len(vals) != 2:
            vals = value.split(' To ')
            if len(vals) != 2:
                vals = value.split(' - ')

        if len(vals) == 2:
            return (float(vals[0]) + float(vals[1])) / 2

        try:
            a = float(value)
            if isHourConversionNeeded:
                a = a * 20 * 52
            return a
        except:
            return np.nan  # Either not proper data or need to still handle some fromat of inputs.


testData = ['26000', '45K', '-', '0', 'N/A', '2000', '30000 - 45000', '21000 per Annum', '', '21000 to 30000',
            '21000 To 50000', '43000/year', 35000.0,'12.40 p/h','12.40 per hour']
df = pd.DataFrame(testData)
print(df)

df[0] = df[0].apply(lambda x: clean_income(x))

print(df)

Answer 2

下面是我将在没有所有循环的情况下执行此操作的方法。

c = ['ID','Income']
d = [
[1, 26000],  
[2, '45K'],
[3, '-'],
[4, 0],  
[5, 'N/A'],     
[6, 2000],         
[7, '30000 - 45000'],
[8, '21000 per Annum'],
[9, '50000 per annum'],
[10, '21000 to 30000'],
[11, ''],
[12, '21000 To 50000'],
[13, '43000/year'],
[14, ''],
[15, '80000/Year'],
[16, '12.40 p/h'],
[17, '12.40 per hour'],
[18, 45000.00]]

import pandas as pd
import numpy as np
df = pd.DataFrame(d,columns=c)

df['Income1'] = df['Income'].astype(str).str.lower()

df['Income1'].replace({'n/a' : '0', '':'0', '-':'0', 0:'0'}, regex=False, inplace=True)

df['Income1'].replace({'k$': '000','to': '+', '-': '+', ' per annum': '', 'p/h' : 'per hour', '/year': ''}, regex=True, inplace=True)

df['Income1'].replace(' per hour', ' * 12 * 52', regex=True, inplace=True)

df.loc[df.astype(str).Income1.str.contains('\+'),'Income1'] = '(' + df['Income1'].astype(str) + ') / 2'

df['Income1'] = df['Income1'].apply(lambda x: eval(x) if (pd.notnull(x)) else x)

df['Income1'] = (df['Income1'].fillna(0)
                 .astype(int)
                 .astype(object)
                 .where(df['Income1'].notnull()))

print (df)

这个输出将是：

    ID           Income Income1
0    1            26000   26000
1    2              45K   45000
2    3                -     NaN
3    4                0     NaN
4    5              N/A     NaN
5    6             2000    2000
6    7    30000 - 45000   37500
7    8  21000 per Annum   21000
8    9  50000 per annum   50000
9   10   21000 to 30000   25500
10  11                      NaN
11  12   21000 To 50000   35500
12  13       43000/year   43000
13  14                      NaN
14  15       80000/Year   80000
15  16        12.40 p/h    7737
16  17   12.40 per hour    7737
17  18            45000   45000

python 数据框收入列清理

python dataframe income column cleanup

python

pandas

data-cleaning