处理 Pandas 中的空值——使用一列中的过滤值来填充其他两列中的 nan

Question

这是我发布的中的 clarification/restatement。我想知道我的解决方案是否是最简单或最有效的选择。

问：包含一些缺失值的单独列

我有一个包含三列的数据框：df.location 以逗号分隔的字符串形式的经度-纬度坐标，df.target，一个目标变量，其整数在 1 到 5 之间，当前格式化为浮点数和 df.null，该列主要是 nan，但也混合了经纬度坐标并在 1 到 5 之间浮动。

这是一个例子 df:

df = pd.DataFrame(
      {'target': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: 4.0, 6: 5.0, 7: 4.0, 8: 4.0, 9: 4.0},
       'location': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: '41.69230795,-72.26691314', 6: '41.70631764,-70.2868794', 7: '41.70687995,-70.28684036', 8: '41.70598417,-70.28671793', 9: '41.69220757,-70.26687248'},
       'null': {0: '41.70477575,-70.28844073', 1: '2', 2: '41.70637091,-70.28704334', 3: '4', 4: '3', 5: nan, 6: nan, 7: nan, 8: nan, 9: nan}
      }
)

对于 df.null 中存在非缺失值的每一行，df.target 和 df.location 中的值都缺失。（我不知道这是怎么发生的，但我检查了我读入 Pandas 数据框的原始 JSON，果然当 location 和 target 丢失时，这个空键经常弹出。）这是一个来自我的 Jupyter 笔记本的 Seaborn 热图的屏幕截图用于说明：

假设 df.location 和 df.target 中的部分或全部缺失值在 df.null 中是否安全？如果是这样，如何根据它们是经纬度字符串还是目标浮点数将这些值移动到适当的列中？

A：使用 fillna() 和 str.contains()

处理

这是我迄今为止的最佳答案 — 让我知道您的想法。基本上我只是用 fillna(value=df.null) 来填充 df.location 和 df.target:

中的所有缺失值

df.target.fillna(
    value=df.null,
    inplace=True
)

df.location.fillna(
    value=df.null,
    inplace=True
)

然后我使用正则表达式对 df.target 和 df.location 进行布尔过滤，并将所有不合适的值设置为 np.nan:

# Converting columns to type str so string methods work
df = df.astype(str)

# Using regex to change values that don't belong in column to NaN
regex = '[,]'
df.loc[df.target.str.contains(regex), 'target'] = np.nan
    
regex = '^\d\.?0?$'
df.loc[df.location.str.contains(regex), 'location'] = np.nan
    
# Returning `df.level` to float datatype (str is the correct
# datatype for `df.location`
df.target.astype(float)

有更好的方法吗？

编辑：更改了 fillna() 单元格代码，使其正常工作。

Answer 1

Is it safe to assume some or all of the missing values in df.location and df.target are in df.null?

这取决于初始数据。如果您有太多无法手动检查，则无法知道。您可以在转换后检查数据框，但您不确定。

我随着fillna(value=)的新用法（感谢这个，我不太理解），我找到了一个更快的写法：

df = pd.DataFrame(
      {'target': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: 4.0, 6: 5.0, 7: 4.0, 8: 4.0, 9: 4.0},
       'location': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: '41.69230795,-72.26691314', 6: '41.70631764,-70.2868794', 7: '41.70687995,-70.28684036', 8: '41.70598417,-70.28671793', 9: '41.69220757,-70.26687248'},
       'null': {0: '41.70477575,-70.28844073', 1: '2', 2: '41.70637091,-70.28704334', 3: '4', 4: '3', 5: nan, 6: nan, 7: nan, 8: nan, 9: nan}
      }
).assign(
    target=lambda x: x.target.fillna(value=pd.to_numeric(x.null, errors='coerce')),
    location=lambda x: x.location.fillna(
        value=x.loc[pd.to_numeric(x.null, errors='coerce').isnull(), 'null']
    )
).drop('null', axis='columns')

前面的代码给出了以下数据框：

                   location  target
0  41.70477575,-70.28844073     NaN
1                       NaN     2.0
2  41.70637091,-70.28704334     NaN
3                       NaN     4.0
4                       NaN     3.0
5  41.69230795,-72.26691314     4.0
6   41.70631764,-70.2868794     5.0
7  41.70687995,-70.28684036     4.0
8  41.70598417,-70.28671793     4.0
9  41.69220757,-70.26687248     4.0

您可以检查 null 和 target 中是否没有值，方法是：

大于5的值（如果有，你的假设是错误的，如果没有，那还不确定:-)）
位置栏中的昏迷数。

我保留给出相同结果的旧版本。

上一版本

这里的转换没有正则表达式：

import pandas as pd
from numpy import nan

df = pd.DataFrame(
      {'target': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: 4.0, 6: 5.0, 7: 4.0, 8: 4.0, 9: 4.0},
       'location': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: '41.69230795,-72.26691314', 6: '41.70631764,-70.2868794', 7: '41.70687995,-70.28684036', 8: '41.70598417,-70.28671793', 9: '41.69220757,-70.26687248'},
       'null': {0: '41.70477575,-70.28844073', 1: '2', 2: '41.70637091,-70.28704334', 3: '4', 4: '3', 5: nan, 6: nan, 7: nan, 8: nan, 9: nan}
      }
).assign(
    # use the conversion to numeric of the null column in order to find values
    # going to target and to location
    new_target=lambda x: pd.to_numeric(x['null'], errors='coerce'),
    new_location=lambda x: x.loc[pd.to_numeric(x['null'], errors='coerce').isnull(), 'null'],
).assign(
    target_without_nan=lambda x: x.new_target.fillna(0.0),
    new_location=lambda x: x.new_location.fillna(''),
    target=lambda x: (x.target_without_nan + x.target.fillna(0.0)).loc[~(x.target.isnull() & x.new_target.isnull())],
    location=lambda x: x.location.fillna('').str.cat(x.new_location.astype(str)).replace('', nan)
).loc[:, ['location', 'target']]

我使用中的技巧进行求和和连接以替换初始列的 nan 值。我还保留了 nan 值，这些值不能在最后一次分配目标时用 .loc 替换。

处理 Pandas 中的空值——使用一列中的过滤值来填充其他两列中的 nan

Handling Nulls in Pandas – Use filtered values in one column to fill nan in two other columns

python

regex

numpy

nan

pandas

问：包含一些缺失值的单独列

A：使用 fillna() 和 str.contains()