测试 Pandas 数据框单元格是否包含空值
Testing whether Pandas dataframe cell contains null value
我有一个 Pandas 数据框,其中包含两列,其中包含项目列表或 NaN 值。可以使用以下方法生成说明性示例:
import numpy as np
import pandas as pd
df = pd.DataFrame({'colA':['ab','abc','de','def','ghi','jkl','mno','pqr','stw','stu'],
'colB':['abcd','bcde','defg','defh','ghijk','j','mnp','pq','stuw','sut'] })
df['colA'] = df['colA'].apply(lambda x: list(x))
df['colB'] = df['colB'].apply(lambda x: list(x))
df.at[3,'colB'] = np.nan
df.at[8,'colB'] = np.nan
... 看起来像:
colA colB
0 [a, b] [a, b, c, d]
1 [a, b, c] [b, c, d, e]
2 [d, e] [d, e, f, g]
3 [d, e, f] NaN
4 [g, h, i] [g, h, i, j, k]
5 [j, k, l] [j]
6 [m, n, o] [m, n, p]
7 [p, q, r] [p, q]
8 [s, t, w] NaN
9 [s, t, u] [s, u, t]
我想在列表对上执行各种任务(例如使用 NLTK 的 jacquard_distance() 函数),但前提是 colB 不包含 NaN。
如果没有 NaN 值,以下命令运行良好:
import nltk
df['jd'] = df.apply(lambda x: nltk.jaccard_distance(set(x['colA']),set(x['colB'])),axis = 1)
但是,如果 colB 包含 NaN,则会产生以下错误:
TypeError: ("'float' object is not iterable", 'occurred at index 3')
我尝试使用 if...else 子句来仅 运行 colB 不包含 NaN 的行上的函数:
df['jd'] = df.apply(lambda x: nltk.jaccard_distance(set(x['colA']),set(x['colB'])) if pd.notnull(x['colB']) else np.nan,axis = 1)
...但这会产生错误:
ValueError: ('The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()', 'occurred at index 0')
我也曾尝试使用错误中建议的 .any() 和 .all() 结构,但无济于事。
似乎将列表传递给 pd.notnull() 会引起混淆,因为 pd.notnull() 想要测试列表的每个元素,而我想要的是考虑数据框的全部内容单元格是否为 NaN。
我的问题是如何确定 Pandas 数据框中的单元格是否包含 NaN 值,以便 lambda 函数只能应用于不包含 NaN 的单元格?
您可以仅为非缺失值筛选行:
f = lambda x: nltk.jaccard_distance(set(x['colA']),set(x['colB']))
m = df['colB'].notna()
df.loc[m, 'jd'] = df[m].apply(f,axis = 1)
print (df)
colA colB jd
0 [a, b] [a, b, c, d] 0.500000
1 [a, b, c] [b, c, d, e] 0.600000
2 [d, e] [d, e, f, g] 0.500000
3 [d, e, f] NaN NaN
4 [g, h, i] [g, h, i, j, k] 0.400000
5 [j, k, l] [j] 0.666667
6 [m, n, o] [m, n, p] 0.500000
7 [p, q, r] [p, q] 0.333333
8 [s, t, w] NaN NaN
9 [s, t, u] [s, u, t] 0.000000
检查列表中缺失值的原因是按元素检查:
df['jd'] = df.apply(lambda x: pd.notna(x['colB']), axis = 1)
print (df)
colA colB jd
0 [a, b] [a, b, c, d] [True, True, True, True]
1 [a, b, c] [b, c, d, e] [True, True, True, True]
2 [d, e] [d, e, f, g] [True, True, True, True]
3 [d, e, f] NaN False
4 [g, h, i] [g, h, i, j, k] [True, True, True, True, True]
5 [j, k, l] [j] [True]
6 [m, n, o] [m, n, p] [True, True, True]
7 [p, q, r] [p, q] [True, True]
8 [s, t, w] NaN False
9 [s, t, u] [s, u, t] [True, True, True]
我在写问题时突然想到,我可以测试单元格的内容是否为列表,而不是测试单元格的内容是否为 NaN。哦!我使用了以下内容:
df['jd'] = df.apply(lambda x: nltk.jaccard_distance(set(x['colA']),set(x['colB'])) if isinstance(x['colB'],list) else np.nan,axis = 1)
这按要求工作并产生输出:
colA colB jd
0 [a, b] [a, b, c, d] 0.500000
1 [a, b, c] [b, c, d, e] 0.600000
2 [d, e] [d, e, f, g] 0.500000
3 [d, e, f] NaN NaN
4 [g, h, i] [g, h, i, j, k] 0.400000
5 [j, k, l] [j] 0.666667
6 [m, n, o] [m, n, p] 0.500000
7 [p, q, r] [p, q] 0.333333
8 [s, t, w] NaN NaN
9 [s, t, u] [s, u, t] 0.000000
但 jezrael 的回答(预先过滤 NaN)可能是最合乎逻辑的方法。
尽管如此,我还是想知道是否有明确测试 NaN 的方法。
我有一个 Pandas 数据框,其中包含两列,其中包含项目列表或 NaN 值。可以使用以下方法生成说明性示例:
import numpy as np
import pandas as pd
df = pd.DataFrame({'colA':['ab','abc','de','def','ghi','jkl','mno','pqr','stw','stu'],
'colB':['abcd','bcde','defg','defh','ghijk','j','mnp','pq','stuw','sut'] })
df['colA'] = df['colA'].apply(lambda x: list(x))
df['colB'] = df['colB'].apply(lambda x: list(x))
df.at[3,'colB'] = np.nan
df.at[8,'colB'] = np.nan
... 看起来像:
colA colB
0 [a, b] [a, b, c, d]
1 [a, b, c] [b, c, d, e]
2 [d, e] [d, e, f, g]
3 [d, e, f] NaN
4 [g, h, i] [g, h, i, j, k]
5 [j, k, l] [j]
6 [m, n, o] [m, n, p]
7 [p, q, r] [p, q]
8 [s, t, w] NaN
9 [s, t, u] [s, u, t]
我想在列表对上执行各种任务(例如使用 NLTK 的 jacquard_distance() 函数),但前提是 colB 不包含 NaN。
如果没有 NaN 值,以下命令运行良好:
import nltk
df['jd'] = df.apply(lambda x: nltk.jaccard_distance(set(x['colA']),set(x['colB'])),axis = 1)
但是,如果 colB 包含 NaN,则会产生以下错误:
TypeError: ("'float' object is not iterable", 'occurred at index 3')
我尝试使用 if...else 子句来仅 运行 colB 不包含 NaN 的行上的函数:
df['jd'] = df.apply(lambda x: nltk.jaccard_distance(set(x['colA']),set(x['colB'])) if pd.notnull(x['colB']) else np.nan,axis = 1)
...但这会产生错误:
ValueError: ('The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()', 'occurred at index 0')
我也曾尝试使用错误中建议的 .any() 和 .all() 结构,但无济于事。
似乎将列表传递给 pd.notnull() 会引起混淆,因为 pd.notnull() 想要测试列表的每个元素,而我想要的是考虑数据框的全部内容单元格是否为 NaN。
我的问题是如何确定 Pandas 数据框中的单元格是否包含 NaN 值,以便 lambda 函数只能应用于不包含 NaN 的单元格?
您可以仅为非缺失值筛选行:
f = lambda x: nltk.jaccard_distance(set(x['colA']),set(x['colB']))
m = df['colB'].notna()
df.loc[m, 'jd'] = df[m].apply(f,axis = 1)
print (df)
colA colB jd
0 [a, b] [a, b, c, d] 0.500000
1 [a, b, c] [b, c, d, e] 0.600000
2 [d, e] [d, e, f, g] 0.500000
3 [d, e, f] NaN NaN
4 [g, h, i] [g, h, i, j, k] 0.400000
5 [j, k, l] [j] 0.666667
6 [m, n, o] [m, n, p] 0.500000
7 [p, q, r] [p, q] 0.333333
8 [s, t, w] NaN NaN
9 [s, t, u] [s, u, t] 0.000000
检查列表中缺失值的原因是按元素检查:
df['jd'] = df.apply(lambda x: pd.notna(x['colB']), axis = 1)
print (df)
colA colB jd
0 [a, b] [a, b, c, d] [True, True, True, True]
1 [a, b, c] [b, c, d, e] [True, True, True, True]
2 [d, e] [d, e, f, g] [True, True, True, True]
3 [d, e, f] NaN False
4 [g, h, i] [g, h, i, j, k] [True, True, True, True, True]
5 [j, k, l] [j] [True]
6 [m, n, o] [m, n, p] [True, True, True]
7 [p, q, r] [p, q] [True, True]
8 [s, t, w] NaN False
9 [s, t, u] [s, u, t] [True, True, True]
我在写问题时突然想到,我可以测试单元格的内容是否为列表,而不是测试单元格的内容是否为 NaN。哦!我使用了以下内容:
df['jd'] = df.apply(lambda x: nltk.jaccard_distance(set(x['colA']),set(x['colB'])) if isinstance(x['colB'],list) else np.nan,axis = 1)
这按要求工作并产生输出:
colA colB jd
0 [a, b] [a, b, c, d] 0.500000
1 [a, b, c] [b, c, d, e] 0.600000
2 [d, e] [d, e, f, g] 0.500000
3 [d, e, f] NaN NaN
4 [g, h, i] [g, h, i, j, k] 0.400000
5 [j, k, l] [j] 0.666667
6 [m, n, o] [m, n, p] 0.500000
7 [p, q, r] [p, q] 0.333333
8 [s, t, w] NaN NaN
9 [s, t, u] [s, u, t] 0.000000
但 jezrael 的回答(预先过滤 NaN)可能是最合乎逻辑的方法。
尽管如此,我还是想知道是否有明确测试 NaN 的方法。