根据条件用列表中的单词填充缺失值

Question

我正在尝试预处理数据，尤其是处理缺失值。我有一个单词列表和两列文本数据。如果列表中的单词至少在两个文本列之一中，我用单词

填充缺失

import pandas as pd
a=['coffee', 'milk', 'sugar']
test=pd.DataFrame({'col':['missing', 'missing', 'missing'],
                   'text1': ['i drink tea', 'i drink coffee', 'i drink whiskey'],
                   'text2': ['i drink juice', 'i drink nothing', 'i drink milk']
                   })

所以数据框看起来像 "col" 列有 "missing" 作为应用 fillna("missing")

的结果

Out[19]: 
       col            text1            text2
0  missing      i drink tea    i drink juice
1  missing   i drink coffee  i drink nothing
2  missing  i drink whiskey     i drink milk

我想出了这样的代码应用循环

for word in a:
    test.loc[(test["col"]=='missing') & ((test["text1"].str.count(word)>0) 
    | (test['text2'].str.count(word)>0)), "col"]=word

列表中有 100 000 行和 2000 个元素 "a" 完成这项工作大约需要 870 秒。有没有什么解决方案可以让它更快地处理一个巨大的数据框？提前致谢

Answer 1

一些建议：

为什么使用 .str.count 而不是 .str.contains？
为什么fillna('missing')？ pd.isnull(test["col"]) 晒黑 test["col"]=='missing'
您还可以使用测试来查看是否所有缺失的字段都已填写。

所以这可以归结为这样的事情：

def fill_missing(original_df, column_name, replacements, inplace=True):
    df = original_df if inplace else original_df.copy()
    for word in replacements:
        empty = pd.isnull(df[column_name])
        if not empty.any():
            return df
        contained = (df.loc[empty, "text1"].str.contains(word))  | (df.loc[empty, 'text2'].str.contains(word))
        df.loc[contained[contained].index, column_name] = word
    return df

根据条件用列表中的单词填充缺失值

Filling missing value with word from list on condition

python

missing-data

pandas