处理缺失值的R程序（类似于Python中的apply函数）

Question

我是 'R' 程序的新手，目前想处理缺失值。基本上，我有一个包含几列的数据集，并且 'Purchase' 列中缺少值。

我想根据缺失值的 'Master_Category' 列估算购买值的平均值。

(Python 代码)

# generate missing Purchase values    
miss_Purch_rows = dataset['Purchase'].isnull()

# Check Purchase values from the grouping by the newly created Master_Product_Category column
categ_mean = dataset.groupby(['Master_Product_Category'])['Purchase'].mean()

# Impute mean Purchase value based on Master_Product_Category column
dataset.loc[miss_Purch_rows,'Purchase'] = dataset.loc[miss_Purch_rows,'Master_Product_Category'].apply(lambda x: categ_mean.loc[x])

我正在 'R-program' 中寻找类似的代码，以通过均值和与另一列相关的方式估算缺失值。

数据集样本数据如下；

   User_ID Product_ID    Gender Age  Occupation   Marital_Status Master_Category Purchase
1  1000001  P00000142      F 0-17         10              0             345    13650
2  1000001  P00004842      F 0-17         10              0            3412    13645
3  1000001  P00025442      F 0-17         10              0             129    15416
4  1000001  P00051442      F 0-17         10              0            8170     9938
5  1000001  P00051842      F 0-17         10              0             480     2849
6  1000001  P00057542      F 0-17         10              0             345       NA
7  1000001  P00058142      F 0-17         10              0            3412    11051
8  1000001  P00058242      F 0-17         10              0            3412       NA
9  1000001  P00059442      F 0-17         10              0            6816    16622
10 1000001  P00064042      F 0-17         10              0            3412     8190

我试过了;

with(dataset, sapply(X = Purchase, INDEX = Master_Category, FUN = mean, na.rm = TRUE))

但是好像不行。

Answer 1

这种类型的每组操作通常可以通过 tidyverse 软件包集轻松完成：

首先，我们读入了您的示例数据：

txt <- 'User_ID Product_ID    Gender Age  Occupation   Marital_Status Master_Category Purchase
1000001  P00000142      F 0-17         10              0             345    13650
1000001  P00004842      F 0-17         10              0            3412    13645
1000001  P00025442      F 0-17         10              0             129    15416
1000001  P00051442      F 0-17         10              0            8170     9938
1000001  P00051842      F 0-17         10              0             480     2849
1000001  P00057542      F 0-17         10              0             345       NA
1000001  P00058142      F 0-17         10              0            3412    11051
1000001  P00058242      F 0-17         10              0            3412       NA
1000001  P00059442      F 0-17         10              0            6816    16622
1000001  P00064042      F 0-17         10              0            3412     8190'

df <- read.table(text = txt, header = T)

然后我们按 "Master_Category" 分组，并在 mutate:

中使用 ifelse 填充任何 NA 值和组均值

library(tidyverse)

df.new <- df %>% 
  group_by(Master_Category) %>% 
  mutate(Purchase = ifelse(is.na(Purchase), mean(Purchase, na.rm = T), Purchase))

   User_ID Product_ID Gender Age   Occupation Marital_Status Master_Category Purchase
     <int> <fct>      <lgl>  <fct>      <int>          <int>           <int>    <dbl>
 1 1000001 P00000142  FALSE  0-17          10              0             345    13650
 2 1000001 P00004842  FALSE  0-17          10              0            3412    13645
 3 1000001 P00025442  FALSE  0-17          10              0             129    15416
 4 1000001 P00051442  FALSE  0-17          10              0            8170     9938
 5 1000001 P00051842  FALSE  0-17          10              0             480     2849
 6 1000001 P00057542  FALSE  0-17          10              0             345    13650
 7 1000001 P00058142  FALSE  0-17          10              0            3412    11051
 8 1000001 P00058242  FALSE  0-17          10              0            3412    10962
 9 1000001 P00059442  FALSE  0-17          10              0            6816    16622
10 1000001 P00064042  FALSE  0-17          10              0            3412     8190

处理缺失值的R程序（类似于Python中的apply函数）

R program dealing with missing values (Similar to apply function in Python)

r

missing-data

imputation