处理缺失值的R程序(类似于Python中的apply函数)
R program dealing with missing values (Similar to apply function in Python)
我是 'R' 程序的新手,目前想处理缺失值。
基本上,我有一个包含几列的数据集,并且 'Purchase' 列中缺少值。
我想根据缺失值的 'Master_Category' 列估算购买值的平均值。
(Python 代码)
# generate missing Purchase values
miss_Purch_rows = dataset['Purchase'].isnull()
# Check Purchase values from the grouping by the newly created Master_Product_Category column
categ_mean = dataset.groupby(['Master_Product_Category'])['Purchase'].mean()
# Impute mean Purchase value based on Master_Product_Category column
dataset.loc[miss_Purch_rows,'Purchase'] = dataset.loc[miss_Purch_rows,'Master_Product_Category'].apply(lambda x: categ_mean.loc[x])
我正在 'R-program' 中寻找类似的代码,以通过均值和与另一列相关的方式估算缺失值。
数据集样本数据如下;
User_ID Product_ID Gender Age Occupation Marital_Status Master_Category Purchase
1 1000001 P00000142 F 0-17 10 0 345 13650
2 1000001 P00004842 F 0-17 10 0 3412 13645
3 1000001 P00025442 F 0-17 10 0 129 15416
4 1000001 P00051442 F 0-17 10 0 8170 9938
5 1000001 P00051842 F 0-17 10 0 480 2849
6 1000001 P00057542 F 0-17 10 0 345 NA
7 1000001 P00058142 F 0-17 10 0 3412 11051
8 1000001 P00058242 F 0-17 10 0 3412 NA
9 1000001 P00059442 F 0-17 10 0 6816 16622
10 1000001 P00064042 F 0-17 10 0 3412 8190
我试过了;
with(dataset, sapply(X = Purchase, INDEX = Master_Category, FUN = mean, na.rm = TRUE))
但是好像不行。
这种类型的每组操作通常可以通过 tidyverse 软件包集轻松完成:
首先,我们读入了您的示例数据:
txt <- 'User_ID Product_ID Gender Age Occupation Marital_Status Master_Category Purchase
1000001 P00000142 F 0-17 10 0 345 13650
1000001 P00004842 F 0-17 10 0 3412 13645
1000001 P00025442 F 0-17 10 0 129 15416
1000001 P00051442 F 0-17 10 0 8170 9938
1000001 P00051842 F 0-17 10 0 480 2849
1000001 P00057542 F 0-17 10 0 345 NA
1000001 P00058142 F 0-17 10 0 3412 11051
1000001 P00058242 F 0-17 10 0 3412 NA
1000001 P00059442 F 0-17 10 0 6816 16622
1000001 P00064042 F 0-17 10 0 3412 8190'
df <- read.table(text = txt, header = T)
然后我们按 "Master_Category" 分组,并在 mutate
:
中使用 ifelse
填充任何 NA
值和组均值
library(tidyverse)
df.new <- df %>%
group_by(Master_Category) %>%
mutate(Purchase = ifelse(is.na(Purchase), mean(Purchase, na.rm = T), Purchase))
User_ID Product_ID Gender Age Occupation Marital_Status Master_Category Purchase
<int> <fct> <lgl> <fct> <int> <int> <int> <dbl>
1 1000001 P00000142 FALSE 0-17 10 0 345 13650
2 1000001 P00004842 FALSE 0-17 10 0 3412 13645
3 1000001 P00025442 FALSE 0-17 10 0 129 15416
4 1000001 P00051442 FALSE 0-17 10 0 8170 9938
5 1000001 P00051842 FALSE 0-17 10 0 480 2849
6 1000001 P00057542 FALSE 0-17 10 0 345 13650
7 1000001 P00058142 FALSE 0-17 10 0 3412 11051
8 1000001 P00058242 FALSE 0-17 10 0 3412 10962
9 1000001 P00059442 FALSE 0-17 10 0 6816 16622
10 1000001 P00064042 FALSE 0-17 10 0 3412 8190
我是 'R' 程序的新手,目前想处理缺失值。 基本上,我有一个包含几列的数据集,并且 'Purchase' 列中缺少值。
我想根据缺失值的 'Master_Category' 列估算购买值的平均值。
(Python 代码)
# generate missing Purchase values
miss_Purch_rows = dataset['Purchase'].isnull()
# Check Purchase values from the grouping by the newly created Master_Product_Category column
categ_mean = dataset.groupby(['Master_Product_Category'])['Purchase'].mean()
# Impute mean Purchase value based on Master_Product_Category column
dataset.loc[miss_Purch_rows,'Purchase'] = dataset.loc[miss_Purch_rows,'Master_Product_Category'].apply(lambda x: categ_mean.loc[x])
我正在 'R-program' 中寻找类似的代码,以通过均值和与另一列相关的方式估算缺失值。
数据集样本数据如下;
User_ID Product_ID Gender Age Occupation Marital_Status Master_Category Purchase
1 1000001 P00000142 F 0-17 10 0 345 13650
2 1000001 P00004842 F 0-17 10 0 3412 13645
3 1000001 P00025442 F 0-17 10 0 129 15416
4 1000001 P00051442 F 0-17 10 0 8170 9938
5 1000001 P00051842 F 0-17 10 0 480 2849
6 1000001 P00057542 F 0-17 10 0 345 NA
7 1000001 P00058142 F 0-17 10 0 3412 11051
8 1000001 P00058242 F 0-17 10 0 3412 NA
9 1000001 P00059442 F 0-17 10 0 6816 16622
10 1000001 P00064042 F 0-17 10 0 3412 8190
我试过了;
with(dataset, sapply(X = Purchase, INDEX = Master_Category, FUN = mean, na.rm = TRUE))
但是好像不行。
这种类型的每组操作通常可以通过 tidyverse 软件包集轻松完成:
首先,我们读入了您的示例数据:
txt <- 'User_ID Product_ID Gender Age Occupation Marital_Status Master_Category Purchase
1000001 P00000142 F 0-17 10 0 345 13650
1000001 P00004842 F 0-17 10 0 3412 13645
1000001 P00025442 F 0-17 10 0 129 15416
1000001 P00051442 F 0-17 10 0 8170 9938
1000001 P00051842 F 0-17 10 0 480 2849
1000001 P00057542 F 0-17 10 0 345 NA
1000001 P00058142 F 0-17 10 0 3412 11051
1000001 P00058242 F 0-17 10 0 3412 NA
1000001 P00059442 F 0-17 10 0 6816 16622
1000001 P00064042 F 0-17 10 0 3412 8190'
df <- read.table(text = txt, header = T)
然后我们按 "Master_Category" 分组,并在 mutate
:
ifelse
填充任何 NA
值和组均值
library(tidyverse)
df.new <- df %>%
group_by(Master_Category) %>%
mutate(Purchase = ifelse(is.na(Purchase), mean(Purchase, na.rm = T), Purchase))
User_ID Product_ID Gender Age Occupation Marital_Status Master_Category Purchase
<int> <fct> <lgl> <fct> <int> <int> <int> <dbl>
1 1000001 P00000142 FALSE 0-17 10 0 345 13650
2 1000001 P00004842 FALSE 0-17 10 0 3412 13645
3 1000001 P00025442 FALSE 0-17 10 0 129 15416
4 1000001 P00051442 FALSE 0-17 10 0 8170 9938
5 1000001 P00051842 FALSE 0-17 10 0 480 2849
6 1000001 P00057542 FALSE 0-17 10 0 345 13650
7 1000001 P00058142 FALSE 0-17 10 0 3412 11051
8 1000001 P00058242 FALSE 0-17 10 0 3412 10962
9 1000001 P00059442 FALSE 0-17 10 0 6816 16622
10 1000001 P00064042 FALSE 0-17 10 0 3412 8190