如何使用 R 中的 gsub 将缺失值替换为变量的中值？

Question

我有一个从维基百科页面 table 的 html 文件中提取的数据框。我想用每个变量的中位数替换缺失值。

根据给出的提示，我知道我需要将 factor 类型转换为 numeric 值，而且我可能需要使用 as.numeric(gsub())。

renew$Hydro[grep('\s', renew$Hydro)]
as.numeric(gsub('', median(as.numeric(renew$Hydro)), renew$Hydro))
lapply(renew, function(x) as.numeric(gsub('', median(as.numeric(x)), x)))

我尝试使用 grep() 来表明 '\s' 是提取空格的模式，但空格实际上被排除在输出之外，只显示了数字。

当我尝试使用 as.numeric(gsub()) 时，输出如下所示：

[1] 5.415405e+13 5.475475e+13 5.475425e+07 5.475415e+13 5.400000e+01 5.400000e+01 5.435405e+16
[8] 5.425435e+13 5.400000e+01 5.415455e+16 5.445425e+16 5.415495e+13 5.400000e+01 5.400000e+01

它与看起来像的数据框完全不同：

[1] 1035.3   7782     72       7109                       30134.8  2351.2            15318

我希望输出看起来与原始数据框完全一样，但空格由列中位数填充。

编辑：这就是数据框开头的样子。它来自“https://en.wikipedia.org/wiki/List_of_countries_by_electricity_production_from_renewable_sources”。

> renew
                             Country    Hydro     Wind     Bio   Solar
1                        Afghanistan   1035.3      0.1            35.5
2                            Albania     7782                      1.9
3                            Algeria       72     19.4           339.1
4                             Angola     7109              155    18.3
5                           Anguilla                               2.4
6                Antigua and Barbuda                               5.5
7                          Argentina  30134.8    554.1  1820.4    14.5
8                            Armenia   2351.2      1.8             1.2
9                              Aruba             130.3     8.9     9.2
10                         Australia    15318    12199    3722    6209
11                           Austria    42919     5235    4603    1096
12                        Azerbaijan   1959.3     22.8   174.5    35.3
13                           Bahamas                               1.9
14                           Bahrain               1.2             8.3
15                        Bangladesh      946      5.1     7.7   224.3

Answer 1

由于您的数据框中有空格，因此列被转换为字符，并且使用 median 个字符列没有任何意义。我们可以先将空格替换为NA，将列转换为数字，然后将replace NAs替换为列的median。使用 dplyr 我们可以执行以下步骤。

library(dplyr)
renew[renew == ""] <- NA

renew %>%
   mutate_at(-1, as.numeric) %>% #-1 is to ignore Country column
   mutate_at(-1, ~ replace(., is.na(.), median(., na.rm = TRUE)))


#             Country   Hydro    Wind    Bio  Solar
#1        Afghanistan  1035.3     0.1  174.5   35.5
#2            Albania  7782.0    21.1  174.5    1.9
#3            Algeria    72.0    19.4  174.5  339.1
#4             Angola  7109.0    21.1  155.0   18.3
#5           Anguilla  4730.1    21.1  174.5    2.4
#6  AntiguaandBarbuda  4730.1    21.1  174.5    5.5
#7          Argentina 30134.8   554.1 1820.4   14.5
#8            Armenia  2351.2     1.8  174.5    1.2
#9              Aruba  4730.1   130.3    8.9    9.2
#10         Australia 15318.0 12199.0 3722.0 6209.0
#11           Austria 42919.0  5235.0 4603.0 1096.0
#12        Azerbaijan  1959.3    22.8  174.5   35.3
#13           Bahamas  4730.1    21.1  174.5    1.9
#14           Bahrain  4730.1     1.2  174.5    8.3
#15        Bangladesh   946.0     5.1    7.7  224.3

我们可以使用基数 R

做同样的事情

renew[renew == ""] <- NA
renew[-1] <- lapply(renew[-1], function(x) 
      as.numeric(replace(x, is.na(x), median(as.numeric(x), na.rm = TRUE))))

Answer 2

我们可以使用 zoo

中的 na.aggregate 以紧凑的方式完成此操作

library(dplyr)
library(hablar)
library(zoo)
renew %>%
    retype %>% # change the type of columns
    # replace missing value of numeric columns with median
     mutate_if(is.numeric, na.aggregate, FUN = median)
# A tibble: 15 x 5
#   Country              Hydro    Wind    Bio  Solar
#   <chr>                <dbl>   <dbl>  <dbl>  <dbl>
# 1 Afghanistan          1035.     0.1  174.    35.5
# 2 Albania              7782     21.1  174.     1.9
# 3 Algeria                72     19.4  174.   339. 
# 4 Angola               7109     21.1  155     18.3
# 5 Anguilla             4730.    21.1  174.     2.4
# 6 Antigua and Barbuda  4730.    21.1  174.     5.5
# 7 Argentina           30135.   554.  1820.    14.5
# 8 Armenia              2351.     1.8  174.     1.2
# 9 Aruba                4730.   130.     8.9    9.2
#10 Australia           15318  12199   3722   6209  
#11 Austria             42919   5235   4603   1096  
#12 Azerbaijan           1959.    22.8  174.    35.3
#13 Bahamas              4730.    21.1  174.     1.9
#14 Bahrain              4730.     1.2  174.     8.3
#15 Bangladesh            946      5.1    7.7  224.

数据

renew <- structure(list(Country = c("Afghanistan", "Albania", "Algeria", 
"Angola", "Anguilla", "Antigua and Barbuda", "Argentina", "Armenia", 
"Aruba", "Australia", "Austria", "Azerbaijan", "Bahamas", "Bahrain", 
"Bangladesh"), Hydro = c("1035.3", "7782", "72", "7109", "", 
"", "30134.8", "2351.2", "", "15318", "42919", "1959.3", "", 
"", "946"), Wind = c("0.1", "", "19.4", "", "", "", "554.1", 
"1.8", "130.3", "12199", "5235", "22.8", "", "1.2", "5.1"), Bio = c("", 
"", "", "155", "", "", "1820.4", "", "8.9", "3722", "4603", "174.5", 
"", "", "7.7"), Solar = c(35.5, 1.9, 339.1, 18.3, 2.4, 5.5, 14.5, 
1.2, 9.2, 6209, 1096, 35.3, 1.9, 8.3, 224.3)), row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", 
"14", "15"), class = "data.frame")

Answer 3

我想指出，刚抓取后数据还不干净，因为 lapply(renew, function(x) grep(",", x)) 产生了一些东西。

首先用gsub清理它，以避免在将数据转换为数字时将这些值转换为NAs。这是一个一步解决方案，正确的 NAs 是自动创建的：

renew[-1] <- lapply(renew[-1], function(x) as.numeric(as.character(gsub(",", ".", x))))

之后你可以运行申请

# sapply(2:5, function(x) renew[[x]][is.na(renew[[x]])] <<- median(renew[[x]], na.rm=TRUE))

或者当然是对 @Ronak Shah 的 第二个基本 R 代码行的更短改编，这要好得多：

renew[-1] <- sapply(renew[-1], function(x) replace(x, is.na(x), median(x, na.rm=TRUE)))

结果

summary(renew)
#                      country        hydro                wind                bio              solar        
# Afghanistan        :  1   Min.   :      0.8   Min.   :     0.00   Min.   :    0.2   Min.   :    0.1  
# Albania            :  1   1st Qu.:    907.8   1st Qu.:    50.45   1st Qu.:  151.1   1st Qu.:    4.8  
# Algeria            :  1   Median :   2595.0   Median :   109.00   Median :  242.5   Median :   22.3  
# Angola             :  1   Mean   :  19989.3   Mean   :  4324.13   Mean   : 2136.3   Mean   : 1483.3  
# Anguilla           :  1   3rd Qu.:   7992.4   3rd Qu.:   293.55   3rd Qu.:  344.4   3rd Qu.:  124.5  
# Antigua and Barbuda:  1   Max.   :1193370.0   Max.   :242387.70   Max.   :69017.0   Max.   :67874.1  
# (Other)            :209

数据

library(rvest)
renew <- setNames(html_table(
  read_html(paste0("https://en.wikipedia.org/wiki/List_of_countries",
                   "_by_electricity_production_from_renewable_sources")),
  fill=TRUE, header=TRUE)[[1]][c(1, 6:9)], c("country", "hydro", "wind", "bio", "solar"))
renew$country <- factor(renew$country)

如何使用 R 中的 gsub 将缺失值替换为变量的中值？

How to replace the missing values with the median for the variable using gsub in R?

regex

r

gsub

数据