如何使用 R 中的 gsub 将缺失值替换为变量的中值?
How to replace the missing values with the median for the variable using gsub in R?
我有一个从维基百科页面 table 的 html 文件中提取的数据框。我想用每个变量的中位数替换缺失值。
根据给出的提示,我知道我需要将 factor
类型转换为 numeric
值,而且我可能需要使用 as.numeric(gsub())
。
renew$Hydro[grep('\s', renew$Hydro)]
as.numeric(gsub('', median(as.numeric(renew$Hydro)), renew$Hydro))
lapply(renew, function(x) as.numeric(gsub('', median(as.numeric(x)), x)))
我尝试使用 grep()
来表明 '\s'
是提取空格的模式,但空格实际上被排除在输出之外,只显示了数字。
当我尝试使用 as.numeric(gsub())
时,输出如下所示:
[1] 5.415405e+13 5.475475e+13 5.475425e+07 5.475415e+13 5.400000e+01 5.400000e+01 5.435405e+16
[8] 5.425435e+13 5.400000e+01 5.415455e+16 5.445425e+16 5.415495e+13 5.400000e+01 5.400000e+01
它与看起来像的数据框完全不同:
[1] 1035.3 7782 72 7109 30134.8 2351.2 15318
我希望输出看起来与原始数据框完全一样,但空格由列中位数填充。
编辑:
这就是数据框开头的样子。它来自“https://en.wikipedia.org/wiki/List_of_countries_by_electricity_production_from_renewable_sources”。
> renew
Country Hydro Wind Bio Solar
1 Afghanistan 1035.3 0.1 35.5
2 Albania 7782 1.9
3 Algeria 72 19.4 339.1
4 Angola 7109 155 18.3
5 Anguilla 2.4
6 Antigua and Barbuda 5.5
7 Argentina 30134.8 554.1 1820.4 14.5
8 Armenia 2351.2 1.8 1.2
9 Aruba 130.3 8.9 9.2
10 Australia 15318 12199 3722 6209
11 Austria 42919 5235 4603 1096
12 Azerbaijan 1959.3 22.8 174.5 35.3
13 Bahamas 1.9
14 Bahrain 1.2 8.3
15 Bangladesh 946 5.1 7.7 224.3
由于您的数据框中有空格,因此列被转换为字符,并且使用 median
个字符列没有任何意义。我们可以先将空格替换为NA
,将列转换为数字,然后将replace
NA
s替换为列的median
。使用 dplyr
我们可以执行以下步骤。
library(dplyr)
renew[renew == ""] <- NA
renew %>%
mutate_at(-1, as.numeric) %>% #-1 is to ignore Country column
mutate_at(-1, ~ replace(., is.na(.), median(., na.rm = TRUE)))
# Country Hydro Wind Bio Solar
#1 Afghanistan 1035.3 0.1 174.5 35.5
#2 Albania 7782.0 21.1 174.5 1.9
#3 Algeria 72.0 19.4 174.5 339.1
#4 Angola 7109.0 21.1 155.0 18.3
#5 Anguilla 4730.1 21.1 174.5 2.4
#6 AntiguaandBarbuda 4730.1 21.1 174.5 5.5
#7 Argentina 30134.8 554.1 1820.4 14.5
#8 Armenia 2351.2 1.8 174.5 1.2
#9 Aruba 4730.1 130.3 8.9 9.2
#10 Australia 15318.0 12199.0 3722.0 6209.0
#11 Austria 42919.0 5235.0 4603.0 1096.0
#12 Azerbaijan 1959.3 22.8 174.5 35.3
#13 Bahamas 4730.1 21.1 174.5 1.9
#14 Bahrain 4730.1 1.2 174.5 8.3
#15 Bangladesh 946.0 5.1 7.7 224.3
我们可以使用基数 R
做同样的事情
renew[renew == ""] <- NA
renew[-1] <- lapply(renew[-1], function(x)
as.numeric(replace(x, is.na(x), median(as.numeric(x), na.rm = TRUE))))
我们可以使用 zoo
中的 na.aggregate
以紧凑的方式完成此操作
library(dplyr)
library(hablar)
library(zoo)
renew %>%
retype %>% # change the type of columns
# replace missing value of numeric columns with median
mutate_if(is.numeric, na.aggregate, FUN = median)
# A tibble: 15 x 5
# Country Hydro Wind Bio Solar
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 Afghanistan 1035. 0.1 174. 35.5
# 2 Albania 7782 21.1 174. 1.9
# 3 Algeria 72 19.4 174. 339.
# 4 Angola 7109 21.1 155 18.3
# 5 Anguilla 4730. 21.1 174. 2.4
# 6 Antigua and Barbuda 4730. 21.1 174. 5.5
# 7 Argentina 30135. 554. 1820. 14.5
# 8 Armenia 2351. 1.8 174. 1.2
# 9 Aruba 4730. 130. 8.9 9.2
#10 Australia 15318 12199 3722 6209
#11 Austria 42919 5235 4603 1096
#12 Azerbaijan 1959. 22.8 174. 35.3
#13 Bahamas 4730. 21.1 174. 1.9
#14 Bahrain 4730. 1.2 174. 8.3
#15 Bangladesh 946 5.1 7.7 224.
数据
renew <- structure(list(Country = c("Afghanistan", "Albania", "Algeria",
"Angola", "Anguilla", "Antigua and Barbuda", "Argentina", "Armenia",
"Aruba", "Australia", "Austria", "Azerbaijan", "Bahamas", "Bahrain",
"Bangladesh"), Hydro = c("1035.3", "7782", "72", "7109", "",
"", "30134.8", "2351.2", "", "15318", "42919", "1959.3", "",
"", "946"), Wind = c("0.1", "", "19.4", "", "", "", "554.1",
"1.8", "130.3", "12199", "5235", "22.8", "", "1.2", "5.1"), Bio = c("",
"", "", "155", "", "", "1820.4", "", "8.9", "3722", "4603", "174.5",
"", "", "7.7"), Solar = c(35.5, 1.9, 339.1, 18.3, 2.4, 5.5, 14.5,
1.2, 9.2, 6209, 1096, 35.3, 1.9, 8.3, 224.3)), row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13",
"14", "15"), class = "data.frame")
我想指出,刚抓取后数据还不干净,因为 lapply(renew, function(x) grep(",", x))
产生了一些东西。
首先用gsub
清理它,以避免在将数据转换为数字时将这些值转换为NA
s。这是一个一步解决方案,正确的 NA
s 是自动创建的:
renew[-1] <- lapply(renew[-1], function(x) as.numeric(as.character(gsub(",", ".", x))))
之后你可以运行申请
# sapply(2:5, function(x) renew[[x]][is.na(renew[[x]])] <<- median(renew[[x]], na.rm=TRUE))
或者当然是对 @Ronak Shah 的 第二个基本 R 代码行的更短改编,这要好得多:
renew[-1] <- sapply(renew[-1], function(x) replace(x, is.na(x), median(x, na.rm=TRUE)))
结果
summary(renew)
# country hydro wind bio solar
# Afghanistan : 1 Min. : 0.8 Min. : 0.00 Min. : 0.2 Min. : 0.1
# Albania : 1 1st Qu.: 907.8 1st Qu.: 50.45 1st Qu.: 151.1 1st Qu.: 4.8
# Algeria : 1 Median : 2595.0 Median : 109.00 Median : 242.5 Median : 22.3
# Angola : 1 Mean : 19989.3 Mean : 4324.13 Mean : 2136.3 Mean : 1483.3
# Anguilla : 1 3rd Qu.: 7992.4 3rd Qu.: 293.55 3rd Qu.: 344.4 3rd Qu.: 124.5
# Antigua and Barbuda: 1 Max. :1193370.0 Max. :242387.70 Max. :69017.0 Max. :67874.1
# (Other) :209
数据
library(rvest)
renew <- setNames(html_table(
read_html(paste0("https://en.wikipedia.org/wiki/List_of_countries",
"_by_electricity_production_from_renewable_sources")),
fill=TRUE, header=TRUE)[[1]][c(1, 6:9)], c("country", "hydro", "wind", "bio", "solar"))
renew$country <- factor(renew$country)
我有一个从维基百科页面 table 的 html 文件中提取的数据框。我想用每个变量的中位数替换缺失值。
根据给出的提示,我知道我需要将 factor
类型转换为 numeric
值,而且我可能需要使用 as.numeric(gsub())
。
renew$Hydro[grep('\s', renew$Hydro)]
as.numeric(gsub('', median(as.numeric(renew$Hydro)), renew$Hydro))
lapply(renew, function(x) as.numeric(gsub('', median(as.numeric(x)), x)))
我尝试使用 grep()
来表明 '\s'
是提取空格的模式,但空格实际上被排除在输出之外,只显示了数字。
当我尝试使用 as.numeric(gsub())
时,输出如下所示:
[1] 5.415405e+13 5.475475e+13 5.475425e+07 5.475415e+13 5.400000e+01 5.400000e+01 5.435405e+16
[8] 5.425435e+13 5.400000e+01 5.415455e+16 5.445425e+16 5.415495e+13 5.400000e+01 5.400000e+01
它与看起来像的数据框完全不同:
[1] 1035.3 7782 72 7109 30134.8 2351.2 15318
我希望输出看起来与原始数据框完全一样,但空格由列中位数填充。
编辑: 这就是数据框开头的样子。它来自“https://en.wikipedia.org/wiki/List_of_countries_by_electricity_production_from_renewable_sources”。
> renew
Country Hydro Wind Bio Solar
1 Afghanistan 1035.3 0.1 35.5
2 Albania 7782 1.9
3 Algeria 72 19.4 339.1
4 Angola 7109 155 18.3
5 Anguilla 2.4
6 Antigua and Barbuda 5.5
7 Argentina 30134.8 554.1 1820.4 14.5
8 Armenia 2351.2 1.8 1.2
9 Aruba 130.3 8.9 9.2
10 Australia 15318 12199 3722 6209
11 Austria 42919 5235 4603 1096
12 Azerbaijan 1959.3 22.8 174.5 35.3
13 Bahamas 1.9
14 Bahrain 1.2 8.3
15 Bangladesh 946 5.1 7.7 224.3
由于您的数据框中有空格,因此列被转换为字符,并且使用 median
个字符列没有任何意义。我们可以先将空格替换为NA
,将列转换为数字,然后将replace
NA
s替换为列的median
。使用 dplyr
我们可以执行以下步骤。
library(dplyr)
renew[renew == ""] <- NA
renew %>%
mutate_at(-1, as.numeric) %>% #-1 is to ignore Country column
mutate_at(-1, ~ replace(., is.na(.), median(., na.rm = TRUE)))
# Country Hydro Wind Bio Solar
#1 Afghanistan 1035.3 0.1 174.5 35.5
#2 Albania 7782.0 21.1 174.5 1.9
#3 Algeria 72.0 19.4 174.5 339.1
#4 Angola 7109.0 21.1 155.0 18.3
#5 Anguilla 4730.1 21.1 174.5 2.4
#6 AntiguaandBarbuda 4730.1 21.1 174.5 5.5
#7 Argentina 30134.8 554.1 1820.4 14.5
#8 Armenia 2351.2 1.8 174.5 1.2
#9 Aruba 4730.1 130.3 8.9 9.2
#10 Australia 15318.0 12199.0 3722.0 6209.0
#11 Austria 42919.0 5235.0 4603.0 1096.0
#12 Azerbaijan 1959.3 22.8 174.5 35.3
#13 Bahamas 4730.1 21.1 174.5 1.9
#14 Bahrain 4730.1 1.2 174.5 8.3
#15 Bangladesh 946.0 5.1 7.7 224.3
我们可以使用基数 R
做同样的事情renew[renew == ""] <- NA
renew[-1] <- lapply(renew[-1], function(x)
as.numeric(replace(x, is.na(x), median(as.numeric(x), na.rm = TRUE))))
我们可以使用 zoo
na.aggregate
以紧凑的方式完成此操作
library(dplyr)
library(hablar)
library(zoo)
renew %>%
retype %>% # change the type of columns
# replace missing value of numeric columns with median
mutate_if(is.numeric, na.aggregate, FUN = median)
# A tibble: 15 x 5
# Country Hydro Wind Bio Solar
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 Afghanistan 1035. 0.1 174. 35.5
# 2 Albania 7782 21.1 174. 1.9
# 3 Algeria 72 19.4 174. 339.
# 4 Angola 7109 21.1 155 18.3
# 5 Anguilla 4730. 21.1 174. 2.4
# 6 Antigua and Barbuda 4730. 21.1 174. 5.5
# 7 Argentina 30135. 554. 1820. 14.5
# 8 Armenia 2351. 1.8 174. 1.2
# 9 Aruba 4730. 130. 8.9 9.2
#10 Australia 15318 12199 3722 6209
#11 Austria 42919 5235 4603 1096
#12 Azerbaijan 1959. 22.8 174. 35.3
#13 Bahamas 4730. 21.1 174. 1.9
#14 Bahrain 4730. 1.2 174. 8.3
#15 Bangladesh 946 5.1 7.7 224.
数据
renew <- structure(list(Country = c("Afghanistan", "Albania", "Algeria",
"Angola", "Anguilla", "Antigua and Barbuda", "Argentina", "Armenia",
"Aruba", "Australia", "Austria", "Azerbaijan", "Bahamas", "Bahrain",
"Bangladesh"), Hydro = c("1035.3", "7782", "72", "7109", "",
"", "30134.8", "2351.2", "", "15318", "42919", "1959.3", "",
"", "946"), Wind = c("0.1", "", "19.4", "", "", "", "554.1",
"1.8", "130.3", "12199", "5235", "22.8", "", "1.2", "5.1"), Bio = c("",
"", "", "155", "", "", "1820.4", "", "8.9", "3722", "4603", "174.5",
"", "", "7.7"), Solar = c(35.5, 1.9, 339.1, 18.3, 2.4, 5.5, 14.5,
1.2, 9.2, 6209, 1096, 35.3, 1.9, 8.3, 224.3)), row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13",
"14", "15"), class = "data.frame")
我想指出,刚抓取后数据还不干净,因为 lapply(renew, function(x) grep(",", x))
产生了一些东西。
首先用gsub
清理它,以避免在将数据转换为数字时将这些值转换为NA
s。这是一个一步解决方案,正确的 NA
s 是自动创建的:
renew[-1] <- lapply(renew[-1], function(x) as.numeric(as.character(gsub(",", ".", x))))
之后你可以运行申请
# sapply(2:5, function(x) renew[[x]][is.na(renew[[x]])] <<- median(renew[[x]], na.rm=TRUE))
或者当然是对 @Ronak Shah 的 第二个基本 R 代码行的更短改编,这要好得多:
renew[-1] <- sapply(renew[-1], function(x) replace(x, is.na(x), median(x, na.rm=TRUE)))
结果
summary(renew)
# country hydro wind bio solar
# Afghanistan : 1 Min. : 0.8 Min. : 0.00 Min. : 0.2 Min. : 0.1
# Albania : 1 1st Qu.: 907.8 1st Qu.: 50.45 1st Qu.: 151.1 1st Qu.: 4.8
# Algeria : 1 Median : 2595.0 Median : 109.00 Median : 242.5 Median : 22.3
# Angola : 1 Mean : 19989.3 Mean : 4324.13 Mean : 2136.3 Mean : 1483.3
# Anguilla : 1 3rd Qu.: 7992.4 3rd Qu.: 293.55 3rd Qu.: 344.4 3rd Qu.: 124.5
# Antigua and Barbuda: 1 Max. :1193370.0 Max. :242387.70 Max. :69017.0 Max. :67874.1
# (Other) :209
数据
library(rvest)
renew <- setNames(html_table(
read_html(paste0("https://en.wikipedia.org/wiki/List_of_countries",
"_by_electricity_production_from_renewable_sources")),
fill=TRUE, header=TRUE)[[1]][c(1, 6:9)], c("country", "hydro", "wind", "bio", "solar"))
renew$country <- factor(renew$country)