根据 R 中的另一个数据框替换一列中的值
Replace values in one column based on another dataframe in R
我有一个超过 20k obs 的数据框。其中一列是 "city names" (df$city)。有 600 多个独特的城市名称。其中一些拼写错误。
我的数据框示例:
> df$city
[1] "BOSTN" "LOS ANGELOS" "NYC" "CHICAGOO"
[2] "SEATTLE" "BOSTON" "NEW YORK CITY"
我创建了一个 csv 文件,其中列出了所有拼写错误的城市名称以及正确的名称。
> head(city)
city city_incorrect
1 BOSTON BOSTN
2 LOS ANGELES LOS ANGELOS
3 NEW YORK CITY NYC
4 CHICAGO CHICAGOO
理想情况下,我会编写代码,根据 "city.csv" 文件替换 df$city 中的值。
注意:我最初发布了这个问题,有人建议我使用合并,我认为这不是解决我问题的最有效方法,因为我还必须包括正确拼写的 600我的 "city.csv" 文件中的城市。或者我想我需要一个额外的步骤来组合合并数据框中的两列。所以我认为根据 "city.csv".
替换 df$city 中的值可能更容易
编辑:
这是我的数据框的更详细信息
> df[1:5]
id owner city state
1 AAAAA BOSTN MA
2 BBBBB LOS ANGELOS CA
3 CCCCC NYC NY
4 DDDDD CHICAGOO IL
5 EEEEE BOSTON MA
6 FFFFF SEATTLE WA
7 GGGGG NEW YORK CITY NY
8 HHHHH LOS ANGELES CA
如果我使用 merge 或 cbind,它不会像这样在我的数据框末尾创建另一列:
> merge()
id owner city state city_correct
1 AAAAA BOSTN MA BOSTON
2 BBBBB LOS ANGELOS CA LOS ANGELES
3 CCCCC NYC NY NEW YORK CITY
4 DDDDD CHICAGOO IL CHICAGO
5 EEEEE BOSTON MA
6 FFFFF SEATTLE WA
7 GGGGG NEW YORK CITY NY
8 HHHHH LOS ANGELES CA
因此拼写错误的城市将被更正,但拼写正确的城市将被排除在外。最后我想要的是一个包含所有更正城市名称的列。
base::merge()
的一种方法是在查找 table 中包含具有正确城市值的行,并将该 table 与原始数据合并。我们将"correct"个城市命名为correctedCity
,合并如下:
cityText <- "id,owner,city,state
1,AAAAA,BOSTN,MA
2,BBBBB,LOS ANGELOS,CA
3,CCCCC,NYC,NY
4,DDDDD,CHICAGOO,IL
5,EEEEE,BOSTON,MA
6,FFFFF,SEATTLE,WA
7,GGGGG,NEW YORK CITY,NY
8,HHHHH,LOS ANGELES,CA"
cities <- read.csv(text = cityText, header = TRUE, stringsAsFactors = FALSE)
# first, find all the distinct versions of city
library(sqldf)
distinctCities <- sqldf("select city, count(*) as count from cities group by city")
# create lookup table, and include rows for items that are already correct
tableText <- "city,correctedCity
BOSTN,BOSTON
BOSTON,BOSTON
CHICAGOO,CHIGAGO
LOS ANGELES,LOS ANGELES
LOS ANGELOS,LOS ANGELES
NEW YORK CITY,NEW YORK CITY
NYC,NEW YORK CITY
SEATTLE,SEATTLE"
cityTable <- read.csv(text = tableText,header = TRUE,stringsAsFactors = FALSE)
corrected <- merge(cities,cityTable,by = "city")
corrected
...以及输出:
> corrected
city id owner state correctedCity
1 BOSTN 1 AAAAA MA BOSTON
2 BOSTON 5 EEEEE MA BOSTON
3 CHICAGOO 4 DDDDD IL CHIGAGO
4 LOS ANGELES 8 HHHHH CA LOS ANGELES
5 LOS ANGELOS 2 BBBBB CA LOS ANGELES
6 NEW YORK CITY 7 GGGGG NY NEW YORK CITY
7 NYC 3 CCCCC NY NEW YORK CITY
8 SEATTLE 6 FFFFF WA SEATTLE
>
此时可以删除原始值并保留更正后的版本。
# rename & keep corrected version
library(dplyr)
corrected %>% select(-city) %>% rename(city = correctedCity)
OP 的评论中指出的替代方法是创建一个查找 table,其中仅包含拼写错误的城市名称的行。在这种情况下,我们将使用 merge()
中的参数 all.x = TRUE
来保留主数据框中的所有行,并将 correctedCity
的非缺失值分配给 city
。
tableText <- "city,correctedCity
BOSTN,BOSTON
CHICAGOO,CHIGAGO
LOS ANGELOS,LOS ANGELES
NYC,NEW YORK CITY"
cityTable <- read.csv(text = tableText,header = TRUE,stringsAsFactors = FALSE)
corrected <- merge(cities,cityTable,by = "city",all.x = TRUE)
corrected$city[!is.na(corrected$correctedCity)] <- corrected$correctedCity[!is.na(corrected$correctedCity)]
corrected
...以及输出:
> corrected
city id owner state correctedCity
1 BOSTON 1 AAAAA MA BOSTON
2 BOSTON 5 EEEEE MA <NA>
3 CHIGAGO 4 DDDDD IL CHIGAGO
4 LOS ANGELES 8 HHHHH CA <NA>
5 LOS ANGELES 2 BBBBB CA LOS ANGELES
6 NEW YORK CITY 7 GGGGG NY <NA>
7 NEW YORK CITY 3 CCCCC NY NEW YORK CITY
8 SEATTLE 6 FFFFF WA <NA>
>
此时,correctedCity
可以从数据框中删除。
在我看来,您正在尝试做的是将一个数据框中不正确的城市名称匹配并替换为另一个数据框中的正确城市名称。如果这是正确的,那么这个 dplyr
解决方案应该有效。
数据:
包含正确和错误城市名称对的数据框:
city <- data.frame(
city_correct = c("BOSTON", "LOS ANGELES", "NEW YORK CITY", "CHICAGO"),
city_incorrect = c("BOSTN", "LOS ANGELOS", "NYC", "CHICAGOO"), stringsAsFactors = F)
混合了正确和错误城市名称的数据框:
set.seed(123)
df <- data.frame(town = sample(c("BOSTON", "LOS ANGELES", "NEW YORK CITY", "CHICAGO","BOSTN",
"LOS ANGELOS", "NYC", "CHICAGOO"), 20, replace = T), stringsAsFactors = F)
解决方案:
library(dplyr)
df <- left_join(df, city, by = c("town" = "city_incorrect"))
df$town_correct<-ifelse(is.na(df$city_correct), df$town, df$city_correct)
df$city_correct <- NULL
编辑:
另外,base R
,解决办法是这样的:
df$town_correct <- ifelse(df$town %in% city$city_incorrect,
city$city_correct[match(df$town, city$city_incorrect)],
df$town[match(df$town, city$city_correct)])
结果:
df
town town_correct
1 NEW YORK CITY NEW YORK CITY
2 NYC NEW YORK CITY
3 CHICAGO CHICAGO
4 CHICAGOO CHICAGO
5 CHICAGOO CHICAGO
6 BOSTON BOSTON
7 BOSTN BOSTON
8 CHICAGOO CHICAGO
9 BOSTN BOSTON
10 CHICAGO CHICAGO
11 CHICAGOO CHICAGO
12 CHICAGO CHICAGO
13 LOS ANGELOS LOS ANGELES
14 BOSTN BOSTON
15 BOSTON BOSTON
16 CHICAGOO CHICAGO
17 LOS ANGELES LOS ANGELES
18 BOSTON BOSTON
19 NEW YORK CITY NEW YORK CITY
20 CHICAGOO CHICAGO
我有一个超过 20k obs 的数据框。其中一列是 "city names" (df$city)。有 600 多个独特的城市名称。其中一些拼写错误。
我的数据框示例:
> df$city
[1] "BOSTN" "LOS ANGELOS" "NYC" "CHICAGOO"
[2] "SEATTLE" "BOSTON" "NEW YORK CITY"
我创建了一个 csv 文件,其中列出了所有拼写错误的城市名称以及正确的名称。
> head(city)
city city_incorrect
1 BOSTON BOSTN
2 LOS ANGELES LOS ANGELOS
3 NEW YORK CITY NYC
4 CHICAGO CHICAGOO
理想情况下,我会编写代码,根据 "city.csv" 文件替换 df$city 中的值。
注意:我最初发布了这个问题,有人建议我使用合并,我认为这不是解决我问题的最有效方法,因为我还必须包括正确拼写的 600我的 "city.csv" 文件中的城市。或者我想我需要一个额外的步骤来组合合并数据框中的两列。所以我认为根据 "city.csv".
替换 df$city 中的值可能更容易编辑: 这是我的数据框的更详细信息
> df[1:5]
id owner city state
1 AAAAA BOSTN MA
2 BBBBB LOS ANGELOS CA
3 CCCCC NYC NY
4 DDDDD CHICAGOO IL
5 EEEEE BOSTON MA
6 FFFFF SEATTLE WA
7 GGGGG NEW YORK CITY NY
8 HHHHH LOS ANGELES CA
如果我使用 merge 或 cbind,它不会像这样在我的数据框末尾创建另一列:
> merge()
id owner city state city_correct
1 AAAAA BOSTN MA BOSTON
2 BBBBB LOS ANGELOS CA LOS ANGELES
3 CCCCC NYC NY NEW YORK CITY
4 DDDDD CHICAGOO IL CHICAGO
5 EEEEE BOSTON MA
6 FFFFF SEATTLE WA
7 GGGGG NEW YORK CITY NY
8 HHHHH LOS ANGELES CA
因此拼写错误的城市将被更正,但拼写正确的城市将被排除在外。最后我想要的是一个包含所有更正城市名称的列。
base::merge()
的一种方法是在查找 table 中包含具有正确城市值的行,并将该 table 与原始数据合并。我们将"correct"个城市命名为correctedCity
,合并如下:
cityText <- "id,owner,city,state
1,AAAAA,BOSTN,MA
2,BBBBB,LOS ANGELOS,CA
3,CCCCC,NYC,NY
4,DDDDD,CHICAGOO,IL
5,EEEEE,BOSTON,MA
6,FFFFF,SEATTLE,WA
7,GGGGG,NEW YORK CITY,NY
8,HHHHH,LOS ANGELES,CA"
cities <- read.csv(text = cityText, header = TRUE, stringsAsFactors = FALSE)
# first, find all the distinct versions of city
library(sqldf)
distinctCities <- sqldf("select city, count(*) as count from cities group by city")
# create lookup table, and include rows for items that are already correct
tableText <- "city,correctedCity
BOSTN,BOSTON
BOSTON,BOSTON
CHICAGOO,CHIGAGO
LOS ANGELES,LOS ANGELES
LOS ANGELOS,LOS ANGELES
NEW YORK CITY,NEW YORK CITY
NYC,NEW YORK CITY
SEATTLE,SEATTLE"
cityTable <- read.csv(text = tableText,header = TRUE,stringsAsFactors = FALSE)
corrected <- merge(cities,cityTable,by = "city")
corrected
...以及输出:
> corrected
city id owner state correctedCity
1 BOSTN 1 AAAAA MA BOSTON
2 BOSTON 5 EEEEE MA BOSTON
3 CHICAGOO 4 DDDDD IL CHIGAGO
4 LOS ANGELES 8 HHHHH CA LOS ANGELES
5 LOS ANGELOS 2 BBBBB CA LOS ANGELES
6 NEW YORK CITY 7 GGGGG NY NEW YORK CITY
7 NYC 3 CCCCC NY NEW YORK CITY
8 SEATTLE 6 FFFFF WA SEATTLE
>
此时可以删除原始值并保留更正后的版本。
# rename & keep corrected version
library(dplyr)
corrected %>% select(-city) %>% rename(city = correctedCity)
OP 的评论中指出的替代方法是创建一个查找 table,其中仅包含拼写错误的城市名称的行。在这种情况下,我们将使用 merge()
中的参数 all.x = TRUE
来保留主数据框中的所有行,并将 correctedCity
的非缺失值分配给 city
。
tableText <- "city,correctedCity
BOSTN,BOSTON
CHICAGOO,CHIGAGO
LOS ANGELOS,LOS ANGELES
NYC,NEW YORK CITY"
cityTable <- read.csv(text = tableText,header = TRUE,stringsAsFactors = FALSE)
corrected <- merge(cities,cityTable,by = "city",all.x = TRUE)
corrected$city[!is.na(corrected$correctedCity)] <- corrected$correctedCity[!is.na(corrected$correctedCity)]
corrected
...以及输出:
> corrected
city id owner state correctedCity
1 BOSTON 1 AAAAA MA BOSTON
2 BOSTON 5 EEEEE MA <NA>
3 CHIGAGO 4 DDDDD IL CHIGAGO
4 LOS ANGELES 8 HHHHH CA <NA>
5 LOS ANGELES 2 BBBBB CA LOS ANGELES
6 NEW YORK CITY 7 GGGGG NY <NA>
7 NEW YORK CITY 3 CCCCC NY NEW YORK CITY
8 SEATTLE 6 FFFFF WA <NA>
>
此时,correctedCity
可以从数据框中删除。
在我看来,您正在尝试做的是将一个数据框中不正确的城市名称匹配并替换为另一个数据框中的正确城市名称。如果这是正确的,那么这个 dplyr
解决方案应该有效。
数据:
包含正确和错误城市名称对的数据框:
city <- data.frame(
city_correct = c("BOSTON", "LOS ANGELES", "NEW YORK CITY", "CHICAGO"),
city_incorrect = c("BOSTN", "LOS ANGELOS", "NYC", "CHICAGOO"), stringsAsFactors = F)
混合了正确和错误城市名称的数据框:
set.seed(123)
df <- data.frame(town = sample(c("BOSTON", "LOS ANGELES", "NEW YORK CITY", "CHICAGO","BOSTN",
"LOS ANGELOS", "NYC", "CHICAGOO"), 20, replace = T), stringsAsFactors = F)
解决方案:
library(dplyr)
df <- left_join(df, city, by = c("town" = "city_incorrect"))
df$town_correct<-ifelse(is.na(df$city_correct), df$town, df$city_correct)
df$city_correct <- NULL
编辑:
另外,base R
,解决办法是这样的:
df$town_correct <- ifelse(df$town %in% city$city_incorrect,
city$city_correct[match(df$town, city$city_incorrect)],
df$town[match(df$town, city$city_correct)])
结果:
df
town town_correct
1 NEW YORK CITY NEW YORK CITY
2 NYC NEW YORK CITY
3 CHICAGO CHICAGO
4 CHICAGOO CHICAGO
5 CHICAGOO CHICAGO
6 BOSTON BOSTON
7 BOSTN BOSTON
8 CHICAGOO CHICAGO
9 BOSTN BOSTON
10 CHICAGO CHICAGO
11 CHICAGOO CHICAGO
12 CHICAGO CHICAGO
13 LOS ANGELOS LOS ANGELES
14 BOSTN BOSTON
15 BOSTON BOSTON
16 CHICAGOO CHICAGO
17 LOS ANGELES LOS ANGELES
18 BOSTON BOSTON
19 NEW YORK CITY NEW YORK CITY
20 CHICAGOO CHICAGO