dplyr,如何根据代码对观察结果进行分组,计数并创建汇总变量,然后根据组内的名称添加新变量
dplyr, how to group observations based on codes, count and create summary variable then add a new variable based on names within the groups
我有多个地址要组合在一起并为其创建理货。但是它们的格式有所不同。我已经对地址进行了地理编码,并计划使用地理编码对它们进行分组,但是在对它们进行分组时,我想创建一个新变量,它至少保留一个版本的地址(或者多个变量,组中的每个地址都采用宽格式,但是我愿意为每一组设置一个变量并保留一个地址)。
这是一些示例数据。
address=c("big fake plaza, 12 this street,district, city",
"Green mansion, district, city",
"Block 7 of orange building district, city",
"98 main street block a blue plaza, city",
"blue red mansion, 46 pearl street, city",
"12 this street, big fake plaza, district, city",
"Green mansion, district, city",
"orange building Block 7 district, city",
"block a 98 main street blue plaza, city",
"blue red mansion, 46 pearl street, city"
"big fake plaza, district, city",
"Green mansion,city")
long =c("112.8838", "111.9154", "114.9318", "116.9318", "112.9320","111.9324",
"112.8838", "111.9154", "114.9318", "116.9318", "112.9320","111.9324",
"112.8838", "111.9154")
lat = c("21.22177", "12.22177", "26.27743", "23.17651", "23.24769", "23.24771",
"21.22177", "12.22177", "26.27743", "23.17651", "23.24769", "23.24771",
"21.22177", "12.22177")
df<-cbind(address, lat, long)
我要做的是分组和计数,但不知道如何根据组中的一个地址来改变和创建命名变量。
df_agg<- df %>%
group_by(long,lat) %>%
summarise(count = n()) %>%
mutate(bldg = ifelse(address[address==1],address, NA )) ???????
我希望它看起来像这样
long lat count bldg
<dbl> <dbl> <int> <chr>
1 112. 21.2 3 "big fake plaza, 12 this street,district, city"
2 114. 12.2 3 "Green mansion, district, city"
3 116. 26.3 2 "98 main street block a blue plaza, city"
4 112. 23.5 2 "Block 7 of orange building district, city"
5 111. 23.5 2 "blue red mansion, 46 pearl street, city"
显然我们不能对地址名称进行分组,因为字符串之间存在差异。如果有更好的选择,很高兴听到任何其他建议。如果我们可以创建新变量 bldg1 blgd2 等。对于每个组中的每个不同的建筑物名称来说都很好但不是优先事项。
提前致谢。
您可以select每个位置的第一个地址。
library(dplyr)
library(tidyr)
df %>%
group_by(long,lat) %>%
summarise(count = n(),
address = first(address)) %>%
ungroup
# long lat count address
# <chr> <chr> <int> <chr>
#1 111.9154 12.22177 3 Green mansion, district, city
#2 111.9324 23.24771 2 12 this street, big fake plaza, district, city
#3 112.8838 21.22177 3 big fake plaza, 12 this street,district, city
#4 112.9320 23.24769 2 blue red mansion, 46 pearl street, city
#5 114.9318 26.27743 2 Block 7 of orange building district, city
#6 116.9318 23.17651 2 98 main street block a blue plaza, city
如果您想创建单独的列,例如 bldg1
、bldg2
等,请以宽格式转换数据。
df %>%
group_by(long,lat) %>%
mutate(row = paste0('bldg', row_number()),
count = n()) %>%
ungroup %>%
pivot_wider(names_from = row, values_from = address)
我有多个地址要组合在一起并为其创建理货。但是它们的格式有所不同。我已经对地址进行了地理编码,并计划使用地理编码对它们进行分组,但是在对它们进行分组时,我想创建一个新变量,它至少保留一个版本的地址(或者多个变量,组中的每个地址都采用宽格式,但是我愿意为每一组设置一个变量并保留一个地址)。
这是一些示例数据。
address=c("big fake plaza, 12 this street,district, city",
"Green mansion, district, city",
"Block 7 of orange building district, city",
"98 main street block a blue plaza, city",
"blue red mansion, 46 pearl street, city",
"12 this street, big fake plaza, district, city",
"Green mansion, district, city",
"orange building Block 7 district, city",
"block a 98 main street blue plaza, city",
"blue red mansion, 46 pearl street, city"
"big fake plaza, district, city",
"Green mansion,city")
long =c("112.8838", "111.9154", "114.9318", "116.9318", "112.9320","111.9324",
"112.8838", "111.9154", "114.9318", "116.9318", "112.9320","111.9324",
"112.8838", "111.9154")
lat = c("21.22177", "12.22177", "26.27743", "23.17651", "23.24769", "23.24771",
"21.22177", "12.22177", "26.27743", "23.17651", "23.24769", "23.24771",
"21.22177", "12.22177")
df<-cbind(address, lat, long)
我要做的是分组和计数,但不知道如何根据组中的一个地址来改变和创建命名变量。
df_agg<- df %>%
group_by(long,lat) %>%
summarise(count = n()) %>%
mutate(bldg = ifelse(address[address==1],address, NA )) ???????
我希望它看起来像这样
long lat count bldg
<dbl> <dbl> <int> <chr>
1 112. 21.2 3 "big fake plaza, 12 this street,district, city"
2 114. 12.2 3 "Green mansion, district, city"
3 116. 26.3 2 "98 main street block a blue plaza, city"
4 112. 23.5 2 "Block 7 of orange building district, city"
5 111. 23.5 2 "blue red mansion, 46 pearl street, city"
显然我们不能对地址名称进行分组,因为字符串之间存在差异。如果有更好的选择,很高兴听到任何其他建议。如果我们可以创建新变量 bldg1 blgd2 等。对于每个组中的每个不同的建筑物名称来说都很好但不是优先事项。
提前致谢。
您可以select每个位置的第一个地址。
library(dplyr)
library(tidyr)
df %>%
group_by(long,lat) %>%
summarise(count = n(),
address = first(address)) %>%
ungroup
# long lat count address
# <chr> <chr> <int> <chr>
#1 111.9154 12.22177 3 Green mansion, district, city
#2 111.9324 23.24771 2 12 this street, big fake plaza, district, city
#3 112.8838 21.22177 3 big fake plaza, 12 this street,district, city
#4 112.9320 23.24769 2 blue red mansion, 46 pearl street, city
#5 114.9318 26.27743 2 Block 7 of orange building district, city
#6 116.9318 23.17651 2 98 main street block a blue plaza, city
如果您想创建单独的列,例如 bldg1
、bldg2
等,请以宽格式转换数据。
df %>%
group_by(long,lat) %>%
mutate(row = paste0('bldg', row_number()),
count = n()) %>%
ungroup %>%
pivot_wider(names_from = row, values_from = address)