R删除基于列的重复值并用重复行的平均值替换其他列值
R remove duplicated values based on a column and replace other column values with the mean of duplicated rows
我正在使用一个 data.frame,它有 6 个感兴趣的环境变量,这些变量按位置进行地理参考。我遇到的问题是有些位置是重复的,但所有环境变量都是唯一的测量值。
不幸的是,如果有重复的位置,我想对这些数据进行的建模将不起作用。但我不想通过只保留一个重复的行来任意丢弃数据。
因此,我正在寻找一种方法,为每组重复的 6 个变量中的每一个取均值,然后将该均值归因于每个变量和位置,从而保留来自多次测量的信息。
我已经尝试过多种方法,但似乎不太正确!
我正在处理的数据可以在这里下载:
(https://www.dropbox.com/sh/xnwp3zz5abnilyo/AABRVJZ0kTmWk0T9Fcp4-bVSa?dl=0/)
我就是这样尝试的:
library(rgdal)
library(sp)
library(maptools)
#load data
hs1<- readOGR (".", "Hollicombe_S1_L1-5_A1.2")
#remove columns we're not interested in
hs1<- subset(hs1, select = -c(1:16, 23:24)
所以我从 hs1 开始——一个具有 552 个观测值和 6 个变量的 SPDF...
#check for duplicate location (present if lengths differ)
length(hs1@coords)
[1] 1104
length(unique(hs1@coords))
[1] 730
#duplicates confirmed
hs1.d <- hs1[duplicated(hs1@coords),] # creates new SPDF with only duplicated locations (?)
hs1.u <- hs1[!duplicated(hs1@coords),] # creates new SPDF with only unique locations
# coerce duplicated locations SPDF to an ordinary data frame
hs1.md<- as.data.frame(hs1.d)
# combine the X&Y into a single "location"
hs1.md <- within(hs1.md,
Location <- paste(coords.x1, coords.x2, sep = ","))
# aggregate duplicate locations and calculate a mean value for each
means_by_location<- aggregate (cbind(BioArea,BioVolume,MeanBioHei,MaxBioheig,PerArIn, PerVolIn)~Location, hs1.md, mean)
#split location back to X&Y
lat_long <- strsplit(means_by_location$Location, ",")
means_by_location$coords.x1 <- sapply(lat_long, function(x) x[1]) #adds X data back
means_by_location$coords.x2 <- sapply(lat_long, function(x) x[2])#adds Y data back
means_by_location$coords.x1 <- as.numeric (means_by_location$coords.x1) #converts to numeric
means_by_location$coords.x2 <- as.numeric (means_by_location$coords.x2)#converts to numeric
# add spatial information back in to create SPDF
coordinates(means_by_location) = ~coords.x1+coords.x2 # adds the locations
proj4string(means_by_location) = CRS(proj4string(hs1)) # sets the CRS
# hs1.md as SPDF containing single rows for previously duplicated locations
# with mean values for each variable
hs1.md <- subset(means_by_location, select = -(1))
#merge hs1.md and hs1.u to create new SPDF without duplicates
hs1 <- spRbind (hs1.u, hs1.md)
因此 hs1 现在是具有 543 个观测值的 SPDF(即已删除 9 个观测值)。
但仍然存在重复位置,唯一位置的数量保持不变:
length(hs1@coords) # total number of locations
[1] 1086
length(unique(hs1@coords)) #number of unique locations
[1] 730
我怀疑我在某处错误地将唯一值与重复观察值分开了,但我对 R 的了解不足以让我发现这一点。谁能看到我哪里出错了?或者有人知道我可以实现这一目标的替代方法吗?
根据我的评论,这个问题的答案有点棘手,因为被认为是重复的内容可能取决于准确性。
在加载你的 shapefile 时,我看到每个测量值都是一条线,有起点、终点和中心。中心似乎与 shapefile 中给出的坐标相匹配。
假设中心实际上是坐标,我会使用 sf
包中的新 dplyr
动词:
library("tidyverse")
library("sf")
hs1 = read_sf(".", "Hollicombe_S1_L1-5_A1")
nrow(hs1)
# 552
nrow(hs1[duplicated(hs1$geometry), ])
# 187
所以我们有 552 个案例,其中 187 个重复(即 365 个位置)。要获得重复位置的平均值,请使用 group_by()
和 summarise()
:
hs1 = hs1 %>%
group_by(CentrePos1, CentrePos_) %>%
summarise(
BioArea = mean(BioArea),
BioVolume = mean(BioVolume),
MeanBioHei = mean(MeanBioHei),
MaxBioheig = mean(MaxBioheig),
PerArIn = mean(PerArIn),
PerVolIn = mean(PerVolIn)
)
hs1
# Simple feature collection with 365 features and 8 fields
# geometry type: POINT
# dimension: XY
# bbox: xmin: -3.548833 ymin: 50.44483 xmax: -3.542333 ymax: 50.45167
# epsg (SRID): 4326
# proj4string: +proj=longlat +datum=WGS84 +no_defs
# A tibble: 365 x 9
# Groups: CentrePos1 [59]
# CentrePos1 CentrePos_ BioArea BioVolume MeanBioHei MaxBioheig PerArIn PerVolIn geometry
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <simple_feature>
# 1 -3.548833 50.44500 0.00000 0.00000 0.192 0.216 -1.000 -1.000 <POINT (-3.54...>
# 2 -3.548833 50.44533 2.27280 0.41470 0.182 0.264 91.410 2.810 <POINT (-3.54...>
# 3 -3.548744 50.44500 6.75470 1.21780 0.180 0.216 74.890 2.210 <POINT (-3.54...>
# 4 -3.548667 50.44506 5.02900 1.14660 0.228 0.228 100.000 3.720 <POINT (-3.54...>
# 5 -3.548667 50.44517 8.24895 1.86555 0.225 0.330 96.550 3.530 <POINT (-3.54...>
# 6 -3.548667 50.44532 10.31200 2.04180 0.198 0.204 100.000 3.210 <POINT (-3.54...>
# 7 -3.548667 50.44536 18.61980 3.67040 0.197 0.276 100.000 3.280 <POINT (-3.54...>
# 8 -3.548667 50.44550 3.31670 0.73700 0.222 0.300 96.150 3.550 <POINT (-3.54...>
# 9 -3.548500 50.44533 6.22370 1.74670 0.269 0.372 81.555 3.470 <POINT (-3.54...>
# 10 -3.548500 50.44550 6.00740 1.00090 0.168 0.234 80.905 2.215 <POINT (-3.54...>
# ... with 355 more rows
可以看到有365行,没有重复:
any(duplicated(hs1$geometry))
# FALSE
新列具有基于我们之前执行的分组的平均值。如果观察位置是唯一的,则返回其原始值(好吧,我想是原始值除以 1)。
我应该指出 sf
正在替换 R
中的 sp
、rgdal
和 rgeos
,但如果您确实想继续使用您可以使用 as_Spatial()
:
将您的 sf
对象转换为 spatialPointsDataFrame
的那些包
hs1_data = st_set_geometry(hs1, NULL)
hs1 = as_Spatial(hs1$geometry)
hs1 = SpatialPointsDataFrame(hs1, hs1_data)
我正在使用一个 data.frame,它有 6 个感兴趣的环境变量,这些变量按位置进行地理参考。我遇到的问题是有些位置是重复的,但所有环境变量都是唯一的测量值。
不幸的是,如果有重复的位置,我想对这些数据进行的建模将不起作用。但我不想通过只保留一个重复的行来任意丢弃数据。
因此,我正在寻找一种方法,为每组重复的 6 个变量中的每一个取均值,然后将该均值归因于每个变量和位置,从而保留来自多次测量的信息。
我已经尝试过多种方法,但似乎不太正确!
我正在处理的数据可以在这里下载:
(https://www.dropbox.com/sh/xnwp3zz5abnilyo/AABRVJZ0kTmWk0T9Fcp4-bVSa?dl=0/)
我就是这样尝试的:
library(rgdal)
library(sp)
library(maptools)
#load data
hs1<- readOGR (".", "Hollicombe_S1_L1-5_A1.2")
#remove columns we're not interested in
hs1<- subset(hs1, select = -c(1:16, 23:24)
所以我从 hs1 开始——一个具有 552 个观测值和 6 个变量的 SPDF...
#check for duplicate location (present if lengths differ)
length(hs1@coords)
[1] 1104
length(unique(hs1@coords))
[1] 730
#duplicates confirmed
hs1.d <- hs1[duplicated(hs1@coords),] # creates new SPDF with only duplicated locations (?)
hs1.u <- hs1[!duplicated(hs1@coords),] # creates new SPDF with only unique locations
# coerce duplicated locations SPDF to an ordinary data frame
hs1.md<- as.data.frame(hs1.d)
# combine the X&Y into a single "location"
hs1.md <- within(hs1.md,
Location <- paste(coords.x1, coords.x2, sep = ","))
# aggregate duplicate locations and calculate a mean value for each
means_by_location<- aggregate (cbind(BioArea,BioVolume,MeanBioHei,MaxBioheig,PerArIn, PerVolIn)~Location, hs1.md, mean)
#split location back to X&Y
lat_long <- strsplit(means_by_location$Location, ",")
means_by_location$coords.x1 <- sapply(lat_long, function(x) x[1]) #adds X data back
means_by_location$coords.x2 <- sapply(lat_long, function(x) x[2])#adds Y data back
means_by_location$coords.x1 <- as.numeric (means_by_location$coords.x1) #converts to numeric
means_by_location$coords.x2 <- as.numeric (means_by_location$coords.x2)#converts to numeric
# add spatial information back in to create SPDF
coordinates(means_by_location) = ~coords.x1+coords.x2 # adds the locations
proj4string(means_by_location) = CRS(proj4string(hs1)) # sets the CRS
# hs1.md as SPDF containing single rows for previously duplicated locations
# with mean values for each variable
hs1.md <- subset(means_by_location, select = -(1))
#merge hs1.md and hs1.u to create new SPDF without duplicates
hs1 <- spRbind (hs1.u, hs1.md)
因此 hs1 现在是具有 543 个观测值的 SPDF(即已删除 9 个观测值)。
但仍然存在重复位置,唯一位置的数量保持不变:
length(hs1@coords) # total number of locations
[1] 1086
length(unique(hs1@coords)) #number of unique locations
[1] 730
我怀疑我在某处错误地将唯一值与重复观察值分开了,但我对 R 的了解不足以让我发现这一点。谁能看到我哪里出错了?或者有人知道我可以实现这一目标的替代方法吗?
根据我的评论,这个问题的答案有点棘手,因为被认为是重复的内容可能取决于准确性。
在加载你的 shapefile 时,我看到每个测量值都是一条线,有起点、终点和中心。中心似乎与 shapefile 中给出的坐标相匹配。
假设中心实际上是坐标,我会使用 sf
包中的新 dplyr
动词:
library("tidyverse")
library("sf")
hs1 = read_sf(".", "Hollicombe_S1_L1-5_A1")
nrow(hs1)
# 552
nrow(hs1[duplicated(hs1$geometry), ])
# 187
所以我们有 552 个案例,其中 187 个重复(即 365 个位置)。要获得重复位置的平均值,请使用 group_by()
和 summarise()
:
hs1 = hs1 %>%
group_by(CentrePos1, CentrePos_) %>%
summarise(
BioArea = mean(BioArea),
BioVolume = mean(BioVolume),
MeanBioHei = mean(MeanBioHei),
MaxBioheig = mean(MaxBioheig),
PerArIn = mean(PerArIn),
PerVolIn = mean(PerVolIn)
)
hs1
# Simple feature collection with 365 features and 8 fields
# geometry type: POINT
# dimension: XY
# bbox: xmin: -3.548833 ymin: 50.44483 xmax: -3.542333 ymax: 50.45167
# epsg (SRID): 4326
# proj4string: +proj=longlat +datum=WGS84 +no_defs
# A tibble: 365 x 9
# Groups: CentrePos1 [59]
# CentrePos1 CentrePos_ BioArea BioVolume MeanBioHei MaxBioheig PerArIn PerVolIn geometry
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <simple_feature>
# 1 -3.548833 50.44500 0.00000 0.00000 0.192 0.216 -1.000 -1.000 <POINT (-3.54...>
# 2 -3.548833 50.44533 2.27280 0.41470 0.182 0.264 91.410 2.810 <POINT (-3.54...>
# 3 -3.548744 50.44500 6.75470 1.21780 0.180 0.216 74.890 2.210 <POINT (-3.54...>
# 4 -3.548667 50.44506 5.02900 1.14660 0.228 0.228 100.000 3.720 <POINT (-3.54...>
# 5 -3.548667 50.44517 8.24895 1.86555 0.225 0.330 96.550 3.530 <POINT (-3.54...>
# 6 -3.548667 50.44532 10.31200 2.04180 0.198 0.204 100.000 3.210 <POINT (-3.54...>
# 7 -3.548667 50.44536 18.61980 3.67040 0.197 0.276 100.000 3.280 <POINT (-3.54...>
# 8 -3.548667 50.44550 3.31670 0.73700 0.222 0.300 96.150 3.550 <POINT (-3.54...>
# 9 -3.548500 50.44533 6.22370 1.74670 0.269 0.372 81.555 3.470 <POINT (-3.54...>
# 10 -3.548500 50.44550 6.00740 1.00090 0.168 0.234 80.905 2.215 <POINT (-3.54...>
# ... with 355 more rows
可以看到有365行,没有重复:
any(duplicated(hs1$geometry))
# FALSE
新列具有基于我们之前执行的分组的平均值。如果观察位置是唯一的,则返回其原始值(好吧,我想是原始值除以 1)。
我应该指出 sf
正在替换 R
中的 sp
、rgdal
和 rgeos
,但如果您确实想继续使用您可以使用 as_Spatial()
:
sf
对象转换为 spatialPointsDataFrame
的那些包
hs1_data = st_set_geometry(hs1, NULL)
hs1 = as_Spatial(hs1$geometry)
hs1 = SpatialPointsDataFrame(hs1, hs1_data)