以另一列为条件拆分字符串
Splitting a string conditional on another column
我想拆分一个名为 country
的变量,条件是它是否有年份(Albania2009 vs. Albania)。
此外,如果变量没有年份(即阿尔巴尼亚),我想将国家名称复制到 cname
并手动将年份输入 cyear
.
idstd id xxx id1 country
<dbl+> <dbl> <dbl+lbl> <dbl+lbl> <chr>
1 445801 NA NA 7 Albania2009
2 542384 4616555 1163 7 Albania
3 445801 NA NA 7 Albania2009
4 542384 4616555 1163 7 Albania
我首先尝试自己,利用 id 是 NA 的事实,当国家有一年时:
CAmerica0306P$cyear <- NA
CAmerica0306P$cname <- NA
for (i in 1:nrow(df)) {
if (df$id[i]==NA) {
df[i,] <- separate(df, country[i], into = c("cname", "cyear"), -4)
} else {
df$cyear[i,] <- 2001
df$cname[i,] <- df$country[i,]
}
}
但它分裂了一切。在检查了 Whosebug 之后,我尝试了:
df <- df %>%
extract(country, into=c("cname", "cyear"), regex="^(?=.{1,7}$)([a-zA-Z]+)([0-9].*)$", remove=FALSE)
但它没有填充单元格(仍然是 NA)。
期望的输出:
idstd id xxx id1 country cyear cname
<dbl+> <dbl> <dbl+lbl> <dbl+lbl> <chr> <dbl>
1 445801 NA NA 7 Albania 2009 Albania
2 542384 4616555 1163 7 Albania 2001 Albania
3 445801 NA NA 7 Albania 2009 Albania
4 542384 4616555 1163 7 Albania 2001 Albania
有什么建议吗?
示例数据:(您应该提供随时可用的数据)
df1<-
data.frame(country = I(paste0("Albania",c("",2007:2012,""))) )
代码:
df1$cname <-sub("\d+$","", df1$country) #remove all numbers in the end
df1$cyear <-gsub("[^0-9]","", df1$country) #remove everything that is not a number
df1$cyear[df1$cyear == ""] <- 2001 #where no year is prominent insert 2001
df1$country<- df1$cname
结果:
# country cname cyear
#1 Albania Albania 2001
#2 Albania Albania 2007
#3 Albania Albania 2008
#4 Albania Albania 2009
#5 Albania Albania 2010
#6 Albania Albania 2011
#7 Albania Albania 2012
#8 Albania Albania 2001
我想拆分一个名为 country
的变量,条件是它是否有年份(Albania2009 vs. Albania)。
此外,如果变量没有年份(即阿尔巴尼亚),我想将国家名称复制到 cname
并手动将年份输入 cyear
.
idstd id xxx id1 country
<dbl+> <dbl> <dbl+lbl> <dbl+lbl> <chr>
1 445801 NA NA 7 Albania2009
2 542384 4616555 1163 7 Albania
3 445801 NA NA 7 Albania2009
4 542384 4616555 1163 7 Albania
我首先尝试自己,利用 id 是 NA 的事实,当国家有一年时:
CAmerica0306P$cyear <- NA
CAmerica0306P$cname <- NA
for (i in 1:nrow(df)) {
if (df$id[i]==NA) {
df[i,] <- separate(df, country[i], into = c("cname", "cyear"), -4)
} else {
df$cyear[i,] <- 2001
df$cname[i,] <- df$country[i,]
}
}
但它分裂了一切。在检查了 Whosebug 之后,我尝试了:
df <- df %>%
extract(country, into=c("cname", "cyear"), regex="^(?=.{1,7}$)([a-zA-Z]+)([0-9].*)$", remove=FALSE)
但它没有填充单元格(仍然是 NA)。
期望的输出:
idstd id xxx id1 country cyear cname
<dbl+> <dbl> <dbl+lbl> <dbl+lbl> <chr> <dbl>
1 445801 NA NA 7 Albania 2009 Albania
2 542384 4616555 1163 7 Albania 2001 Albania
3 445801 NA NA 7 Albania 2009 Albania
4 542384 4616555 1163 7 Albania 2001 Albania
有什么建议吗?
示例数据:(您应该提供随时可用的数据)
df1<-
data.frame(country = I(paste0("Albania",c("",2007:2012,""))) )
代码:
df1$cname <-sub("\d+$","", df1$country) #remove all numbers in the end
df1$cyear <-gsub("[^0-9]","", df1$country) #remove everything that is not a number
df1$cyear[df1$cyear == ""] <- 2001 #where no year is prominent insert 2001
df1$country<- df1$cname
结果:
# country cname cyear
#1 Albania Albania 2001
#2 Albania Albania 2007
#3 Albania Albania 2008
#4 Albania Albania 2009
#5 Albania Albania 2010
#6 Albania Albania 2011
#7 Albania Albania 2012
#8 Albania Albania 2001