消除 R 中字符串向下单元格的多个后续实例
Eliminating more than one succeeding instance of a string down cells in R
我对 R 比较陌生。我有一个包含 500 万个观察值的数据框和 1 个看起来像这样的变量:
PMID- 28524368
PMID- 28504342
PMID- 28501042
RN - 4964P6T9RB (Aldosterone)
RN - EC 3.4.23.15 (Renin)
RN - RWP5GA015D (Potassium)
MH - Adrenal Cortex Neoplasms/*diagnostic imaging/pathology/surgery
MH - Adrenocortical Adenoma/*diagnostic imaging/pathology/surgery
MH - Aldosterone/blood
MH - Humans
PMID- 28523858
PMID- 28517030
PMID- 28513869
MH - Hyperaldosteronism/*complications
MH - Hypertension/*etiology
MH - Male
MH - Middle Aged
MH - Potassium/blood
PMID- 28494487
PMID- 28493475
MH - Renin/blood
MH - Tomography, X-Ray Computed
但是,我只希望连续有 1 个 PMID,而且第一个 - 其余的 PMID 应该被删除,导致数据帧如下所示:
PMID- 28524368
RN - 4964P6T9RB (Aldosterone)
RN - EC 3.4.23.15 (Renin)
RN - RWP5GA015D (Potassium)
MH - Adrenal Cortex Neoplasms/*diagnostic imaging/pathology/surgery
MH - Adrenocortical Adenoma/*diagnostic imaging/pathology/surgery
MH - Aldosterone/blood
MH - Humans
PMID- 28523858
MH - Hyperaldosteronism/*complications
MH - Hypertension/*etiology
MH - Male
MH - Middle Aged
MH - Potassium/blood
PMID- 28494487
MH - Renin/blood
MH - Tomography, X-Ray Computed
请指教。我尝试使用:
# remove excessive PMIDs
for (i in nrow(original_reduced))
{
if (substr(original_reduced[i, 1], 1, 4) == "PMID")
{
if (substr(original_reduced[i+1, 1], 1, 4) == "PMID" && i != nrow(original_reduced)) # if next row is also PMID
{
original_reduced <- original_reduced[-c(i+1), ] # delete entry after
}
}
}
但是我得到了这个错误:
Error in if (substr(original_reduced[i + 1, 1], 1, 4) == "PMID") { : missing value where TRUE/FALSE needed
即使我的数据框中没有 NA。
谢谢。
试试这个:
df%>%mutate(number=sequence(rle(name)[['lengths']]))%>%filter((number==1 & grepl('PMID',number))|!grepl('PMID',name))%>%select(name)
这是一个可行的解决方案。请参阅注释以获取代码的解释
df<-structure(list(V1 = c("PMID- 28524368", "PMID- 28504342", "PMID- 28501042",
"RN - 4964P6T9RB (Aldosterone)", "RN - EC 3.4.23.15 (Renin)",
"RN - RWP5GA015D (Potassium)", "MH - Adrenal Cortex Neoplasms/*diagnostic imaging/pathology/surgery",
"MH - Adrenocortical Adenoma/*diagnostic imaging/pathology/surgery",
"MH - Aldosterone/blood", "MH - Humans", "PMID- 28523858", "PMID- 28517030",
"PMID- 28513869", "MH - Hyperaldosteronism/*complications", "MH - Hypertension/*etiology",
"MH - Male", "MH - Middle Aged", "MH - Potassium/blood", "PMID- 28494487",
"PMID- 28493475", "MH - Renin/blood", "MH - Tomography, X-Ray Computed"
)), .Names = "V1", row.names = c(NA, -22L), class = "data.frame")
library(dplyr)
#Add flag for PMID rows
df$pmid<-grepl("^PMID", df$V1)
#find rows of where n == n+1
matches<-df$pmid==lag(df$pmid)
#find rows equal to previous row and is a PMID row
toremove<-which(matches==TRUE & df$pmid==TRUE)
#remove rows
df<-df[-toremove,]
df$pmid<-NULL #remove added column
我对 R 比较陌生。我有一个包含 500 万个观察值的数据框和 1 个看起来像这样的变量:
PMID- 28524368
PMID- 28504342
PMID- 28501042
RN - 4964P6T9RB (Aldosterone)
RN - EC 3.4.23.15 (Renin)
RN - RWP5GA015D (Potassium)
MH - Adrenal Cortex Neoplasms/*diagnostic imaging/pathology/surgery
MH - Adrenocortical Adenoma/*diagnostic imaging/pathology/surgery
MH - Aldosterone/blood
MH - Humans
PMID- 28523858
PMID- 28517030
PMID- 28513869
MH - Hyperaldosteronism/*complications
MH - Hypertension/*etiology
MH - Male
MH - Middle Aged
MH - Potassium/blood
PMID- 28494487
PMID- 28493475
MH - Renin/blood
MH - Tomography, X-Ray Computed
但是,我只希望连续有 1 个 PMID,而且第一个 - 其余的 PMID 应该被删除,导致数据帧如下所示:
PMID- 28524368
RN - 4964P6T9RB (Aldosterone)
RN - EC 3.4.23.15 (Renin)
RN - RWP5GA015D (Potassium)
MH - Adrenal Cortex Neoplasms/*diagnostic imaging/pathology/surgery
MH - Adrenocortical Adenoma/*diagnostic imaging/pathology/surgery
MH - Aldosterone/blood
MH - Humans
PMID- 28523858
MH - Hyperaldosteronism/*complications
MH - Hypertension/*etiology
MH - Male
MH - Middle Aged
MH - Potassium/blood
PMID- 28494487
MH - Renin/blood
MH - Tomography, X-Ray Computed
请指教。我尝试使用:
# remove excessive PMIDs
for (i in nrow(original_reduced))
{
if (substr(original_reduced[i, 1], 1, 4) == "PMID")
{
if (substr(original_reduced[i+1, 1], 1, 4) == "PMID" && i != nrow(original_reduced)) # if next row is also PMID
{
original_reduced <- original_reduced[-c(i+1), ] # delete entry after
}
}
}
但是我得到了这个错误:
Error in if (substr(original_reduced[i + 1, 1], 1, 4) == "PMID") { : missing value where TRUE/FALSE needed
即使我的数据框中没有 NA。
谢谢。
试试这个:
df%>%mutate(number=sequence(rle(name)[['lengths']]))%>%filter((number==1 & grepl('PMID',number))|!grepl('PMID',name))%>%select(name)
这是一个可行的解决方案。请参阅注释以获取代码的解释
df<-structure(list(V1 = c("PMID- 28524368", "PMID- 28504342", "PMID- 28501042",
"RN - 4964P6T9RB (Aldosterone)", "RN - EC 3.4.23.15 (Renin)",
"RN - RWP5GA015D (Potassium)", "MH - Adrenal Cortex Neoplasms/*diagnostic imaging/pathology/surgery",
"MH - Adrenocortical Adenoma/*diagnostic imaging/pathology/surgery",
"MH - Aldosterone/blood", "MH - Humans", "PMID- 28523858", "PMID- 28517030",
"PMID- 28513869", "MH - Hyperaldosteronism/*complications", "MH - Hypertension/*etiology",
"MH - Male", "MH - Middle Aged", "MH - Potassium/blood", "PMID- 28494487",
"PMID- 28493475", "MH - Renin/blood", "MH - Tomography, X-Ray Computed"
)), .Names = "V1", row.names = c(NA, -22L), class = "data.frame")
library(dplyr)
#Add flag for PMID rows
df$pmid<-grepl("^PMID", df$V1)
#find rows of where n == n+1
matches<-df$pmid==lag(df$pmid)
#find rows equal to previous row and is a PMID row
toremove<-which(matches==TRUE & df$pmid==TRUE)
#remove rows
df<-df[-toremove,]
df$pmid<-NULL #remove added column