R:使用 gsub 和 grep 从文本列表创建新数据框
R: Creating new data frame from text list using gsub, and grep
如何从文本列表中提取重要信息 - 姓名、年龄、(疾病)和体重并创建新的数据列表或框架?
test<-c("James is approximately age 25 & 26 weighted 130lbs",
"Angelina is age 40 (Diabetes)",
"Harry Peterson is male with ages 27")
这样做
我能够在括号内对名称/和疾病进行子集化。
> sapply(strsplit(test, "\s+"),"[",1)
[1] "James" "Angelina" "Harry"
> gsub("[\(\)]","\1", regmatches(test, gregexpr("(?<=\().*?(?=\))",test, perl=T)))
[1] "character0" "Diabetes" "character0"
然而, 无法对年龄 25 和 26 进行子集化,并且 grep 'ages'
> paste(grep(pattern="age", trimws(strsplit(test, " ")[[1]]), value = TRUE),as.numeric(sub(".*age.
(\d+).*", "\1", test[[1]])) )
[1] "age 25"
如何从文本中提取所有数字和符号? 像 "age 25 & 26"
如何设置提取年龄和年龄的年龄模式? 年龄 27 -> "age 27",加权 -> "weight 130"
我怎样才能按照下面的顺序 grep 所有信息,而不是单独的子集名称、年龄、体重和 ()?
c("James","age 25 & 26", "weight 130", "Angelina","age 40", "Diabetes", "Harry", "age 27")
并最终创建如下数据框
age weight illness
James "25 & 26" "130" NA
Angelina "40" NA "Diabetes"
Harry "27" NA NA
如果您只能部分回答,这也会有所帮助。
谢谢。
gsub(" .*", "", test)
# [1] "James" "Angelina" "Harry"
trimws(gsub("ages?", "", regmatches(test, gregexpr("ages?\s*[-&0-9 ]+\b", test, perl = TRUE))))
# [1] "25 & 26" "40" "27"
weights <- regmatches(test, gregexpr("weight(s|ed)? [0-9]+(lb|pound|kg|g)?", test))
weights[lengths(weights) < 1] <- NA_character_
trimws(gsub("weight(s|ed)?", "", unlist(weights)))
# [1] "130lb" NA NA
ill <- regmatches(test, gregexpr("(?<=\().*(?=\))", test, perl = TRUE))
ill[lengths(ill) < 1] <- NA_character_
unlist(ill)
# [1] NA "Diabetes" NA
如何从文本列表中提取重要信息 - 姓名、年龄、(疾病)和体重并创建新的数据列表或框架?
test<-c("James is approximately age 25 & 26 weighted 130lbs",
"Angelina is age 40 (Diabetes)",
"Harry Peterson is male with ages 27")
这样做
我能够在括号内对名称/和疾病进行子集化。
> sapply(strsplit(test, "\s+"),"[",1)
[1] "James" "Angelina" "Harry"
> gsub("[\(\)]","\1", regmatches(test, gregexpr("(?<=\().*?(?=\))",test, perl=T)))
[1] "character0" "Diabetes" "character0"
然而, 无法对年龄 25 和 26 进行子集化,并且 grep 'ages'
> paste(grep(pattern="age", trimws(strsplit(test, " ")[[1]]), value = TRUE),as.numeric(sub(".*age.
(\d+).*", "\1", test[[1]])) )
[1] "age 25"
如何从文本中提取所有数字和符号? 像 "age 25 & 26"
如何设置提取年龄和年龄的年龄模式? 年龄 27 -> "age 27",加权 -> "weight 130"
我怎样才能按照下面的顺序 grep 所有信息,而不是单独的子集名称、年龄、体重和 ()?
c("James","age 25 & 26", "weight 130", "Angelina","age 40", "Diabetes", "Harry", "age 27")
并最终创建如下数据框
age weight illness
James "25 & 26" "130" NA
Angelina "40" NA "Diabetes"
Harry "27" NA NA
如果您只能部分回答,这也会有所帮助。 谢谢。
gsub(" .*", "", test)
# [1] "James" "Angelina" "Harry"
trimws(gsub("ages?", "", regmatches(test, gregexpr("ages?\s*[-&0-9 ]+\b", test, perl = TRUE))))
# [1] "25 & 26" "40" "27"
weights <- regmatches(test, gregexpr("weight(s|ed)? [0-9]+(lb|pound|kg|g)?", test))
weights[lengths(weights) < 1] <- NA_character_
trimws(gsub("weight(s|ed)?", "", unlist(weights)))
# [1] "130lb" NA NA
ill <- regmatches(test, gregexpr("(?<=\().*(?=\))", test, perl = TRUE))
ill[lengths(ill) < 1] <- NA_character_
unlist(ill)
# [1] NA "Diabetes" NA